An early view of my new book ‘Data Badass’ is available to view using this link. It’s not yet been through thorough editing and will be added to over time but I am keen to gather some feedback. It’s a book that covers the data basics; data platforms (including Hadoop, Kafka, Flume, Hive, Spark) and data science (including statistics and machine learning). More machine learning topics are being added prior to final release.
Data is all around us and we are generating it constantly. When you browse the web, you’re generating logs of absolutely everything you do which are collected by your internet service provider. When you place an order on a website, you’re creating new records in the retailers database and when you walk around the neighbourhood or go for a drive, your location is likely being tracked and stored by many of the apps installed on your device. Below is a sample, simple flow, showing how data is generated.
These are just a few of the occasions that you are generating data. In reality, you are almost always generating data, without knowing it.
Businesses are increasingly becoming aware of the value of data and it has become a core component of many company strategies in recent years.
As a retailer you will have a list of transactions. Without analysis, these transaction lists are not very helpful or beneficial to your business; but with some analysis, we can determine which products are usually purchased together. For example, we would probably find a high likelihood that if a customer was to buy hot dogs, they would also buy bread rolls.
Now that we have this additional focus on data driven decisions within businesses; we have additional challenges around the data. How do we store and process such an enormous volume of data?
This book will take you through the core concepts of data types, data structures and data modelling. With this core knowledge, we’ll then discuss the toolsets we can use to extract maximum value from the datasets we have and we’ll even cover machine learning theory.
The goal of this book is to give you a good grounding in all things data, so let’s begin.