Python

Generate mock data to test your pipeline

Quite often, we need to test our pipelines work at scale without having access to production systems. To help solve this, we can generate mock data using the Python library ‘Faker’. Faker is a comprehensive fake data library. They have data surrounding: customers, addresses, bank details; company names; credit card details; currencies; cryptocurencies; files; domain […]

Read more
Data

What’s all this about Flume + Kafka (Flafka)?

There is a concept called ‘Flafka’ through which we exploit the strengths of both Kafka and Flume to create a more robust ingestion process. By deploying Flafka, we take advantage of the out of the box connectors available in Flume and use it to connect to all of our data sources. The data is then […]

Read more
Data

An introduction to Kafka

Kafka is a low-latency messaging system. It takes data from one location, for example application log files and makes it available to other systems with very little delay (latency).  That may be useful in the below example, where Kafka makes our CRM and Google Analytics data available to Spark Streaming, which will carry out some […]

Read more
Python

Extracting and cleaning data from the Twitter API in Python

I’ve heard from a tonne of people that they don’t read the news because it’s negative and the constant presence of articles about assault, killings and other terrible things bring their mood down. So I started thinking, how could we filter the news to only show us things that we want to see. The way […]

Read more
Data

An introduction to Flume

Flume is similar to Kafka in many ways. It takes data from its source and distributes it to its destination. The differentiation here is that Flume is designed for high-throughput log streaming into Hadoop (HDFS or HBase) – it is not designed to ship out to a large number of consumers. Flume, like Kafka is […]

Read more
Python

A simple intro to sentiment analysis in Python

Sentiment analysis can provide key insight into the feelings of your customers towards your company & hence is becoming an increasingly important part of data analysis. Building a machine learning model to identify positive and negative sentiments is pretty complex, but luckily for us, there is a Python library that can help us out. It’s […]

Read more
ML, Python

Working with date time features in your ML models

Having a date in our dataset can really improve our predictions. Imagine an ice cream company, their data has a seasonal aspect (more sales in summer), so predicting sales information without dates in our data would be inaccurate. The problem is though, dates are horrible to work with. The first of June 2020 could be […]

Read more
ML

So, what is a Random Forest?

A random forest is a grouping of decision trees. Data is run through each tree within a forest and provides a classification. It’s important that we do this, because, while decision trees are great & easy to read, they’re inaccurate. They’re good with their training data, but aren’t flexible enough to accurately predict outcomes for […]

Read more
ML

ML Scenario Part 2: Tuning our model hyper parameters

In the last article, we built a simple machine learning model which predicted the quality of wine based on a number of features. In this article, we’re going to tune that model. Our starting point was an accuracy of 0.68 (68%), so we aim to get better than that. The way we make models more […]

Read more
ML

Simple Machine Learning Scenario: predicting quality

In this article, I am going to walk through a really simple machine learning scenario, using data from Kaggle. In future articles, I will look at tuning this model to improve the accuracy and see how different algorithms produce differing results. The purpose of this article is an introduction to a machine learning model with […]

Read more
ML

Using Gridsearch for Hyper Parameter Tuning

Hyperparameter tuning can be quite the task. It can be heavily manual and rely on gut feel as to what the value for that parameter should roughly be. If we did it in a completely manual way, I imagine that depression rates in data scientists would go through the roof but luckily, we have GridSearch. […]

Read more
ML

Validating error estimation of our models with Cross Validation

Let’s say we have the below dataset. It’s a list of patients, with some health KPI’s and a boolean field which says whether they have diabetes or not. If you had thousands of patients, you would probably find some trends in this data which would allow you to predict if someone had diabetes, based on […]

Read more