Python

Extracting data from the YouTube API

In this article, we’re going to look at extracting data from the YouTube Data API. For this, we need to setup an API key, which you can get via the Google Developer Console. Now, you need to pick a channel to pull data from. In this case, I have taken the rapper Hopsin’s channel. To […]

Read more
Data

What’s all this about Flume + Kafka (Flafka)?

There is a concept called ‘Flafka’ through which we exploit the strengths of both Kafka and Flume to create a more robust ingestion process. By deploying Flafka, we take advantage of the out of the box connectors available in Flume and use it to connect to all of our data sources. The data is then […]

Read more
Data

An introduction to Kafka

Kafka is a low-latency messaging system. It takes data from one location, for example application log files and makes it available to other systems with very little delay (latency).  That may be useful in the below example, where Kafka makes our CRM and Google Analytics data available to Spark Streaming, which will carry out some […]

Read more
Python

Pulling comment sentiment from Twitter API

In my last article, I spoke about extracting tweet data from Twitter, filtering it and cleaning it up. Today, I want to look at sentiment analysis of comments / replies to tweets using the API. For this, I am going to use Vader Sentiment, which is a library specifically tuned for social media comments. We […]

Read more
Python

Extracting and cleaning data from the Twitter API in Python

I’ve heard from a tonne of people that they don’t read the news because it’s negative and the constant presence of articles about assault, killings and other terrible things bring their mood down. So I started thinking, how could we filter the news to only show us things that we want to see. The way […]

Read more
Data

An introduction to Flume

Flume is similar to Kafka in many ways. It takes data from its source and distributes it to its destination. The differentiation here is that Flume is designed for high-throughput log streaming into Hadoop (HDFS or HBase) – it is not designed to ship out to a large number of consumers. Flume, like Kafka is […]

Read more
Python

A simple intro to sentiment analysis in Python

Sentiment analysis can provide key insight into the feelings of your customers towards your company & hence is becoming an increasingly important part of data analysis. Building a machine learning model to identify positive and negative sentiments is pretty complex, but luckily for us, there is a Python library that can help us out. It’s […]

Read more
Spark

A crash course in UDF’s in Pyspark

Okay, so the first thing I should note is that you should avoid UDF’s (User Defined Functions) like the plague unless you absolutely have to; I have spoken about why that is here. Now that I’ve warned you, let’s talk about what a UDF is. A UDF is a User Defined Function, it’s a function […]

Read more
ML, Python

Working with date time features in your ML models

Having a date in our dataset can really improve our predictions. Imagine an ice cream company, their data has a seasonal aspect (more sales in summer), so predicting sales information without dates in our data would be inaccurate. The problem is though, dates are horrible to work with. The first of June 2020 could be […]

Read more
Python

Statistical methods of handling outliers

Outliers can be a real pain & can skew our datasets. In the below, I’ve added an obvious outlier on row 3 (company name = Talane). If we were to take an average of InvoiceValue, this outlier would skew the result massively. Extract values within x standard deviations from the mean. Statistically, 99.7% of normally […]

Read more
ML

ML Scenario Part 2: Tuning our model hyper parameters

In the last article, we built a simple machine learning model which predicted the quality of wine based on a number of features. In this article, we’re going to tune that model. Our starting point was an accuracy of 0.68 (68%), so we aim to get better than that. The way we make models more […]

Read more