Python

Generate mock data to test your pipeline

Quite often, we need to test our pipelines work at scale without having access to production systems. To help solve this, we can generate mock data using the Python library ‘Faker’. Faker is a comprehensive fake data library. They have data surrounding: customers, addresses, bank details; company names; credit card details; currencies; cryptocurencies; files; domain […]

Read more
Data, ML

The data scientist learning plan for 2021

When you look online at what it takes to become a data scientist, it’s enough to make your brain melt. You have people telling you that you need to be an expert statistician / mathematician; you need to be a top-level coder in 15 different languages; be proficient with every type of SQL/NoSQL database on […]

Read more
ML

Can we successfully implement Agile in data science?

Agile is about iterative development and delivering tangible products/features quickly, which provides the business with value and ROI faster than a traditional waterfall project. Consider the example of a piece of accounting software. Overall, it’s going to have 50 features to support the accounts team. To deliver all of the features in a waterfall fashion, […]

Read more
ML

The Data Scientist Statistics Learning Plan For 2021

As data scientists, we need to be comfortable with mathematics. If you Google what you need to know, you’ll find answers stating you need to fully understand linear algebra; calculus and how to calculate all of the algorthms we use by hand. I’m not going to downplay the importance of understanding how the algorithm works, […]

Read more
Data

An introduction to Awk

Project 1 (Print file): This is as simple use case, we are simply going to print the contents of a txt file. To do that, we simply use: awk ‘{print}’ filename.csv Here, the single quotes denote the start of the program and the curly braces show the start of an action within that program. In […]

Read more
Python

Pulling comment sentiment from Twitter API

In my last article, I spoke about extracting tweet data from Twitter, filtering it and cleaning it up. Today, I want to look at sentiment analysis of comments / replies to tweets using the API. For this, I am going to use Vader Sentiment, which is a library specifically tuned for social media comments. We […]

Read more
Python

Extracting and cleaning data from the Twitter API in Python

I’ve heard from a tonne of people that they don’t read the news because it’s negative and the constant presence of articles about assault, killings and other terrible things bring their mood down. So I started thinking, how could we filter the news to only show us things that we want to see. The way […]

Read more
ML

K-Means Clustering Explained

K-Means clustering is an unsupervised machine learning model. Unsupervised learning is where we do not provide the model with the actual outputs from the data. Unsupervised learning aims to model the underling structure or distribution in the data to learn more about the data. The most popular use-cases of unsupervised learning are association rules – […]

Read more
Python

A simple intro to sentiment analysis in Python

Sentiment analysis can provide key insight into the feelings of your customers towards your company & hence is becoming an increasingly important part of data analysis. Building a machine learning model to identify positive and negative sentiments is pretty complex, but luckily for us, there is a Python library that can help us out. It’s […]

Read more
ML, Python

Working with date time features in your ML models

Having a date in our dataset can really improve our predictions. Imagine an ice cream company, their data has a seasonal aspect (more sales in summer), so predicting sales information without dates in our data would be inaccurate. The problem is though, dates are horrible to work with. The first of June 2020 could be […]

Read more
Python

Statistical methods of handling outliers

Outliers can be a real pain & can skew our datasets. In the below, I’ve added an obvious outlier on row 3 (company name = Talane). If we were to take an average of InvoiceValue, this outlier would skew the result massively. Extract values within x standard deviations from the mean. Statistically, 99.7% of normally […]

Read more