When you look online at what it takes to become a data scientist, it’s enough to make your brain melt. You have people telling you that you need to be an expert statistician / mathematician; you need to be a top-level coder in 15 different languages; be proficient with every type of SQL/NoSQL database on […]

Read more## Can we successfully implement Agile in data science?

Agile is about iterative development and delivering tangible products/features quickly, which provides the business with value and ROI faster than a traditional waterfall project. Consider the example of a piece of accounting software. Overall, it’s going to have 50 features to support the accounts team. To deliver all of the features in a waterfall fashion, […]

Read more## The Data Scientist Statistics Learning Plan For 2021

As data scientists, we need to be comfortable with mathematics. If you Google what you need to know, you’ll find answers stating you need to fully understand linear algebra; calculus and how to calculate all of the algorthms we use by hand. I’m not going to downplay the importance of understanding how the algorithm works, […]

Read more## A Guide To Basic Linear Algebra Notation For Machine Learning

Often, you’ll be looking around on the web for an answer to a question you have about an algorithm & you are presented with a formulae-heavy answer on a forum. If you don’t know the notation, this is going to give you a headache. So this article aims to cover off much of the common […]

Read more## The Ultimate Guide To Linear Regression For Aspiring Data Scientists

Regression is about finding the relationship between some input variable(s) and an outcome. Let’s think about a simple example of height and weight. We need to understand the relationship between the two – intuition can tell us that as height increases, so does weight. The idea of regression is to create a mathematical formula which […]

Read more## How Do Data Scientists Deal With Outliers In Timeseries Analysis To Reveal Trends and Patterns?

Welcome back to the series on timeseries analysis. In this article, we’re going to discuss: plotting timeseries data and smoothing the data to handle outliers & make finding a trend a little bit easier. In the below, I have ingsted the timeseries data into a dataframe called df. From that dataframe, I have then set […]

Read more## The Basics of Timeseries Analysis – Stationarity and Autocorrelation For Data Scientists

The purpose of time series analysis is to analyse timeseries data and extrapolate the patterns you identify into the future, enabling us to make predictions. It’s important to understand when we can use timeseries analysis. If the timeseries has a clear pattern, trend and seasonality then you can accurately model a forecast. If a dataset […]

Read more## Pattern Matching In Large Bodies Of Text Using Spacy’s NLP Capabilities For Data Scientists

Pattern matching can be really very useful in large bodies of text. Yes, you could, for the most part, use a regular expression to achieve the same thing; but I tend to find the Spacy functionality a lot more straightforward – afterall, nobody enjoys the syntax of a regular expression. Further to this, Spacy makes […]

Read more## How Do Data Scientists Use NLP Topic Modeling To Understand Key Document Topics?

In this article, we are going to look into an unsupervised machine learning model, which aims to cluster documents into topics – the algorithm is called Latent Dirichlet Allocation (LDA). Essentially, we can assume that documents which are written about similar things, use a similar group of words. LDA tries to map all the documents […]

Read more## Using A Support Vector Machine Learning Model (SVM) To Classify Emails As Spam

Text classification is all about classifying a body of text, without actually reading it. It’s a supervised machine learning algorithm which uses term frequency (how often words occur) to classify the document. The classic example is to determine whether an email is spam or not. To do this, we can ingest a dataframe which includes […]

Read more## The Six Basic Building Blocks of NLP (Natural Language Processing) For Data Scientists

Recently, I have been presented with some problems which require Natural Language Processing to solve. Natural Language Processing (NLP) is a field of Artificial Intelligence which enables a machine to interpret human text for the purpose of analysis – for example, understanding the main topics discussed in a document or how positive customers reviews are […]

Read more## Top Five Ways To Avoid Data Leakage As A Data Scientist

Data leakage is where we accidentally share data between the test and training sets; so,that our predictions are more accurate than they will ever be in the real world. If you’ve run your model against your test dataset and have a 99% accuracy, you probably have data leakage. Maybe you don’t, but you probably do. […]

Read more