ML

NLP Series: Pattern Matching With Spacy

Pattern matching can be really very useful in large bodies of text. Yes, you could, for the most part, use a regular expression to achieve the same thing; but I tend to find the Spacy functionality a lot more straightforward – afterall, nobody enjoys the syntax of a regular expression. Further to this, Spacy makes […]

Read more
ML

NLP Series: Using Topic Modeling To Understand Key Document Topics

In this article, we are going to look into an unsupervised machine learning model, which aims to cluster documents into topics – the algorithm is called Latent Dirichlet Allocation (LDA). Essentially, we can assume that documents which are written about similar things, use a similar group of words. LDA tries to map all the documents […]

Read more
ML

NLP Series: Is It Spam? Text Classification

Text classification is all about classifying a body of text, without actually reading it. It’s a supervised machine learning algorithm which uses term frequency (how often words occur) to classify the document.  The classic example is to determine whether an email is spam or not. To do this, we can ingest a dataframe which includes […]

Read more
ML

NLP Series : The Basic Building Blocks of NLP

Recently, I have been presented with some problems which require Natural Language Processing to solve. Natural Language Processing (NLP) is a field of Artificial Intelligence which enables a machine to interpret human text for the purpose of analysis – for example, understanding the main topics discussed in a document or how positive customers reviews are […]

Read more
ML

ML Series: Top Ways To Avoid Data Leakage

Data leakage is where we accidentally share data between the test and training sets; so,that our predictions are more accurate than they will ever be in the real world. If you’ve run your model against your test dataset and have a 99% accuracy, you probably have data leakage. Maybe you don’t, but you probably do. […]

Read more
ML

ML Series: How Likely Is Y, Given X. Association Rule Mining

Association rule mining is an unsupervised learning algorithm which finds patterns in our data, whereby we know how likely it is that Y occurs in the event of X. In other words, it finds features (or dimensions) which appear together frequently. Note: just because someone buys burgers with ketchup, does not mean someone that buys […]

Read more
ML

ML Series: A Random Forest Implementation With An Imbalanced Dataset

For this article, I wanted to demonstrate an end to end model implementation, including tuning for an imbalanced dataset. First off, let’s ingest our data: Now, we can start to explore our dataset. First off, are there any null values that we need to worry about? There are no nulls! a rare pleasure. Let’s now […]

Read more
ML

ML Series: Methods For Increasing Model Accuracy

Tuning our machine learning model is of course critical to its accuracy. There are two methods for tuning models: tuning the data and tuning the parameters.  We ‘tune’ the data to clean it up, select the right features and make sure we don’t have a class imbalance – because afterall, if you pump garbage into […]

Read more
ML

ML Series: Accuracy Metrics To Prove Or Improve Our Diabetes Algorithm

In the last article, we started to look at the implemented algorithm and looked at the model score along with the confusion matrix. In this article, I am going to cover off a high level overview of some of the methods we could use to evaluate our model performance. Classification Accuracy Classification Accuracy is the […]

Read more