Python

Quick Tip: Image Analysis With Python

Let’s say, we’re working for a retailer. They have two distribution centres in Australia – the geographic area covered by each distribution centre is coloured in orange. Our task is to, from this image, work out to how many square kilometers of Australia can we deliver and what percentage of the country is that? Right, […]

Read more
ML

NLP Series: Pattern Matching With Spacy

Pattern matching can be really very useful in large bodies of text. Yes, you could, for the most part, use a regular expression to achieve the same thing; but I tend to find the Spacy functionality a lot more straightforward – afterall, nobody enjoys the syntax of a regular expression. Further to this, Spacy makes […]

Read more
ML

NLP Series: Using Topic Modeling To Understand Key Document Topics

In this article, we are going to look into an unsupervised machine learning model, which aims to cluster documents into topics – the algorithm is called Latent Dirichlet Allocation (LDA). Essentially, we can assume that documents which are written about similar things, use a similar group of words. LDA tries to map all the documents […]

Read more
ML

NLP Series: Is It Spam? Text Classification

Text classification is all about classifying a body of text, without actually reading it. It’s a supervised machine learning algorithm which uses term frequency (how often words occur) to classify the document.  The classic example is to determine whether an email is spam or not. To do this, we can ingest a dataframe which includes […]

Read more
ML

NLP Series : The Basic Building Blocks of NLP

Recently, I have been presented with some problems which require Natural Language Processing to solve. Natural Language Processing (NLP) is a field of Artificial Intelligence which enables a machine to interpret human text for the purpose of analysis – for example, understanding the main topics discussed in a document or how positive customers reviews are […]

Read more
Python

Bamboolib: The Most Flexible Pandas GUI?

As data engineers and data scientists, we’re spend a lot of time exploring data. When you’re working with huge datasets, you may find you need to utilise Apache Spark or similar to conduct this exploratative analysis but for the majority of use-cases, Pandas is the defaqto tool we choose. The Pandas library has been so […]

Read more
Python

Another Pandas GUI: Pandas Profiling

Another GUI for Pandas! YES – they’re coming out of the woodwork now! But this one is a little bit different to the Pandas GUI library I discussed previously. Pandas GUI let us slice and dise our data; restructure it and visualize it. Pandas Profiling isn’t as feature-rich but provides a different way of looking […]

Read more
Python

A GUI For Pandas! Is This A Game Changer?

Pandas is the defaqto library for data analysis in Python for good reason. It’s infinitely flexible, relatively performant and easy to learn. That said, quite a lot of what we, as data scientists and engineers do is trial and error and exploration. This exploratative process can sometimes be a bit tedius. We create a dataframe; […]

Read more
ML

ML Series: Top Ways To Avoid Data Leakage

Data leakage is where we accidentally share data between the test and training sets; so,that our predictions are more accurate than they will ever be in the real world. If you’ve run your model against your test dataset and have a 99% accuracy, you probably have data leakage. Maybe you don’t, but you probably do. […]

Read more
ML

ML Series: How Likely Is Y, Given X. Association Rule Mining

Association rule mining is an unsupervised learning algorithm which finds patterns in our data, whereby we know how likely it is that Y occurs in the event of X. In other words, it finds features (or dimensions) which appear together frequently. Note: just because someone buys burgers with ketchup, does not mean someone that buys […]

Read more