A guide to windowing functions in Hive

Windowing functions in Hive are super useful. They make analysis that would otherwise be challenging, much easier. Let’s look at examples of the key windowing functions in Hive. We will use the below dataset for our analysis, this is the ‘payments’ table. customerid department payment_amount 1 1 10000 2 1 9000 3 2 12000 4 […]

Read more

The Hive SQL Crash Course For Data Analysts

SQL is one of the most in-demand data skills. The language has been adopted by many database platforms, including Apache Hive. This article will serve as a crash couse into the key functionality of Hive QL. Throughout this article, we will use the two sample tables as the basis for our code. THE CUSTOMERS TABLE […]

Read more
What skills do you actually need to become a data scientist?

The data scientist learning plan for 2021

When you look online at what it takes to become a data scientist, it’s enough to make your brain melt. You have people telling you that you need to be an expert statistician / mathematician; you need to be a top-level coder in 15 different languages; be proficient with every type of SQL/NoSQL database on […]

Read more
Can we successfully implement Agile in data science?

Can we successfully implement Agile in data science?

Agile is about iterative development and delivering tangible products/features quickly, which provides the business with value and ROI faster than a traditional waterfall project. Consider the example of a piece of accounting software. Overall, it’s going to have 50 features to support the accounts team. To deliver all of the features in a waterfall fashion, […]

Read more
The Data Scientist Statistics Learning Plan For 2021

The Data Scientist Statistics Learning Plan For 2021

As data scientists, we need to be comfortable with mathematics. If you Google what you need to know, you’ll find answers stating you need to fully understand linear algebra; calculus and how to calculate all of the algorthms we use by hand. I’m not going to downplay the importance of understanding how the algorithm works, […]

Read more
A Guide To Basic Linear Algebra Notation For Machine Learning

A Guide To Basic Linear Algebra Notation For Machine Learning

Often, you’ll be looking around on the web for an answer to a question you have about an algorithm & you are presented with a formulae-heavy answer on a forum. If you don’t know the notation, this is going to give you a headache. So this article aims to cover off much of the common […]

Read more
The Ultimate Guide To Linear Regression For Aspiring Data Scientists

The Ultimate Guide To Linear Regression For Aspiring Data Scientists

Regression is about finding the relationship between some input variable(s) and an outcome. Let’s think about a simple example of height and weight. We need to understand the relationship between the two – intuition can tell us that as height increases, so does weight. The idea of regression is to create a mathematical formula which […]

Read more
Using A Support Vector Machine Learning Model (SVM) To Classify Emails As Spam

Using A Support Vector Machine Learning Model (SVM) To Classify Emails As Spam

Text classification is all about classifying a body of text, without actually reading it. It’s a supervised machine learning algorithm which uses term frequency (how often words occur) to classify the document.  The classic example is to determine whether an email is spam or not. To do this, we can ingest a dataframe which includes […]

Read more
The Six Basic Building Blocks of NLP (Natural Language Processing) For Data Scientists

The Six Basic Building Blocks of NLP (Natural Language Processing) For Data Scientists

Recently, I have been presented with some problems which require Natural Language Processing to solve. Natural Language Processing (NLP) is a field of Artificial Intelligence which enables a machine to interpret human text for the purpose of analysis – for example, understanding the main topics discussed in a document or how positive customers reviews are […]

Read more

Bamboolib: The Most Flexible Pandas GUI?

As data engineers and data scientists, we’re spend a lot of time exploring data. When you’re working with huge datasets, you may find you need to utilise Apache Spark or similar to conduct this exploratative analysis but for the majority of use-cases, Pandas is the defaqto tool we choose. The Pandas library has been so […]

Read more


Something went wrong. Please refresh the page and/or try again.

About Me

Hi, I’m Lillie. Previously a magazine editor, I became a full-time mother and freelance writer in 2017. I spend most of my time with my kids and husband over at The Brown Bear Family but this blog is for my love of food and sharing my favorites with you!

Subscribe to My Blog

Get new content delivered directly to your inbox.