ML

Ensemble modelling to improve your model performance

In my last article, I spoke about auto-sklearn. I said that, the library would train several models and then use them in conjunction with one another to make a final prediction. This is what we call ensemble modelling (or meta algorithms). It’s the process of including multiple models into the prediction process with the goal of […]

Read more
ML

Getting started with Sci-Kit Learn AutoML

An automl workflow should be able to preprocess data; select the right model to use; tune the hyper parameters and provide us with the best possible model as a result. One such automl library is auto-sklearn. This library automatically finds the right algorithm for the dataset you have provided and automatically tunes the hyper parameters […]

Read more
ML

Using Shapley Values to explain your ML models

You work for a fitness centre. Let’s say you’ve recently deployed a machine learning model to predict whether a customer will churn at the end of their current contract. Your input features to the model are: Average times visited per week over the last month Average times visited per week over the last 6 months […]

Read more
Python

Speeding up Pandas apply functions using Swifter

Pandas is an excellent library for data analytics. However, when you get to work with really huge datasets, it just can’t hack it – the Pandas apply function runs on a single core, which constrains your computational efficiency. You’d usually be playing around with multiprocessing & Dask to try and optimise execution time – which, […]

Read more
ML

Using PySpark & SKLearn to deploy a machine learning model

Recently, I’ve been working to deploy a new machine learning model into a production environment. This is the first time I’ve had to deploy a model that runs across such huge datasets. The requirement is to make 30,000,000 predictions each time the model runs. In terms of the pipeline, it’s three distinct phases. The first, […]

Read more
Python

Generate mock data to test your pipeline

Quite often, we need to test our pipelines work at scale without having access to production systems. To help solve this, we can generate mock data using the Python library ‘Faker’. Faker is a comprehensive fake data library. They have data surrounding: customers, addresses, bank details; company names; credit card details; currencies; cryptocurencies; files; domain […]

Read more
Blockchain

How does Bitcoin work?

Bitcoin is a cryptocurrency using the blockchain technology, which I discussed here. It’s completely digital and has no central bank controlling it. Based on the previous article, we now understand what blockchain is & how it works. But it’s probably not clear how that technology relates to a cryptocurrency. The components of a block We […]

Read more
Blockchain

An introduction to blockchain – hashes, pointers & blocks

Right now, Cloud infrastructure and big data are two of the most in demand technology areas. I think that blockchain will be next. Let’s look at how blockchain works & some potential use-cases. Before we get into what blockchain is exactly, we need to understand a few terms. First, a hash. A hash function converts […]

Read more
Python

An introduction to digital signatures & asymmetric encryption in Python

This article will give you an introduction into asymmetric encryption using RSA. Asymmetric encryption uses two keys: a public key and a private key. The public key, can be provided to anyone, the private key should be kept for your own records. Asymmetric encryption is used heavily online. It’s used to send sensitive information from […]

Read more