Daily photo blog, reviews, photography tips, gear buying guides, and more.
Established in June of 2010, Edison is a blog focusing on the world of photography. We partner with some of the best vendors in the photography space to provide you with informative gear, tools, and other products you might need to succeed. If you like our content, don’t forget to subscribe at the bottom of the page.
In my last article, I spoke about auto-sklearn. I said that, the library would train several models and then use them in conjunction with one another to make a final prediction. This is what we call ensemble modelling (or meta algorithms). It’s the process of including multiple models into the prediction process with the goal of […]Read more
An automl workflow should be able to preprocess data; select the right model to use; tune the hyper parameters and provide us with the best possible model as a result. One such automl library is auto-sklearn. This library automatically finds the right algorithm for the dataset you have provided and automatically tunes the hyper parameters […]Read more
You work for a fitness centre. Let’s say you’ve recently deployed a machine learning model to predict whether a customer will churn at the end of their current contract. Your input features to the model are: Average times visited per week over the last month Average times visited per week over the last 6 months […]Read more
Pandas is an excellent library for data analytics. However, when you get to work with really huge datasets, it just can’t hack it – the Pandas apply function runs on a single core, which constrains your computational efficiency. You’d usually be playing around with multiprocessing & Dask to try and optimise execution time – which, […]Read more
Recently, I’ve been working to deploy a new machine learning model into a production environment. This is the first time I’ve had to deploy a model that runs across such huge datasets. The requirement is to make 30,000,000 predictions each time the model runs. In terms of the pipeline, it’s three distinct phases. The first, […]Read more
Quite often, we need to test our pipelines work at scale without having access to production systems. To help solve this, we can generate mock data using the Python library ‘Faker’. Faker is a comprehensive fake data library. They have data surrounding: customers, addresses, bank details; company names; credit card details; currencies; cryptocurencies; files; domain […]Read more
Bitcoin is a cryptocurrency using the blockchain technology, which I discussed here. It’s completely digital and has no central bank controlling it. Based on the previous article, we now understand what blockchain is & how it works. But it’s probably not clear how that technology relates to a cryptocurrency. The components of a block We […]Read more
Right now, Cloud infrastructure and big data are two of the most in demand technology areas. I think that blockchain will be next. Let’s look at how blockchain works & some potential use-cases. Before we get into what blockchain is exactly, we need to understand a few terms. First, a hash. A hash function converts […]Read more
This article will give you an introduction into asymmetric encryption using RSA. Asymmetric encryption uses two keys: a public key and a private key. The public key, can be provided to anyone, the private key should be kept for your own records. Asymmetric encryption is used heavily online. It’s used to send sensitive information from […]Read more
Hive sits on top of HDFS (the Hadoop Distributed File System). It reads the files that reside onto HDFS into a specified schema (column names & types), which we can then query and interact with. One such way we may interact with Hive is using Apache Spark. When we write to HDFS, we often end […]Read more
In this article, we are going to go through some of the key steps to implementing a random forest machine learning model. The data we will be using is from Kaggle – it’s a dataset describing whether patients have diabetes or not. Luckily, this dataset is nice and clean, so we don’t need to worry […]Read more
As data scientists, it is important that we have a method of sharing the insight from our models. In this post, I am going to show you how to create a super simple API, whereby the customer can pass URL parameters to extract data, generated by a Python function. Before we get started, make sure […]Read more
Hive, in my experience, is a platform which can have extremely variable performance, which can make it difficult to predict how long your jobs are going to take to run. Below are a few key optimisation techniques we can use to help make our lives a bit better! Choose the right file type First of […]Read more
This is a super short & quick article about keeping your queries as readable and performant as possible by using CTEs. When we’re working with a number of different datasets it is really very temptying to use subqueries. However, when your queries start to get very large, this can become difficult to manage with a […]Read more
Working with dates is one of those tedious things we frequently come across as data engineers. The frustration is that there are simply tonnes of date formats. Let’s list a few: Format Example MM/dd/yy 11/01/21 dd/MM/yy 01/11/21 yy/MM/dd 21/11/01 d/MM/yy 1/11/21 (no leading zeros) MMddyy 110121 ddMMyy 011121 yyyyMMdd 20211101 yyyy-MM-dd HH:mm:ss.SSS 01-11-2021 10:45:12.084 yyyy-MM-dd […]Read more
Performance is a major concern when you’re working in a distributed environment with a massive amount of data. I’ve discussed Spark performance in quite a lot of detail before here and here. Today I am going to talk specifically about percentiles. Because, calculating percentiles over large distributed datasets is a mammoth task. You will likely […]Read more
Timeseries Decomposition is a mathematical procedure which allows us to transform our single timeseries into multiple series. These help us to extract seasonality information and trend easily. Doing this in Python is quite a simple task, I have outlined it below. However, before we get into that, we need to understand the difference between additive […]Read more
Timeseries forecasting is quite a big topic to cover. I’ve spoken about key terminology and exponential smoothing in this article and I’ve spoken about how we might remove timeseries outliers here. In this post, I am going to discuss the different components of the ARIMA model (AR and MA), in addition to the ARIMA model […]Read more
Something went wrong. Please refresh the page and/or try again.
Get new content delivered directly to your inbox.