Daily photo blog, reviews, photography tips, gear buying guides, and more.

Established in June of 2010, Edison is a blog focusing on the world of photography. We partner with some of the best vendors in the photography space to provide you with informative gear, tools, and other products you might need to succeed. If you like our content, don’t forget to subscribe at the bottom of the page.

Latest Posts

How does Bitcoin work?

How does Bitcoin work?

Bitcoin is a cryptocurrency using the blockchain technology, which I discussed here. It’s completely digital and has no central bank controlling it. Based on the previous article, we now understand what blockchain is & how it works. But it’s probably not clear how that technology relates to a cryptocurrency. The components of a block We […]

Read more
An introduction to blockchain - hashes, pointers & blocks

An introduction to blockchain – hashes, pointers & blocks

Right now, Cloud infrastructure and big data are two of the most in demand technology areas. I think that blockchain will be next. Let’s look at how blockchain works & some potential use-cases. Before we get into what blockchain is exactly, we need to understand a few terms. First, a hash. A hash function converts […]

Read more
An introduction to digitial signatures & asymmetric encryption in Python

An introduction to digital signatures & asymmetric encryption in Python

This article will give you an introduction into asymmetric encryption using RSA. Asymmetric encryption uses two keys: a public key and a private key. The public key, can be provided to anyone, the private key should be kept for your own records. Asymmetric encryption is used heavily online. It’s used to send sensitive information from […]

Read more
Managing small file issues when writing to Hive with Spark SQL

Managing small file issues when writing to Hive with Spark SQL

Hive sits on top of HDFS (the Hadoop Distributed File System). It reads the files that reside onto HDFS into a specified schema (column names & types), which we can then query and interact with. One such way we may interact with Hive is using Apache Spark. When we write to HDFS, we often end […]

Read more
training a random forest machine learning model

An end to end Random Forest Classifier implementation guide

In this article, we are going to go through some of the key steps to implementing a random forest machine learning model. The data we will be using is from Kaggle – it’s a dataset describing whether patients have diabetes or not. Luckily, this dataset is nice and clean, so we don’t need to worry […]

Read more
Getting started with API development in Python

Expose your ML model via a simple API in Python

As data scientists, it is important that we have a method of sharing the insight from our models. In this post, I am going to show you how to create a super simple API, whereby the customer can pass URL parameters to extract data, generated by a Python function. Before we get started, make sure […]

Read more
Parameters to make your Hive queries perform better

Parameters to make your Hive queries perform better

Hive, in my experience, is a platform which can have extremely variable performance, which can make it difficult to predict how long your jobs are going to take to run. Below are a few key optimisation techniques we can use to help make our lives a bit better! Choose the right file type First of […]

Read more
Keeping your Hive queries clean with CTEs

Keeping your Hive queries clean with CTEs

This is a super short & quick article about keeping your queries as readable and performant as possible by using CTEs. When we’re working with a number of different datasets it is really very temptying to use subqueries. However, when your queries start to get very large, this can become difficult to manage with a […]

Read more
Working with dates in Apache Hive

Working with dates in Apache Hive

Working with dates is one of those tedious things we frequently come across as data engineers. The frustration is that there are simply tonnes of date formats. Let’s list a few: Format Example MM/dd/yy 11/01/21 dd/MM/yy 01/11/21 yy/MM/dd 21/11/01 d/MM/yy 1/11/21 (no leading zeros) MMddyy 110121 ddMMyy 011121 yyyyMMdd 20211101 yyyy-MM-dd HH:mm:ss.SSS 01-11-2021 10:45:12.084 yyyy-MM-dd […]

Read more
Improving performance when calculating percentiles in Spark

Improving performance when calculating percentiles in Spark

Performance is a major concern when you’re working in a distributed environment with a massive amount of data. I’ve discussed Spark performance in quite a lot of detail before here and here. Today I am going to talk specifically about percentiles. Because, calculating percentiles over large distributed datasets is a mammoth task. You will likely […]

Read more
timeseries arima arma model

An introduction to timeseries models (AR, MA, ARMA and ARIMA)

Timeseries forecasting is quite a big topic to cover. I’ve spoken about key terminology and exponential smoothing in this article and I’ve spoken about how we might remove timeseries outliers here. In this post, I am going to discuss the different components of the ARIMA model (AR and MA), in addition to the ARIMA model […]

Read more
Overcoming Futures Timeout & Read Timeout errors in PySpark

Overcoming Futures Timeout & Read Timeout errors in PySpark

This article covers a less than orthodox method for handling resource constraints in your PySpark code. Consider the below scenario. Here, we are loading data from four sources doing some joins & aggregations and producing an output. The problem is, we keep getting timeout errors because the data is just so large. After tuning the […]

Read more

MAE | RMSE | MAPE : Measures of model accuracy for data scientists

Mean Absolute Error (MAE) This simply takes the difference between the predicted value and the actual value for every prediction and takes an average of the result. However, to avoid values cancelling one another out, it takes the absolute value (which means, it makes all the values positive). Let’s consider an example. In the below, […]

Read more

Data Badass Early Preview

An early view of my new book ‘Data Badass’ is available to view using this link. It’s not yet been through thorough editing and will be added to over time but I am keen to gather some feedback. It’s a book that covers the data basics; data platforms (including Hadoop, Kafka, Flume, Hive, Spark) and […]

Read more
roc curve auc

Using ROC Curves & AUC

This is a snippet from my upcoming book ‘Data Badass’ (pictured below): The ROC Curve & the Area Under Curve (AUC) is used for binary classification problems. The ROC curve chart looks at the True Positive Rate vs the False Positive Rate. Ideally, you want to reduce the number of false positives as much as […]

Read more
An introduction to structured data modelling

An introduction to structured data modelling

This is an introductory chapter of my upcoming book ‘Data Badass’ (pictured below): Data modelling is all about designing the way that your data is going to be organized. Think about it like building a house; you wouldn’t start laying bricks, without a plan. How would you know that the end result would meet your […]

Read more


Something went wrong. Please refresh the page and/or try again.

Follow Me

Get new content delivered directly to your inbox.