Daily photo blog, reviews, photography tips, gear buying guides, and more.
Established in June of 2010, Edison is a blog focusing on the world of photography. We partner with some of the best vendors in the photography space to provide you with informative gear, tools, and other products you might need to succeed. If you like our content, don’t forget to subscribe at the bottom of the page.
Performance is a major concern when you’re working in a distributed environment with a massive amount of data. I’ve discussed Spark performance in quite a lot of detail before here and here. Today I am going to talk specifically about percentiles. Because, calculating percentiles over large distributed datasets is a mammoth task. You will likely […]Read more
Timeseries Decomposition is a mathematical procedure which allows us to transform our single timeseries into multiple series. These help us to extract seasonality information and trend easily. Doing this in Python is quite a simple task, I have outlined it below. However, before we get into that, we need to understand the difference between additive […]Read more
Timeseries forecasting is quite a big topic to cover. I’ve spoken about key terminology and exponential smoothing in this article and I’ve spoken about how we might remove timeseries outliers here. In this post, I am going to discuss the different components of the ARIMA model (AR and MA), in addition to the ARIMA model […]Read more
Timeseries analysis is incredibly powerful but can get quite confusing. There is a lot of terminology which we need to understand before we can really progress with making a forecast. Ultimately, timeseries analysis is all about analysing and forecasting data that is indexed in equally spaced increments of time; i.e. minutes, seconds, days, weeks, months, […]Read more
This article covers a less than orthodox method for handling resource constraints in your PySpark code. Consider the below scenario. Here, we are loading data from four sources doing some joins & aggregations and producing an output. The problem is, we keep getting timeout errors because the data is just so large. After tuning the […]Read more
Mean Absolute Error (MAE) This simply takes the difference between the predicted value and the actual value for every prediction and takes an average of the result. However, to avoid values cancelling one another out, it takes the absolute value (which means, it makes all the values positive). Let’s consider an example. In the below, […]Read more
An early view of my new book ‘Data Badass’ is available to view using this link. It’s not yet been through thorough editing and will be added to over time but I am keen to gather some feedback. It’s a book that covers the data basics; data platforms (including Hadoop, Kafka, Flume, Hive, Spark) and […]Read more
This is a snippet from my upcoming book ‘Data Badass’ (pictured below): The ROC Curve & the Area Under Curve (AUC) is used for binary classification problems. The ROC curve chart looks at the True Positive Rate vs the False Positive Rate. Ideally, you want to reduce the number of false positives as much as […]Read more
This is an introductory chapter of my upcoming book ‘Data Badass’ (pictured below): Data modelling is all about designing the way that your data is going to be organized. Think about it like building a house; you wouldn’t start laying bricks, without a plan. How would you know that the end result would meet your […]Read more
This is an introductory chapter of my upcoming book ‘Data Badass’ (pictured below): Data Types Data types are ways to classify data. For example, integer, string and boolean. These assignments / classifications determine what kind of values we can store in our data structures (outlined below). A string, is simply a text value. An example […]Read more
This article will overview how to extract data through screen scraping from a website. Specifically, this will focus on the UK government, open data website. Scraping websites is a contentious topic – while some websites don’t mind you doing it; some really would rather you didn’t and put in measures to try and stop you. […]Read more
Data Skew is a real problem in Spark. It seems like the sort of thing which should be solved automatically by the job manager but unfortunately, it isn’t. Skew is where we have a given partition which contains a huge amount more data than others – leaving one executor to process a lot of data, […]Read more
Apache Spark provides us with a framework to crunch a huge amount of data efficiently by leveraging parallelism which is great! However, with great power, comes great responsibility; because, optimising your scripts to run efficiently, is not so easy. Within our scripts, we need to look to minimize the data we bring in; avoid UDF’s […]Read more
Window functions are incredibly useful. Within a single query, you can find out things which may have otherwise been tricky. In this article, I will cover all of the key window functions in Pyspark. First off, we need to define our dataframe – you can get the data to play along here. Now, we have […]Read more
Windowing functions in Hive are super useful. They make analysis that would otherwise be challenging, much easier. Let’s look at examples of the key windowing functions in Hive. We will use the below dataset for our analysis, this is the ‘payments’ table. customerid department payment_amount 1 1 10000 2 1 9000 3 2 12000 4 […]Read more
SQL is one of the most in-demand data skills. The language has been adopted by many database platforms, including Apache Hive. This article will serve as a crash couse into the key functionality of Hive QL. Throughout this article, we will use the two sample tables as the basis for our code. THE CUSTOMERS TABLE […]Read more
When you look online at what it takes to become a data scientist, it’s enough to make your brain melt. You have people telling you that you need to be an expert statistician / mathematician; you need to be a top-level coder in 15 different languages; be proficient with every type of SQL/NoSQL database on […]Read more
Agile is about iterative development and delivering tangible products/features quickly, which provides the business with value and ROI faster than a traditional waterfall project. Consider the example of a piece of accounting software. Overall, it’s going to have 50 features to support the accounts team. To deliver all of the features in a waterfall fashion, […]Read more
Something went wrong. Please refresh the page and/or try again.
Get new content delivered directly to your inbox.