Today I am going to talk specifically about percentiles. Because, calculating percentiles over large distributed datasets is a mammoth task. You will likely find that a map reduce or spark job will take absolutely ages to calculate the measures. This is because, the data is distributed across many nodes and a significant amount of data shuffle has to occur in order to calculate the percentile values.
What are percentiles & why are they computationally expensive?
Percentiles are a way of ranking data within a set of values. It is a measure used in statistics to describe the relative standing of an individual value within a dataset. Percentiles indicate the percentage of values that fall below or above a certain point. For example, if a value is at the 25th percentile, it indicates that 25% of the values in the dataset are lower than that value.
Calculating percentiles can be computationally expensive as it requires sorting through a large amount of data to determine the desired percentile. This can take a considerable amount of time and resources when dealing with large datasets. Furthermore, calculating percentiles is non-trivial and involves several complicated steps such as creating a cumulative distribution function, finding the rank for each data point and using the rank to identify the percentile corresponding to each data point. All these steps involve complex calculations that require significant computational power.
Use an Approximate Percentile to improve performance
Approximate percentile (or approx_percentile) is a statistical function in Apache PySpark that provides an efficient way to calculate percentiles for large datasets. It is particularly useful for calculating percentile values for very large datasets when exact percentiles are too time-consuming or costly to calculate. Approximate percentile uses the t-digest algorithm to quickly approximate percentiles with a high degree of accuracy.
The approximate percentile calculation process is highly efficient and can be used to calculate approximate percentile values for very large datasets in a fraction of the time required to calculate exact percentile values. Approximate percentile is especially useful for datasets that have a high degree of variation, such as those with outliers or non-normal distributions.
How do we calculate approximate percentiles in PySpark?
Apache PySpark provides a range of powerful functions for analyzing data and gaining insight from it. One such function is the percentile_approx calculation.
In Apache PySpark, calculating percentiles is a simple three-step process:
1. Create a dataframe from the dataset: using Apache Spark’s DataFrame API, we can create a dataframe from our dataset using a single command. This dataframe is then used in subsequent operations.
2. Sort the dataframe: We then sort our dataframe based on the value we are looking to calculate the percentile for. We can use Apache Spark’s sort function to achieve this.
3. Calculate the percentile: finally, we use Apache Spark’s percentile_approx function to calculate the percentile value for our sorted dataframe. The percentile_approx function takes two parameters – a dataframe and a percentage value. For example, if we wanted to calculate the 80th percentile for our data, we would pass 80 as the second parameter.
The percentile_approx function returns the approximate percentile value for our dataframe. With this, we can quickly and easily calculate percentiles for any dataset.
The approx functions available do not work on floating point numbers. So, we can cast them as integers, to remove everything after the decimal point.
#select the fields we need, limited to given social catid's df1 = df1.select( df1.fieldname.cast("integer").alias('somealias'),\ df1.fieldname2.cast("integer").alias("somealias2"))
PySpark offers two methods approxQuantile and approx_percentile. Approx quantile takes in three arguments: the field you want to calculate the percentile of; the percentile you want to calculate (e.g. 0.25 for 25th percentile) and the error you are willing to accept (between 0 and 1 (where 0 is the exact value)). This returns a list, which is not all that helpful; so in most cases, the approx_percentile is preferable.
df.approxQuantile('fieldname', [quantile to calculate], accepted_error) df.approxQuantile('usage', [0.5], 0.25)
To use approx percentile, we have a number of options. First, we could simply use the function by running SQL on our dataframes. This is a super useful way to be able to implement this function, expecially if you’re writing back to a database table.
spark.sql("SELECT percentile_approx(x, 0.5) FROM df")
We could also run the below. Which would give us a new column called ‘%10’ which would be the result of the percentile calculation. This again has it’s advantages – keeping all your code in a single language, rather than mix and matching between Spark functions & SQL.
df \ .groupby('grp') \ .agg(round(sqlfunc.expr('percentile_approx(fieldname, 0.10)'),2).alias('%10')) .show()
These simple changes to your script will result in a huge improvement in performance – but there will be a slight dip in absolute accuracy. Unless you’re chasing accuracy to an extreme level (for example in financial trading or medical use-cases), this will probably be sufficient.