Mastering PySpark Window Functions: Cumulative Calculations (Running Totals and Averages)

As data grows larger, it becomes more difficult to process, which is why big data frameworks like PySpark exist. One common task in data processing is the calculation of running totals and averages. In PySpark, we can accomplish this task using window functions.

In this article, we will explore PySpark window functions for cumulative calculations. We will explain what they are, when to use them, and provide real-world examples.

What are PySpark Window Functions?

PySpark window functions are a powerful tool that allows you to perform calculations across a set of rows that are related to the current row. PySpark window functions are similar to SQL window functions, and they allow you to perform calculations on a set of rows, without having to group them. Window functions allow you to calculate a running total, average, or any other statistical calculation for a subset of data.

Related Posts:

Example 1: Calculating Running Totals

In this example, we will calculate the running total of sales for each day in our dataset. We will use the sum function to calculate the running total.

from pyspark.sql.window import Window
from pyspark.sql.functions import sum

# create example dataframe
data = [("2022-01-01", 10),
        ("2022-01-02", 15),
        ("2022-01-03", 20),
        ("2022-01-04", 30),
        ("2022-01-05", 25)]

df = spark.createDataFrame(data, ["date", "sales"])

# define window specification
windowSpec = Window.orderBy("date").rowsBetween(Window.unboundedPreceding, 0)

# create running total column
df = df.withColumn("running_total", sum("sales").over(windowSpec))

df.show()

Output:

+----------+-----+-------------+
|      date|sales|running_total|
+----------+-----+-------------+
|2022-01-01|   10|           10|
|2022-01-02|   15|           25|
|2022-01-03|   20|           45|
|2022-01-04|   30|           75|
|2022-01-05|   25|          100|
+----------+-----+-------------+

As we can see, we have added a new column called running_total which calculates the running total of sales for each day in the dataset.

You might also like: A Beginner's Guide to Machine Learning with Python: Techniques and Examples

Example 2: Running Average of a Column

In this example, we will calculate the running average of the “sales” column. We will use the same “sales_df” dataframe as in the previous example.

from pyspark.sql.window import Window
from pyspark.sql.functions import avg

windowSpec = Window.orderBy("month").rowsBetween(Window.unboundedPreceding, Window.currentRow)

sales_df.withColumn("running_avg", avg("sales").over(windowSpec)).show()

Output:

+-----+-----+-----------+
|month|sales|running_avg|
+-----+-----+-----------+
|    1|  100|      100.0|
|    2|   80|       90.0|
|    3|  120|      100.0|
|    4|  140|      110.0|
|    5|   90|      106.0|
|    6|  110|      105.0|
|    7|  100|      103.0|
|    8|   70|       98.0|
|    9|  150|      104.0|
|   10|  130|      105.0|
|   11|  120|      104.0|
|   12|  110|      102.0|
+-----+-----+-----------+

Here, we have used the same window specification as in the previous example. We have also used the “avg” function to calculate the average of the “sales” column for the rows within the window. The “withColumn” function is used to add a new column “running_avg” to the dataframe.

Example 3: Running Total of a Column

In this example, we will calculate the running total of the “sales” column. We will use the same “sales_df” dataframe as in the previous examples.

from pyspark.sql.functions import sum

sales_df.withColumn("running_total", sum("sales").over(windowSpec)).show()

Output:

BECOME APACHE KAFKA GURU – ZERO TO HERO IN MINUTES

ENROLL TODAY & GET 90% OFF

+-----+-----+-------------+
|month|sales|running_total|
+-----+-----+-------------+
|    1|  100|          100|
|    2|   80|          180|
|    3|  120|          300|
|    4|  140|          440|
|    5|   90|          530|
|    6|  110|          640|
|    7|  100|          740|
|    8|   70|          810|
|    9|  150|          960|
|   10|  130|         1090|
|   11|  120|         1210|
|   12|  110|         1320|
+-----+-----+-------------+

Here, we have used the same window specification as in the previous examples. We have also used the “sum” function to calculate the running total of the “sales” column for the rows within the window. The “withColumn” function is used to add a new column “running_total” to the dataframe.

You might also like: Logistic Regression for Email Spam Detection: A Practical Approach

When to use PySpark Window Functions for Cumulative Calculations?

Cumulative calculations can be used to understand trends over time. Some examples of cumulative calculations include running totals, running averages, year-to-date sales, etc. PySpark window functions can be used to easily calculate these metrics without the need for complex groupings or subqueries.

Closing Thoughts

In conclusion, PySpark window functions are incredibly powerful tools for performing advanced data manipulations and calculations on large datasets. In this blog post, we explored the concept of cumulative calculations and how PySpark window functions can be used to create running totals and averages. We discussed several examples and provided detailed code explanations, along with real-world use cases.

It’s important to note that while window functions can be incredibly useful, they may not always be the best solution for every problem. Understanding the limitations and when not to use window functions is also critical in effectively using them. With the knowledge gained from this post, you can confidently leverage PySpark window functions in your data processing workflows and produce valuable insights for your organization.

2 Comments

PySpark Partitioning by Multiple Columns - A Complete Guide with Examples - DataShark AcademyApril 7, 2023

[…] Mastering PySpark Window Functions: Cumulative Calculations (Running Totals and Averages) […]

Log in to Reply
Mastering PySpark Window Ranking Functions: A Comprehensive Guide with Code Examples and Performance Profiling - DataShark AcademyApril 7, 2023

[…] Mastering PySpark Window Functions: Cumulative Calculations (Running Totals and Averages) […]

Log in to Reply

What are PySpark Window Functions?

Example 1: Calculating Running Totals

Example 2: Running Average of a Column

Example 3: Running Total of a Column

BECOME APACHE KAFKA GURU – ZERO TO HERO IN MINUTES

ENROLL TODAY & GET 90% OFF

When to use PySpark Window Functions for Cumulative Calculations?

Closing Thoughts

2 Comments

Leave a Reply Cancel reply