Spark Chained Transformations

Originally published on Medium

In typical data engineering tasks, we follow some procedural style of code. If we take a sample dataset, for example:

Read and parse the file
Perform some aggregations on top of the file
Persist the aggregation to a table or share the results on console

In Spark, we can do this in multiple ways, but let’s see a couple of organized approaches.

Approach #1: Chained Transformations

In the above class, we could notice two simple functions, each handling a particular task — aggregation and a filter.

Here if you notice the main class ChainedTransformationMain, we read an open source Spotify dataset to load the data to a DataFrame and on top of it, we would like to apply some transformations like aggregation and filter. It’s much easier now to chain the set of transformations here, it’s a lot easier to understand and maintain the code.

val result = loadSpotifyAlbum
  .transform(numAlbumPopularity)
  .transform(greaterThan6kAlbums)

Here we have used the transform function. It is a function that is used in Spark DataFrames, and is used to apply a function to one or more columns of a DataFrame, generating new columns with the results of the function.

def transform[U](func: (DataFrame) => U): U

The func argument is a function that accepts a DataFrame as its argument, performs some transformations on the DataFrame, and returns a result of any type U.

When transform is called, it applies the func function to the DataFrame column(s) specified in the code block, and returns the result of the func function.

I would like to highlight that chained transformations is a best way to practice Spark development. It helps in:

Readability: of the code to process smaller transformations to chain together
Performance: this allows operations to be optimized and executed in a single pass, which can reduce the time and resources needed to execute the processing steps
Maintainability: If each transformation is a separate method, it can be reused in other parts of the code, which makes it easier to maintain the code and can result in fewer bugs

Let’s see another approach, even further!

Approach #2: Implicit Classes

An implicit class is a syntactic sugar in Scala that allows implicit conversions between the original class and the new enriched class. In this case, ClassDataFrame enriches instances of the DataFrame class with two additional methods — numAlbumPopularity and greaterThan6kAlbums.

If you see the readability aspect of the code of result variable, it clearly states that loading the Spotify album, count the number of album popularity and find the greater than 6k albums — it’s more like reading a simple English sentence, if the variables are appropriately named.

val result = loadSpotifyAlbum.numAlbumPopularity.greaterThan6kAlbums

Execution — I did try to run this on a Databricks notebook, so there was no need for explicit spark session as a separate variable. Code was executed something like this:

ImplicitClassMain.main(Array("Hey Implicit Class!"))
ChainedTransformationMain.main(Array("Hey Chained Transform!"))

Conclusion

I would like to highlight that we do have multiple approaches to handle Spark transformations — transform and implicit class. I have been using transform for a while but the other approach is something I felt extremely interesting and useful, because of the readability concerns.

Code

The code is maintained in our DataTribe community GitHub account.

Approach #1: Chained Transformations

Approach #2: Implicit Classes

Conclusion

Code

References