Originally published on Medium - Data AI Stockholm
.. and stop crushing it!
Background and problem statement
We always see opportunities to improve the code when working with data analysts or data scientists. They bring deep domain skills in specific areas, which adds real value on top of the datasets.
In most use cases, the data platform’s responsibility ends once the data is available in structured or raw/bronze layers. From there, it’s expected that Data Scientists use the distributed platform and follow engineering best practices. But a lot of time ends up going into EDA, and less into engineering the solution — things like CI/CD, writing test cases, adding docstrings, or following the team’s coding guidelines.
Then, when it’s time to release those changes, a lot of effort goes into refactoring the code, cleaning things up, and making it production-ready. And because the engineer handling this doesn’t always have full context of the business logic, some things get overlooked — like the use of pandas instead of Spark. That decision may have been intentional, based on how the pipeline behaves at that checkpoint, but the nuance gets lost..?
Let’s explore the example of pandas vs spark topic with a usecase and in much detailed manner; how it effects the program.
Usecase: Customer Churn Analysis
Lets say you’re a data scientist working on customer churn analysis. You have 50GB of customer transaction data in Delta Lake with 100M+ records, and you need to create features for ML model training.
Solution
To get started, I created a sample spark program which creates the data like below.
Generating sample data...
Generated 500000 transactions for testing
DataFrame shape: 500000 rows
Unique customers: 10000
Date range: 2023-01-01 00:00:00 to 2024-12-31 00:00:00
And, example data — how it looks like.
[Image showing sample data - reference original article for visual]
Now, lets use this customer transaction data to understand their behavior and find out who has likely churned.
Approach #1
Lets call it as pandas based approach.
def analyze_churn_pandas_way(df_spark):
import time
print("Testing Pandas Approach...")
start_time = time.time()
# Convert entire dataset to pandas
df = df_spark.toPandas()
# Feature engineering with pandas
customer_features = df.groupby('customer_id').agg({
'transaction_amount': ['sum', 'mean', 'std', 'count'],
'days_since_last_purchase': 'min',
'product_category': lambda x: x.nunique(),
'quantity': 'sum'
})
# Flatten column names
customer_features.columns = ['_'.join(col).strip() for col in customer_features.columns.values]
customer_features = customer_features.reset_index()
# To find out churn
avg_days_between = df.groupby('customer_id')['purchase_date'].apply(
lambda x: x.diff().dt.days.mean() if len(x) > 1 else 0
).reset_index()
avg_days_between.columns = ['customer_id', 'avg_days_between_purchases']
customer_features = customer_features.merge(avg_days_between, on='customer_id')
# Create churn labels
customer_features['is_churned'] = (customer_features['days_since_last_purchase_min'] > 90).astype(int)
end_time = time.time()
print(f"Pandas approach took: {end_time - start_time:.2f} seconds")
print(f"Features created for {len(customer_features)} customers")
return customer_features
There is no actual problem here in the solution, because it just works, what happens in the output here..
Testing Pandas Approach...
Pandas approach took: 5.96 seconds
Features created for 10000 customers
Ok, it depends on the cluster setup, which is like below.
cluster_info:
{'driver_memory': 'None',
'driver_cores': None,
'executor_memory': '3157m',
'num_executors': 2}
# by default, we dont get programatically driver details, those are abstracted away in Databricks environment.
# we can find the details in spark configuration
driver_memory: 8GB
The results of churn
[Image showing churn results - reference original article for visual]
OK! But, what happened under the hood? Like how did the process run behind the scenes while running the above function?
[Image showing driver process doing all the heavy-lifting - reference original article for visual]
So, the function analyze_churn_pandas_way when called, it does 3 things, when this is called: df = df_spark.toPandas()
Phase 1: Spark execution, creates two partitions, one for each executor. Each executor fetches the data.(executors working)
Phase 2: Driver collects the data from executors (Network transfer)
Phase 3: Pandas processing, the rest of the step, like groupby on customer data — its a single threaded operation, leaving executors sitting idle (driver only)
Just imagine if we have GBs of data being utilized and grouping by on a customer across the transactions and handling via pandas is a serious memory issue, leading to out of memory errors!
Ok! How to solve this issue? Use spark APIs as much as possible everywhere and leverage distributed compute!
Approach #2
This is the spark based approach
def analyze_churn_spark_way(df_spark):
from pyspark.sql import functions as F
from pyspark.sql.window import Window
import time
print("Testing Spark Approach...")
start_time = time.time()
# Feature engineering with Spark operations (distributed)
customer_features = df_spark.groupBy('customer_id').agg(
F.sum('transaction_amount').alias('transaction_amount_sum'),
F.mean('transaction_amount').alias('transaction_amount_mean'),
F.stddev('transaction_amount').alias('transaction_amount_std'),
F.count('transaction_amount').alias('transaction_amount_count'),
F.min('days_since_last_purchase').alias('days_since_last_purchase_min'),
F.countDistinct('product_category').alias('unique_categories'),
F.sum('quantity').alias('quantity_sum')
)
# To find out churn
window_spec = Window.partitionBy('customer_id').orderBy('purchase_date')
df_with_lag = df_spark.withColumn('prev_purchase',
F.lag('purchase_date').over(window_spec))
avg_days_between = df_with_lag.withColumn('days_diff',
F.datediff('purchase_date', 'prev_purchase')
).groupBy('customer_id').agg(
F.mean('days_diff').alias('avg_days_between_purchases')
)
# Join features
customer_features = customer_features.join(avg_days_between, 'customer_id', 'left')
# Create churn labels
customer_features = customer_features.withColumn('is_churned',
F.when(F.col('days_since_last_purchase_min') > 90, 1).otherwise(0)
)
# Force computation (equivalent to pandas operations above)
result_count = customer_features.count()
end_time = time.time()
print(f"Spark approach took: {end_time - start_time:.2f} seconds")
print(f"Features created for {result_count} customers")
return customer_features
And, it produces the output
Testing Spark Approach...
Spark approach took: 2.26 seconds
Features created for 10000 customers
What happens under the hood?
[Image showing only one executor doing all the work - reference original article for visual]
So, these details are obtained from sparkUrl, with even more verbose information, but, the point here is that we should expect that both executors should do the processing work, but infact only one executor has been informed to perform the grouping by the customer and every other transactions. Here its due to spark’s AQE capability, which decides on the plan, due to the volume of the data, it decided that one task should be able to handle the entire data and there is no need for other task to actually perform the aggregation. We learnt additional thing about AQE — Adaptive Query Engine, looks up at the data volume and decided to reduced overhead and coalesced partitions into one, which lead to a single executor taking all the work. If we now compare in phases
Phase 1: Data collection, driver coordinates, executor 1 processes all the data and returns results. Executor 2 sits idle
In comparison:
- Spark uses single task while pandas uses 2 tasks + driver process.
- Spark 2.6x faster when compared to processing on a same notebook with same cluster.
- Using spark APIs leverages best distribution leveraging the catalyst optimizer to choose most efficient plan
- Spark reduced unnnecessary shuffling between the executors! Shuffle is one of the serious problems in spark world, when data volumes grow enormously, the same spark code doesnt work any longer — it needs a revisit!
Key Takeaway
Our testing revealed that Spark outperformed pandas by 2.6x even on relatively small data (500k records, 10k customers). This challenges the common assumption that pandas is always better for “small” datasets in distributed environments.
The real issue isn’t dataset size — it’s workflow context:
- Pandas excels during exploratory data analysis and rapid prototyping
- Spark excels when the same logic needs to run reliably in production
- The transition between these phases is where problems occur
What gets overlooked in code reviews is that pandas code written for EDA often carries forward to production unchanged. While it “works” on development datasets, it creates performance bottlenecks and scalability risks that could have been avoided.
The solution isn’t to ban pandas, but to be intentional about when and why you choose it, and plan the transition to distributed processing early in the development cycle.
pandas API on Spark has created effective hybrid approaches that preserve pandas’ development advantages while enabling seamless scaling.
If teams are relying on non-databricks or non-spark environment, running pandas code on jupyter notebook, model development lifecycle tooling changes, which is obvious and focus will be more on pandas!
Lot of data science teams come from python and pandas as a background and learning or adapting the code towards data platform in Spark ways has a steep learning curve.
During production releases, these “small” architectural decisions around pandas vs Spark usage often get overlooked in code reviews, but can become critical bottlenecks as data volumes grow…
This is written by Chanukya Pekala, for Data AI Stockholm. More about DAIS. If you would like to contribute or write to us, please reach out to me or DAIS on linkedin