Optimizing Spark Timestamp Columns in Parquet Files

Problem Description

An upstream process writes time-series datasets into Parquet files separated by frequency. In hourly frequency datasets, there’s significant cardinality—each row has a unique timestamp. This results in:

Large compressed and uncompressed column sizes
Bloated Parquet file metadata
Slow query (read) performance

To understand the root cause, we needed to examine the Parquet file metadata information more closely.

Inspecting Parquet File Metadata

While parquet-tools is a common option, I found DuckDB to be simpler. Download a Parquet file and open it with the DuckDB CLI:

SELECT * FROM parquet_schema('my_file.snappy.parquet');

This shows encoding types, physical types, and compression details for each column.

Understanding Parquet Encoding

To understand the optimization, we need to cover two key Parquet encoding concepts:

1. Run Length Encoding (RLE)

RLE is a compression technique that stores sequences of repeated values efficiently by replacing consecutive identical values with a single value and a count.

Example:

Original: CCCDD12AABB
RLE encoded: 3C2D122A2B

In Parquet’s columnar format, RLE works well when columns have repeated values across many rows. However, for high-cardinality columns (like unique timestamps), RLE provides minimal benefit.

2. Dictionary Encoding

Dictionary encoding creates a dictionary of unique values and stores indices instead of the actual values. This works excellently for low-cardinality columns (like country codes, status fields) but fails for high-cardinality data where nearly every value is unique.

The Root Cause

Before optimization, our timestamp column metadata showed:

Physical Type: INT96
Encoding: PLAIN_DICTIONARY and RLE

Why This Is Inefficient

Spark’s v1 writer (default) uses INT96 format for timestamps with PLAIN_DICTIONARY and RLE encoding. When each timestamp is nearly unique:

Dictionary encoding fails: The dictionary becomes almost as large as the data itself
RLE fails: Each row has a different timestamp, causing frequent encoding switches
INT96 is wasteful: Uses 12 bytes per timestamp vs. 8 bytes for INT64

The Ideal Solution

For sorted timestamp data (increasing by 1 hour per row), DELTA_BINARY_PACKED encoding would be optimal. This encoding stores:

The first value
Deltas (differences) between consecutive values

Unfortunately, Spark doesn’t let you specify encoding per column. However, the v2 writer can automatically choose better encodings including DELTA_BINARY_PACKED.

The Solution

Convert timestamps from INT96 to INT64 using Spark’s v2 writer, which enables better encoding and compression.

Configuration

Add these settings to your job cluster:

spark.conf.set("spark.sql.parquet.outputTimestampType", "TIMESTAMP_MILLIS")
spark.conf.set("spark.hadoop.parquet.writer.version", "v2")

Or in cluster configuration:

spark.sql.parquet.outputTimestampType TIMESTAMP_MILLIS
spark.hadoop.parquet.writer.version v2

Results After Optimization

After applying these settings, the Parquet file metadata shows:

Physical Type: INT64
Logical Type: TIMESTAMP_MILLIS
Encoding: DELTA_BINARY_PACKED (automatically chosen by v2 writer)

SELECT * FROM parquet_schema('my_prod_file.snappy.parquet');

The timestamp column is now stored as INT64 with TIMESTAMP_MILLIS precision, using only 8 bytes instead of 12.

Performance Impact

After implementing this optimization:

File size: Reduced by 20-40% for timestamp-heavy datasets
Query performance: Significant improvement in read operations
Metadata overhead: Reduced due to better encoding

Key Learnings

Spark v1 writer limitations: Uses INT96 and PLAIN_DICTIONARY by default for compatibility, which is inefficient for high-cardinality timestamps
Encoding is file-level: You cannot choose encoding per column in Spark; the writer version determines available encodings
v2 writer benefits: Automatically selects better encodings (like DELTA_BINARY_PACKED) based on data patterns
INT64 > INT96: For timestamps, INT64 is more efficient (8 bytes vs 12 bytes) and enables better encoding options

Recommendations

Use v2 writer for high-cardinality timestamps: If your dataset has unique or sequential timestamps, v2 writer with INT64 encoding will perform significantly better
Works across platforms: Both Apache Spark and Databricks Spark support v2 writer
Backwards compatibility: Ensure downstream tools can read v2 Parquet files (most modern tools support it)
Test first: Benchmark your specific workload to confirm performance improvements
Not a silver bullet: This optimization specifically helps high-cardinality timestamp columns; low-cardinality columns benefit from dictionary encoding

When NOT to Use This

Stick with the default (v1 writer, INT96) if:

You need compatibility with very old Parquet readers
Your timestamps have low cardinality (many repeated values)
You’re using Impala or other tools with limited INT64 timestamp support

Conclusion

For time-series datasets with high-cardinality timestamps, switching from INT96 to INT64 encoding with Spark’s v2 writer provides substantial improvements in both storage efficiency and query performance. This simple configuration change can yield 20-40% reduction in file size and faster reads—a significant win for minimal effort.