Problem Description
An upstream process writes time-series datasets into Parquet files separated by frequency. In hourly frequency datasets, there’s significant cardinality—each row has a unique timestamp. This results in:
- Large compressed and uncompressed column sizes
- Bloated Parquet file metadata
- Slow query (read) performance
To understand the root cause, we needed to examine the Parquet file metadata information more closely.
Inspecting Parquet File Metadata
While parquet-tools is a common option, I found DuckDB to be simpler. Download a Parquet file and open it with the DuckDB CLI:
SELECT * FROM parquet_schema('my_file.snappy.parquet');
This shows encoding types, physical types, and compression details for each column.
Understanding Parquet Encoding
To understand the optimization, we need to cover two key Parquet encoding concepts:
1. Run Length Encoding (RLE)
RLE is a compression technique that stores sequences of repeated values efficiently by replacing consecutive identical values with a single value and a count.
Example:
- Original:
CCCDD12AABB - RLE encoded:
3C2D122A2B
In Parquet’s columnar format, RLE works well when columns have repeated values across many rows. However, for high-cardinality columns (like unique timestamps), RLE provides minimal benefit.
2. Dictionary Encoding
Dictionary encoding creates a dictionary of unique values and stores indices instead of the actual values. This works excellently for low-cardinality columns (like country codes, status fields) but fails for high-cardinality data where nearly every value is unique.
The Root Cause
Before optimization, our timestamp column metadata showed:
- Physical Type:
INT96 - Encoding:
PLAIN_DICTIONARYandRLE
Why This Is Inefficient
Spark’s v1 writer (default) uses INT96 format for timestamps with PLAIN_DICTIONARY and RLE encoding. When each timestamp is nearly unique:
- Dictionary encoding fails: The dictionary becomes almost as large as the data itself
- RLE fails: Each row has a different timestamp, causing frequent encoding switches
- INT96 is wasteful: Uses 12 bytes per timestamp vs. 8 bytes for INT64
The Ideal Solution
For sorted timestamp data (increasing by 1 hour per row), DELTA_BINARY_PACKED encoding would be optimal. This encoding stores:
- The first value
- Deltas (differences) between consecutive values
Unfortunately, Spark doesn’t let you specify encoding per column. However, the v2 writer can automatically choose better encodings including DELTA_BINARY_PACKED.
The Solution
Convert timestamps from INT96 to INT64 using Spark’s v2 writer, which enables better encoding and compression.
Configuration
Add these settings to your job cluster:
spark.conf.set("spark.sql.parquet.outputTimestampType", "TIMESTAMP_MILLIS")
spark.conf.set("spark.hadoop.parquet.writer.version", "v2")
Or in cluster configuration:
spark.sql.parquet.outputTimestampType TIMESTAMP_MILLIS
spark.hadoop.parquet.writer.version v2
Results After Optimization
After applying these settings, the Parquet file metadata shows:
- Physical Type:
INT64 - Logical Type:
TIMESTAMP_MILLIS - Encoding:
DELTA_BINARY_PACKED(automatically chosen by v2 writer)
SELECT * FROM parquet_schema('my_prod_file.snappy.parquet');
The timestamp column is now stored as INT64 with TIMESTAMP_MILLIS precision, using only 8 bytes instead of 12.
Performance Impact
After implementing this optimization:
- File size: Reduced by 20-40% for timestamp-heavy datasets
- Query performance: Significant improvement in read operations
- Metadata overhead: Reduced due to better encoding
Key Learnings
- Spark v1 writer limitations: Uses
INT96andPLAIN_DICTIONARYby default for compatibility, which is inefficient for high-cardinality timestamps - Encoding is file-level: You cannot choose encoding per column in Spark; the writer version determines available encodings
- v2 writer benefits: Automatically selects better encodings (like
DELTA_BINARY_PACKED) based on data patterns - INT64 > INT96: For timestamps, INT64 is more efficient (8 bytes vs 12 bytes) and enables better encoding options
Recommendations
- Use v2 writer for high-cardinality timestamps: If your dataset has unique or sequential timestamps, v2 writer with INT64 encoding will perform significantly better
- Works across platforms: Both Apache Spark and Databricks Spark support v2 writer
- Backwards compatibility: Ensure downstream tools can read v2 Parquet files (most modern tools support it)
- Test first: Benchmark your specific workload to confirm performance improvements
- Not a silver bullet: This optimization specifically helps high-cardinality timestamp columns; low-cardinality columns benefit from dictionary encoding
When NOT to Use This
Stick with the default (v1 writer, INT96) if:
- You need compatibility with very old Parquet readers
- Your timestamps have low cardinality (many repeated values)
- You’re using Impala or other tools with limited INT64 timestamp support
Conclusion
For time-series datasets with high-cardinality timestamps, switching from INT96 to INT64 encoding with Spark’s v2 writer provides substantial improvements in both storage efficiency and query performance. This simple configuration change can yield 20-40% reduction in file size and faster reads—a significant win for minimal effort.