Problem Description

An upstream process writes time-series datasets into Parquet files separated by frequency. In hourly frequency datasets, there’s significant cardinality—each row has a unique timestamp. This results in:

  • Large compressed and uncompressed column sizes
  • Bloated Parquet file metadata
  • Slow query (read) performance

To understand the root cause, we needed to examine the Parquet file metadata information more closely.

Inspecting Parquet File Metadata

While parquet-tools is a common option, I found DuckDB to be simpler. Download a Parquet file and open it with the DuckDB CLI:

SELECT * FROM parquet_schema('my_file.snappy.parquet');

This shows encoding types, physical types, and compression details for each column.

Understanding Parquet Encoding

To understand the optimization, we need to cover two key Parquet encoding concepts:

1. Run Length Encoding (RLE)

RLE is a compression technique that stores sequences of repeated values efficiently by replacing consecutive identical values with a single value and a count.

Example:

  • Original: CCCDD12AABB
  • RLE encoded: 3C2D122A2B

In Parquet’s columnar format, RLE works well when columns have repeated values across many rows. However, for high-cardinality columns (like unique timestamps), RLE provides minimal benefit.

2. Dictionary Encoding

Dictionary encoding creates a dictionary of unique values and stores indices instead of the actual values. This works excellently for low-cardinality columns (like country codes, status fields) but fails for high-cardinality data where nearly every value is unique.

The Root Cause

Before optimization, our timestamp column metadata showed:

  • Physical Type: INT96
  • Encoding: PLAIN_DICTIONARY and RLE

Why This Is Inefficient

Spark’s v1 writer (default) uses INT96 format for timestamps with PLAIN_DICTIONARY and RLE encoding. When each timestamp is nearly unique:

  1. Dictionary encoding fails: The dictionary becomes almost as large as the data itself
  2. RLE fails: Each row has a different timestamp, causing frequent encoding switches
  3. INT96 is wasteful: Uses 12 bytes per timestamp vs. 8 bytes for INT64

The Ideal Solution

For sorted timestamp data (increasing by 1 hour per row), DELTA_BINARY_PACKED encoding would be optimal. This encoding stores:

  • The first value
  • Deltas (differences) between consecutive values

Unfortunately, Spark doesn’t let you specify encoding per column. However, the v2 writer can automatically choose better encodings including DELTA_BINARY_PACKED.

The Solution

Convert timestamps from INT96 to INT64 using Spark’s v2 writer, which enables better encoding and compression.

Configuration

Add these settings to your job cluster:

spark.conf.set("spark.sql.parquet.outputTimestampType", "TIMESTAMP_MILLIS")
spark.conf.set("spark.hadoop.parquet.writer.version", "v2")

Or in cluster configuration:

spark.sql.parquet.outputTimestampType TIMESTAMP_MILLIS
spark.hadoop.parquet.writer.version v2

Results After Optimization

After applying these settings, the Parquet file metadata shows:

  • Physical Type: INT64
  • Logical Type: TIMESTAMP_MILLIS
  • Encoding: DELTA_BINARY_PACKED (automatically chosen by v2 writer)
SELECT * FROM parquet_schema('my_prod_file.snappy.parquet');

The timestamp column is now stored as INT64 with TIMESTAMP_MILLIS precision, using only 8 bytes instead of 12.

Performance Impact

After implementing this optimization:

  • File size: Reduced by 20-40% for timestamp-heavy datasets
  • Query performance: Significant improvement in read operations
  • Metadata overhead: Reduced due to better encoding

Key Learnings

  1. Spark v1 writer limitations: Uses INT96 and PLAIN_DICTIONARY by default for compatibility, which is inefficient for high-cardinality timestamps
  2. Encoding is file-level: You cannot choose encoding per column in Spark; the writer version determines available encodings
  3. v2 writer benefits: Automatically selects better encodings (like DELTA_BINARY_PACKED) based on data patterns
  4. INT64 > INT96: For timestamps, INT64 is more efficient (8 bytes vs 12 bytes) and enables better encoding options

Recommendations

  1. Use v2 writer for high-cardinality timestamps: If your dataset has unique or sequential timestamps, v2 writer with INT64 encoding will perform significantly better
  2. Works across platforms: Both Apache Spark and Databricks Spark support v2 writer
  3. Backwards compatibility: Ensure downstream tools can read v2 Parquet files (most modern tools support it)
  4. Test first: Benchmark your specific workload to confirm performance improvements
  5. Not a silver bullet: This optimization specifically helps high-cardinality timestamp columns; low-cardinality columns benefit from dictionary encoding

When NOT to Use This

Stick with the default (v1 writer, INT96) if:

  • You need compatibility with very old Parquet readers
  • Your timestamps have low cardinality (many repeated values)
  • You’re using Impala or other tools with limited INT64 timestamp support

Conclusion

For time-series datasets with high-cardinality timestamps, switching from INT96 to INT64 encoding with Spark’s v2 writer provides substantial improvements in both storage efficiency and query performance. This simple configuration change can yield 20-40% reduction in file size and faster reads—a significant win for minimal effort.