Building a Modern Data Pipeline: Scala 3 + Kafka + Flink + Spark
Itβs been a while since Iβve written serious Scala code. I moved into data engineering β Spark jobs, Delta Lake, Databricks β and the language became a tool I used rather than something I was actively exploring.
Then Scala 3.5+ shipped. Enums with real parameters. Opaque types for zero-cost type safety. Extension methods. given/using replacing the old implicit soup. I wanted to actually use these in a real project, not just read the release notes.
The goal: build a full data pipeline entirely in Scala. Event generation β Kafka β real-time streaming β batch analytics. The kind of thing Iβd deploy at work.
The catch: Spark doesnβt support Scala 3. Neither does Databricks β stuck on 2.13, and thatβs not changing anytime soon. So I canβt just write everything in Scala 3 and deploy. The Spark ecosystem is firmly 2.13-only.
This became the real engineering problem: how do you get Scala 3βs type system for the parts that benefit from it, while keeping Spark happy?
The multi-version trick
It comes down to how Scala bytecode compatibility works:
- Scala 3 can read Scala 2.13 bytecode (forward compatible)
- Scala 2.13 cannot read Scala 3 bytecode
So I put all shared types in a Scala 2.13 module that everything depends on:
shared (2.13) β all modules depend on this
βββ preprocessing (3.5.2) β
βββ flink-streaming (3.5.2) β
βββ etl (2.13) β
The shared module holds common schemas, Kafka message types, and configs. Itβs the lowest common denominator β compiled with 2.13 so every module can read it. The Scala 3 modules import from shared without issue. The Spark ETL module stays on 2.13 and also imports from shared. No cross-compilation hacks.
In build.sbt, each module sets its own scalaVersion and declares dependsOn(shared). sbt compiles shared first, then the dependents see its classes on the classpath.
The discipline: keep shared minimal. Only types and serialization. All the interesting domain modelling lives in the Scala 3 modules.
Architecture
Lambda architecture β streaming for speed, batch for accuracy:
βββββββββββββββββββββββββββββββ
β Streaming Generator β Scala 3.5.2
β - Enums, Opaque Types β
β - Extension Methods β
ββββββββββββ¬βββββββββββββββββββ
β Kafka (people-events topic)
βΌ
βββββββββββββββββββββββββββββββ
β Apache Kafka β Event backbone
β - Decoupled services β
β - Reliable messaging β
ββββββββββββ¬βββββββββββββββββββ
β
βΌ
βββββββββββββββββββββββββββββββ
β Flink Streaming β Scala 3.5.2
β - Real-time processing β
β - Write to Data Lake β
ββββββββββββ¬βββββββββββββββββββ
β Parquet (partitioned by date/hour)
βΌ
βββββββββββββββββββββββββββββββ
β Spark Batch Analytics β Scala 2.13
β - Aggregations β
β - Complex analytics β
βββββββββββββββββββββββββββββββ
| Module | Scala Version | What it does |
|---|---|---|
| shared | 2.13.12 | Common schemas, Kafka messages, configs (bridge layer) |
| preprocessing | 3.5.2 | Event generator with Scala 3 features |
| flink-streaming | 3.5.2 | Stream processing: Kafka β Parquet |
| etl | 2.13.12 | Spark batch analytics |
| orchestrator | 2.13 | ZIO-based web dashboard |
14 Scala files, 2,522 lines of code. Lean.
Event generation (preprocessing)
This is where Scala 3 shines. The preprocessing module generates PersonEvent records that flow through the entire pipeline.
Enums with parameters β not just labels:
enum PersonStatus:
case Active
case Inactive
case Suspended(reason: String, until: LocalDate)
In Scala 2 this is a sealed trait + case classes β 15 lines of boilerplate. Now itβs 4, with exhaustiveness checking for free.
Opaque types β the feature I was most curious about.
In a data pipeline, everything is strings at the wire level. An email is a string. A city is a string. An age is an int. Without discipline, these bleed together β you pass a raw string into a function expecting a validated email, and nothing stops you until runtime.
In Scala 2, the fix is value classes or case class wrappers. But those allocate objects. In a pipeline generating thousands of events, thatβs boxing overhead you donβt need.
Opaque types give you both: zero-cost at runtime (itβs literally a String in the bytecode), but full type distinction at compile time.
object types:
opaque type Email = String
object Email:
def apply(value: String): Validated[Email] =
val emailRegex = "^[A-Za-z0-9+_.-]+@[A-Za-z0-9.-]+\\.[A-Za-z]{2,}$"
if value.matches(emailRegex) then Right(value)
else Left(s"Invalid email format: $value")
extension (e: Email)
def value: String = e
def domain: String = e.split("@")(1)
def username: String = e.split("@")(0)
opaque type Age = Int
object Age:
def apply(value: Int): Validated[Age] =
if value >= MIN_AGE && value < MAX_AGE then Right(value)
else Left(s"Invalid age: $value")
extension (a: Age)
def value: Int = a
def isAdult: Boolean = a >= 18
def category: String = a match
case a if a < 13 => "child"
case a if a < 20 => "teen"
case a if a < 65 => "adult"
case _ => "senior"
Then the Person case class enforces it:
case class Person(
email: types.Email,
age: types.Age,
city: types.City,
...
)
You canβt construct a Person without going through the validated constructors. And when you need the raw value back β say, to publish to Kafka β you explicitly call .value:
PersonEvent(
age = person.age.value, // Extract Int from opaque type
email = person.email.value,
...
)
That .value call is a deliberate boundary crossing. It forces you to acknowledge: βIβm leaving the type-safe domain and entering the raw wire format.β Without opaque types, thereβs no boundary β strings just flow everywhere and you hope for the best.
Extension methods, given/using, union types, top-level definitions, @main annotation β all used naturally throughout the module. Not forced in for demo purposes; they genuinely made the code better.
Flink streaming
The Flink job consumes PersonEvent records from the people-events Kafka topic, processes them in real-time, and writes partitioned Parquet to the data lake.
Key detail: Flinkβs API is Java-based. It doesnβt care what Scala version you compile with β it just needs JVM bytecode. So I get to write the streaming logic in Scala 3.5.2 while using Flinkβs Java API directly. Best of both worlds.
The output lands in data/streaming/people/dt=2025-12-27/hour=14/ β partitioned by date and hour. This matters for the downstream Spark job: it can prune entire partitions when querying date ranges instead of scanning everything.
Spark batch ETL
The ETL module reads the Parquet data lake and computes aggregations that are too expensive for the streaming path. City-level stats, age group categorization (Minor/Adult/Senior), email domain extraction, active status breakdowns.
This module stays on Scala 2.13 because Spark requires it. Imports from shared (also 2.13) with no friction. Compiles to a fat JAR via sbt-assembly that runs on Databricks without modification.
spark-submit --class etl.SparkETLPipeline etl/target/scala-2.13/etl-assembly.jar
The orchestrator
The ZIO-based orchestrator provides a web dashboard at localhost:9090 that ties the pipeline together. It runs all tasks and shows execution status, metrics, and logs in real-time. Not a mock UI β it actually executes pipeline stages and tracks run history.
What I trimmed
The project started messy β 25+ files, scattered shell scripts, duplicate docs. I went through it aggressively:
- Removed 7 unused orchestrator files (DAGVisualizer, PipelineDAG, TaskRunner, etc.)
- Removed 8 duplicate markdown docs
- Eliminated ~72 LOC of dead code across modules
- Replaced 4 shell scripts with a single Makefile (20+ targets)
Result: 14 files, 2,522 LOC, 2 docs. Every file earns its place.
Running it
# Check prerequisites (Java 17+, sbt 1.10+)
make setup
# Quickest path β start the dashboard (no Docker needed)
make orchestrator
# open http://localhost:9090
# Or explore the code interactively
make learn
# Or run the full stack with Kafka + Flink
make full-stack
# Kafka UI: http://localhost:8080
# Flink UI: http://localhost:8081
Individual modules:
make preprocessing-run # Scala 3 event generator
make etl-run # Spark batch analytics
make compile # compile everything
make test # run all tests
Why Scala 2.13 (not 2.12)?
Spark 3.5.x supports both 2.12 and 2.13. I chose 2.13 for better forward-compatibility with Scala 3. When Spark 4.x eventually supports Scala 3 directly, the migration from this architecture will be straightforward β just move types from shared into the Scala 3 modules and drop the bridge.
What I took away from this
Multi-version Scala works if youβre disciplined about module boundaries. The shared layer stays dumb β just types. All business logic lives in the modules that can use the language features they need.
Opaque types are the feature Iβll miss most when Iβm back in 2.13 at work. Zero-cost type safety is exactly what you want in a data pipeline where everything is strings at the wire level but meaningfully different types in your domain.
And the Flink + Java API trick is underrated. Since Flink doesnβt have a Scala-specific API anymore, youβre just calling Java methods β which means any JVM language works. Scala 3, Kotlin, whatever. The version lock-in is purely a Spark problem.