Building a Modern Data Pipeline: Scala 3 + Kafka + Flink + Spark

It’s been a while since I’ve written serious Scala code. I moved into data engineering — Spark jobs, Delta Lake, Databricks — and the language became a tool I used rather than something I was actively exploring.

Then Scala 3.5+ shipped. Enums with real parameters. Opaque types for zero-cost type safety. Extension methods. given/using replacing the old implicit soup. I wanted to actually use these in a real project, not just read the release notes.

The goal: build a full data pipeline entirely in Scala. Event generation → Kafka → real-time streaming → batch analytics. The kind of thing I’d deploy at work.

The catch: Spark doesn’t support Scala 3. Neither does Databricks — stuck on 2.13, and that’s not changing anytime soon. So I can’t just write everything in Scala 3 and deploy. The Spark ecosystem is firmly 2.13-only.

This became the real engineering problem: how do you get Scala 3’s type system for the parts that benefit from it, while keeping Spark happy?

The multi-version trick

It comes down to how Scala bytecode compatibility works:

Scala 3 can read Scala 2.13 bytecode (forward compatible)
Scala 2.13 cannot read Scala 3 bytecode

So I put all shared types in a Scala 2.13 module that everything depends on:

shared (2.13) ← all modules depend on this
  ├── preprocessing (3.5.2)     ✅
  ├── flink-streaming (3.5.2)   ✅
  └── etl (2.13)                ✅

The shared module holds common schemas, Kafka message types, and configs. It’s the lowest common denominator — compiled with 2.13 so every module can read it. The Scala 3 modules import from shared without issue. The Spark ETL module stays on 2.13 and also imports from shared. No cross-compilation hacks.

In build.sbt, each module sets its own scalaVersion and declares dependsOn(shared). sbt compiles shared first, then the dependents see its classes on the classpath.

The discipline: keep shared minimal. Only types and serialization. All the interesting domain modelling lives in the Scala 3 modules.

Architecture

Lambda architecture — streaming for speed, batch for accuracy:

┌─────────────────────────────┐
│ Streaming Generator         │  Scala 3.5.2
│ - Enums, Opaque Types       │
│ - Extension Methods         │
└──────────┬──────────────────┘
           │ Kafka (people-events topic)
           ▼
┌─────────────────────────────┐
│ Apache Kafka                │  Event backbone
│ - Decoupled services        │
│ - Reliable messaging        │
└──────────┬──────────────────┘
           │
           ▼
┌─────────────────────────────┐
│ Flink Streaming             │  Scala 3.5.2
│ - Real-time processing      │
│ - Write to Data Lake        │
└──────────┬──────────────────┘
           │ Parquet (partitioned by date/hour)
           ▼
┌─────────────────────────────┐
│ Spark Batch Analytics       │  Scala 2.13
│ - Aggregations              │
│ - Complex analytics         │
└─────────────────────────────┘

Module	Scala Version	What it does
shared	2.13.12	Common schemas, Kafka messages, configs (bridge layer)
preprocessing	3.5.2	Event generator with Scala 3 features
flink-streaming	3.5.2	Stream processing: Kafka → Parquet
etl	2.13.12	Spark batch analytics
orchestrator	2.13	ZIO-based web dashboard

14 Scala files, 2,522 lines of code. Lean.

Event generation (preprocessing)

This is where Scala 3 shines. The preprocessing module generates PersonEvent records that flow through the entire pipeline.

Enums with parameters — not just labels:

enum PersonStatus:
  case Active
  case Inactive
  case Suspended(reason: String, until: LocalDate)

In Scala 2 this is a sealed trait + case classes — 15 lines of boilerplate. Now it’s 4, with exhaustiveness checking for free.

Opaque types — the feature I was most curious about.

In a data pipeline, everything is strings at the wire level. An email is a string. A city is a string. An age is an int. Without discipline, these bleed together — you pass a raw string into a function expecting a validated email, and nothing stops you until runtime.

In Scala 2, the fix is value classes or case class wrappers. But those allocate objects. In a pipeline generating thousands of events, that’s boxing overhead you don’t need.

Opaque types give you both: zero-cost at runtime (it’s literally a String in the bytecode), but full type distinction at compile time.

object types:

  opaque type Email = String
  object Email:
    def apply(value: String): Validated[Email] =
      val emailRegex = "^[A-Za-z0-9+_.-]+@[A-Za-z0-9.-]+\\.[A-Za-z]{2,}$"
      if value.matches(emailRegex) then Right(value)
      else Left(s"Invalid email format: $value")

    extension (e: Email)
      def value: String = e
      def domain: String = e.split("@")(1)
      def username: String = e.split("@")(0)

  opaque type Age = Int
  object Age:
    def apply(value: Int): Validated[Age] =
      if value >= MIN_AGE && value < MAX_AGE then Right(value)
      else Left(s"Invalid age: $value")

    extension (a: Age)
      def value: Int = a
      def isAdult: Boolean = a >= 18
      def category: String = a match
        case a if a < 13 => "child"
        case a if a < 20 => "teen"
        case a if a < 65 => "adult"
        case _ => "senior"

Then the Person case class enforces it:

case class Person(
  email: types.Email,
  age: types.Age,
  city: types.City,
  ...
)

You can’t construct a Person without going through the validated constructors. And when you need the raw value back — say, to publish to Kafka — you explicitly call .value:

PersonEvent(
  age = person.age.value,  // Extract Int from opaque type
  email = person.email.value,
  ...
)

That .value call is a deliberate boundary crossing. It forces you to acknowledge: “I’m leaving the type-safe domain and entering the raw wire format.” Without opaque types, there’s no boundary — strings just flow everywhere and you hope for the best.

Extension methods, given/using, union types, top-level definitions, @main annotation — all used naturally throughout the module. Not forced in for demo purposes; they genuinely made the code better.

Flink streaming

The Flink job consumes PersonEvent records from the people-events Kafka topic, processes them in real-time, and writes partitioned Parquet to the data lake.

Key detail: Flink’s API is Java-based. It doesn’t care what Scala version you compile with — it just needs JVM bytecode. So I get to write the streaming logic in Scala 3.5.2 while using Flink’s Java API directly. Best of both worlds.

The output lands in data/streaming/people/dt=2025-12-27/hour=14/ — partitioned by date and hour. This matters for the downstream Spark job: it can prune entire partitions when querying date ranges instead of scanning everything.

Spark batch ETL

The ETL module reads the Parquet data lake and computes aggregations that are too expensive for the streaming path. City-level stats, age group categorization (Minor/Adult/Senior), email domain extraction, active status breakdowns.

This module stays on Scala 2.13 because Spark requires it. Imports from shared (also 2.13) with no friction. Compiles to a fat JAR via sbt-assembly that runs on Databricks without modification.

spark-submit --class etl.SparkETLPipeline etl/target/scala-2.13/etl-assembly.jar

The orchestrator

The ZIO-based orchestrator provides a web dashboard at localhost:9090 that ties the pipeline together. It runs all tasks and shows execution status, metrics, and logs in real-time. Not a mock UI — it actually executes pipeline stages and tracks run history.

What I trimmed

The project started messy — 25+ files, scattered shell scripts, duplicate docs. I went through it aggressively:

Removed 7 unused orchestrator files (DAGVisualizer, PipelineDAG, TaskRunner, etc.)
Removed 8 duplicate markdown docs
Eliminated ~72 LOC of dead code across modules
Replaced 4 shell scripts with a single Makefile (20+ targets)

Result: 14 files, 2,522 LOC, 2 docs. Every file earns its place.

Running it

# Check prerequisites (Java 17+, sbt 1.10+)
make setup

# Quickest path — start the dashboard (no Docker needed)
make orchestrator
# open http://localhost:9090

# Or explore the code interactively
make learn

# Or run the full stack with Kafka + Flink
make full-stack
# Kafka UI: http://localhost:8080
# Flink UI: http://localhost:8081

Individual modules:

make preprocessing-run    # Scala 3 event generator
make etl-run              # Spark batch analytics
make compile              # compile everything
make test                 # run all tests

Why Scala 2.13 (not 2.12)?

Spark 3.5.x supports both 2.12 and 2.13. I chose 2.13 for better forward-compatibility with Scala 3. When Spark 4.x eventually supports Scala 3 directly, the migration from this architecture will be straightforward — just move types from shared into the Scala 3 modules and drop the bridge.

What I took away from this

Multi-version Scala works if you’re disciplined about module boundaries. The shared layer stays dumb — just types. All business logic lives in the modules that can use the language features they need.

Opaque types are the feature I’ll miss most when I’m back in 2.13 at work. Zero-cost type safety is exactly what you want in a data pipeline where everything is strings at the wire level but meaningfully different types in your domain.

And the Flink + Java API trick is underrated. Since Flink doesn’t have a Scala-specific API anymore, you’re just calling Java methods — which means any JVM language works. Scala 3, Kotlin, whatever. The version lock-in is purely a Spark problem.

Source: github.com/chanukyapekala/spark-with-scala3