Prepared by SCOUT | 2026-04-26 | ~2-week scopes each
---
1. Stream-to-Iceberg in a day — the Flink Dynamic Iceberg Sink (GA in Iceberg 1.11 / Flink 1.20) handles schema evolution automatically; a 2-week POC on your most volatile Kafka topic produces a concrete latency and freshness benchmark.
2. Compaction is your fastest demo win — CDC workloads generate delete files exponentially; showing a 5–10x query speedup after scheduled bin-pack compaction is a slide that writes itself.
3. WAP (Write-Audit-Publish) is the "safe deploys for data" story — Iceberg branches let you gate data quality like a PR gate; this maps directly to compliance asks and resonates with infra engineers.
4. DuckDB + Iceberg as a local dev loop — engineers querying production Iceberg tables on a laptop in <30 s replaces a class of ad-hoc Spark jobs; the demo is visceral and the win is instant.
5. dbt + Iceberg incremental models cut Spark costs — replacing full-scan batch jobs with merge incremental models is measurable in compute hours and shows up on a cost dashboard within the 2-week window.
---
Production Kafka topics change schemas constantly — new fields, renamed columns, type widening — and today every schema change requires a job restart, a schema migration meeting, and at least one incident. The Flink Dynamic Iceberg Sink (shipped in November 2025 on Apache Flink, promoted to @PublicEvolving in Iceberg 1.11) can consume Avro/JSON events from a Confluent Schema Registry and automatically evolve the target Iceberg table schema in flight, with zero downtime. Your team likely has at least one "fragile topic" that blocks onboarding new fields. This POC makes that topic self-healing.
Week 1: Stand up a local Flink cluster (Docker Compose), connect it to your team's dev Kafka cluster, and write a first-version Flink job using IcebergDynamicSink targeting a REST catalog. Confirm basic ingestion works and check Iceberg metadata after first commit.
Week 2: Deliberately evolve the Avro schema (add a nullable field, widen an int to long), observe the sink handle it without restart. Measure commit latency, file count per checkpoint interval, and metadata overhead vs. a static sink. Document the before/after story.
What does NOT fit: multi-tenant fan-out (more than 2–3 tables), production traffic — dev topic only.
Live standup demo: two terminal panes side by side. Left: a Python producer publishing events with schema v1, then v2, then v3 over 90 seconds. Right: a Trino query showing the Iceberg table schema update in real time. No restarts. The room goes quiet.
dlq) for records that fail schema validationSchema evolution is the #1 operational pain in streaming pipelines. Demonstrating you can solve it at the sink layer — not by freezing schemas in place — signals senior data engineer judgment. This also hands-on covers Flink SinkV2 internals, which is increasingly tested in staff-level DE interviews.
---
CDC pipelines (Postgres, MySQL, or Oracle to Kafka via Debezium) produce equality-delete files at a rate of roughly 1 per update event. At 1,000 updates/second that is 86 million delete files per day. Without compaction, a query that takes 2 seconds on day 1 takes 45 seconds by week 3. Your team either has this problem now or will have it within a quarter of scaling CDC. This POC instruments the problem, demonstrates the degradation, and then automates the fix: scheduled bin-pack compaction with Spark's RewriteDataFiles action, measuring query latency before and after on a real table.
RewriteDataFiles compaction actionWeek 1: Set up a Postgres → Debezium → Kafka → Iceberg CDC pipeline on a dev table. Deliberately run 24–48 hours of CDC traffic without compaction. Export Trino query latency metrics and Iceberg metadata (manifest file count, delete file count) at regular intervals to show the degradation curve.
Week 2: Implement scheduled compaction (Spark job, cron, or Amoro catalog-level). Run the same query benchmarks post-compaction. Chart the before/after: manifest count, delete file count, and p50/p95 query latency. Add expire_snapshots to keep metadata lean.
What does NOT fit: automated adaptive compaction (Amoro ML-based scheduling) — that is a month-long project.
A single slide with two charts: (1) query latency over time without compaction (slopes upward), and (2) the same metric after compaction fires (drops back to baseline). Title it "We fixed the ticking clock." Optional: show the Iceberg metadata diff before/after in Trino's $manifests and $files system tables.
Iceberg table maintenance (compaction, snapshot expiry, orphan file cleanup) is an operational gap at most companies. Owning the "compaction story" makes you the person who keeps the lakehouse healthy. This skill is directly transferable to any team running Iceberg at scale.
---
When a pipeline pushes bad data to a production table, rollback is painful, downstream dashboards are poisoned, and the RCA takes days. Iceberg's branching model (introduced in Iceberg 1.2, matured in 1.7) enables a Write-Audit-Publish (WAP) workflow: new data lands on an audit branch, dbt data quality tests (or Great Expectations assertions) run against it, and only after a clean test run does the branch fast-forward to main. This is the data equivalent of a PR gate, and it can be demonstrated end-to-end in two weeks using tooling your team already runs.
Week 1: Pick one existing dbt model that touches a high-stakes table. Modify the write path to target wap-branch instead of main. Add 3–5 dbt tests (not-null, accepted-values, row-count delta check). Wire a simple orchestrator that runs tests after each load and only calls fast_forward('main', 'wap-branch') on a green run.
Week 2: Deliberately inject bad data (null primary keys, out-of-range values). Show the branch catches it before main is affected. Query main during the bad run — it returns the last clean snapshot. Document the rollback story: time-to-recovery drops from "hours of debugging" to "branch rollback in seconds."
What does NOT fit: multi-branch fan-out per environment, row-level lineage — keep scope to one table and one branch.
Two Trino queries running in parallel during the demo: one reads main (always clean), one reads wap-branch (shows the bad batch). After the orchestrator rejects the bad batch, the branch resets. Show the Iceberg snapshot log — the failed run leaves zero trace on main.
WAP is mentioned in virtually every modern data quality talk (Data + AI Summit, Current, dbt Coalesce). Knowing the implementation details — Iceberg branch API, fast_forward, snapshot isolation — puts you well ahead of candidates who only know the concept from a slide deck.
---
Your team's analysts and engineers currently run ad-hoc Spark jobs or stand up a Trino cluster to query production Iceberg tables — a process that takes 5–15 minutes just to get a cluster warm. DuckDB 1.4 (released late 2025) added native Iceberg write support and can read Iceberg REST catalogs directly from a laptop, running queries 10–100x faster than Spark on datasets under ~500 GB. The POC shows engineers how to swap their entire "spin up a Spark shell" workflow for a DuckDB Python script that returns results in under 30 seconds, directly from your production catalog.
iceberg extensionWeek 1: Install DuckDB locally, configure the iceberg extension to point at your catalog's REST endpoint. Write 10 representative ad-hoc queries that engineers currently run on Spark. Benchmark wall-clock time: Spark (cluster cold) vs DuckDB (local). Capture both p50 and worst-case times.
Week 2: Add Polars as a downstream layer for anything that needs Python-native transformation (joins against local CSVs, output to Excel for stakeholders). Write a one-page "DuckDB Quick-Start" cheatsheet. Present the benchmark table at sharedown.
What does NOT fit: writing back to production Iceberg (read-only POC), datasets >500 GB (Spark's turf), cluster-scale aggregations.
Open a Jupyter notebook live. Run the same query against the Spark cluster (start the clock while it warms), then run it in DuckDB. DuckDB finishes before Spark's cluster is even ready. The benchmark table shows 10 queries: Spark average 4.5 min, DuckDB average 22 s. That slide goes into every "why we're modernizing" deck for the next year.
pip install titan-lakehouse-client internal library with catalog config baked inpg_duckdb to let engineers query Iceberg via standard Postgres clientsDuckDB is rapidly displacing Spark for interactive workloads. Understanding where each tool fits — DuckDB for interactive/exploratory, Spark for cluster-scale ETL, Flink for streaming — is exactly the mental model senior engineers are expected to have in 2026 interviews.
---
Your team almost certainly has at least one dbt model that runs a full table scan every hour or every night, even though only 0.1% of the rows changed. On a large Iceberg table this is both slow and expensive. The dbt-spark and dbt-trino adapters support incremental_strategy='merge' for Iceberg tables using MERGE INTO, which only processes new or changed rows identified by a unique_key. This POC takes one high-cost full-scan model, converts it to an incremental Iceberg model, measures the compute reduction, and documents the pattern for the team.
Week 1: Identify the top 3 most expensive full-scan dbt models (by Spark job duration or Trino query cost). Pick one to convert. Write the incremental version with unique_key, incremental_strategy='merge', and on_schema_change='append_new_columns'. Run both versions on the same dataset and capture runtime, bytes scanned, and cluster-hours consumed.
Week 2: Validate data correctness (compare row counts, checksums). Handle the edge case of late-arriving data (use a lookback window: where updated_at >= current_timestamp - interval '2' hour). Document the pattern and present a cost projection for converting the other high-cost models.
What does NOT fit: converting all models at once, type-2 SCD logic (more complex merge semantics required).
Side-by-side runtime comparison: full-scan model = 38 minutes, 2.1 TB scanned. Incremental model = 4 minutes, 180 GB scanned. If your team tracks cloud compute costs, show the dollar figure. The cost delta is the demo.
Cost-aware data engineering is a first-class interview topic in 2026. Being able to say "I reduced our Spark compute cost by 8x by migrating full-scan models to Iceberg incremental merge" with a concrete number is worth three rounds of LeetCode prep.
---
When a Flink or Spark Structured Streaming job reads from Kafka and writes to Iceberg, the link between a Kafka offset and an Iceberg snapshot is invisible. If a consumer reports missing data, you cannot answer: "which Iceberg snapshot corresponds to Kafka offset X on partition Y?" Iceberg 1.7's snapshot properties allow storing arbitrary metadata per commit, including the Kafka offset ranges consumed. This POC instruments your Flink-to-Iceberg pipeline to write per-snapshot Kafka offset metadata, then builds a simple Trino query or Grafana panel that shows the offset-to-snapshot map in real time.
SnapshotSummary properties)$snapshots system table)Week 1: Extend your existing Flink-to-Iceberg job (or build a toy pipeline) to write Kafka topic, partition, min_offset, and max_offset as snapshot summary properties on every Iceberg commit. Query the $snapshots table in Trino to confirm properties are landing correctly.
Week 2: Build a Grafana panel (or Streamlit script) that joins Kafka consumer group lag (from Kafka JMX/REST) with Iceberg snapshot timestamps to answer: "as of 14:32 UTC, we had processed offsets 0–41,823 on partition 0." Add a simple alert threshold for lag > N minutes.
What does NOT fit: full data lineage (column-level) — that is a month-long OpenLineage project.
Show the audit query: "find all Iceberg snapshots that consumed Kafka messages between 14:00 and 14:05 UTC." Return snapshot IDs, offsets, commit times, row counts. This answers a question that currently requires manual log-digging. The person on-call in the room will immediately want this.
Observability is the gap between "it works" and "it works at 3 AM." Building offset-to-snapshot lineage demonstrates that you think about operational reliability, not just correctness. It signals readiness for a staff or principal engineer role.
---
Most data teams have data in at least three places: a legacy Hive metastore, a modern Iceberg REST catalog, and operational Postgres databases. Analysts currently write three separate queries and join in Python — which is slow, error-prone, and opaque to governance tools. Trino's multi-catalog connector model can join across all three in a single SQL query, with Iceberg partition pruning on the lakehouse side. This POC configures a Trino cluster (or the free Starburst Galaxy trial) to federate across your catalogs and demonstrates a cross-catalog join that replaces a multi-step pipeline.
iceberg catalog in TrinoWeek 1: Stand up Trino locally (Docker) or use Starburst Galaxy. Configure 3 catalogs: iceberg, hive, postgres. Write a 3-way join query that pulls from all three. Measure query execution plan — verify Iceberg partition pruning is working (check $partitions in the explain plan).
Week 2: Wrap the federated query as a dbt-trino model that materializes the result as a new Iceberg table. Schedule it in Airflow. Benchmark: federated Trino query vs. the current "extract to S3, join in Spark" approach. Document the cases where federation wins and where it does not (network I/O to Postgres at scale).
What does NOT fit: replacing your Postgres OLTP with Trino writes, row-level security (multi-week project).
A single SQL query in the Trino UI that joins iceberg.prod.orders, hive.legacy.customers, and postgres.ops.support_tickets and returns a unified view. No Python. No intermediate S3 dump. The stakeholders who maintain those three systems will immediately understand the value.
ANALYZE TABLE) to improve the join cardinality estimatesMERGE into Iceberg from the federated result (upsert pattern)Federation is an increasingly common architecture pattern as organizations move from monolithic warehouses to multi-system lakehouses. Knowing the practical limits (Postgres connector is not partition-aware; Hive connector cannot prune below partition level) makes your architecture recommendations credible.
---
Object storage is cheap but not free, and access patterns are highly skewed — last 30 days of data gets 95% of queries, everything older is cold. Right now your team either keeps everything in the same hot-tier bucket (expensive) or manually archives to Glacier (operationally painful, invisible to query engines). Iceberg's partition spec combined with a scheduled Spark job can automatically migrate partitions older than N days from S3 Standard to S3 Glacier Instant Retrieval (or MinIO tiered storage locally), while keeping the table queryable without any schema change. This POC demonstrates automated hot/cold tiering with measurable storage cost reduction.
rewrite_position_delete_files)RewriteDataFiles with target storage config)Week 1: Instrument a test Iceberg table with day-level partitioning. Write a Spark job that identifies partitions older than 30 days and rewrites them with an S3 storage class tag (GLACIER_IR). Verify Trino can still query both tiers seamlessly (it can — the metadata layer is unchanged).
Week 2: Schedule the tiering job in Airflow with a configurable cold_after_days parameter. Measure storage cost before and after for the test table. Build a simple cost projection for the full dataset. Document the runbook.
What does NOT fit: automated query routing by tier (Trino does not natively reroute based on storage class), deep archive (restore latency makes it unsuitable for ad-hoc queries).
Show an S3 console (or MinIO dashboard) with two storage classes side by side. Run a Trino query that spans both hot and cold partitions — same result, same latency for recent data. Show the projected monthly storage cost reduction (even 20–30% on a large table is a meaningful number).
FinOps for data platforms is a growing discipline. Being able to design and implement storage tiering — rather than just talking about it — is a differentiator, especially at companies with large historical datasets. This also deepens your understanding of Iceberg's physical layout vs. logical metadata.
---
Business dashboards today either query a slow Iceberg table directly (stale by hours) or maintain a separate Redis/ClickHouse store for real-time metrics (duplicated schema, dual write bugs). Flink's streaming aggregation can maintain a pre-aggregated Iceberg table (e.g., hourly sales by region) that Trino queries in milliseconds, with freshness measured in seconds rather than hours. This POC builds a mini pipeline: Kafka events → Flink aggregation → Iceberg aggregate table → Trino query, and benchmarks freshness (time from event to queryable result) vs. the current batch approach.
Week 1: Choose a business metric your team cares about (order counts, event rates, error rates — anything with a natural aggregation). Write a Flink job with a 1-minute tumbling window that outputs aggregates to an Iceberg table. Confirm Trino can query the table and see updates within 2–3 minutes of event production.
Week 2: Tune the Flink checkpoint interval and Iceberg commit interval to reduce freshness to <60 seconds. Set up a Grafana dashboard reading from Trino. Document the latency profile: p50 and p95 time-to-queryable from Kafka produce timestamp. Compare to the current batch job's freshness.
What does NOT fit: sub-second freshness (use ClickHouse or Pinot for that), complex CEP patterns — keep to simple aggregations.
A live Grafana dashboard that ticks every 60 seconds with real event counts, fed by Trino queries on an Iceberg table that Flink is writing to. Point at the Kafka producer and say "I just increased the event rate" — the dashboard updates within 90 seconds. The current batch job would take 2 hours to show the same change.
corrections Iceberg table)Real-time aggregation on open table formats is the core use case that justifies the Kafka + Flink + Iceberg stack. Understanding the trade-offs (freshness vs. cost, tumbling vs. sliding windows, checkpoint interval vs. file count) is what separates an architect from an implementer.
---
Your team probably has informal agreements about what each Kafka topic should contain, but no automated enforcement. When a producer team changes a field without notice, consumers break silently and data quality degrades invisibly. Data contracts — formalized schemas with machine-checked assertions — are the solution, but implementing them end-to-end (Confluent Schema Registry for Avro enforcement on the producer side, Great Expectations for semantic checks at the Iceberg sink, and Iceberg tags to mark "certified" snapshots) has not been done as a unified proof of concept at most companies. This POC builds that pipeline.
certified, quarantine)certified)Week 1: Set up Schema Registry with FULL_TRANSITIVE compatibility on one Kafka topic. Write a Flink job that rejects events failing the registry check and routes them to a quarantine Iceberg branch. Confirm the happy-path producer and a bad producer both behave correctly.
Week 2: Add Great Expectations checks on the Iceberg table after each Flink commit (row count, null rates, value distributions). On pass: apply Iceberg tag certified_<timestamp>. On fail: snapshot stays untagged and an alert fires. Document the contract (schema + semantic assertions) as a YAML file checked into version control.
What does NOT fit: full Soda Core or Monte Carlo integration (multi-week), producer-side SDK (separate project).
Deploy a "bad producer" that sends events with a null required field. Show the Schema Registry reject it at the wire level. Then deploy a "schema-valid but semantically wrong" producer (all prices = 0). Show Great Expectations catch it at the Iceberg layer and withhold the certified tag. Query main — only certified snapshots are visible. Data quality is enforced at three independent layers.
Data contracts are the hottest topic in the data engineering space in 2025–2026 (Data + AI Summit, dbt Coalesce, Current conference all led with it). Implementing one end-to-end — not just talking about it — is a strong signal for staff and principal roles.
---
| Priority | POC | Rationale |
|----------|-----|-----------|
| 1 | POC 4 — DuckDB Local Lakehouse | Fastest to ship (1 day of setup), immediate team value, visceral demo |
| 2 | POC 2 — CDC Compaction Watchdog | Fixes a likely-existing problem, measurable win, operationally credible |
| 3 | POC 5 — Iceberg Incremental dbt | Direct cost impact, uses existing dbt models, low infrastructure risk |
| 4 | POC 3 — WAP Data Quality Gate | Architecture story, compliance angle, moderate complexity |
| 5 | POC 1 — Zero-Restart Schema Evolution | Highest technical depth, best resume signal, requires Flink familiarity |
---