Data Engineer Tool Selection

Combination of Reading O'Reilly Books, working with Google Gemini Pro and my own thinking about business and data system concerns at tool selection.

January 2026

Sub-second / Real-time: You need a stream processor.
- Tooling: Apache Flink, Spark Streaming, or Confluent (Kafka).
Minutes to Hours (Micro-batch/Batch): Standard ETL/ELT is fine.
- Tooling: dbt, Apache Airflow, or Dagster.

Small to Medium (< 1TB, Structured): Don't over-engineer. Use a cloud warehouse.
- Tooling: Snowflake, BigQuery, or PostgreSQL (if very small).
Massive / Unstructured (> 10TB, Logs, JSON, Images): You need a Data Lakehouse.
- Tooling: Databricks (Spark), Starburst (Trino), or AWS Athena.

If you went the Lakehouse route, you must pick a table format to handle ACID transactions on files:

Heavy Upserts/Deletes: Apache Hudi.
High Performance / Ecosystem Support: Apache Iceberg (The current industry favorite for interoperability).
Deep Databricks Integration: Delta Lake.

Arunkumar Palanisamy

Found this on LinkedIn and liked how it reads.

Cross Check before Takoff

Before finalizing a tool, ask yourself these three non-technical questions:

Is the licensing + engineering hours to maintain it cheaper than a managed service?

Does my team know Scala/Java, or are they strictly SQL/Python? (Don't buy a Ferrari if no one can drive manual).

How hard is it to migrate our metadata if this vendor hikes prices by 30% next year?