My Interpretation of the Modern Data Stack

Last updated on 08 Apr 2025

Data Engineering Architecture:

This a post with information I have gathered and my interpretation on what I have learned and used.

Modern Data Stack:

A data stack is a collection of various technologies that allow for raw data to be processed before it can be used. A modern data stack (MDS) consists of the specific tools that are used to organize, store, and transform data.

Legacy data stack:

Technically heavy
Requires lots of infrastructure
Time-consuming

Modern data stack:

Cloud configured
Ease of use
Suite of tools is designed with non-technical employers in mind (Low Code / No Code)
Saves time

Simplified Example of a Modern Data Stack.

ELT

Extract: Fivetran
Load: Fivetran
Transform: DBT
Orchestration/workflow: Airflow

Destination:

Data Warehouse : Snowflake
Data Lake: Snowflake
Visualization: Looker
Governance: Data Catalog & Data Governance | Alation

Simplified Example of a Modern Data Stack.

Tools typically used in my experience.

Function	Managed Services	Open Source	GCP Services
Extract:	Fivetran or Stich	Debezium, Airbyte, Meltano or SQL script to pull to data	Cloud Data Fusion / Datastream
Transform:	Fivetran(limited) and or DBT Cloud	Open source SQL/ Apache Spark	DataFlow
Load:	Fivetran or Stich	SQL Script, Airbyte Community	PubSub / Cloud Data Fusion / Datastream
Dashboard:	Looker, Tableau	Metabase / Superset	Looker Studio
Observability:	DBT Cloud, Cloud Composer(airflow), New Relic	Airflow + Custom Logic, Grafana	Cloud Composer
Schedule:	DBT Cloud	Airflow	Cloud Composer
Data Warehouse:	BigQuery, Snowflake, Redshift Spectrum	Clickhouse, Druid, Kylin, DuckDB(Serverless)	BigQuery
Data Catalog:	Dataplex, Atlan, Castor	DataHub, Open Metadata, Open Data Discovery	Dataplex

APIs:

APIs are a data source that continues to grow in importance and popularity. A typical organization may have hundreds of external data sources such as SaaS platforms or partner companies.

APIs click to expan...

ETL vs ELT:

Typical Examples:

ETL = Data is extracted from the source, the raw data is stored in some file storage, transformed via python, spark, scala or other language and loaded into the tables to be used by end user.

ELT: Data is extracted from the source, the raw data is loaded directly into the data warehouse and transformed using SQL into the final table to be used by the end-user. This model has been popularized by dbt.

In a good portion of use cases ELT would be the better fit however if the transformations need to support streaming or more complex needing the power of a spark cluster than ETL would be the better suited.

Simple comparison:

Criteria	Notes	ELT	ETL
Running cost	Depending on your data warehouse, performing transformation using k8s tasks or serverless functions(lambdas) can be much cheaper compared to transforming data in your data warehouse.		x
Engineering expertise	Both require a good grasp of distributed systems, coding, maintenance, debugging, and SQL.	x	x
Development time	Depending on existing tools this may vary. If you have tools like Fivetran or dbt, ELT is a breeze.	x
Transformation capabilities	Programming languages/Frameworks like python, scala, and Spark enable complex transformations (enrich with data from external API, running an ML model, etc). SQL is not as powerful.		x
Latency between data generation and availability of use by end-user	In general, the T of ELT is run as a batch job since SQL does not support streaming transformation. ETL can do batching, micro-batching, or streaming.		x
SAAS tools	ELT allows for faster data feature delivery due to the availability of EL tools (Fivetran, Airbyte, etc).	x

The above table is a great break down from: https://www.startdataengineering.com/post/elt-vs-etl/#3-differences-between-etl--elt

Whats the difference between ETL & ELT?

Cloud ETL Tools:

What qualifies as an ETL tool?

Facilitate extract, transform, and load processes
Transform data for quality or visualization
Audit or record integration data
Archive data for backup, future reference, or analysis

Tool	Description	Have I seen in action?
DBT	dbt is a transformation workflow that lets teams quickly and collaboratively deploy analytics code following software engineering best practices like modularity, portability, CI/CD, and documentation. Now anyone who knows SQL can build production-grade data pipelines.	Yes
Fivetran	Fivetran is an automated data integration service	Yes
Data Fusion	Google’s visual point-and-click interface enabling code-free deployment of ETL/ELT data pipelines	no
Hevo Data	Hevo Data is an automated data pipeline that helps users migrate, consolidate, and transform data from different sources into their data warehouse	no
Keboola	Keboola enables data engineers, analysts, and engineers to collaborate on analytics and automation as a team, from extraction, transformation, data management, and pipeline orchestration to reverse ET	no

Storage Beyond the Traditional Database:

Data Warehouse / Data Lakes / Data Lakehouse :

Traditional Database	Data Warehouse	Data Lakes	Data Lakehouse
OLTP	OLAP	Catch All Storage for Processing	Catch All Storage for Processing
Stores Real-time relational structured data	For analyzing archived structured data	Structures = flexibility and scalability Process and Analyze a lot of data	Structures = flexibility and scalability and archive structured data Process and Analyze a lot of data
	Optimized for analytics - OLAP Organized for reading/aggregating data Usually read-only (many can merge data) Massively Parallel Processing (MPP) Typically uses a denormalized schema and dimensional modeling Data Marts: Subset of data warehouses Dedicated to a specific topic	Store for ALL types of data at a lower cost Retains all data and can take up to petabytes of data Schema-on-read as opposed to schema-on-write Need to catalog data otherwise it becomes a data swamp Run big data analytics such as spark, beam and so on.	More cost effective than traditional data warehouse solutions with… Increased scalability and flexibility Support of multiple data sources and formats Faster development and deployment of applications Streamlined data analysis and reporting processes Easy integration with other systems Improved data quality and governance
	GCP Big Query AWS Redshift Azure SQL Data Warehouse Snowflake (Works on all cloud platforms.)		GCP Big Query Snowflake Databricks

Brief overview of popular systems:

A brief overview of the similarities and differences between Snowflake, BigQuery and Databricks.

	Snowflake	BigQuery	Databricks
When to use:	You need a multi cloud DataWarehouse / Datalake You need it to have support for structured and semi-structured data. Sharing data options are required Need for Comprehensive data governance and compliance features.	Integration with popular GCP services like Dataproc, Data Flow, Cloud Storage. Query large datasets using SQL to leverage machine learning algorithms. ISO/IEC 27001 and SOC 2 compliance is needed.	You need a multi cloud Data Warehouse / Data Lake (Data Lakehouse) Using an optimized Spark environment is preferred. Databricks is best suited for complex data processing operations, and streaming data. Databricks provides a variety of security best practices and certifications, such as ISO/IEC 27001 and SOC 2 compliance.
When not to use:	Snowflake may not be suited for real-time or complicated analysis, large datasets, or organizations with really strict security requirements.	BigQuery is not suitable for low-latency, real-time applications, sub-second response times, random reads/writes, data streaming.	Databricks should not be used when the data processing needs are minimal, when the data size is too small, or when there is no need for distributed computing.

Costs?

Depending on the size and complexity of the data analysis, BigQuery may be more cost-effective than either Snowflake or Databricks. Generally, Snowflake tends to be pricier than Databricks for most applications, however, there are scenarios where Databricks can be more expensive than Snowflake. Ultimately, the cost of each platform depends on the use of features and services each offers.