My Interpretation of the Modern Data Stack

Data Engineering Architecture:

This a post with information I have gathered and my interpretation on what I have learned and used.

Modern Data Stack:

A data stack is a collection of various technologies that allow for raw data to be processed before it can be used. A modern data stack (MDS) consists of the specific tools that are used to organize, store, and transform data.

Legacy data stack:

  • Technically heavy
  • Requires lots of infrastructure
  • Time-consuming

Modern data stack:

  • Cloud configured
  • Ease of use
  • Suite of tools is designed with non-technical employers in mind  (Low Code / No Code)
  • Saves time

Simplified Example of a Modern Data Stack.

ELT

Destination:

Simplified Example of a Modern Data Stack.

Tools typically used in my experience.

Function 

Managed Services

Open Source 

GCP Services

Extract:

Fivetran or Stich

Debezium, Airbyte, Meltano

 or SQL script to pull to data

Cloud Data Fusion  / Datastream

Transform: 

Fivetran(limited) and or DBT Cloud

Open source SQL/ Apache Spark

DataFlow

Load: 

Fivetran or Stich

SQL Script, Airbyte Community  

PubSub / Cloud Data Fusion  / Datastream

Dashboard:

Looker, Tableau

Metabase /  Superset 

Looker Studio

Observability:

DBT Cloud, Cloud Composer(airflow), New Relic

Airflow + Custom Logic, Grafana

Cloud Composer

Schedule:

DBT Cloud

Airflow 

Cloud Composer

Data Warehouse:

 

BigQuery, Snowflake, Redshift Spectrum

Clickhouse, Druid, Kylin, DuckDB(Serverless)

BigQuery

Data Catalog:

Dataplex, Atlan, Castor

DataHub, Open Metadata, Open Data Discovery

Dataplex

APIs:

APIs are a data source that continues to grow in importance and popularity. A typical organization may have hundreds of external data sources such as SaaS platforms or partner companies.

APIs click to expan...

ETL vs ELT:


Typical Examples:

ETL = Data is extracted from the source, the raw data is stored in some file storage, transformed via python, spark, scala or other language and loaded into the tables to be used by end user.

ELT: Data is extracted from the source, the raw data is loaded directly into the data warehouse and transformed using SQL into the final table to be used by the end-user. This model has been popularized by dbt.

In a good portion of use cases ELT would be the better fit however if the transformations need to support streaming or more complex needing the power of a spark cluster than ETL would be the better suited.

Simple comparison:

Criteria

Notes

ELT

ETL

Running cost

Depending on your data warehouse, performing transformation using k8s tasks or serverless functions(lambdas) can be much cheaper compared to transforming data in your data warehouse.

 

x

Engineering expertise

Both require a good grasp of distributed systems, coding, maintenance, debugging, and SQL.

x

x

Development time

Depending on existing tools this may vary. If you have tools like Fivetran or dbt, ELT is a breeze.

x

 

Transformation capabilities

Programming languages/Frameworks like python, scala, and Spark enable complex transformations (enrich with data from external API, running an ML model, etc). SQL is not as powerful.

 

x

Latency between data generation and availability of use by end-user

In general, the T of ELT is run as a batch job since SQL does not support streaming transformation. ETL can do batching, micro-batching, or streaming.

 

x

SAAS tools

ELT allows for faster data feature delivery due to the availability of EL tools (Fivetran, Airbyte, etc).

x

 

The above table is a great break down from: https://www.startdataengineering.com/post/elt-vs-etl/#3-differences-between-etl--elt

Whats the difference between ETL & ELT?

Cloud ETL Tools:

What qualifies as an ETL tool?

  • Facilitate extract, transform, and load processes
  • Transform data for quality or visualization
  • Audit or record integration data
  • Archive data for backup, future reference, or analysis

Tool

Description

Have I seen in action?

DBT

dbt is a transformation workflow that lets teams quickly and collaboratively deploy analytics code following software engineering best practices like modularity, portability, CI/CD, and documentation. Now anyone who knows SQL can build production-grade data pipelines.

Yes

Fivetran

Fivetran is an automated data integration service

Yes

Data Fusion

Google’s visual point-and-click interface enabling code-free deployment of ETL/ELT data pipelines

no

Hevo Data

Hevo Data is an automated data pipeline that helps users migrate, consolidate, and transform data from different sources into their data warehouse

no

Keboola

Keboola enables data engineers, analysts, and engineers to collaborate on analytics and automation as a team, from extraction, transformation, data management, and pipeline orchestration to reverse ET

no

Storage Beyond the Traditional Database:

Data Warehouse / Data Lakes / Data Lakehouse :

Traditional Database

Data Warehouse

Data Lakes

Data Lakehouse

OLTP

OLAP

Catch All Storage for Processing

Catch All Storage for Processing

Stores Real-time relational structured data

For analyzing archived structured data

  • Structures = flexibility and scalability

  • Process and Analyze a lot of data

  • Structures = flexibility and scalability and archive structured data

  • Process and Analyze a lot of data

 

Optimized for analytics - OLAP

  • Organized for reading/aggregating data

  • Usually read-only (many can merge data)

  • Massively Parallel Processing (MPP)

  • Typically uses a denormalized schema and dimensional modeling

Data Marts:

  • Subset of data warehouses

  • Dedicated to a specific topic

 

 

Store for ALL types of data at a lower cost

  • Retains all data and can take up to petabytes of data

  • Schema-on-read as opposed to schema-on-write

  • Need to catalog data otherwise it becomes a data swamp

  • Run big data analytics such as spark, beam and so on.

More cost effective than traditional data warehouse solutions with…

  • Increased scalability and flexibility

  • Support of multiple data sources and formats

  • Faster development and deployment of applications

  • Streamlined data analysis and reporting processes

  • Easy integration with other systems

  • Improved data quality and governance

 

  • GCP Big Query

  • AWS Redshift

  • Azure SQL Data Warehouse

  • Snowflake (Works on all cloud platforms.)

 

  • GCP Big Query

  • Snowflake

  • Databricks

A brief overview of the similarities and differences between Snowflake, BigQuery and Databricks.

 

When to use:

  • You need a multi cloud DataWarehouse / Datalake

  • You need it to have support for structured and semi-structured data.

  • Sharing data options are required

  • Need for Comprehensive data governance and compliance features.

  • Integration with popular GCP services like Dataproc, Data Flow, Cloud Storage.

  • Query large datasets using SQL to leverage machine learning algorithms.

  • ISO/IEC 27001 and SOC 2 compliance is needed.

  • You need a multi cloud Data Warehouse / Data Lake (Data Lakehouse)

  • Using an optimized Spark environment is preferred.

  • Databricks is best suited for complex data processing operations, and streaming data.

  • Databricks provides a variety of security best practices and certifications, such as ISO/IEC 27001 and SOC 2 compliance.

When not to use:

Snowflake may not be suited for real-time or complicated analysis, large datasets, or organizations with really strict security requirements.

BigQuery is not suitable for low-latency, real-time applications, sub-second response times, random reads/writes, data streaming.

Databricks should not be used when the data processing needs are minimal, when the data size is too small, or when there is no need for distributed computing.

Costs?

Depending on the size and complexity of the data analysis, BigQuery may be more cost-effective than either Snowflake or Databricks. Generally, Snowflake tends to be pricier than Databricks for most applications, however, there are scenarios where Databricks can be more expensive than Snowflake. Ultimately, the cost of each platform depends on the use of features and services each offers.