Books and Courses for Learning Data Engineering Basics
Data Engineering:
I recently embarked on a journey to become a Data Engineer, and I have been tracking the books and courses that have been most helpful to me in this pursuit. I thought it would be useful to document my findings in a journal post for others who are looking to do the same.
Basic Skills to Aquire:
The skills you need depends on many different needs of the industory or company you work for or plan to work for.
I believe below are the more common.
- Python
- SQL
- Understanding of Distributed Systems
- DevOps (DE maybe focus on concepts / common tools)
- Data Engineering Fundamentals
- Data Warehouse / Data Models
- Data Architecture
- Cloud
Programing: Focus on Python and SQL
No matter what you currently know, I believe it's important to understand Python and SQL well. Other known programing in data would be Java but I kind of left this out for now.
Datacamp:
Datacamp has a good comprehensive beginer programing course for Data Engineering. It covers topics like:
- Python for data manipulation using libraries like Pandas, NumPy, PySpark and handling formats like csv, parquet and JSON.
- Object Oriented Programing in Python
- Writing Efficient code in Python
- Unit testing
- Orchestrate data flows with Airflow
- Data Ingestion
- Scala
- Relational design
- NoSQL
- AND MORE
Meta's Database Engineer Professional in Coursera:
There are many courses in Coursera but a well rounded basic course in data managment is Meta's Database Engineer. This is important to understand a "Database Engineer" and "Data Engineer" do not have the same roles however there are some overlap and having a good grasp of data management is needed.
I found the courses to have a good amount of information with hands on labs and extra resources. To clarify I did not take the full specialization and mostly autited the classes as needed.
You may not need to take all of the classes unless your looking for the certificate or a job through the program but even as a refresher I found the following courses in the specialization pretty good.
- Database Structures and Management with MySQL
- Advanced Data Modeling
Other more simplified but useful topics like
- Database Clients
- Version Control
- Python
I am not sugesting you need the the full specialization but if your new to data it might be useful to take the full program?
If you need a pure beginners resource on SQL and Python there are many free tutorials. Here are a few:
SQL:
- SQL Tutorial (free): https://www.w3schools.com/sql/default.asp
- Learn SQL (free): https://www.learnsql.com/
- SQL Bolt (free): https://sqlbolt.com
Python:
- Python Tutorial (free): https://www.w3schools.com/python/default.asp
- Learn Python(free): https://www.learnpython.org
- Learn Python(free): https://www.codecademy.com/learn/learn-python
- Python Basics(free): https://developers.google.com/edu/python
Understanding Data Engineering Fundamentals:
A good book I found is O'reilly's "Fundamentals of Data Engineering" .
Fundamentals of Data Engineering is an excellent book for those interested in gaining a highlevel but comprehensive understanding of the field of data engineering. It covers topics ranging from data modeling and architecture to machine learning and analytics. The author provides a clear and concise overview of the subject matter, making it easy to understand even for those who are new to data engineering. Additionally, the book is filled with helpful examples and diagrams to illustrate key concepts. All in all, Fundamentals of Data Engineering is an essential guide for anyone looking to learn the basics of data engineering.
It begins by outlining the various stages of the data life cycle, including data collection, storage, processing, analysis, and presentation. It then dives into the different technologies involved in each stage and how they are used to achieve the desired outcomes. Finally, the book provides various examples of how the data life cycle is applied in different industries and organizations.
This is a great book to understanding the fundamentals.
Designing Data-Intensive Applications
For a more intermediate/professional resource on reliable and scalable data systems this book is great to understand the more techinical aspects behind data systems. The book is by Martin Kleppmann that explores the design principles and trade-offs involved in building large-scale, distributed, and reliable data systems. I dont think this is a book you need to read cover to cover line the book above if your new to Data Engineering but as a reference and understanding concepts this book is great.
What about Projects?
I recently stumbled upon a great free website called StartDataEngineering.com that offers free tutorials on a variety of data engineering topics. From data structure and databases to ETL pipelines and machine learning, this website is a great starting point for anyone looking to break into data engineering and start some projects.
Great information like:
- Data Pipeline Design Patterns - Part1
- Data Pipeline Design Patterns - Part2 Python Code
- How to add tests to your data pipeline
- Patterns to load data into a data warehouse
Project examples like:
- DBT Data Build Tool Tutorial
- Build Data Engineering Project Template
- End to End Data Engineering Project
- How to - dbt cloud and Snowflake data-ops workflow
Also just great insights like his posts on Data pipeline Design patterns and approaches to land Data Engineering jobs.
One down side is I whish it had a search feature. It's free and we should not complain.
Open Source Vendor Specific Tools:
Data Orchestration with Astronomer - Airflow:
Learning Directed Acyclic Graphs (DAGs) and using Airflow doesn't have to be difficult! Atro provides a great, simple local development environment to help get started. Additionally, Astronomer has tutorials available in their documentation that range from the basics all the way up to more advanced topics, with links to further explore related concepts.
Run Transformations with dbt:
A popular and useful tool is dbt. Dbt ensures that your analytics code is maintainable, testable and reliable, allowing your data team to focus on the most impactful tasks.
- Fundamental Courses: https://courses.getdbt.com/collections
- Best Practice Guides: https://docs.getdbt.com/guides/best-practices
Cloud:
GCP Data Engineering Course via Coursera
This Course is a great resource for those who are already using GCP data products, or those looking to learn more about them. Although Google is the third-largest cloud provider, the "Google Cloud Professional Data Engineer" certificate is still one of the most sought-after certifications. These courses are designed to help prepare you for the Google Professional Data Engineer Certification exam, even if you don't intend to take the exam. It's a great way to get a comprehensive overview of Google's data products and how to deploy and use them in the cloud.
Since my company is using GCP and I already have some experience with GCP Data products this course was more desirable to me.
Whats to come?
Data Engineering is a complex and ever-evolving field. I'm always looking for ways to stay up to date with the best resources available and I'd like to share my findings with you. I'll be updating this post periodically with any new resources I come across, so keep an eye out! I hope you find this helpful.