Books and Courses for Learning Data Engineering Basics

Books and Courses for Learning Data Engineering Basics

Data Engineering:

I recently embarked on a journey to become a Data Engineer, and I have been tracking the books and courses that have been most helpful to me in this pursuit. I thought it would be useful to document my findings in a journal post for others who are looking to do the same.

Basic Skills to Aquire:

The skills you need depends on many different needs of the industory or company you work for or plan to work for.

I believe below are the more common.

  • Python
  • SQL
  • Understanding of Distributed Systems
  • DevOps (DE maybe focus on concepts / common tools)
  • Data Engineering Fundamentals
  • Data Warehouse / Data Models
  • Data Architecture
  • Cloud

Programing: Focus on Python and SQL

No matter what you currently know, I believe it's important to understand Python and SQL well. Other known programing in data would be Java but I kind of left this out for now.

Datacamp:

Datacamp has a good comprehensive beginer programing course for Data Engineering. It covers topics like:

  • Python for data manipulation using libraries like Pandas, NumPy, PySpark and handling formats like csv, parquet and JSON.
  • Object Oriented Programing in Python
  • Writing Efficient code in Python
  • Unit testing
  • Orchestrate data flows with Airflow
  • Data Ingestion
  • Scala
  • Relational design
  • NoSQL
  • AND MORE

Meta's Database Engineer Professional in Coursera:

There are many courses in Coursera but a well rounded basic course in data managment is Meta's Database Engineer. This is important to understand a "Database Engineer" and "Data Engineer" do not have the same roles however there are some overlap and having a good grasp of data management is needed.

I found the courses to have a good amount of information with hands on labs and extra resources. To clarify I did not take the full specialization and mostly autited the classes as needed.

Meta Database Engineer
Offered by Meta. Launch your career as a Database Engineer. Build job-ready skills for an in-demand career and earn a credential from Meta. ... Enroll for free.

You may not need to take all of the classes unless your looking for the certificate or a job through the program but even as a refresher I found the following courses in the specialization pretty good.

  • Database Structures and Management with MySQL
  • Advanced Data Modeling

Other more simplified but useful topics like

  • Database Clients
  • Version Control
  • Python

I am not sugesting you need the the full specialization but if your new to data it might be useful to take the full program?

If you need a pure beginners resource on SQL and Python there are many free tutorials. Here are a few:

SQL:

Python:

Understanding Data Engineering Fundamentals:

A good book I found is O'reilly's "Fundamentals of Data Engineering" .

Fundamentals of Data Engineering is an excellent book for those interested in gaining a highlevel but comprehensive understanding of the field of data engineering. It covers topics ranging from data modeling and architecture to machine learning and analytics. The author provides a clear and concise overview of the subject matter, making it easy to understand even for those who are new to data engineering. Additionally, the book is filled with helpful examples and diagrams to illustrate key concepts. All in all, Fundamentals of Data Engineering is an essential guide for anyone looking to learn the basics of data engineering.

From the Book "Fundamentals of Data Engineering": 

It begins by outlining the various stages of the data life cycle, including data collection, storage, processing, analysis, and presentation. It then dives into the different technologies involved in each stage and how they are used to achieve the desired outcomes. Finally, the book provides various examples of how the data life cycle is applied in different industries and organizations.

This is a great book to understanding the fundamentals.

Designing Data-Intensive Applications

For a more intermediate/professional resource on reliable and scalable data systems this book is great to understand the more techinical aspects behind data systems. The book is by Martin Kleppmann that explores the design principles and trade-offs involved in building large-scale, distributed, and reliable data systems. I dont think this is a book you need to read cover to cover line the book above if your new to Data Engineering but as a reference and understanding concepts this book is great.

Designing Data-Intensive Applications
Data is at the center of many challenges in system design today. Difficult issues need to be figured out, such as scalability, consistency, reliability, efficiency, and maintainability. In addition, we … - Selection from Designing Data-Intensive Applications [Book]

What about Projects?

I recently stumbled upon a great free website called StartDataEngineering.com that offers free tutorials on a variety of data engineering topics. From data structure and databases to ETL pipelines and machine learning, this website is a great starting point for anyone looking to break into data engineering and start some projects.

Great information like:

Project examples like:

Also just great insights like his posts on Data pipeline Design patterns and approaches to land Data Engineering jobs.

One down side is I whish it had a search feature. It's free and we should not complain.

Open Source Vendor Specific Tools:

Data Orchestration with Astronomer - Airflow:

Learning Directed Acyclic Graphs (DAGs) and using Airflow doesn't have to be difficult! Atro provides a great, simple local development environment to help get started. Additionally, Astronomer has tutorials available in their documentation that range from the basics all the way up to more advanced topics, with links to further explore related concepts.

Run Transformations with dbt:

A popular and useful tool is dbt. Dbt ensures that your analytics code is maintainable, testable and reliable, allowing your data team to focus on the most impactful tasks.

Cloud:

GCP Data Engineering Course via Coursera

This Course is a great resource for those who are already using GCP data products, or those looking to learn more about them. Although Google is the third-largest cloud provider, the "Google Cloud Professional Data Engineer" certificate is still one of the most sought-after certifications. These courses are designed to help prepare you for the Google Professional Data Engineer Certification exam, even if you don't intend to take the exam. It's a great way to get a comprehensive overview of Google's data products and how to deploy and use them in the cloud.

Preparing for Google Cloud Certification: Cloud Data Engineer
Offered by Google Cloud. Advance your career in data engineering Enroll for free.

Since my company is using GCP and I already have some experience with GCP Data products this course was more desirable to me.

Whats to come?

Data Engineering is a complex and ever-evolving field. I'm always looking for ways to stay up to date with the best resources available and I'd like to share my findings with you. I'll be updating this post periodically with any new resources I come across, so keep an eye out! I hope you find this helpful.