A hands-on technical and industry roadmap for aspiring data engineers
In Data Engineering for Beginners, big data expert Chisom Nwokwu delivers a beginner-friendly handbook for everyone interested in the fundamentals of data engineering. Whether you're interested in starting a rewarding, new career as a data analyst, data engineer, or data scientist, or seeking to expand your skillset in an existing engineering role, Nwokwu offers the technical and industry knowledge you need to succeed.
The book explains:
Database fundamentals, including relational and noSQL databases Data warehouses and data lakes Data pipelines, including info about batch and stream processing Data quality dimensions Data security principles, including data encryption Data governance principles and data framework Big data and distributed systems concepts Data engineering on the cloud Essential skills and tools for data engineering interviews and jobs
Data Engineering for Beginners offers an easy-to-read roadmap on a seemingly complicated and intimidating subject. It addresses the topics most likely to cause a beginning data engineer to stumble, clearly explaining key concepts in an accessible way. You'll also find:
A comprehensive glossary of data engineering terms Common and practical career paths in the data engineering industry An introduction to key cloud technologies and services you may encounter early in your data engineering career
Perfect for practicing and aspiring data analysts, data scientists, and data engineers, Data Engineering for Beginners is an effective and reliable starting point for learning an in-demand skill. It's a powerful resource for everyone hoping to expand their data engineering Skillset and upskill in the big data era.
By:
Chisom Nwokwu
Imprint: John Wiley & Sons Inc
Country of Publication: United States
ISBN: 9781394325412
ISBN 10: 139432541X
Series: Tech Today
Pages: 384
Publication Date: 28 October 2025
Audience:
Professional and scholarly
,
Undergraduate
Format: Paperback
Publisher's Status: Forthcoming
Foreword xxi Introduction xxiii Chapter 1 Understanding Data 1 A Brief History of Data 2 Data in 19,000 bce: The Great Baboon and Abacus 2 Data in the 1600s: Public Health Statistics 2 Data in the 1800s: The U.S. Census 3 Data in the 1900s: The Concept of Storage 3 Data in the 1990s: Data and the Internet 4 Types of Data 4 Structured Data 4 Unstructured Data 5 Semi-structured Data 6 Why Is Data Important? 7 Healthcare 7 Supply Chain 8 Transportation and Logistics 8 Artificial Intelligence 9 Data and Information 9 Summary 10 Notes 11 Chapter 2 Introduction to Data Engineering 13 Data Engineering Explained Using an Oil Refinery Analogy 14 An Overview of the Data Engineering Life Cycle 15 Data Storage 16 Data Ingestion 20 Data Transformation 21 Data Serving 22 Navigating Project Requirements, Engaging Stakeholders, and Delivering Business Value 24 Requirements Gathering 24 Understanding Stakeholders 24 Understanding System Requirements 26 Delivering Business Value 28 The Current State of Data Engineering 28 The Importance of Data Engineering 29 Summary 30 Chapter 3 Database Fundamentals 33 Key Concepts of Databases 34 Rows 34 Columns 34 Schema 35 Keys 35 Types of Databases 35 Relational Databases 36 NoSQL Databases 47 Choosing Between Relational and NoSQL Databases 55 Start With Your Data’s Structure 55 Think About the Relationships in Your Data 55 How Fast Do You Need to Move? 55 How Do You Need to Query Your Data? 55 Scaling and Performance 56 Transaction and Strong Consistency Needs 56 Summary 56 Chapter 4 SQL Fundamentals 59 Introduction to SQL 60 Basic SQL Clauses 60 Comparison Operators 62 LIKE Statement 63 IN Statement 64 BETWEEN Statement 64 AND Statement 65 OR Statement 65 NOT Statement 66 IS NULL and IS NOT NULL Statements 66 Sorting and Limiting 67 Aggregate Functions 68 Sum() 69 Avg() 69 MAX() and MIN() 69 Group by 70 Having 71 Understanding Joins 72 Inner Join 72 Left Join 73 Right Join 74 Full Outer Join 75 Subqueries 76 Common Table Expressions (CTEs) 77 Set Operations 78 Window Functions 80 Lab: Setting Up SQL Server and Running SQL Queries 85 Best Practices for Writing Efficient SQL Queries 87 Summary 88 Chapter 5 Database Design 91 Data Modeling 92 Why Do We Need to Model Data? 92 Types of Data Modeling 93 Normalization 100 Rules of Normalization 102 Downsides of Normalization 109 Denormalization 110 Data Modeling Best Practices 111 Define the Grain 111 Normalize Now, Denormalize Later 112 Choose the Right Data Types 112 Proper Naming Conventions 113 Database Optimization 114 Indexing 114 Partitioning 115 Sharding 116 Views 118 Summary 120 Chapter 6 Data Warehouses, Data Lakes, and Data Lakehouses 123 Data Warehouses 124 Extract, Transform, and Load (ETL) 126 Schema Design 127 Snowflake Schema 132 Slowly Changing Dimensions 134 Data Marts 138 Benefits of a Data Mart 138 Challenges with Data Marts 138 Data Lakes 139 How Do Data Lakes Work? 139 Challenges of Data Lakes 142 Data Lakehouse 142 Features of a Data Lakehouse 143 Data Lakehouse Architecture 143 The Key Differences Between a Database, Data Warehouse, Data Lake, and Data Lakehouse 144 Summary 145 Chapter 7 Data Pipelines 147 Batch Pipelines 148 Components of a Batch Pipeline 148 ETL Pipelines vs. ELT Pipelines 151 Stream Pipelines 152 How Would This Work? 152 Components of a Streaming Data Pipeline 153 Lambda Architecture 164 Components of the Lambda Architecture 165 Advantages of the Lambda Architecture 166 Challenges and Trade-offs 166 Data Orchestration 167 Directed Acyclic Graphs (DAGs) 168 Scheduling and Automation 170 Monitoring 171 Alerts 172 Lab: Building an ETL Pipeline and Automating with Apache Airflow 173 Requirements 174 Set Up Your Development Environment 174 Extracting Data from CSV 176 Transforming the Data 177 Load the New CSV File into a Postgres Database Instance 181 Schedule ETL Pipeline with Apache Airflow 182 Summary 185 Chapter 8 Data Quality 187 Bad Data 188 Dimensions of Data Quality 190 Accuracy 191 Completeness 191 Consistency 194 Validity 195 Uniqueness 196 Timeliness 198 Accessibility 198 Relevance 198 Data Quality Hierarchy 199 Data Quality Best Practices 200 Summary 201 Chapter 9 Data Security 203 What Is Data Security? 204 Common Threats to Data Security 205 Core Principles of Data Security 206 Confidentiality 206 Integrity 207 Availability 208 Data Encryption 209 Symmetric Encryption 209 Asymmetric Encryption 210 Data Masking 211 Understanding Network Security 214 Access Control 216 Authentication 217 Authorization 219 The Principle of Least Privilege 222 Access Levels 224 Secrets Management 225 Data Security and Data Privacy 225 Summary 226 Chapter 10 Data Governance 229 How to Think About Data Governance 230 Data Governance Framework 232 Policies 233 Regulatory Compliance Policy 234 Data Classification Policy 238 Data Retention and Disposal Policy 239 Data Sharing Policy 240 Processes 241 Metadata Management 242 Data Lineage 244 Incident Management 244 Master Data Management 246 Roles in the Data Governance Framework 247 Data Owner 248 Data Steward 248 Data Custodian 249 Chief Data Officer (CDO) 249 Data Management and Data Governance 250 Summary 250 Chapter 11 Big Data and Distributed Systems 253 The Five V’s of Big Data 254 Volume 255 Velocity 255 Variety 255 Veracity 256 Value 256 Distributed Systems 256 Scalability 258 Fault Tolerance 259 Reliability 260 Concurrency 260 Resource Management 260 Consistency 261 Availability 261 Load Balancing 261 Latency 262 Distributed Data Processing 262 Apache Hadoop 262 Big Data File Types 272 Avro 272 Parquet 273 Optimized Row Columnar (ORC) 274 Choosing the File Type 275 Summary 276 Chapter 12 Data Engineering on the Cloud 279 Cloud Computing 280 On-Premises 281 Cloud 281 Making the Right Choice 282 Core Cloud Concepts 282 Storage 282 Compute 286 Networking 287 Cloud Service Models 291 Infrastructure as a Service 291 Platform as a Service 292 Software as a Service 293 Choosing Between IaaS, PaaS, and SaaS 294 A Hybrid Approach 298 Cloud Management Models 298 Serverless 299 Managed 300 Self-Managed 301 Putting It All Together 302 Cost Optimization 302 Understanding Cloud Pricing Models 302 Rightsizing Resources 303 Smart Job Scheduling 304 Storage Optimization 304 Shutting Down Idle Resources 304 Use Serverless Where Possible 304 Monitoring and Alerting 305 Summary 305 Chapter 13 Building a Career in Data Engineering 307 Types of Data Engineering Roles 308 Types of Data Engineers 308 Platform Data Engineer 308 Analytics Data Engineer 310 AI/ML Data Engineers 310 Landing Your First Data Engineering Role 312 A Typical Data Engineering Job Description 312 How to Build a Winning Résumé 314 Preparing for a Data Engineering Interview 316 Thinking Like a Data Engineer 321 Think in Systems 321 Learn to Prioritize Data Quality 321 Design for Failure 321 Balance Business Context with Technical Choices 322 Optimize for Clarity, Then Speed 322 Think Beyond the Tool 322 Master Automation 322 Summary 323 Appendix Sample Interview Questions 325 SQL 325 Data Modeling 328 Data Pipelines 330 Apache Spark 332 System Design 333 Data Engineering Glossary 335 Index 347
CHISOM NWOKWU, is a Big-Data Engineer, Multi-Published Author, and Creator specialising in the design and development of scalable data platforms for teams. She’s an Azure Certified Data Engineer Associate who has worked with large international firms, including Microsoft and Bank of America.