Data Engineering free courses
Linked Data Engineering
π¬ Video Lessons
Rating βοΈ: 5 out of 5
Students π¨βπ: 9,973
Duration β°: 8 weeks long
Source: openHPI
π Course Link
Data Engineering
Credits β³: 15
Duration β°: 4 hours
πββοΈ Self paced
Source: Google cloud
π Course Link
Data Engineering Essentials using Spark, Python and SQL
π¬ 402 video lesson
πββοΈ Self paced
Teacher: itversity
Resource: Youtube
π Course Link
Data engineering with Azure Databricks
Modules β³: 5
Duration β°: 4-5 hours worth of material
πββοΈ Self paced
Source: Microsoft ignite
π Course Link
Perform data engineering with Azure Synapse Apache Spark Pools
Modules β³: 5
Duration β°: 2-3 hours worth of material
πββοΈ Self paced
Source: Microsoft Learn
π Course Link
Books
Data Engineering
The Data Engineers Guide to Apache Spark
All the best ππ
Linked Data Engineering
π¬ Video Lessons
Rating βοΈ: 5 out of 5
Students π¨βπ: 9,973
Duration β°: 8 weeks long
Source: openHPI
π Course Link
Data Engineering
Credits β³: 15
Duration β°: 4 hours
πββοΈ Self paced
Source: Google cloud
π Course Link
Data Engineering Essentials using Spark, Python and SQL
π¬ 402 video lesson
πββοΈ Self paced
Teacher: itversity
Resource: Youtube
π Course Link
Data engineering with Azure Databricks
Modules β³: 5
Duration β°: 4-5 hours worth of material
πββοΈ Self paced
Source: Microsoft ignite
π Course Link
Perform data engineering with Azure Synapse Apache Spark Pools
Modules β³: 5
Duration β°: 2-3 hours worth of material
πββοΈ Self paced
Source: Microsoft Learn
π Course Link
Books
Data Engineering
The Data Engineers Guide to Apache Spark
All the best ππ
π4β€2
π Mastering Spark: 20 Interview Questions Demystified!
1οΈβ£ MapReduce vs. Spark: Learn how Spark achieves 100x faster performance compared to MapReduce.
2οΈβ£ RDD vs. DataFrame: Unravel the key differences between RDD and DataFrame, and discover what makes DataFrame unique.
3οΈβ£ DataFrame vs. Datasets: Delve into the distinctions between DataFrame and Datasets in Spark.
4οΈβ£ RDD Operations: Explore the various RDD operations that power Spark.
5οΈβ£ Narrow vs. Wide Transformations: Understand the differences between narrow and wide transformations in Spark.
6οΈβ£ Shared Variables: Discover the shared variables that facilitate distributed computing in Spark.
7οΈβ£ Persist vs. Cache: Differentiate between the persist and cache functionalities in Spark.
8οΈβ£ Spark Checkpointing: Learn about Spark checkpointing and how it differs from persisting to disk.
9οΈβ£ SparkSession vs. SparkContext: Understand the roles of SparkSession and SparkContext in Spark applications.
π spark-submit Parameters: Explore the parameters to specify in the spark-submit command.
1οΈβ£1οΈβ£ Cluster Managers in Spark: Familiarize yourself with the different types of cluster managers available in Spark.
1οΈβ£2οΈβ£ Deploy Modes: Learn about the deploy modes in Spark and their significance.
1οΈβ£3οΈβ£ Executor vs. Executor Core: Distinguish between executor and executor core in the Spark ecosystem.
1οΈβ£4οΈβ£ Shuffling Concept: Gain insights into the shuffling concept in Spark and its importance.
1οΈβ£5οΈβ£ Number of Stages in Spark Job: Understand how to decide the number of stages created in a Spark job.
1οΈβ£6οΈβ£ Spark Job Execution Internals: Get a peek into how Spark internally executes a program.
1οΈβ£7οΈβ£ Direct Output Storage: Explore the possibility of directly storing output without sending it back to the driver.
1οΈβ£8οΈβ£ Coalesce and Repartition: Learn about the applications of coalesce and repartition in Spark.
1οΈβ£9οΈβ£ Physical and Logical Plan Optimization: Uncover the optimization techniques employed in Spark's physical and logical plans.
2οΈβ£0οΈβ£ Treereduce and Treeaggregate: Discover why treereduce and treeaggregate are preferred over reduceByKey and aggregateByKey in certain scenarios.
Data Engineering Interview Preparation Resources: https://whatsapp.com/channel/0029Vaovs0ZKbYMKXvKRYi3C
1οΈβ£ MapReduce vs. Spark: Learn how Spark achieves 100x faster performance compared to MapReduce.
2οΈβ£ RDD vs. DataFrame: Unravel the key differences between RDD and DataFrame, and discover what makes DataFrame unique.
3οΈβ£ DataFrame vs. Datasets: Delve into the distinctions between DataFrame and Datasets in Spark.
4οΈβ£ RDD Operations: Explore the various RDD operations that power Spark.
5οΈβ£ Narrow vs. Wide Transformations: Understand the differences between narrow and wide transformations in Spark.
6οΈβ£ Shared Variables: Discover the shared variables that facilitate distributed computing in Spark.
7οΈβ£ Persist vs. Cache: Differentiate between the persist and cache functionalities in Spark.
8οΈβ£ Spark Checkpointing: Learn about Spark checkpointing and how it differs from persisting to disk.
9οΈβ£ SparkSession vs. SparkContext: Understand the roles of SparkSession and SparkContext in Spark applications.
π spark-submit Parameters: Explore the parameters to specify in the spark-submit command.
1οΈβ£1οΈβ£ Cluster Managers in Spark: Familiarize yourself with the different types of cluster managers available in Spark.
1οΈβ£2οΈβ£ Deploy Modes: Learn about the deploy modes in Spark and their significance.
1οΈβ£3οΈβ£ Executor vs. Executor Core: Distinguish between executor and executor core in the Spark ecosystem.
1οΈβ£4οΈβ£ Shuffling Concept: Gain insights into the shuffling concept in Spark and its importance.
1οΈβ£5οΈβ£ Number of Stages in Spark Job: Understand how to decide the number of stages created in a Spark job.
1οΈβ£6οΈβ£ Spark Job Execution Internals: Get a peek into how Spark internally executes a program.
1οΈβ£7οΈβ£ Direct Output Storage: Explore the possibility of directly storing output without sending it back to the driver.
1οΈβ£8οΈβ£ Coalesce and Repartition: Learn about the applications of coalesce and repartition in Spark.
1οΈβ£9οΈβ£ Physical and Logical Plan Optimization: Uncover the optimization techniques employed in Spark's physical and logical plans.
2οΈβ£0οΈβ£ Treereduce and Treeaggregate: Discover why treereduce and treeaggregate are preferred over reduceByKey and aggregateByKey in certain scenarios.
Data Engineering Interview Preparation Resources: https://whatsapp.com/channel/0029Vaovs0ZKbYMKXvKRYi3C
π7
We are now on WhatsApp as well
Follow for more data engineering resources: π https://whatsapp.com/channel/0029Vaovs0ZKbYMKXvKRYi3C
Follow for more data engineering resources: π https://whatsapp.com/channel/0029Vaovs0ZKbYMKXvKRYi3C
π4β€1π₯1
Data Engineer Interview Questions for Entry-Level Data Engineerπ₯
1. What are the core responsibilities of a data engineer?
2. Explain the ETL process
3. How do you handle large datasets in a data pipeline?
4. What is the difference between a relational & a non-relational database?
5. Describe how data partitioning improves performance in distributed systems
6. What is a data warehouse & how is it different from a database?
7. How would you design a data pipeline for real-time data processing?
8. Explain the concept of normalization & denormalization in database design
9. What tools do you commonly use for data ingestion, transformation & storage?
10. How do you optimize SQL queries for better performance in data processing?
11. What is the role of Apache Hadoop in big data?
12. How do you implement data security & privacy in data engineering?
13. Explain the concept of data lakes & their importance in modern data architectures
14. What is the difference between batch processing & stream processing?
15. How do you manage & monitor data quality in your pipelines?
16. What are your preferred cloud platforms for data engineering & why?
17. How do you handle schema changes in a production data pipeline?
18. Describe how you would build a scalable & fault-tolerant data pipeline
19. What is Apache Kafka & how is it used in data engineering?
20. What techniques do you use for data compression & storage optimization?
1. What are the core responsibilities of a data engineer?
2. Explain the ETL process
3. How do you handle large datasets in a data pipeline?
4. What is the difference between a relational & a non-relational database?
5. Describe how data partitioning improves performance in distributed systems
6. What is a data warehouse & how is it different from a database?
7. How would you design a data pipeline for real-time data processing?
8. Explain the concept of normalization & denormalization in database design
9. What tools do you commonly use for data ingestion, transformation & storage?
10. How do you optimize SQL queries for better performance in data processing?
11. What is the role of Apache Hadoop in big data?
12. How do you implement data security & privacy in data engineering?
13. Explain the concept of data lakes & their importance in modern data architectures
14. What is the difference between batch processing & stream processing?
15. How do you manage & monitor data quality in your pipelines?
16. What are your preferred cloud platforms for data engineering & why?
17. How do you handle schema changes in a production data pipeline?
18. Describe how you would build a scalable & fault-tolerant data pipeline
19. What is Apache Kafka & how is it used in data engineering?
20. What techniques do you use for data compression & storage optimization?
β€4
Here are three PySpark questions:
Scenario 1: Data Aggregation
Interviewer: "How would you aggregate data by category and calculate the sum of sales, handling missing values and grouping by multiple columns?"
Candidate:
Scenario 2: Data Transformation
Interviewer: "How would you transform a DataFrame by converting a column to timestamp, handling invalid dates and extracting specific date components?"
Candidate:
Scenario 3: Data Partitioning
Interviewer: "How would you partition a large DataFrame by date and save it to parquet format, handling data skewness and optimizing storage?"
Candidate:
Here, you can find Data Engineering Resources π
https://whatsapp.com/channel/0029Vaovs0ZKbYMKXvKRYi3C
All the best ππ
Scenario 1: Data Aggregation
Interviewer: "How would you aggregate data by category and calculate the sum of sales, handling missing values and grouping by multiple columns?"
Candidate:
# Load the DataFrame
df = spark.read.csv("path/to/data.csv", header=True, inferSchema=True)
# Handle missing values
df_filled = df.fillna(0)
# Aggregate data
from pyspark.sql.functions import sum, col
df_aggregated = df_filled.groupBy("category", "region").agg(sum(col("sales")).alias("total_sales"))
# Sort the results
df_aggregated_sorted = df_aggregated.orderBy("total_sales", ascending=False)
# Save the aggregated DataFrame
df_aggregated_sorted.write.csv("path/to/aggregated/data.csv", header=True)
Scenario 2: Data Transformation
Interviewer: "How would you transform a DataFrame by converting a column to timestamp, handling invalid dates and extracting specific date components?"
Candidate:
# Load the DataFrame
df = spark.read.csv("path/to/data.csv", header=True, inferSchema=True)
# Convert column to timestamp
from pyspark.sql.functions import to_timestamp, col
df_transformed = df.withColumn("date_column", to_timestamp(col("date_column"), "yyyy-MM-dd"))
# Handle invalid dates
df_transformed_filtered = df_transformed.filter(col("date_column").isNotNull())
# Extract date components
from pyspark.sql.functions import year, month, dayofmonth
df_transformed_extracted = df_transformed_filtered.withColumn("year", year(col("date_column"))).withColumn("month", month(col("date_column"))).withColumn("day", dayofmonth(col("date_column")))
# Save the transformed DataFrame
df_transformed_extracted.write.csv("path/to/transformed/data.csv", header=True)
Scenario 3: Data Partitioning
Interviewer: "How would you partition a large DataFrame by date and save it to parquet format, handling data skewness and optimizing storage?"
Candidate:
# Load the DataFrame
df = spark.read.csv("path/to/data.csv", header=True, inferSchema=True)
# Partition by date
df_partitioned = df.repartitionByRange("date_column")
# Save to parquet format
df_partitioned.write.parquet("path/to/partitioned/data.parquet", partitionBy=["date_column"])
# Optimize storage
df_partitioned.write.option("compression", "snappy").parquet("path/to/partitioned/data.parquet", partitionBy=["date_column"])
Here, you can find Data Engineering Resources π
https://whatsapp.com/channel/0029Vaovs0ZKbYMKXvKRYi3C
All the best ππ
π6β€5
fundamentals-of-data-engineering.pdf
7.6 MB
π The good book to start learning Data Engineering.
β You can download it for free here
βWith this practical #book, you'll learn how to plan and build systems to serve the needs of your organization and your customers by evaluating the best technologies available through the framework of the #data #engineering lifecycle.
β You can download it for free here
βWith this practical #book, you'll learn how to plan and build systems to serve the needs of your organization and your customers by evaluating the best technologies available through the framework of the #data #engineering lifecycle.
π5β€2
Life of a Data Engineer.....
Business user : Can we add a filter on this dashboard. This will help us track a critical metric.
me : sure this should be a quick one.
Next day :
I quickly opened the dashboard to find the column in the existing dashboard's data sources. -- column not found
Spent a couple of hours to identify the data source and how to bring the column into the existence data pipeline which feeds the dashboard( table granularity , join condition etc..).
Then comes the pipeline changes , data model changes , dashboard changes , validation/testing.
Finally deploying to production and a simple email to the user that the filter has been added.
A small change in the front end but a lot of work in the backend to bring that column to life.
Never underestimate data engineers and data pipelines πͺ
Business user : Can we add a filter on this dashboard. This will help us track a critical metric.
me : sure this should be a quick one.
Next day :
I quickly opened the dashboard to find the column in the existing dashboard's data sources. -- column not found
Spent a couple of hours to identify the data source and how to bring the column into the existence data pipeline which feeds the dashboard( table granularity , join condition etc..).
Then comes the pipeline changes , data model changes , dashboard changes , validation/testing.
Finally deploying to production and a simple email to the user that the filter has been added.
A small change in the front end but a lot of work in the backend to bring that column to life.
Never underestimate data engineers and data pipelines πͺ
π5π₯1
Don't aim for this:
SQL - 100%
Python - 0%
PySpark - 0%
Cloud - 0%
Aim for this:
SQL - 25%
Python - 25%
PySpark - 25%
Cloud - 25%
You don't need to know everything straight away.
Here, you can find Data Engineering Resources π
https://whatsapp.com/channel/0029Vaovs0ZKbYMKXvKRYi3C
All the best ππ
SQL - 100%
Python - 0%
PySpark - 0%
Cloud - 0%
Aim for this:
SQL - 25%
Python - 25%
PySpark - 25%
Cloud - 25%
You don't need to know everything straight away.
Here, you can find Data Engineering Resources π
https://whatsapp.com/channel/0029Vaovs0ZKbYMKXvKRYi3C
All the best ππ
β€9π4
π₯ ETL vs ELT: What's the Difference?
When it comes to data processing, two key approaches stand out: ETL and ELT. Both involve transforming data, but the processes differ significantly!
πΉ ETL (Extract, Transform, Load)
- Extract data from various sources (databases, APIs, etc.)
- Transform data before loading it into the storage (cleaning, aggregating, formatting)
- Load the transformed data into the data warehouse (DWH)
βοΈ Key point: Data is transformed before being loaded into the storage.
πΉ ELT (Extract, Load, Transform)
- Extract data from sources
- Load raw data into the data warehouse
- Transform the data after it's loaded, using the power of the data warehouseβs computational resources
βοΈ Key point: Data is loaded into the storage first, and transformation happens afterward.
π― When to use which?
- ETL is ideal for structured data and traditional systems where pre-processing is crucial.
- ELT is better suited for handling large volumes of data in modern cloud-based architectures.
Which one works best for your project? π€
When it comes to data processing, two key approaches stand out: ETL and ELT. Both involve transforming data, but the processes differ significantly!
πΉ ETL (Extract, Transform, Load)
- Extract data from various sources (databases, APIs, etc.)
- Transform data before loading it into the storage (cleaning, aggregating, formatting)
- Load the transformed data into the data warehouse (DWH)
βοΈ Key point: Data is transformed before being loaded into the storage.
πΉ ELT (Extract, Load, Transform)
- Extract data from sources
- Load raw data into the data warehouse
- Transform the data after it's loaded, using the power of the data warehouseβs computational resources
βοΈ Key point: Data is loaded into the storage first, and transformation happens afterward.
π― When to use which?
- ETL is ideal for structured data and traditional systems where pre-processing is crucial.
- ELT is better suited for handling large volumes of data in modern cloud-based architectures.
Which one works best for your project? π€
π4π₯4π₯°1
Join our WhatsApp channel for more data engineering resources
ππ
https://whatsapp.com/channel/0029Vaovs0ZKbYMKXvKRYi3C
ππ
https://whatsapp.com/channel/0029Vaovs0ZKbYMKXvKRYi3C
π6