Data Engineers
8.82K subscribers
345 photos
74 files
337 links
Free Data Engineering Ebooks & Courses
Download Telegram
❀8πŸ”₯1
Data Engineering free courses   

Linked Data Engineering
🎬 Video Lessons
Rating ⭐️: 5 out of 5     
Students πŸ‘¨β€πŸŽ“: 9,973
Duration ⏰:  8 weeks long
Source: openHPI
πŸ”— Course Link  

Data Engineering
Credits ⏳: 15
Duration ⏰: 4 hours
πŸƒβ€β™‚οΈ Self paced       
Source:  Google cloud
πŸ”— Course Link

Data Engineering Essentials using Spark, Python and SQL  
🎬 402 video lesson
πŸƒβ€β™‚οΈ Self paced
Teacher: itversity
Resource: Youtube
πŸ”— Course Link  
 
Data engineering with Azure Databricks      
Modules ⏳: 5
Duration ⏰:  4-5 hours worth of material
πŸƒβ€β™‚οΈ Self paced       
Source:  Microsoft ignite
πŸ”— Course Link

Perform data engineering with Azure Synapse Apache Spark Pools      
Modules ⏳: 5
Duration ⏰:  2-3 hours worth of material
πŸƒβ€β™‚οΈ Self paced       
Source:  Microsoft Learn
πŸ”— Course Link

Books
Data Engineering
The Data Engineers Guide to Apache Spark

All the best πŸ‘πŸ‘
πŸ‘4❀2
πŸ” Mastering Spark: 20 Interview Questions Demystified!

1️⃣ MapReduce vs. Spark: Learn how Spark achieves 100x faster performance compared to MapReduce.
2️⃣ RDD vs. DataFrame: Unravel the key differences between RDD and DataFrame, and discover what makes DataFrame unique.
3️⃣ DataFrame vs. Datasets: Delve into the distinctions between DataFrame and Datasets in Spark.
4️⃣ RDD Operations: Explore the various RDD operations that power Spark.
5️⃣ Narrow vs. Wide Transformations: Understand the differences between narrow and wide transformations in Spark.
6️⃣ Shared Variables: Discover the shared variables that facilitate distributed computing in Spark.
7️⃣ Persist vs. Cache: Differentiate between the persist and cache functionalities in Spark.
8️⃣ Spark Checkpointing: Learn about Spark checkpointing and how it differs from persisting to disk.
9️⃣ SparkSession vs. SparkContext: Understand the roles of SparkSession and SparkContext in Spark applications.
πŸ”Ÿ spark-submit Parameters: Explore the parameters to specify in the spark-submit command.
1️⃣1️⃣ Cluster Managers in Spark: Familiarize yourself with the different types of cluster managers available in Spark.
1️⃣2️⃣ Deploy Modes: Learn about the deploy modes in Spark and their significance.
1️⃣3️⃣ Executor vs. Executor Core: Distinguish between executor and executor core in the Spark ecosystem.
1️⃣4️⃣ Shuffling Concept: Gain insights into the shuffling concept in Spark and its importance.
1️⃣5️⃣ Number of Stages in Spark Job: Understand how to decide the number of stages created in a Spark job.
1️⃣6️⃣ Spark Job Execution Internals: Get a peek into how Spark internally executes a program.
1️⃣7️⃣ Direct Output Storage: Explore the possibility of directly storing output without sending it back to the driver.
1️⃣8️⃣ Coalesce and Repartition: Learn about the applications of coalesce and repartition in Spark.
1️⃣9️⃣ Physical and Logical Plan Optimization: Uncover the optimization techniques employed in Spark's physical and logical plans.
2️⃣0️⃣ Treereduce and Treeaggregate: Discover why treereduce and treeaggregate are preferred over reduceByKey and aggregateByKey in certain scenarios.

Data Engineering Interview Preparation Resources: https://whatsapp.com/channel/0029Vaovs0ZKbYMKXvKRYi3C
πŸ‘7
The four V's of big data
πŸ₯°7πŸ‘1
Pandas Data Cleaning.pdf
14.9 MB
Pandas Data Cleaning.pdf
❀7πŸ‘1
Data Pipeline Overview
❀7πŸ‘2
We are now on WhatsApp as well

Follow for more data engineering resources: πŸ‘‡ https://whatsapp.com/channel/0029Vaovs0ZKbYMKXvKRYi3C
πŸ‘4❀1πŸ”₯1
πŸ”₯1
Data Engineer Interview Questions for Entry-Level Data EngineerπŸ”₯


1. What are the core responsibilities of a data engineer?

2. Explain the ETL process

3. How do you handle large datasets in a data pipeline?

4. What is the difference between a relational & a non-relational database?

5. Describe how data partitioning improves performance in distributed systems

6. What is a data warehouse & how is it different from a database?

7. How would you design a data pipeline for real-time data processing?

8. Explain the concept of normalization & denormalization in database design

9. What tools do you commonly use for data ingestion, transformation & storage?

10. How do you optimize SQL queries for better performance in data processing?

11. What is the role of Apache Hadoop in big data?

12. How do you implement data security & privacy in data engineering?

13. Explain the concept of data lakes & their importance in modern data architectures

14. What is the difference between batch processing & stream processing?

15. How do you manage & monitor data quality in your pipelines?

16. What are your preferred cloud platforms for data engineering & why?

17. How do you handle schema changes in a production data pipeline?

18. Describe how you would build a scalable & fault-tolerant data pipeline

19. What is Apache Kafka & how is it used in data engineering?

20. What techniques do you use for data compression & storage optimization?
❀4
Here are three PySpark questions:


Scenario 1: Data Aggregation


Interviewer: "How would you aggregate data by category and calculate the sum of sales, handling missing values and grouping by multiple columns?"


Candidate:


# Load the DataFrame
df = spark.read.csv("path/to/data.csv", header=True, inferSchema=True)

# Handle missing values
df_filled = df.fillna(0)

# Aggregate data
from pyspark.sql.functions import sum, col
df_aggregated = df_filled.groupBy("category", "region").agg(sum(col("sales")).alias("total_sales"))

# Sort the results
df_aggregated_sorted = df_aggregated.orderBy("total_sales", ascending=False)

# Save the aggregated DataFrame
df_aggregated_sorted.write.csv("path/to/aggregated/data.csv", header=True)


Scenario 2: Data Transformation


Interviewer: "How would you transform a DataFrame by converting a column to timestamp, handling invalid dates and extracting specific date components?"


Candidate:


# Load the DataFrame
df = spark.read.csv("path/to/data.csv", header=True, inferSchema=True)

# Convert column to timestamp
from pyspark.sql.functions import to_timestamp, col
df_transformed = df.withColumn("date_column", to_timestamp(col("date_column"), "yyyy-MM-dd"))

# Handle invalid dates
df_transformed_filtered = df_transformed.filter(col("date_column").isNotNull())

# Extract date components
from pyspark.sql.functions import year, month, dayofmonth
df_transformed_extracted = df_transformed_filtered.withColumn("year", year(col("date_column"))).withColumn("month", month(col("date_column"))).withColumn("day", dayofmonth(col("date_column")))

# Save the transformed DataFrame
df_transformed_extracted.write.csv("path/to/transformed/data.csv", header=True)

Scenario 3: Data Partitioning


Interviewer: "How would you partition a large DataFrame by date and save it to parquet format, handling data skewness and optimizing storage?"


Candidate:


# Load the DataFrame
df = spark.read.csv("path/to/data.csv", header=True, inferSchema=True)

# Partition by date
df_partitioned = df.repartitionByRange("date_column")

# Save to parquet format
df_partitioned.write.parquet("path/to/partitioned/data.parquet", partitionBy=["date_column"])

# Optimize storage
df_partitioned.write.option("compression", "snappy").parquet("path/to/partitioned/data.parquet", partitionBy=["date_column"])

Here, you can find Data Engineering Resources πŸ‘‡
https://whatsapp.com/channel/0029Vaovs0ZKbYMKXvKRYi3C

All the best πŸ‘πŸ‘
πŸ‘6❀5
fundamentals-of-data-engineering.pdf
7.6 MB
πŸš€ The good book to start learning Data Engineering.

⚠You can download it for free here

βš™With this practical #book, you'll learn how to plan and build systems to serve the needs of your organization and your customers by evaluating the best technologies available through the framework of the #data #engineering lifecycle.
πŸ‘5❀2
Life of a Data Engineer.....


Business user : Can we add a filter on this dashboard. This will help us track a critical metric.
me : sure this should be a quick one.

Next day :

I quickly opened the dashboard to find the column in the existing dashboard's data sources.  -- column not found

Spent a couple of hours to identify the data source and how to bring the column into the existence data pipeline which feeds the dashboard( table granularity , join condition etc..).

Then comes the pipeline changes , data model changes , dashboard changes , validation/testing.

Finally deploying to production and a simple email to the user that the filter has been added.

A small change in the front end but a lot of work in the backend to bring that column to life.

Never underestimate data engineers and data pipelines πŸ’ͺ
πŸ‘5πŸ”₯1
Don't aim for this:

SQL - 100%
Python - 0%
PySpark - 0%
Cloud - 0%

Aim for this:

SQL - 25%
Python - 25%
PySpark - 25%
Cloud - 25%

You don't need to know everything straight away.

Here, you can find Data Engineering Resources πŸ‘‡
https://whatsapp.com/channel/0029Vaovs0ZKbYMKXvKRYi3C

All the best πŸ‘πŸ‘
❀9πŸ‘4
πŸ”₯ ETL vs ELT: What's the Difference?

When it comes to data processing, two key approaches stand out: ETL and ELT. Both involve transforming data, but the processes differ significantly!

πŸ”Ή ETL (Extract, Transform, Load)
- Extract data from various sources (databases, APIs, etc.)
- Transform data before loading it into the storage (cleaning, aggregating, formatting)
- Load the transformed data into the data warehouse (DWH)

✏️ Key point: Data is transformed before being loaded into the storage.

πŸ”Ή ELT (Extract, Load, Transform)
- Extract data from sources
- Load raw data into the data warehouse
- Transform the data after it's loaded, using the power of the data warehouse’s computational resources

✏️ Key point: Data is loaded into the storage first, and transformation happens afterward.

🎯 When to use which?
- ETL is ideal for structured data and traditional systems where pre-processing is crucial.
- ELT is better suited for handling large volumes of data in modern cloud-based architectures.

Which one works best for your project? πŸ€”
πŸ‘4πŸ”₯4πŸ₯°1
Join our WhatsApp channel for more data engineering resources
πŸ‘‡πŸ‘‡
https://whatsapp.com/channel/0029Vaovs0ZKbYMKXvKRYi3C
πŸ‘6
Importance of ETL.pdf
3.2 MB
Importance of ETL.pdf
πŸ‘11
DevOps Engineering
πŸ‘5πŸ”₯2❀1