Data Engineers
8.82K subscribers
345 photos
74 files
338 links
Free Data Engineering Ebooks & Courses
Download Telegram
SQL is composed of five key components:

๐ƒ๐ƒ๐‹ (๐ƒ๐š๐ญ๐š ๐ƒ๐ž๐Ÿ๐ข๐ง๐ข๐ญ๐ข๐จ๐ง ๐‹๐š๐ง๐ ๐ฎ๐š๐ ๐ž): Commands like CREATE, ALTER, DROP for defining and modifying database structures.
๐ƒ๐๐‹ (๐ƒ๐š๐ญ๐š ๐๐ฎ๐ž๐ซ๐ฒ ๐‹๐š๐ง๐ ๐ฎ๐š๐ ๐ž): Commands like SELECT for querying and retrieving data.
๐ƒ๐Œ๐‹ (๐ƒ๐š๐ญ๐š ๐Œ๐š๐ง๐ข๐ฉ๐ฎ๐ฅ๐š๐ญ๐ข๐จ๐ง ๐‹๐š๐ง๐ ๐ฎ๐š๐ ๐ž): Commands like INSERT, UPDATE, DELETE for modifying data.
๐ƒ๐‚๐‹ (๐ƒ๐š๐ญ๐š ๐‚๐จ๐ง๐ญ๐ซ๐จ๐ฅ ๐‹๐š๐ง๐ ๐ฎ๐š๐ ๐ž): Commands like GRANT, REVOKE for managing access permissions.
๐“๐‚๐‹ (๐“๐ซ๐š๐ง๐ฌ๐š๐œ๐ญ๐ข๐จ๐ง ๐‚๐จ๐ง๐ญ๐ซ๐จ๐ฅ ๐‹๐š๐ง๐ ๐ฎ๐š๐ ๐ž): Commands like COMMIT, ROLLBACK for managing transactions.

If you're an engineer, you'll likely need a solid understanding of all these components. If you're a data analyst, focusing on DQL will be more relevant. Tailor your learning to the topics that best fit your role.
๐Ÿ‘5๐Ÿ”ฅ4
โค8๐Ÿ”ฅ1
Data Engineering free courses   

Linked Data Engineering
๐ŸŽฌ Video Lessons
Rating โญ๏ธ: 5 out of 5     
Students ๐Ÿ‘จโ€๐ŸŽ“: 9,973
Duration โฐ:  8 weeks long
Source: openHPI
๐Ÿ”— Course Link  

Data Engineering
Credits โณ: 15
Duration โฐ: 4 hours
๐Ÿƒโ€โ™‚๏ธ Self paced       
Source:  Google cloud
๐Ÿ”— Course Link

Data Engineering Essentials using Spark, Python and SQL  
๐ŸŽฌ 402 video lesson
๐Ÿƒโ€โ™‚๏ธ Self paced
Teacher: itversity
Resource: Youtube
๐Ÿ”— Course Link  
 
Data engineering with Azure Databricks      
Modules โณ: 5
Duration โฐ:  4-5 hours worth of material
๐Ÿƒโ€โ™‚๏ธ Self paced       
Source:  Microsoft ignite
๐Ÿ”— Course Link

Perform data engineering with Azure Synapse Apache Spark Pools      
Modules โณ: 5
Duration โฐ:  2-3 hours worth of material
๐Ÿƒโ€โ™‚๏ธ Self paced       
Source:  Microsoft Learn
๐Ÿ”— Course Link

Books
Data Engineering
The Data Engineers Guide to Apache Spark

All the best ๐Ÿ‘๐Ÿ‘
๐Ÿ‘4โค2
๐Ÿ” Mastering Spark: 20 Interview Questions Demystified!

1๏ธโƒฃ MapReduce vs. Spark: Learn how Spark achieves 100x faster performance compared to MapReduce.
2๏ธโƒฃ RDD vs. DataFrame: Unravel the key differences between RDD and DataFrame, and discover what makes DataFrame unique.
3๏ธโƒฃ DataFrame vs. Datasets: Delve into the distinctions between DataFrame and Datasets in Spark.
4๏ธโƒฃ RDD Operations: Explore the various RDD operations that power Spark.
5๏ธโƒฃ Narrow vs. Wide Transformations: Understand the differences between narrow and wide transformations in Spark.
6๏ธโƒฃ Shared Variables: Discover the shared variables that facilitate distributed computing in Spark.
7๏ธโƒฃ Persist vs. Cache: Differentiate between the persist and cache functionalities in Spark.
8๏ธโƒฃ Spark Checkpointing: Learn about Spark checkpointing and how it differs from persisting to disk.
9๏ธโƒฃ SparkSession vs. SparkContext: Understand the roles of SparkSession and SparkContext in Spark applications.
๐Ÿ”Ÿ spark-submit Parameters: Explore the parameters to specify in the spark-submit command.
1๏ธโƒฃ1๏ธโƒฃ Cluster Managers in Spark: Familiarize yourself with the different types of cluster managers available in Spark.
1๏ธโƒฃ2๏ธโƒฃ Deploy Modes: Learn about the deploy modes in Spark and their significance.
1๏ธโƒฃ3๏ธโƒฃ Executor vs. Executor Core: Distinguish between executor and executor core in the Spark ecosystem.
1๏ธโƒฃ4๏ธโƒฃ Shuffling Concept: Gain insights into the shuffling concept in Spark and its importance.
1๏ธโƒฃ5๏ธโƒฃ Number of Stages in Spark Job: Understand how to decide the number of stages created in a Spark job.
1๏ธโƒฃ6๏ธโƒฃ Spark Job Execution Internals: Get a peek into how Spark internally executes a program.
1๏ธโƒฃ7๏ธโƒฃ Direct Output Storage: Explore the possibility of directly storing output without sending it back to the driver.
1๏ธโƒฃ8๏ธโƒฃ Coalesce and Repartition: Learn about the applications of coalesce and repartition in Spark.
1๏ธโƒฃ9๏ธโƒฃ Physical and Logical Plan Optimization: Uncover the optimization techniques employed in Spark's physical and logical plans.
2๏ธโƒฃ0๏ธโƒฃ Treereduce and Treeaggregate: Discover why treereduce and treeaggregate are preferred over reduceByKey and aggregateByKey in certain scenarios.

Data Engineering Interview Preparation Resources: https://whatsapp.com/channel/0029Vaovs0ZKbYMKXvKRYi3C
๐Ÿ‘7
The four V's of big data
๐Ÿฅฐ7๐Ÿ‘1
Pandas Data Cleaning.pdf
14.9 MB
Pandas Data Cleaning.pdf
โค7๐Ÿ‘1
Data Pipeline Overview
โค7๐Ÿ‘2
We are now on WhatsApp as well

Follow for more data engineering resources: ๐Ÿ‘‡ https://whatsapp.com/channel/0029Vaovs0ZKbYMKXvKRYi3C
๐Ÿ‘4โค1๐Ÿ”ฅ1
๐Ÿ”ฅ1
Data Engineer Interview Questions for Entry-Level Data Engineer๐Ÿ”ฅ


1. What are the core responsibilities of a data engineer?

2. Explain the ETL process

3. How do you handle large datasets in a data pipeline?

4. What is the difference between a relational & a non-relational database?

5. Describe how data partitioning improves performance in distributed systems

6. What is a data warehouse & how is it different from a database?

7. How would you design a data pipeline for real-time data processing?

8. Explain the concept of normalization & denormalization in database design

9. What tools do you commonly use for data ingestion, transformation & storage?

10. How do you optimize SQL queries for better performance in data processing?

11. What is the role of Apache Hadoop in big data?

12. How do you implement data security & privacy in data engineering?

13. Explain the concept of data lakes & their importance in modern data architectures

14. What is the difference between batch processing & stream processing?

15. How do you manage & monitor data quality in your pipelines?

16. What are your preferred cloud platforms for data engineering & why?

17. How do you handle schema changes in a production data pipeline?

18. Describe how you would build a scalable & fault-tolerant data pipeline

19. What is Apache Kafka & how is it used in data engineering?

20. What techniques do you use for data compression & storage optimization?
โค4
Here are three PySpark questions:


Scenario 1: Data Aggregation


Interviewer: "How would you aggregate data by category and calculate the sum of sales, handling missing values and grouping by multiple columns?"


Candidate:


# Load the DataFrame
df = spark.read.csv("path/to/data.csv", header=True, inferSchema=True)

# Handle missing values
df_filled = df.fillna(0)

# Aggregate data
from pyspark.sql.functions import sum, col
df_aggregated = df_filled.groupBy("category", "region").agg(sum(col("sales")).alias("total_sales"))

# Sort the results
df_aggregated_sorted = df_aggregated.orderBy("total_sales", ascending=False)

# Save the aggregated DataFrame
df_aggregated_sorted.write.csv("path/to/aggregated/data.csv", header=True)


Scenario 2: Data Transformation


Interviewer: "How would you transform a DataFrame by converting a column to timestamp, handling invalid dates and extracting specific date components?"


Candidate:


# Load the DataFrame
df = spark.read.csv("path/to/data.csv", header=True, inferSchema=True)

# Convert column to timestamp
from pyspark.sql.functions import to_timestamp, col
df_transformed = df.withColumn("date_column", to_timestamp(col("date_column"), "yyyy-MM-dd"))

# Handle invalid dates
df_transformed_filtered = df_transformed.filter(col("date_column").isNotNull())

# Extract date components
from pyspark.sql.functions import year, month, dayofmonth
df_transformed_extracted = df_transformed_filtered.withColumn("year", year(col("date_column"))).withColumn("month", month(col("date_column"))).withColumn("day", dayofmonth(col("date_column")))

# Save the transformed DataFrame
df_transformed_extracted.write.csv("path/to/transformed/data.csv", header=True)

Scenario 3: Data Partitioning


Interviewer: "How would you partition a large DataFrame by date and save it to parquet format, handling data skewness and optimizing storage?"


Candidate:


# Load the DataFrame
df = spark.read.csv("path/to/data.csv", header=True, inferSchema=True)

# Partition by date
df_partitioned = df.repartitionByRange("date_column")

# Save to parquet format
df_partitioned.write.parquet("path/to/partitioned/data.parquet", partitionBy=["date_column"])

# Optimize storage
df_partitioned.write.option("compression", "snappy").parquet("path/to/partitioned/data.parquet", partitionBy=["date_column"])

Here, you can find Data Engineering Resources ๐Ÿ‘‡
https://whatsapp.com/channel/0029Vaovs0ZKbYMKXvKRYi3C

All the best ๐Ÿ‘๐Ÿ‘
๐Ÿ‘6โค5
fundamentals-of-data-engineering.pdf
7.6 MB
๐Ÿš€ The good book to start learning Data Engineering.

โš You can download it for free here

โš™With this practical #book, you'll learn how to plan and build systems to serve the needs of your organization and your customers by evaluating the best technologies available through the framework of the #data #engineering lifecycle.
๐Ÿ‘5โค2
Life of a Data Engineer.....


Business user : Can we add a filter on this dashboard. This will help us track a critical metric.
me : sure this should be a quick one.

Next day :

I quickly opened the dashboard to find the column in the existing dashboard's data sources.  -- column not found

Spent a couple of hours to identify the data source and how to bring the column into the existence data pipeline which feeds the dashboard( table granularity , join condition etc..).

Then comes the pipeline changes , data model changes , dashboard changes , validation/testing.

Finally deploying to production and a simple email to the user that the filter has been added.

A small change in the front end but a lot of work in the backend to bring that column to life.

Never underestimate data engineers and data pipelines ๐Ÿ’ช
๐Ÿ‘5๐Ÿ”ฅ1
Don't aim for this:

SQL - 100%
Python - 0%
PySpark - 0%
Cloud - 0%

Aim for this:

SQL - 25%
Python - 25%
PySpark - 25%
Cloud - 25%

You don't need to know everything straight away.

Here, you can find Data Engineering Resources ๐Ÿ‘‡
https://whatsapp.com/channel/0029Vaovs0ZKbYMKXvKRYi3C

All the best ๐Ÿ‘๐Ÿ‘
โค9๐Ÿ‘4
๐Ÿ”ฅ ETL vs ELT: What's the Difference?

When it comes to data processing, two key approaches stand out: ETL and ELT. Both involve transforming data, but the processes differ significantly!

๐Ÿ”น ETL (Extract, Transform, Load)
- Extract data from various sources (databases, APIs, etc.)
- Transform data before loading it into the storage (cleaning, aggregating, formatting)
- Load the transformed data into the data warehouse (DWH)

โœ๏ธ Key point: Data is transformed before being loaded into the storage.

๐Ÿ”น ELT (Extract, Load, Transform)
- Extract data from sources
- Load raw data into the data warehouse
- Transform the data after it's loaded, using the power of the data warehouseโ€™s computational resources

โœ๏ธ Key point: Data is loaded into the storage first, and transformation happens afterward.

๐ŸŽฏ When to use which?
- ETL is ideal for structured data and traditional systems where pre-processing is crucial.
- ELT is better suited for handling large volumes of data in modern cloud-based architectures.

Which one works best for your project? ๐Ÿค”
๐Ÿ‘4๐Ÿ”ฅ4๐Ÿฅฐ1
Join our WhatsApp channel for more data engineering resources
๐Ÿ‘‡๐Ÿ‘‡
https://whatsapp.com/channel/0029Vaovs0ZKbYMKXvKRYi3C
๐Ÿ‘6
Importance of ETL.pdf
3.2 MB
Importance of ETL.pdf
๐Ÿ‘11