Data Engineers
8.8K subscribers
344 photos
74 files
336 links
Free Data Engineering Ebooks & Courses
Download Telegram
Data Engineers pinned ยซKavitha's Journey to become a Data Engineer ๐Ÿ‘‡๐Ÿ‘‡ 1. Startup to Dream Job Journey: - Started at a startup in India, transitioned to Infosys, then grabbed UK opportunity. - Shifted from legacy Mainframe to AWS Cloud, pursued Master's from illinoisstateu, andโ€ฆยป
Frequently asked SQL interview questions for Data Analyst/Data Engineer role-

1 - What is SQL and what are its main features?
2 - Order of writing SQL query?
3- Order of execution of SQL query?
4- What are some of the most common SQL commands?
5- Whatโ€™s a primary key & foreign key?
6 - All types of joins and questions on their outputs?
7 - Explain all window functions and difference between them?
8 - What is stored procedure?
9 - Difference between stored procedure & Functions in SQL?
10 - What is trigger in SQL?
11 - Difference between where and having?
โค2๐Ÿ‘1
DevOps Tech Stack
๐Ÿ‘4
image_2024-05-30_10-00-48.png
2.6 MB
For all Data Engineers out there, here is The State of Data Engineering 2024

Some of the highlights:

โœ… More and more, data observability tools are used not just to monitor data sources, but also the infrastructure, pipelines, and systems after data is collected.

โœ… Companies are now seeing data observability as essential for their AI projects. Gartner has called it a must-have for AI-ready data.

โœ… Like in 2023, Monte Carlo is leading in this area, with G2 naming them the #1 Data Observability Platform. Big organizations like Cisco, American Airlines, and NASDAQ use Monte Carlo to make their AI systems more reliable.
๐Ÿ‘2
Spark Book.pdf
2.1 MB
๐Ÿ‘5โค2
Data Engineer Interview Questions.pdf
2.4 MB
Data Engineering Interview Questions ๐Ÿ”ฅ๐Ÿ”ฅ๐Ÿ”ฅ

React โค๏ธ if you want more content like this
โค13๐Ÿ‘2
Learning SQL is actually a really good skill. It's not just learning SQL the language, but learning the concepts of relational algebra and how to think about data sets, designing schemas, and organizing data.

...

It is about learning the file formatting and the basics of data storage, data partitioning, and the relationship between the execution engines. All of these things will yield you to be a better DBT user, a better Snowflake user or a Databricks user.
๐Ÿ‘15
The number one thing to do as a data engineer? Create high-quality data that people can trust.๐Ÿค
Life of a Data Engineer.....


Business user : Can we add a filter on this dashboard. This will help us track a critical metric.
me : sure this should be a quick one.

Next day :

I quickly opened the dashboard to find the column in the existing dashboard's data sources. -- column not found

Spent a couple of hours to identify the data source and how to bring the column into the existence data pipeline which feeds the dashboard( table granularity , join condition etc..).

Then comes the pipeline changes , data model changes , dashboard changes , validation/testing.

Finally deploying to production and a simple email to the user that the filter has been added.

A small change in the front end but a lot of work in the backend to bring that column to life.

Never underestimate data engineers and data pipelines ๐Ÿ’ช
๐Ÿ‘27โค2
Data Engineering is not Excel. Not writing ML models. Not โ€œplease can you do this quick? I need it asapโ€
๐Ÿ‘8
Complete Python topics required for the Data Engineer role:

โžค ๐—•๐—ฎ๐˜€๐—ถ๐—ฐ๐˜€ ๐—ผ๐—ณ ๐—ฃ๐˜†๐˜๐—ต๐—ผ๐—ป:

- Python Syntax
- Data Types
- Lists
- Tuples
- Dictionaries
- Sets
- Variables
- Operators
- Control Structures:
- if-elif-else
- Loops
- Break & Continue try-except block
- Functions
- Modules & Packages

โžค ๐—ฃ๐—ฎ๐—ป๐—ฑ๐—ฎ๐˜€:

- What is Pandas & imports?
- Pandas Data Structures (Series, DataFrame, Index)
- Working with DataFrames:
-> Creating DFs
-> Accessing Data in DFs Filtering & Selecting Data
-> Adding & Removing Columns
-> Merging & Joining in DFs
-> Grouping and Aggregating Data
-> Pivot Tables

- Input/Output Operations with Pandas:
-> Reading & Writing CSV Files
-> Reading & Writing Excel Files
-> Reading & Writing SQL Databases
-> Reading & Writing JSON Files
-> Reading & Writing - Text & Binary Files

โžค ๐—ก๐˜‚๐—บ๐—ฝ๐˜†:

- What is NumPy & imports?
- NumPy Arrays
- NumPy Array Operations:
- Creating Arrays
- Accessing Array Elements
- Slicing & Indexing
- Reshaping, Combining & Arrays
- Arithmetic Operations
- Broadcasting
- Mathematical Functions
- Statistical Functions

โžค ๐—•๐—ฎ๐˜€๐—ถ๐—ฐ๐˜€ ๐—ผ๐—ณ ๐—ฃ๐˜†๐˜๐—ต๐—ผ๐—ป, ๐—ฃ๐—ฎ๐—ป๐—ฑ๐—ฎ๐˜€, ๐—ก๐˜‚๐—บ๐—ฝ๐˜† are more than enough for Data Engineer role.

Data Engineering Interview Preparation Resources: https://whatsapp.com/channel/0029Vaovs0ZKbYMKXvKRYi3C

All the best ๐Ÿ‘๐Ÿ‘
๐Ÿ‘19โค4
Preparing for a Spark Interview? Here are 20 Key Differences You Should Know!

1๏ธโƒฃ Repartition vs. Coalesce: Repartition changes the number of partitions, while coalesce reduces partitions without full shuffle.

2๏ธโƒฃ Sort By vs. Order By: Sort By sorts data within each partition and may result in partially ordered final results if multiple reducers are used. Order By guarantees total order across all partitions in the final output.

3๏ธโƒฃ RDD vs. Datasets vs. DataFrames: RDDs are the basic abstraction, Datasets add type safety, and DataFrames optimize for structured data.

4๏ธโƒฃ Broadcast Join vs. Shuffle Join vs. Sort Merge Join: Broadcast Join is for small tables, Shuffle Join redistributes data, and Sort Merge Join sorts data before joining.

5๏ธโƒฃ Spark Session vs. Spark Context: Spark Session is the entry point in Spark 2.0+, combining functionality of Spark Context and SQL Context.

6๏ธโƒฃ Executor vs. Executor Core: Executor runs tasks and manages data storage, while Executor Core handles task execution.

7๏ธโƒฃ DAG vs. Lineage: DAG (Directed Acyclic Graph) is the execution plan, while Lineage tracks the RDD lineage for fault tolerance.

8๏ธโƒฃ Transformation vs. Action: Transformation creates RDD/Dataset/DataFrame, while Action triggers execution and returns results to driver.

9๏ธโƒฃ Narrow Transformation vs. Wide Transformation: Narrow operates on single partition, while Wide involves shuffling across partitions.

๐Ÿ”Ÿ Lazy Evaluation vs. Eager Evaluation: Spark delays execution until action is called (Lazy), optimizing performance.

1๏ธโƒฃ1๏ธโƒฃ Window Functions vs. Group By: Window Functions compute over a range of rows, while Group By aggregates data into summary.

1๏ธโƒฃ2๏ธโƒฃ Partitioning vs. Bucketing: Partitioning divides data into logical units, while Bucketing organizes data into equal-sized buckets.

1๏ธโƒฃ3๏ธโƒฃ Avro vs. Parquet vs. ORC: Avro is row-based with schema, Parquet and ORC are columnar formats optimized for query speed.

1๏ธโƒฃ4๏ธโƒฃ Client Mode vs. Cluster Mode: Client runs driver in client process, while Cluster deploys driver to the cluster.

1๏ธโƒฃ5๏ธโƒฃ Serialization vs. Deserialization: Serialization converts data to byte stream, while Deserialization reconstructs data from byte stream.

1๏ธโƒฃ6๏ธโƒฃ DAG Scheduler vs. Task Scheduler: DAG Scheduler divides job into stages, while Task Scheduler assigns tasks to workers.

1๏ธโƒฃ7๏ธโƒฃ Accumulators vs. Broadcast Variables: Accumulators aggregate values from workers to driver, Broadcast Variables efficiently broadcast read-only variables.

1๏ธโƒฃ8๏ธโƒฃ Cache vs. Persist: Cache stores RDD/Dataset/DataFrame in memory, Persist allows choosing storage level (memory, disk, etc.).

1๏ธโƒฃ9๏ธโƒฃ Internal Table vs. External Table: Internal managed by Spark, External managed externally (e.g., Hive).

2๏ธโƒฃ0๏ธโƒฃ Executor vs. Driver: Executor runs tasks on worker nodes, Driver manages job execution.

Data Engineering Interview Preparation Resources: https://whatsapp.com/channel/0029Vaovs0ZKbYMKXvKRYi3C

All the best ๐Ÿ‘๐Ÿ‘
๐Ÿ‘5โค4
๐‡๐ž๐ซ๐ž ๐š๐ซ๐ž 20 ๐ซ๐ž๐š๐ฅ-๐ญ๐ข๐ฆ๐ž ๐’๐ฉ๐š๐ซ๐ค ๐ฌ๐œ๐ž๐ง๐š๐ซ๐ข๐จ-๐›๐š๐ฌ๐ž๐ ๐ช๐ฎ๐ž๐ฌ๐ญ๐ข๐จ๐ง๐ฌ

1. Data Processing Optimization: How would you optimize a Spark job that processes 1 TB of data daily to reduce execution time and cost?

2. Handling Skewed Data: In a Spark job, one partition is taking significantly longer to process due to skewed data. How would you handle this situation?

3. Streaming Data Pipeline: Describe how you would set up a real-time data pipeline using Spark Structured Streaming to process and analyze clickstream data from a website.

4. Fault Tolerance: How does Spark handle node failures during a job, and what strategies would you use to ensure data processing continues smoothly?

5. Data Join Strategies: You need to join two large datasets in Spark, but you encounter memory issues. What strategies would you employ to handle this?

6. Checkpointing: Explain the role of checkpointing in Spark Streaming and how you would implement it in a real-time application.

7. Stateful Processing: Describe a scenario where you would use stateful processing in Spark Streaming and how you would implement it.

8. Performance Tuning: What are the key parameters you would tune in Spark to improve the performance of a real-time analytics application?

9. Window Operations: How would you use window operations in Spark Streaming to compute rolling averages over a sliding window of events?

10. Handling Late Data: In a Spark Streaming job, how would you handle late-arriving data to ensure accurate results?

11. Integration with Kafka: Describe how you would integrate Spark Streaming with Apache Kafka to process real-time data streams.

12. Backpressure Handling: How does Spark handle backpressure in a streaming application, and what configurations can you use to manage it?

13. Data Deduplication: How would you implement data deduplication in a Spark Streaming job to ensure unique records?

14. Cluster Resource Management: How would you manage cluster resources effectively to run multiple concurrent Spark jobs without contention?

15. Real-Time ETL: Explain how you would design a real-time ETL pipeline using Spark to ingest, transform, and load data into a data warehouse.

16. Handling Large Files: You have a #Spark job that needs to process very large files (e.g., 100 GB). How would you optimize the job to handle such files efficiently?

17. Monitoring and Debugging: What tools and techniques would you use to monitor and debug a Spark job running in production?

18. Delta Lake: How would you use Delta Lake with Spark to manage real-time data lakes and ensure data consistency?

19. Partitioning Strategy: How you would design an effective partitioning strategy for a large dataset.

20. Data Serialization: What serialization formats would you use in Spark for real-time data processing, and why?

Data Engineering Interview Preparation Resources: https://whatsapp.com/channel/0029Vaovs0ZKbYMKXvKRYi3C

All the best ๐Ÿ‘๐Ÿ‘
โค7๐Ÿ‘6
Cisco Kafka interview questions for Data Engineers 2024.

โžค How do you create a topic in Kafka using the Confluent CLI?
โžค Explain the role of the Schema Registry in Kafka.
โžค How do you register a new schema in the Schema Registry?
โžค What is the importance of key-value messages in Kafka?
โžค Describe a scenario where using a random key for messages is beneficial.
โžค Provide an example where using a constant key for messages is necessary.
โžค Write a simple Kafka producer code that sends JSON messages to a topic.
โžค How do you serialize a custom object before sending it to a Kafka topic?
โžค Describe how you can handle serialization errors in Kafka producers.
โžค Write a Kafka consumer code that reads messages from a topic and deserializes them from JSON.
โžค How do you handle deserialization errors in Kafka consumers?
โžค Explain the process of deserializing messages into custom objects.
โžค What is a consumer group in Kafka, and why is it important?
โžค Describe a scenario where multiple consumer groups are used for a single topic.
โžค How does Kafka ensure load balancing among consumers in a group?
โžค How do you send JSON data to a Kafka topic and ensure it is properly serialized?
โžค Describe the process of consuming JSON data from a Kafka topic and converting it to a usable format.
โžค Explain how you can work with CSV data in Kafka, including serialization and deserialization.
โžค Write a Kafka producer code snippet that sends CSV data to a topic.
โžค Write a Kafka consumer code snippet that reads and processes CSV data from a topic.

Data Engineering Interview Preparation Resources: https://whatsapp.com/channel/0029Vaovs0ZKbYMKXvKRYi3C

All the best ๐Ÿ‘๐Ÿ‘
๐Ÿ‘8โค4
ETL Using Pyspark.pdf
2.2 MB
๐Ÿ‘10๐Ÿ”ฅ1