Data Engineers – Telegram

Data Engineers

9.27K subscribers

314 photos

76 files

298 links

Free Data Engineering Ebooks & Courses

Download Telegram

About

Blog

Apps

Platform

9.27K subscribers

Working with PySpark Aggregations

What are Aggregations?

Aggregations in PySpark allow you to transform large datasets by computing statistics across specified groups. PySpark offers built-in functions for common aggregations, such as sum, avg, min, max, count, and more.

Common Aggregation Methods in PySpark

1. groupBy(): Groups data by one or more columns and allows applying aggregation functions on each group.

2. agg(): Lets you apply multiple aggregation functions simultaneously.

3. count(): Counts the number of non-null entries.

4. sum(): Adds up the values in a column.

5. avg(): Computes the average of a column.

Example: Using groupBy() and Aggregations

Let’s say you have a DataFrame with sales data and want to calculate the total and average sales per salesperson.

from pyspark.sql import SparkSession
from pyspark.sql.functions import sum, avg

# Create Spark session
spark = SparkSession.builder.appName("AggregationExample").getOrCreate()

# Sample data
data = [("Alice", 100), ("Alice", 150), ("Bob", 200), ("Bob", 300)]
df = spark.createDataFrame(data, ["Salesperson", "Sales_Amount"])

# Aggregating data
agg_df = df.groupBy("Salesperson").agg(
sum("Sales_Amount").alias("Total_Sales"),
avg("Sales_Amount").alias("Avg_Sales")
)

agg_df.show()

In this example, we used groupBy("Salesperson") to group the data by each salesperson, and agg() to calculate the total and average sales for each.

Real-World Example: Aggregating Product Sales Data

Imagine you're analyzing sales data for a retail store. You might want to know the total sales per product category, the highest and lowest sales amounts, or the average sales per transaction. Aggregations allow you to gain these insights quickly:

# Group by product category and calculate total and average sales
sales_df.groupBy("Product_Category").agg(
sum("Sales_Amount").alias("Total_Sales"),
avg("Sales_Amount").alias("Avg_Sales")
).show()

Advanced Aggregation Functions

countDistinct(): Counts unique values in a column.

df.groupBy("Salesperson").agg(countDistinct("Product_ID").alias("Unique_Products_Sold")).show()

approx_count_distinct(): Uses an approximate algorithm to count distinct values, useful for very large datasets.

from pyspark.sql.functions import approx_count_distinct
df.agg(approx_count_distinct("Product_ID")).show()

Windowed Aggregations

Sometimes, aggregations are performed over a “window” rather than over the entire dataset or specific groups. We’ve covered window functions, but it’s useful to know they can be combined with aggregations for tasks like rolling averages.

👍6

3.52K views09:22

Working with PySpark Aggregations What are Aggregations? Aggregations in PySpark allow you to transform large datasets by computing statistics across specified groups. PySpark offers built-in functions for common aggregations, such as sum, avg, min, max…

Interview Questions

1. What are some common aggregation functions in PySpark, and how are they used?

2. Explain the difference between groupBy() and agg() in PySpark.

👍3

2.93K views08:43

Life of a Data Engineer.....

Business user : Can we add a filter on this dashboard. This will help us track a critical metric.
me : sure this should be a quick one.

Next day :

I quickly opened the dashboard to find the column in the existing dashboard's data sources. -- column not found

Spent a couple of hours to identify the data source and how to bring the column into the existence data pipeline which feeds the dashboard( table granularity , join condition etc..).

Then comes the pipeline changes , data model changes , dashboard changes , validation/testing.

Finally deploying to production and a simple email to the user that the filter has been added.

A small change in the front end but a lot of work in the backend to bring that column to life.

Never underestimate data engineers and data pipelines 💪

👍4❤2

2.95K views12:25

Top 100 SQL Interview Questions.pdf

Top 100 SQL Interview Questions.pdf

👍6

3.21K views12:57

SQL ASSIGNMENT

#Check your fundamental knowledge

👏4

9.82K viewsedited 07:11

𝗠𝗮𝘀𝘁𝗲𝗿 𝗦𝗤𝗟 𝗳𝗼𝗿 𝗜𝗻𝘁𝗲𝗿𝘃𝗶𝗲𝘄𝘀, 𝗙𝗮𝘀𝘁!

Here are 10 must-know SQL concepts:

● Stored Procedure vs. Function
Procedures allow DML; functions handle calculations only.

● Clustered vs. Non-Clustered Index
Clustered sorts data physically; non-clustered creates pointers.

● DELETE vs. TRUNCATE
DELETE is row-specific; TRUNCATE clears all rows fast.

● WHERE vs. HAVING
WHERE filters rows; HAVING filters after GROUP BY.

● Primary Key vs. Unique Key
Primary is unique & non-null; Unique allows one null.

● JOIN Types
INNER, LEFT, RIGHT, FULL JOIN—combine tables in different ways.

● Normalization Forms
Minimizes redundancy and improves data integrity.

● ACID Properties
Ensures reliable transactions with Atomicity, Consistency, Isolation, Durability.

● Indexes
Speeds up data retrieval; careful use is key.

● Subqueries
Nest queries within queries for flexible data retrieval.

Master these, and you’re SQL-interview ready!

👍10❤2

2.76K views13:21

📌10 intermediate-level SQL interview questions

1. How would you find the nth highest salary in a table?
2. What is the difference between JOIN, LEFT JOIN, RIGHT JOIN, and FULL JOIN?
3. How would you calculate cumulative sum in SQL?
4. How do you identify duplicate records in a table?
5. Explain the concept of a window function and give examples.
6. How would you retrieve records between two dates in SQL?
7. What is the difference between UNION and UNION ALL?
8. How can you pivot data in SQL?
9. Explain the use of CASE statements in SQL.
10. How do you use common table expressions (CTEs)?

#sql

👍2

2.66K views02:31

The purpose of Data Normalisation in a database.

Anonymous Quiz

To reduce data redundancy and improve data integrity

To combine multiple tables into one

To speed up data retrieval without using indexes

To create a backup of the data

❤7

308 voters2.63K views10:41

📝 Interview Tip of the Day!

💡 Know the Basics of SQL: Expect SQL questions on joins, group by, and subqueries.

✅ Pro Tip: Practice writing clean, efficient SQL code.

🗣 Prepare: Be ready to walk through SQL logic verbally!

❤5

2.33K views19:24

What is CRUD?

CRUD stands for Create, Read, Update, and Delete. It represents the basic operations that can be performed on data in a database.

Examples in SQL:

1. Create:
Adding new records to a table.

    INSERT INTO students (id, name, age)
    VALUES (1, 'John Doe', 20);

2. Read:
Retrieving data from a table.

    SELECT * FROM students;

3. Update:

Modifying existing records.

    UPDATE students
    SET age = 21
    WHERE id = 1;

4. Delete:

Removing records.

DELETE FROM students
WHERE id = 1;

👍11❤4

2.56K views09:04

⏰ MySQL Data Types

MySQL provides a variety of data types to store different kinds of data. These are categorized into three main groups:

1. Numeric Data Types:
- INT, BIGINT, SMALLINT, TINYINT: For whole numbers.
- DECIMAL, FLOAT, DOUBLE: For real numbers with decimal points.
- BIT: For binary values.
- Example:

            CREATE TABLE numeric_example (
                id INT,
                amount DECIMAL(10, 2)
            );

1. String Data Types:
- CHAR, VARCHAR: For fixed and variable-length strings.
- TEXT: For large text.
- BLOB: For binary large objects like images.
- Example:

            CREATE TABLE string_example (
                name VARCHAR(100),
                description TEXT
            );

1. Date and Time Data Types:
- DATE, DATETIME, TIMESTAMP: For date and time values.
- YEAR: For storing a year.
- Example:

                CREATE TABLE datetime_example (
                    created_at DATETIME,
                    year_of_joining YEAR
                );

Interview Questions:

- Q1: What is the difference between CHAR and VARCHAR?
A1: CHAR has a fixed length, while VARCHAR has a variable length. VARCHAR is more storage-efficient for varying-length data.
- Q2: When should you use DECIMAL instead of FLOAT?
A2: Use DECIMAL for precise calculations (e.g., financial data) and FLOAT for approximate values where precision is less critical.

👍2

2.85K viewsedited 10:35

SQL Basics to Advanced Q&A.pdf

👍2🔥2

2.51K views09:23

📊 How to Present Data Projects Effectively!

💡 Start with a Clear Objective: Clearly define the purpose of your presentation at the outset to set expectations and context.

✅ Pro Tip: A strong opening statement can grab your audience's attention right away!

2.63K views14:21

🔍 Quick Note of the Day!

💡 Python: Basic Data Types

Familiarize yourself with basic data types in Python: integers, floats, strings, and booleans.

✅ Pro Tip: Understanding data types is crucial for effective data manipulation!

👍1

2.35K views14:21

⏰ VIEWS in SQL

Definition

A view is a virtual table based on the result of a SELECT query.

Features

- Does not store data; it retrieves data from underlying tables.
- Simplifies complex queries.

Syntax

CREATE VIEW view_name AS
SELECT columns
FROM table_name
WHERE condition;

Example

Create a view to show high-salaried employees:

CREATE VIEW HighSalaryEmployees AS
SELECT name, salary
FROM employees
WHERE salary > 100000;

Use the view:

SELECT * FROM HighSalaryEmployees;

Interview Questions

1. What is the difference between a table and a view?
- A table stores data physically; a view does not.
2. Can you update data through a view?
- Yes, if the view is updatable (no joins, no aggregate functions, etc.).
3. What are the advantages of using views?
- Simplifies complex queries, enhances security, and provides abstraction.

👍4🔥1

2.5K viewsedited 11:59

During Interview: Spark, Hadoop, Kafka, Airflow, SQL, Python, Azure, Data Modeling, etc..

Actual Job: Mostly filtering data with SQL and writing ETL scripts

Still we have to keep up-skill because competition is growing and now in-depth knowledge is in demand.

👏5

2.34K views04:20

🎯 Master the Math & Stats for Data Engineering Success!

📊 Mathematics and statistics are the backbone of data analytics, powering pattern recognition, predictions, and problem-solving in interviews. Let’s make your prep easy and effective!

💡 Why it Matters?

Key concepts ensure precision and help you tackle complex analytical challenges like a pro.

📚 Syllabus Snapshot

🔢 Basic Statistics:

✅ Mean, Median, Mode

✅ Standard Deviation & Variance

✅ Normal Distribution

✅ Percentile & Quintiles

✅ Correlation & Regression Analysis

➕ Basic Math:

✅ Arithmetic (Sum, Subtraction, Division, Multiplication)

✅ Probability, Percentages & Ratios

✅ Weighted Average & Cumulative Sum

✅ Linear Equations & Matrices

✨ Quick Tip: Focus on these concepts, and you'll ace any data analytics interview!

📌 Save this post & start practicing today!

#MathForData #StatisticsForData #DataInterviewTips

3K views09:53

Data Engineering Top Interview Question.pdf

Data Engineering Top Interview Question.pdf

#dataengineering

👍5

2.61K viewsedited 09:33

SQL Essentials for Quick Revision

🚀 SELECT
Retrieve data from one or more tables.

🎯 WHERE Clause
Filter records based on specific conditions.

🔄 ORDER BY
Sort query results in ascending (ASC) or descending (DESC) order.

📊 Aggregation Functions

MIN, MAX, AVG, COUNT: Summarize data.

Window Functions: Perform calculations across a dataset without grouping rows.

🔑 GROUP BY
Group data based on one or more columns and apply aggregate functions.

🔗 JOINS

INNER JOIN: Fetch matching rows from both tables.

LEFT JOIN: All rows from the left table and matching rows from the right.

RIGHT JOIN: All rows from the right table and matching rows from the left.

FULL JOIN: Combine rows when there is a match in either table.

SELF JOIN: Join a table with itself.

🧩 Common Table Expressions (CTE)
Simplify complex queries with temporary result sets.

Quick SQL Revision Notes 📌
Master these concepts for interviews and projects!

#SQL #DataEngineer #QuickNotes

👍2

3.16K viewsedited 11:27

Which SQL statement is used to retrieve data from a database?

Anonymous Quiz

👍4

438 voters3.19K views15:14