Exploratory Data Analysis (EDA)
EDA is the process of analyzing datasets to summarize key patterns, detect anomalies, and gain insights before applying machine learning or reporting.
1️⃣ Descriptive Statistics
Descriptive statistics help summarize and understand data distributions.
In SQL:
Calculate Mean (Average):
Find Mode (Most Frequent Value)
Calculate Variance & Standard Deviation
In Python (Pandas):
Mean, Median, Mode
Variance & Standard Deviation
2️⃣ Data Visualization
Visualizing data helps identify trends, outliers, and patterns.
In SQL (For Basic Visualization in Some Databases Like PostgreSQL):
Create Histogram (Approximate in SQL)
In Python (Matplotlib & Seaborn):
Bar Chart (Category-Wise Sales)
Histogram (Salary Distribution)
Box Plot (Outliers in Sales Data)
Heatmap (Correlation Between Variables)
3️⃣ Detecting Anomalies & Outliers
Outliers can skew results and should be identified.
In SQL:
Find records with unusually high salaries
In Python (Pandas & NumPy):
Using Z-Score (Values Beyond 3 Standard Deviations)
Using IQR (Interquartile Range)
4️⃣ Key EDA Steps
Understand the Data → Check missing values, duplicates, and column types
Summarize Statistics → Mean, Median, Standard Deviation, etc.
Visualize Trends → Histograms, Box Plots, Heatmaps
Detect Outliers & Anomalies → Z-Score, IQR
Feature Engineering → Transform variables if needed
Mini Task for You: Write an SQL query to find employees whose salaries are above two standard deviations from the mean salary.
Here you can find the roadmap for data analyst: https://t.me/sqlspecialist/1159
Like this post if you want me to continue covering all the topics! ❤️
Share with credits: https://t.me/sqlspecialist
Hope it helps :)
#sql
EDA is the process of analyzing datasets to summarize key patterns, detect anomalies, and gain insights before applying machine learning or reporting.
1️⃣ Descriptive Statistics
Descriptive statistics help summarize and understand data distributions.
In SQL:
Calculate Mean (Average):
SELECT AVG(salary) AS average_salary FROM employees;
Find Median (Using Window Functions) SELECT salary FROM ( SELECT salary, ROW_NUMBER() OVER (ORDER BY salary) AS row_num, COUNT(*) OVER () AS total_rows FROM employees ) subquery WHERE row_num = (total_rows / 2);
Find Mode (Most Frequent Value)
SELECT department, COUNT(*) AS count FROM employees GROUP BY department ORDER BY count DESC LIMIT 1;
Calculate Variance & Standard Deviation
SELECT VARIANCE(salary) AS salary_variance, STDDEV(salary) AS salary_std_dev FROM employees;
In Python (Pandas):
Mean, Median, Mode
df['salary'].mean() df['salary'].median() df['salary'].mode()[0]
Variance & Standard Deviation
df['salary'].var() df['salary'].std()
2️⃣ Data Visualization
Visualizing data helps identify trends, outliers, and patterns.
In SQL (For Basic Visualization in Some Databases Like PostgreSQL):
Create Histogram (Approximate in SQL)
SELECT salary, COUNT(*) FROM employees GROUP BY salary ORDER BY salary;
In Python (Matplotlib & Seaborn):
Bar Chart (Category-Wise Sales)
import matplotlib.pyplot as plt
import seaborn as sns
df.groupby('category')['sales'].sum().plot(kind='bar')
plt.title('Total Sales by Category')
plt.xlabel('Category')
plt.ylabel('Sales')
plt.show()
Histogram (Salary Distribution)
sns.histplot(df['salary'], bins=10, kde=True)
plt.title('Salary Distribution')
plt.show()
Box Plot (Outliers in Sales Data)
sns.boxplot(y=df['sales'])
plt.title('Sales Data Outliers')
plt.show()
Heatmap (Correlation Between Variables)
sns.heatmap(df.corr(), annot=True, cmap='coolwarm') plt.title('Feature Correlation Heatmap') plt.show() 3️⃣ Detecting Anomalies & Outliers
Outliers can skew results and should be identified.
In SQL:
Find records with unusually high salaries
SELECT * FROM employees WHERE salary > (SELECT AVG(salary) + 2 * STDDEV(salary) FROM employees);
In Python (Pandas & NumPy):
Using Z-Score (Values Beyond 3 Standard Deviations)
from scipy import stats df['z_score'] = stats.zscore(df['salary']) df_outliers = df[df['z_score'].abs() > 3]
Using IQR (Interquartile Range)
Q1 = df['salary'].quantile(0.25)
Q3 = df['salary'].quantile(0.75)
IQR = Q3 - Q1
df_outliers = df[(df['salary'] < (Q1 - 1.5 * IQR)) | (df['salary'] > (Q3 + 1.5 * IQR))]
4️⃣ Key EDA Steps
Understand the Data → Check missing values, duplicates, and column types
Summarize Statistics → Mean, Median, Standard Deviation, etc.
Visualize Trends → Histograms, Box Plots, Heatmaps
Detect Outliers & Anomalies → Z-Score, IQR
Feature Engineering → Transform variables if needed
Mini Task for You: Write an SQL query to find employees whose salaries are above two standard deviations from the mean salary.
Here you can find the roadmap for data analyst: https://t.me/sqlspecialist/1159
Like this post if you want me to continue covering all the topics! ❤️
Share with credits: https://t.me/sqlspecialist
Hope it helps :)
#sql
❤5👍3🆒1
🔥 Step-by-step Data Analysis Projects with SQL
Below are popular data projects from Kaggle, GitHub and Medium and YouTube. They will:
- Help you gain skills in working with real data
- Introduce you to SQL for data analysis
- Inspire you to undertake your own data analysis projects
🗺 Real World Fake Data Analysis
🏠 Housing sales in Nashville
🛒 Walmart Sales Analysis SQL Project
🧳 Alex the Analyst SQL Project
🤑 Superstore Sales Analysis using SQL
💸 International Debt Analysis using SQL
⚽️ Soccer Game Analysis using SQL
🌍 World Population Analysis 2015 using SQL
📉 SQL Project for Data Analysis
🚍 Public Transportation Data Analysis using SQL
📸 Instagram User Data Analysis using SQL
🙌 HR Data Analysis using SQL
🎬 Data Analyst Project: Step-by-step analysis with SQL
🎼 Music Store Data Analysis Project Using SQL
✅ Top 10 SQL Projects with Datasets
✅ Roadmap to Master SQL
#DataAnalyst #DataAnalytics #DataAnalysis #data_analyst #sql
If you find this useful, give it a👍
Below are popular data projects from Kaggle, GitHub and Medium and YouTube. They will:
- Help you gain skills in working with real data
- Introduce you to SQL for data analysis
- Inspire you to undertake your own data analysis projects
🗺 Real World Fake Data Analysis
🏠 Housing sales in Nashville
🛒 Walmart Sales Analysis SQL Project
🧳 Alex the Analyst SQL Project
🤑 Superstore Sales Analysis using SQL
💸 International Debt Analysis using SQL
⚽️ Soccer Game Analysis using SQL
🌍 World Population Analysis 2015 using SQL
📉 SQL Project for Data Analysis
🚍 Public Transportation Data Analysis using SQL
📸 Instagram User Data Analysis using SQL
🙌 HR Data Analysis using SQL
🎬 Data Analyst Project: Step-by-step analysis with SQL
🎼 Music Store Data Analysis Project Using SQL
✅ Top 10 SQL Projects with Datasets
✅ Roadmap to Master SQL
#DataAnalyst #DataAnalytics #DataAnalysis #data_analyst #sql
If you find this useful, give it a👍
❤4
Top 10 SQL interview questions with solutions by @sqlspecialist
1. What is the difference between WHERE and HAVING?
Solution:
WHERE filters rows before aggregation.
HAVING filters rows after aggregation.
2. Write a query to find the second-highest salary.
Solution:
3. How do you fetch the first 5 rows of a table?
Solution:
For SQL Server:
4. Write a query to find duplicate records in a table.
Solution:
5. How do you find employees who don’t belong to any department?
Solution:
6. What is a JOIN, and write a query to fetch data using INNER JOIN.
Solution:
A JOIN combines rows from two or more tables based on a related column.
7. Write a query to find the total number of employees in each department.
Solution:
8. How do you fetch the current date in SQL?
Solution:
9. Write a query to delete duplicate rows but keep one.
Solution:
10. What is a Common Table Expression (CTE), and how do you use it?
Solution:
A CTE is a temporary result set defined within a query.
Hope it helps :)
#sql #dataanalysts
1. What is the difference between WHERE and HAVING?
Solution:
WHERE filters rows before aggregation.
HAVING filters rows after aggregation.
SELECT department, AVG(salary)
FROM employees
WHERE salary > 3000
GROUP BY department
HAVING AVG(salary) > 5000;
2. Write a query to find the second-highest salary.
Solution:
SELECT MAX(salary) AS second_highest_salary
FROM employees
WHERE salary < (SELECT MAX(salary) FROM employees);
3. How do you fetch the first 5 rows of a table?
Solution:
SELECT * FROM employees
LIMIT 5; -- (MySQL/PostgreSQL)
For SQL Server:
SELECT TOP 5 * FROM employees;
4. Write a query to find duplicate records in a table.
Solution:
SELECT column1, column2, COUNT(*)
FROM table_name
GROUP BY column1, column2
HAVING COUNT(*) > 1;
5. How do you find employees who don’t belong to any department?
Solution:
SELECT *
FROM employees
WHERE department_id IS NULL;
6. What is a JOIN, and write a query to fetch data using INNER JOIN.
Solution:
A JOIN combines rows from two or more tables based on a related column.
SELECT e.name, d.department_name
FROM employees e
INNER JOIN departments d ON e.department_id = d.id;
7. Write a query to find the total number of employees in each department.
Solution:
SELECT department_id, COUNT(*) AS total_employees
FROM employees
GROUP BY department_id;
8. How do you fetch the current date in SQL?
Solution:
SELECT CURRENT_DATE; -- MySQL/PostgreSQL
SELECT GETDATE(); -- SQL Server
9. Write a query to delete duplicate rows but keep one.
Solution:
WITH CTE AS (
SELECT *, ROW_NUMBER() OVER (PARTITION BY column1, column2 ORDER BY id) AS rn
FROM table_name
)
DELETE FROM CTE WHERE rn > 1;
10. What is a Common Table Expression (CTE), and how do you use it?
Solution:
A CTE is a temporary result set defined within a query.
WITH EmployeeCTE AS (
SELECT department_id, COUNT(*) AS total_employees
FROM employees
GROUP BY department_id
)
SELECT * FROM EmployeeCTE WHERE total_employees > 10;
Hope it helps :)
#sql #dataanalysts
❤2