Data Analysis Books | Python | SQL | Excel | Artificial Intelligence | Power BI | Tableau | AI Resources
47.7K subscribers
226 photos
1 video
36 files
390 links
Download Telegram
Exploratory Data Analysis (EDA)

EDA is the process of analyzing datasets to summarize key patterns, detect anomalies, and gain insights before applying machine learning or reporting.

1️⃣ Descriptive Statistics
Descriptive statistics help summarize and understand data distributions.

In SQL:

Calculate Mean (Average):

SELECT AVG(salary) AS average_salary FROM employees; 
Find Median (Using Window Functions) SELECT salary FROM ( SELECT salary, ROW_NUMBER() OVER (ORDER BY salary) AS row_num, COUNT(*) OVER () AS total_rows FROM employees ) subquery WHERE row_num = (total_rows / 2);


Find Mode (Most Frequent Value)

SELECT department, COUNT(*) AS count FROM employees GROUP BY department ORDER BY count DESC LIMIT 1; 


Calculate Variance & Standard Deviation

SELECT VARIANCE(salary) AS salary_variance, STDDEV(salary) AS salary_std_dev FROM employees; 


In Python (Pandas):

Mean, Median, Mode

df['salary'].mean() df['salary'].median() df['salary'].mode()[0]



Variance & Standard Deviation

df['salary'].var() df['salary'].std()


2️⃣ Data Visualization

Visualizing data helps identify trends, outliers, and patterns.

In SQL (For Basic Visualization in Some Databases Like PostgreSQL):

Create Histogram (Approximate in SQL)

SELECT salary, COUNT(*) FROM employees GROUP BY salary ORDER BY salary; 


In Python (Matplotlib & Seaborn):

Bar Chart (Category-Wise Sales)

import matplotlib.pyplot as plt 
import seaborn as sns
df.groupby('category')['sales'].sum().plot(kind='bar')
plt.title('Total Sales by Category')
plt.xlabel('Category')
plt.ylabel('Sales')
plt.show()


Histogram (Salary Distribution)

sns.histplot(df['salary'], bins=10, kde=True) 
plt.title('Salary Distribution')
plt.show()


Box Plot (Outliers in Sales Data)

sns.boxplot(y=df['sales']) 
plt.title('Sales Data Outliers')
plt.show()


Heatmap (Correlation Between Variables)

sns.heatmap(df.corr(), annot=True, cmap='coolwarm') plt.title('Feature Correlation Heatmap') plt.show() 


3️⃣ Detecting Anomalies & Outliers

Outliers can skew results and should be identified.

In SQL:

Find records with unusually high salaries

SELECT * FROM employees WHERE salary > (SELECT AVG(salary) + 2 * STDDEV(salary) FROM employees); 

In Python (Pandas & NumPy):

Using Z-Score (Values Beyond 3 Standard Deviations)

from scipy import stats df['z_score'] = stats.zscore(df['salary']) df_outliers = df[df['z_score'].abs() > 3] 

Using IQR (Interquartile Range)

Q1 = df['salary'].quantile(0.25) 
Q3 = df['salary'].quantile(0.75)
IQR = Q3 - Q1
df_outliers = df[(df['salary'] < (Q1 - 1.5 * IQR)) | (df['salary'] > (Q3 + 1.5 * IQR))]


4️⃣ Key EDA Steps

Understand the Data → Check missing values, duplicates, and column types

Summarize Statistics → Mean, Median, Standard Deviation, etc.

Visualize Trends → Histograms, Box Plots, Heatmaps

Detect Outliers & Anomalies → Z-Score, IQR

Feature Engineering → Transform variables if needed

Mini Task for You: Write an SQL query to find employees whose salaries are above two standard deviations from the mean salary.

Here you can find the roadmap for data analyst: https://t.me/sqlspecialist/1159

Like this post if you want me to continue covering all the topics! ❤️

Share with credits: https://t.me/sqlspecialist

Hope it helps :)

#sql
5👍3🆒1
Top 10 SQL interview questions with solutions by @sqlspecialist

1. What is the difference between WHERE and HAVING?

Solution:

WHERE filters rows before aggregation.

HAVING filters rows after aggregation.

SELECT department, AVG(salary)
FROM employees
WHERE salary > 3000
GROUP BY department
HAVING AVG(salary) > 5000;


2. Write a query to find the second-highest salary.

Solution:

SELECT MAX(salary) AS second_highest_salary
FROM employees
WHERE salary < (SELECT MAX(salary) FROM employees);



3. How do you fetch the first 5 rows of a table?

Solution:

SELECT * FROM employees
LIMIT 5; -- (MySQL/PostgreSQL)

For SQL Server:

SELECT TOP 5 * FROM employees;



4. Write a query to find duplicate records in a table.

Solution:

SELECT column1, column2, COUNT(*)
FROM table_name
GROUP BY column1, column2
HAVING COUNT(*) > 1;



5. How do you find employees who don’t belong to any department?

Solution:

SELECT * 
FROM employees
WHERE department_id IS NULL;


6. What is a JOIN, and write a query to fetch data using INNER JOIN.

Solution:
A JOIN combines rows from two or more tables based on a related column.

SELECT e.name, d.department_name
FROM employees e
INNER JOIN departments d ON e.department_id = d.id;


7. Write a query to find the total number of employees in each department.

Solution:

SELECT department_id, COUNT(*) AS total_employees
FROM employees
GROUP BY department_id;


8. How do you fetch the current date in SQL?

Solution:

SELECT CURRENT_DATE; -- MySQL/PostgreSQL
SELECT GETDATE(); -- SQL Server


9. Write a query to delete duplicate rows but keep one.

Solution:

WITH CTE AS (
SELECT *, ROW_NUMBER() OVER (PARTITION BY column1, column2 ORDER BY id) AS rn
FROM table_name
)
DELETE FROM CTE WHERE rn > 1;


10. What is a Common Table Expression (CTE), and how do you use it?

Solution:
A CTE is a temporary result set defined within a query.

WITH EmployeeCTE AS (
SELECT department_id, COUNT(*) AS total_employees
FROM employees
GROUP BY department_id
)
SELECT * FROM EmployeeCTE WHERE total_employees > 10;



Hope it helps :)

#sql #dataanalysts
2