Machine Learning
39.4K subscribers
4.35K photos
40 videos
50 files
1.42K links
Real Machine Learning โ€” simple, practical, and built on experience.
Learn step by step with clear explanations and working code.

Admin: @HusseinSheikho || @Hussein_Sheikho
Download Telegram
๐Ÿ“Œ Long Short Term Memory (LSTM)- Improving RNNs

๐Ÿ—‚ Category: DEEP LEARNING

๐Ÿ•’ Date: 2024-05-31 | โฑ๏ธ Read time: 9 min read

How state of the art RNNs work
๐Ÿ“Œ Orchestrating a Dynamic Time-series Pipeline in Azure

๐Ÿ—‚ Category: DATA ENGINEERING

๐Ÿ•’ Date: 2024-05-31 | โฑ๏ธ Read time: 9 min read

Explore how to build, trigger, and parameterize a time-series data pipeline with ADF and Databricks,โ€ฆ
๐Ÿ“Œ Data Science Portfolios, Speeding Up Python, KANs, and Other May Must-Reads

๐Ÿ—‚ Category: DATA SCIENCE

๐Ÿ•’ Date: 2024-05-30 | โฑ๏ธ Read time: 4 min read

The stories that resonated the most with our community in the past month
Top 30 Pandas Functions & Methods
๐Ÿ‘‡๐Ÿ‘‡๐Ÿ‘‡๐Ÿ‘‡
Please open Telegram to view this post
VIEW IN TELEGRAM
#Pandas #DataAnalysis #Python #DataScience #Tutorial

Top 30 Pandas Functions & Methods

This lesson covers 30 essential Pandas functions for data manipulation and analysis, each with a standalone example and its output.

---

1. pd.DataFrame()
Creates a new DataFrame (a 2D labeled data structure) from various inputs like dictionaries or lists.

import pandas as pd
data = {'col1': [1, 2], 'col2': [3, 4]}
df = pd.DataFrame(data)
print(df)

col1  col2
0 1 3
1 2 4


---

2. pd.Series()
Creates a new Series (a 1D labeled array).

import pandas as pd
s = pd.Series([10, 20, 30, 40], name='MyNumbers')
print(s)

0    10
1 20
2 30
3 40
Name: MyNumbers, dtype: int64


---

3. pd.read_csv()
Reads data from a CSV file into a DataFrame. (Assuming a file data.csv exists).

# Create a dummy csv file first
with open('data.csv', 'w') as f:
f.write('Name,Age\nAlice,25\nBob,30')

df = pd.read_csv('data.csv')
print(df)

Name  Age
0 Alice 25
1 Bob 30


---

4. df.to_csv()
Writes a DataFrame to a CSV file.

import pandas as pd
df = pd.DataFrame({'Name': ['Charlie'], 'Age': [35]})
# index=False prevents writing the DataFrame index to the file
df.to_csv('output.csv', index=False)
# You can check that 'output.csv' has been created.
print("File 'output.csv' created.")

File 'output.csv' created.

#PandasIO #DataFrame #Series

---

5. df.head()
Returns the first n rows of the DataFrame (default is 5).

import pandas as pd
data = {'Name': ['A', 'B', 'C', 'D', 'E', 'F'], 'Value': [1, 2, 3, 4, 5, 6]}
df = pd.DataFrame(data)
print(df.head(3))

Name  Value
0 A 1
1 B 2
2 C 3


---

6. df.tail()
Returns the last n rows of the DataFrame (default is 5).

import pandas as pd
data = {'Name': ['A', 'B', 'C', 'D', 'E', 'F'], 'Value': [1, 2, 3, 4, 5, 6]}
df = pd.DataFrame(data)
print(df.tail(2))

Name  Value
4 E 5
5 F 6


---

7. df.info()
Provides a concise summary of the DataFrame, including data types and non-null values.

import pandas as pd
import numpy as np
data = {'col1': [1, 2, 3], 'col2': [4.0, 5.0, np.nan], 'col3': ['A', 'B', 'C']}
df = pd.DataFrame(data)
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3 entries, 0 to 2
Data columns (total 3 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 col1 3 non-null int64
1 col2 2 non-null float64
2 col3 3 non-null object
dtypes: float64(1), int64(1), object(1)
memory usage: 200.0+ bytes


---

8. df.shape
Returns a tuple representing the dimensionality (rows, columns) of the DataFrame.

import pandas as pd
df = pd.DataFrame({'A': [1, 2], 'B': [3, 4], 'C': [5, 6]})
print(df.shape)

(2, 3)

#DataInspection #PandasBasics

---

9. df.describe()
Generates descriptive statistics for numerical columns (count, mean, std, min, max, etc.).

import pandas as pd
df = pd.DataFrame({'Age': [22, 38, 26, 35, 29]})
print(df.describe())
โค4
Age
count 5.000000
mean 30.000000
std 6.363961
min 22.000000
25% 26.000000
50% 29.000000
75% 35.000000
max 38.000000


---

10. df.columns
Returns the column labels of the DataFrame.

import pandas as pd
df = pd.DataFrame({'Name': [], 'Age': [], 'City': []})
print(df.columns)

Index(['Name', 'Age', 'City'], dtype='object')


---

11. df.dtypes
Returns the data type of each column.

import pandas as pd
df = pd.DataFrame({'Name': ['Alice'], 'Age': [25], 'Salary': [75000.50]})
print(df.dtypes)

Name       object
Age int64
Salary float64
dtype: object


---

12. Selecting a Column
Select a single column, which returns a Pandas Series.

import pandas as pd
data = {'Name': ['Alice', 'Bob'], 'Age': [25, 30]}
df = pd.DataFrame(data)
ages = df['Age']
print(ages)

0    25
1 30
Name: Age, dtype: int64

#DataSelection #Indexing #Statistics

---

13. df.loc[]
Access a group of rows and columns by label(s) or a boolean array.

import pandas as pd
data = {'Age': [25, 30, 35], 'City': ['NY', 'LA', 'CH']}
df = pd.DataFrame(data, index=['Alice', 'Bob', 'Charlie'])
print(df.loc['Bob'])

Age     30
City LA
Name: Bob, dtype: object


---

14. df.iloc[]
Access a group of rows and columns by integer position(s).

import pandas as pd
data = {'Age': [25, 30, 35], 'City': ['NY', 'LA', 'CH']}
df = pd.DataFrame(data, index=['Alice', 'Bob', 'Charlie'])
print(df.iloc[1]) # Get the second row (index 1)

Age     30
City LA
Name: Bob, dtype: object


---

15. df.isnull()
Returns a DataFrame of the same shape with boolean values indicating if a value is missing (NaN).

import pandas as pd
import numpy as np
df = pd.DataFrame({'A': [1, np.nan], 'B': [3, 4]})
print(df.isnull())

A      B
0 False False
1 True False


---

16. df.dropna()
Removes missing values.

import pandas as pd
import numpy as np
df = pd.DataFrame({'A': [1, np.nan, 3], 'B': [4, 5, 6]})
cleaned_df = df.dropna()
print(cleaned_df)

A  B
0 1.0 4
2 3.0 6

#DataCleaning #MissingData

---

17. df.fillna()
Fills missing (NaN) values with a specified value or method.

import pandas as pd
import numpy as np
df = pd.DataFrame({'Score': [90, 85, np.nan, 92]})
filled_df = df.fillna(0)
print(filled_df)

Score
0 90.0
1 85.0
2 0.0
3 92.0


---

18. df.drop_duplicates()
Removes duplicate rows from the DataFrame.

import pandas as pd
data = {'Name': ['Alice', 'Bob', 'Alice'], 'Age': [25, 30, 25]}
df = pd.DataFrame(data)
unique_df = df.drop_duplicates()
print(unique_df)

Name  Age
0 Alice 25
1 Bob 30


---

19. df.rename()
Alters axes labels (e.g., column names).

import pandas as pd
df = pd.DataFrame({'A': [1], 'B': [2]})
renamed_df = df.rename(columns={'A': 'Column_A', 'B': 'Column_B'})
print(renamed_df)

Column_A  Column_B
0 1 2


---

20. series.value_counts()
Returns a Series containing counts of unique values.
import pandas as pd
s = pd.Series(['A', 'B', 'A', 'C', 'A', 'B'])
print(s.value_counts())

A    3
B 2
C 1
dtype: int64

#DataManipulation #Transformation

---

21. series.unique()
Returns an array of unique values in a Series.

import pandas as pd
s = pd.Series(['A', 'B', 'A', 'C', 'A', 'B'])
print(s.unique())

['A' 'B' 'C']


---

22. df.sort_values()
Sorts a DataFrame by the values of one or more columns.

import pandas as pd
data = {'Name': ['Charlie', 'Alice', 'Bob'], 'Age': [35, 25, 30]}
df = pd.DataFrame(data)
sorted_df = df.sort_values(by='Age')
print(sorted_df)

Name  Age
1 Alice 25
2 Bob 30
0 Charlie 35


---

23. df.groupby()
Groups a DataFrame using a mapper or by a Series of columns for aggregation.

import pandas as pd
data = {'Dept': ['HR', 'IT', 'HR', 'IT'], 'Salary': [70, 85, 75, 90]}
df = pd.DataFrame(data)
grouped = df.groupby('Dept').mean()
print(grouped)

Salary
Dept
HR 72.5
IT 87.5


---

24. df.agg()
Applies one or more aggregations over the specified axis.

import pandas as pd
data = {'Dept': ['HR', 'IT', 'HR', 'IT'], 'Salary': [70, 85, 75, 90]}
df = pd.DataFrame(data)
agg_results = df.groupby('Dept')['Salary'].agg(['mean', 'sum'])
print(agg_results)

mean  sum
Dept
HR 72.5 145
IT 87.5 175

#Aggregation #Grouping #Sorting

---

25. df.apply()
Applies a function along an axis of the DataFrame.

import pandas as pd
df = pd.DataFrame({'A': [1, 2, 3], 'B': [10, 20, 30]})
# Apply a function to double each value in column 'A'
df['A_doubled'] = df['A'].apply(lambda x: x * 2)
print(df)

A   B  A_doubled
0 1 10 2
1 2 20 4
2 3 30 6


---

26. pd.merge()
Merges two DataFrames based on a common column or index, similar to a SQL join.

import pandas as pd
df1 = pd.DataFrame({'ID': [1, 2], 'Name': ['Alice', 'Bob']})
df2 = pd.DataFrame({'ID': [1, 2], 'Role': ['Engineer', 'Analyst']})
merged_df = pd.merge(df1, df2, on='ID')
print(merged_df)

ID   Name      Role
0 1 Alice Engineer
1 2 Bob Analyst


---

27. pd.concat()
Concatenates (stacks) pandas objects along a particular axis.

import pandas as pd
df1 = pd.DataFrame({'A': ['A0'], 'B': ['B0']})
df2 = pd.DataFrame({'A': ['A1'], 'B': ['B1']})
concatenated_df = pd.concat([df1, df2])
print(concatenated_df)

A   B
0 A0 B0
0 A1 B1


---

28. df.pivot_table()
Creates a spreadsheet-style pivot table as a DataFrame.
โค2
import pandas as pd
data = {'Date': ['2023-01-01', '2023-01-01', '2023-01-02'],
'City': ['NY', 'LA', 'NY'],
'Sales': [100, 150, 120]}
df = pd.DataFrame(data)
pivot = df.pivot_table(values='Sales', index='Date', columns='City')
print(pivot)

City             LA     NY
Date
2023-01-01 150.0 100.0
2023-01-02 NaN 120.0

#CombiningData #PivotTable

---

29. df.set_index()
Sets one or more existing columns as the DataFrame index.

import pandas as pd
data = {'ID': ['a1', 'a2'], 'Name': ['Alice', 'Bob']}
df = pd.DataFrame(data)
df_indexed = df.set_index('ID')
print(df_indexed)

Name
ID
a1 Alice
a2 Bob


---

30. df.reset_index()
Resets the index of the DataFrame, making the old index a new column.

import pandas as pd
data = {'Name': ['Alice', 'Bob']}
df = pd.DataFrame(data, index=['a1', 'a2'])
df_reset = df.reset_index()
print(df_reset)

index   Name
0 a1 Alice
1 a2 Bob

#Indexing #PandasTips

โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”
By: @DataScienceM โœจ
โค8
๐Ÿ“Œ The History of Bodybuilding Through Network Visualization

๐Ÿ—‚ Category: DATA SCIENCE

๐Ÿ•’ Date: 2024-05-30 | โฑ๏ธ Read time: 5 min read

Constructing the Shared Podium Graph of Mr. Olympia Winners (1965-2023) using Python and Gephi.
Top 30 MATLAB Image Processing Functions

#MATLAB #ImageProcessing #Basics

๐Ÿ‘‡๐Ÿ‘‡๐Ÿ‘‡๐Ÿ‘‡๐Ÿ‘‡
Please open Telegram to view this post
VIEW IN TELEGRAM
Top 30 MATLAB Image Processing Functions

#MATLAB #ImageProcessing #Basics

#1. imread()
Reads an image from a file into a matrix.

img = imread('peppers.png');
disp('Image "peppers.png" loaded into variable "img".');

Image "peppers.png" loaded into variable "img".


#2. imshow()
Displays an image in a figure window.

img = imread('peppers.png');
imshow(img);
title('Peppers Image');

Output: A new figure window opens, displaying the 'peppers.png' image with the title "Peppers Image".


#3. imwrite()
Writes an image matrix to a file.

img = imread('cameraman.tif');
imwrite(img, 'my_cameraman.jpg');
disp('Image saved as my_cameraman.jpg');

Image saved as my_cameraman.jpg


#4. size()
Returns the dimensions of the image matrix (rows, columns, color channels).

rgb_img = imread('peppers.png');
gray_img = imread('cameraman.tif');
size_rgb = size(rgb_img);
size_gray = size(gray_img);
disp(['Size of RGB image: ', num2str(size_rgb)]);
disp(['Size of grayscale image: ', num2str(size_gray)]);

Size of RGB image: 384   512     3
Size of grayscale image: 256 256


#5. rgb2gray()
Converts an RGB color image to a grayscale intensity image.

rgb_img = imread('peppers.png');
gray_img = rgb2gray(rgb_img);
imshow(gray_img);
title('Grayscale Peppers');

Output: A figure window displays the grayscale version of the peppers image.

---
#MATLAB #ImageProcessing #Conversion #Transformation

#6. im2double()
Converts an image to double-precision format, scaling data to the range [0, 1].

img_uint8 = imread('cameraman.tif');
img_double = im2double(img_uint8);
disp(['Max value of original image: ', num2str(max(img_uint8(:)))]);
disp(['Max value of double image: ', num2str(max(img_double(:)))]);

Max value of original image: 253
Max value of double image: 0.99216


#7. imresize()
Resizes an image to a specified size.

img = imread('cameraman.tif');
resized_img = imresize(img, 0.5); % Resize to 50% of original size
imshow(resized_img);
title('Resized Cameraman');

Output: A figure window displays the cameraman image at half its original size.


#8. imrotate()
Rotates an image by a specified angle.

img = imread('cameraman.tif');
rotated_img = imrotate(img, 30, 'bilinear', 'crop');
imshow(rotated_img);
title('Rotated 30 Degrees');

Output: A figure window displays the cameraman image rotated by 30 degrees, cropped to the original size.


#9. imcrop()
Crops an image to a specified rectangle.

img = imread('peppers.png');
% [xmin ymin width height]
cropped_img = imcrop(img, [100 80 250 200]);
imshow(cropped_img);
title('Cropped Image');

Output: A figure window displays only the rectangular section specified from the peppers image.


#10. rgb2hsv()
Converts an RGB image to the Hue-Saturation-Value (HSV) color space.

rgb_img = imread('peppers.png');
hsv_img = rgb2hsv(rgb_img);
hue_channel = hsv_img(:,:,1); % Extract the Hue channel
imshow(hue_channel);
title('Hue Channel of Peppers Image');

Output: A figure window displays the Hue channel of the peppers image as a grayscale image.

---
#MATLAB #ImageProcessing #Enhancement

#11. imhist()
Displays the histogram of an image, showing the distribution of pixel intensity values.
โค1
gray_img = imread('pout.tif');
imhist(gray_img);
title('Histogram of a Low-Contrast Image');

Output: A figure window with a bar chart showing the intensity distribution of the 'pout.tif' image.


#12. histeq()
Enhances contrast using histogram equalization.

low_contrast_img = imread('pout.tif');
high_contrast_img = histeq(low_contrast_img);
imshow(high_contrast_img);
title('Histogram Equalized Image');

Output: A figure window displays a higher contrast version of the 'pout.tif' image.


#13. imadjust()
Adjusts image intensity values or colormap by mapping intensity values to new values.

img = imread('cameraman.tif');
adjusted_img = imadjust(img, [0.3 0.7], []);
imshow(adjusted_img);
title('Intensity Adjusted Image');

Output: A figure window showing a high-contrast version of the cameraman image, where intensities between 0.3 and 0.7 are stretched to the full [0, 1] range.


#14. imtranslate()
Translates (shifts) an image horizontally and vertically.

img = imread('cameraman.tif');
translated_img = imtranslate(img, [25, 15]); % Shift 25 pixels right, 15 pixels down
imshow(translated_img);
title('Translated Image');

Output: A figure window shows the cameraman image shifted to the right and down.


#15. imsharpen()
Sharpens an image using the unsharp masking method.

img = imread('peppers.png');
sharpened_img = imsharpen(img);
imshow(sharpened_img);
title('Sharpened Image');

Output: A figure window displays a crisper, more detailed version of the peppers image.

---
#MATLAB #ImageProcessing #Filtering #Noise

#16. imnoise()
Adds a specified type of noise to an image.

img = imread('cameraman.tif');
noisy_img = imnoise(img, 'salt & pepper', 0.02);
imshow(noisy_img);
title('Image with Salt & Pepper Noise');

Output: A figure window displays the cameraman image with random white and black pixels (noise).


#17. fspecial()
Creates a predefined 2-D filter kernel (e.g., for averaging, Gaussian blur, Laplacian).

h = fspecial('motion', 20, 45); % Create a motion blur filter
disp('Generated a 2D motion filter kernel.');
disp(h);

Generated a 2D motion filter kernel.
(Output is a matrix representing the filter kernel)


#18. imfilter()
Filters a multidimensional image with a specified filter kernel.

img = imread('cameraman.tif');
h = fspecial('motion', 20, 45);
motion_blur_img = imfilter(img, h, 'replicate');
imshow(motion_blur_img);
title('Motion Blurred Image');

Output: A figure window shows the cameraman image with a motion blur effect applied at a 45-degree angle.


#19. medfilt2()
Performs 2-D median filtering, which is excellent for removing 'salt & pepper' noise.

noisy_img = imnoise(imread('cameraman.tif'), 'salt & pepper', 0.02);
denoised_img = medfilt2(noisy_img);
imshow(denoised_img);
title('Denoised with Median Filter');

Output: A figure window shows the noisy image significantly cleaned up, with most salt & pepper noise removed.


#20. edge()
Finds edges in an intensity image using various algorithms (e.g., Sobel, Canny).
img = imread('cameraman.tif');
edges = edge(img, 'Canny');
imshow(edges);
title('Edges found with Canny Detector');

Output: A figure window displays a binary image showing only the detected edges from the original image in white.

---
#MATLAB #ImageProcessing #Segmentation #Morphology

#21. graythresh()
Computes a global image threshold from a grayscale image using Otsu's method.

img = imread('coins.png');
level = graythresh(img);
disp(['Optimal threshold level (Otsu): ', num2str(level)]);

Optimal threshold level (Otsu): 0.49412


#22. imbinarize()
Converts a grayscale image to a binary image based on a threshold.

img = imread('coins.png');
level = graythresh(img); % Find optimal threshold
bw_img = imbinarize(img, level);
imshow(bw_img);
title('Binarized Image (Otsu Method)');

Output: A figure window displays a black and white image of the coins.


#23. strel()
Creates a morphological structuring element (SE), which is used to probe an image in morphological operations.

se = strel('disk', 5);
disp('Created a disk-shaped structuring element with radius 5.');
disp(se);

Created a disk-shaped structuring element with radius 5.
(Output describes the strel object and shows its matrix representation)


#24. imdilate()
Dilates a binary image, making objects larger and filling small holes.

img = imread('text.png');
se = strel('line', 3, 90); % A vertical line SE
dilated_img = imdilate(img, se);
imshow(dilated_img);
title('Dilated Text');

Output: A figure window shows the text characters appearing thicker, especially in the vertical direction.


#25. imerode()
Erodes a binary image, shrinking objects and removing small noise.

img = imread('text.png');
se = strel('line', 3, 0); % A horizontal line SE
eroded_img = imerode(img, se);
imshow(eroded_img);
title('Eroded Text');

Output: A figure window shows the text characters appearing thinner, with horizontal parts possibly disappearing.

---
#MATLAB #ImageProcessing #Analysis

#26. imopen()
Performs morphological opening (erosion followed by dilation). It smooths contours and removes small objects.

original = imread('circbw.tif');
se = strel('disk', 10);
opened_img = imopen(original, se);
imshow(opened_img);
title('Morphologically Opened Image');

Output: A figure window displays the image with small protrusions removed and gaps between objects widened.


#27. bwareaopen()
Removes all connected components (objects) from a binary image that have fewer than a specified number of pixels.

img = imread('text.png');
cleaned_img = bwareaopen(img, 50); % Remove objects with fewer than 50 pixels
imshow(cleaned_img);
title('Image after removing small objects');

Output: A figure window shows the text image with small noise specks or broken parts of characters removed.


#28. bwlabel()
Labels connected components in a binary image.

img = imread('text.png');
[L, num] = bwlabel(img);
disp(['Number of connected objects found: ', num2str(num)]);

Number of connected objects found: 114


#29. regionprops()
Measures a set of properties for each labeled region in an image.
img = imread('coins.png');
bw = imbinarize(img);
stats = regionprops('table', bw, 'Centroid', 'MajorAxisLength', 'MinorAxisLength');
disp('Properties of the first 3 coins:');
disp(stats(1:3,:));

Properties of the first 3 coins:
Centroid MajorAxisLength MinorAxisLength
________ _______________ _______________

52.715 26.839 48.24 47.369
134.5 27.067 48.745 47.534
215.01 29.805 47.854 47.502


#30. hough()
Performs the Hough transform, used to detect lines in an image.

img = imrotate(imread('circuit.tif'), 33, 'crop');
edges = edge(img, 'canny');
[H, T, R] = hough(edges);
imshow(imadjust(rescale(H)), 'XData', T, 'YData', R, ...
'InitialMagnification', 'fit');
xlabel('\theta'), ylabel('\rho');
title('Hough Transform of Circuit');

Output: A figure window displaying the Hough transform as an image, where bright spots correspond to potential lines in the original image.


โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”
By: @DataScienceM โœจ
โค2
๐Ÿ“Œ Constructive Heuristics in Discrete Optimization

๐Ÿ—‚ Category:

๐Ÿ•’ Date: 2024-05-30 | โฑ๏ธ Read time: 13 min read

Obtain initial solutions for combinatorial optimization problems with Python examples
Top 100 Data Analyst Interview Questions & Answers

๐Ÿ‘‡๐Ÿ‘‡๐Ÿ‘‡๐Ÿ‘‡๐Ÿ‘‡
Please open Telegram to view this post
VIEW IN TELEGRAM
Top 100 Data Analyst Interview Questions & Answers

#DataAnalysis #InterviewQuestions #SQL #Python #Statistics #CaseStudy #DataScience

Part 1: SQL Questions (Q1-30)

#1. What is the difference between DELETE, TRUNCATE, and DROP?
A:
โ€ข DELETE is a DML command that removes rows from a table based on a WHERE clause. It is slower as it logs each row deletion and can be rolled back.
โ€ข TRUNCATE is a DDL command that quickly removes all rows from a table. It is faster, cannot be rolled back, and resets table identity.
โ€ข DROP is a DDL command that removes the entire table, including its structure, data, and indexes.

#2. Select all unique departments from the employees table.
A: Use the DISTINCT keyword.

SELECT DISTINCT department
FROM employees;


#3. Find the top 5 highest-paid employees.
A: Use ORDER BY and LIMIT.

SELECT name, salary
FROM employees
ORDER BY salary DESC
LIMIT 5;


#4. What is the difference between WHERE and HAVING?
A:
โ€ข WHERE is used to filter records before any groupings are made (i.e., it operates on individual rows).
โ€ข HAVING is used to filter groups after aggregations (GROUP BY) have been performed.

-- Find departments with more than 10 employees
SELECT department, COUNT(employee_id)
FROM employees
GROUP BY department
HAVING COUNT(employee_id) > 10;


#5. What are the different types of SQL joins?
A:
โ€ข (INNER) JOIN: Returns records that have matching values in both tables.
โ€ข LEFT (OUTER) JOIN: Returns all records from the left table, and the matched records from the right table.
โ€ข RIGHT (OUTER) JOIN: Returns all records from the right table, and the matched records from the left table.
โ€ข FULL (OUTER) JOIN: Returns all records when there is a match in either the left or right table.
โ€ข SELF JOIN: A regular join, but the table is joined with itself.

#6. Write a query to find the second-highest salary.
A: Use OFFSET or a subquery.

-- Method 1: Using OFFSET
SELECT salary
FROM employees
ORDER BY salary DESC
LIMIT 1 OFFSET 1;

-- Method 2: Using a Subquery
SELECT MAX(salary)
FROM employees
WHERE salary < (SELECT MAX(salary) FROM employees);


#7. Find duplicate emails in a customers table.
A: Group by the email column and use HAVING to find groups with a count greater than 1.

SELECT email, COUNT(email)
FROM customers
GROUP BY email
HAVING COUNT(email) > 1;


#8. What is a primary key vs. a foreign key?
A:
โ€ข A Primary Key is a constraint that uniquely identifies each record in a table. It must contain unique values and cannot contain NULL values.
โ€ข A Foreign Key is a key used to link two tables together. It is a field (or collection of fields) in one table that refers to the Primary Key in another table.

#9. Explain Window Functions. Give an example.
A: Window functions perform a calculation across a set of table rows that are somehow related to the current row. Unlike aggregate functions, they do not collapse rows.

-- Rank employees by salary within each department
SELECT
name,
department,
salary,
RANK() OVER (PARTITION BY department ORDER BY salary DESC) as dept_rank
FROM employees;


#10. What is a CTE (Common Table Expression)?
A: A CTE is a temporary, named result set that you can reference within a SELECT, INSERT, UPDATE, or DELETE statement. It helps improve readability and break down complex queries.
WITH DepartmentSales AS (
SELECT department, SUM(sale_amount) as total_sales
FROM sales
GROUP BY department
)
SELECT department, total_sales
FROM DepartmentSales
WHERE total_sales > 100000;

---
#11. Difference between UNION and UNION ALL?
A:
โ€ข UNION combines the result sets of two or more SELECT statements and removes duplicate rows.
โ€ข UNION ALL also combines result sets but includes all rows, including duplicates. It is faster because it doesn't check for duplicates.

#12. How would you find the total number of employees in each department?
A: Use COUNT() with GROUP BY.

SELECT department, COUNT(employee_id) as number_of_employees
FROM employees
GROUP BY department;


#13. What is the difference between RANK() and DENSE_RANK()?
A:
โ€ข RANK() assigns a rank to each row within a partition. If there are ties, it skips the next rank(s). (e.g., 1, 2, 2, 4)
โ€ข DENSE_RANK() also assigns ranks, but it does not skip any ranks in case of ties. (e.g., 1, 2, 2, 3)

#14. Write a query to get the Nth highest salary.
A: Use DENSE_RANK() in a CTE.

WITH SalaryRanks AS (
SELECT
salary,
DENSE_RANK() OVER (ORDER BY salary DESC) as rnk
FROM employees
)
SELECT salary
FROM SalaryRanks
WHERE rnk = 5; -- For the 5th highest salary


#15. What is COALESCE() used for?
A: The COALESCE() function returns the first non-NULL value in a list of expressions. It's useful for providing default values for nulls.

SELECT name, COALESCE(commission, 0) as commission
FROM employees; -- Replaces NULL commissions with 0

---
#16. How would you select all employees whose name starts with 'A'?
A: Use the LIKE operator with a wildcard (%).

SELECT name
FROM employees
WHERE name LIKE 'A%';


#17. Get the current date and time.
A: This is function-dependent on the SQL dialect.
โ€ข PostgreSQL/MySQL: NOW()
โ€ข SQL Server: GETDATE()

SELECT NOW();


#18. How can you extract the month from a date?
A: Use the EXTRACT function or MONTH().

-- Standard SQL
SELECT EXTRACT(MONTH FROM '2023-10-27');
-- MySQL
SELECT MONTH('2023-10-27');


#19. What is a subquery? What are the types?
A: A subquery is a query nested inside another query.
โ€ข Scalar Subquery: Returns a single value (one row, one column).
โ€ข Multi-row Subquery: Returns multiple rows.
โ€ข Correlated Subquery: An inner query that depends on the outer query for its values. It is evaluated once for each row processed by the outer query.

#20. Write a query to find all employees who work in the 'Sales' department.
A: Use a JOIN or a subquery.

-- Using JOIN (preferred)
SELECT e.name
FROM employees e
JOIN departments d ON e.department_id = d.id
WHERE d.name = 'Sales';

---
#21. How would you calculate the month-over-month growth rate of sales?
A: Use the LAG() window function to get the previous month's sales and then apply the growth formula.

WITH MonthlySales AS (
SELECT
DATE_TRUNC('month', order_date)::DATE as sales_month,
SUM(sale_amount) as total_sales
FROM sales
GROUP BY 1
)
SELECT
sales_month,
total_sales,
(total_sales - LAG(total_sales, 1) OVER (ORDER BY sales_month)) / LAG(total_sales, 1) OVER (ORDER BY sales_month) * 100 as growth_rate
FROM MonthlySales;
#22. What is an index in a database? Why is it useful?
A: An index is a special lookup table that the database search engine can use to speed up data retrieval. It works like an index in the back of a book. It improves the speed of SELECT queries but can slow down data modification (INSERT, UPDATE, DELETE).

#23. Difference between VARCHAR and CHAR?
A:
โ€ข CHAR is a fixed-length string data type. CHAR(10) will always store 10 characters, padding with spaces if necessary.
โ€ข VARCHAR is a variable-length string data type. VARCHAR(10) can store up to 10 characters, but only uses the storage needed for the actual string.

#24. What is a CASE statement?
A: The CASE statement goes through conditions and returns a value when the first condition is met (like an if-then-else statement).

SELECT
name,
salary,
CASE
WHEN salary > 100000 THEN 'High Earner'
WHEN salary > 50000 THEN 'Mid Earner'
ELSE 'Low Earner'
END as salary_category
FROM employees;


#25. Find the cumulative sum of sales over time.
A: Use a SUM() window function.

SELECT
order_date,
sale_amount,
SUM(sale_amount) OVER (ORDER BY order_date) as cumulative_sales
FROM sales;

---
#26. What does GROUP_CONCAT (MySQL) or STRING_AGG (PostgreSQL) do?
A: These functions concatenate strings from a group into a single string with a specified separator.

-- PostgreSQL example
SELECT department, STRING_AGG(name, ', ') as employee_names
FROM employees
GROUP BY department;


#27. What is data normalization? Why is it important?
A: Data normalization is the process of organizing columns and tables in a relational database to minimize data redundancy. It is important because it reduces storage space, eliminates inconsistent data, and simplifies data management.

#28. Write a query to find users who made a purchase in January but not in February.
A: Use LEFT JOIN or NOT IN.

SELECT user_id
FROM sales
WHERE EXTRACT(MONTH FROM order_date) = 1
EXCEPT
SELECT user_id
FROM sales
WHERE EXTRACT(MONTH FROM order_date) = 2;


#29. What is a self-join?
A: A self-join is a join in which a table is joined to itself. This is useful for querying hierarchical data or comparing rows within the same table.

-- Find employees who have the same manager
SELECT e1.name as employee1, e2.name as employee2, e1.manager_id
FROM employees e1
JOIN employees e2 ON e1.manager_id = e2.manager_id AND e1.id <> e2.id;


#30. What is the execution order of a SQL query?
A: The logical processing order is generally:
โ€ข FROM / JOIN
โ€ข WHERE
โ€ข GROUP BY
โ€ข HAVING
โ€ข SELECT
โ€ข DISTINCT
โ€ข ORDER BY
โ€ข LIMIT / OFFSET

---
Part 2: Python (Pandas/NumPy) Questions (Q31-50)

#31. How do you select a column named 'age' from a pandas DataFrame df?
A: There are two common ways.

# Method 1 (preferred, handles column names with spaces)
age_column = df['age']

# Method 2 (dot notation)
age_column = df.age


#32. How do you filter a DataFrame df to get rows where 'age' is greater than 30?
A: Use boolean indexing.

filtered_df = df[df['age'] > 30]


#33. What's the difference between .loc and .iloc?
A:
โ€ข .loc is a label-based indexer. You use row and column names to select data.
โ€ข .iloc is an integer-position-based indexer. You use integer indices (like in Python lists) to select data.
๐Ÿ‘1
# .loc example (select row with index 'a')
df.loc['a']
# .iloc example (select first row)
df.iloc[0]


#34. How do you handle missing values in a DataFrame?
A: Several methods:
โ€ข df.isnull().sum(): To count missing values per column.
โ€ข df.dropna(): To remove rows/columns with missing values.
โ€ข df.fillna(value): To fill missing values with a specific value (e.g., 0, mean, median).

# Fill missing age values with the mean age
mean_age = df['age'].mean()
df['age'].fillna(mean_age, inplace=True)


#35. How would you create a new column 'age_group' based on the 'age' column?
A: Use pd.cut or a custom function with .apply.

bins = [0, 18, 35, 60, 100]
labels = ['Child', 'Young Adult', 'Adult', 'Senior']
df['age_group'] = pd.cut(df['age'], bins=bins, labels=labels, right=False)


#36. How do you merge two DataFrames, df1 and df2, on a common column 'user_id'?
A: Use pd.merge().

merged_df = pd.merge(df1, df2, on='user_id', how='inner') # 'how' can be 'left', 'right', 'outer'


#37. How can you group a DataFrame by 'department' and calculate the average 'salary'?
A: Use .groupby() and .agg() or a direct aggregation function.

avg_salary_by_dept = df.groupby('department')['salary'].mean()


#38. What is the purpose of the .apply() method in pandas?
A: .apply() lets you apply a function along an axis of a DataFrame. It is used for complex, custom operations that are not covered by built-in pandas functions.

# Create a new column by applying a custom function to the 'salary' column
def categorize_salary(salary):
if salary > 100000:
return 'High'
return 'Low'

df['salary_category'] = df['salary'].apply(categorize_salary)


#39. How do you remove duplicate rows from a DataFrame?
A: Use the .drop_duplicates() method.

# Keep the first occurrence of each duplicate row
unique_df = df.drop_duplicates()

# Keep the last occurrence
unique_df_last = df.drop_duplicates(keep='last')


#40. Explain the difference between join() and merge() in pandas.
A:
โ€ข merge() is more versatile and is the main entry point for database-style join operations. It can join on columns or indices.
โ€ข join() is a convenience method for joining DataFrames primarily on their indices. It can also join on a column of the calling DataFrame to the index of the other.
In most cases, merge() is the more powerful and flexible choice.
---
#41. How do you convert a column's data type, e.g., 'date_string' to datetime?
A: Use pd.to_datetime().

df['date'] = pd.to_datetime(df['date_string'])


#42. What is a pivot table and how do you create one in pandas?
A: A pivot table is a data summarization tool. It reshapes data by aggregating values based on one or more grouping keys along rows and columns. Use df.pivot_table().

# Create a pivot table to see average sales by region and product
pivot = df.pivot_table(values='sales', index='region', columns='product', aggfunc='mean')


#43. How would you select rows with multiple conditions, e.g., 'age' > 30 and 'city' == 'New York'?
A: Use boolean indexing with & for AND, and | for OR. Wrap each condition in parentheses.
๐Ÿ‘1