Machine Learning
39.4K subscribers
4.35K photos
40 videos
50 files
1.42K links
Real Machine Learning β€” simple, practical, and built on experience.
Learn step by step with clear explanations and working code.

Admin: @HusseinSheikho || @Hussein_Sheikho
Download Telegram
πŸ“š Data Cleaning and Exploration with Machine Learning (2022)

1⃣ Join Channel Download:
https://t.me/+MhmkscCzIYQ2MmM8

2⃣ Download Book: https://t.me/c/1854405158/119

πŸ’¬ Tags: #DataCleaning #ML

USEFUL CHANNELS FOR YOU
❀7πŸ‘3
πŸ“š Python Data Cleaning Cookbook (2023)

1⃣ Join Channel Download:
https://t.me/+MhmkscCzIYQ2MmM8

2⃣ Download Book: https://t.me/c/1854405158/866

πŸ’¬ Tags: #DataCleaning

πŸ‘‰ BEST DATA SCIENCE CHANNELS ON TELEGRAM πŸ‘ˆ
πŸ‘12❀1
Pandas Data Cleaning (Guide)

πŸ”‘ Tags: #Pandas #DataCleaning #ML

https://t.me/DataScienceM βœ…
Please open Telegram to view this post
VIEW IN TELEGRAM
πŸ‘11
Pandas.pdf
14.9 MB
Pandas Data Cleaning (Guide)

πŸ”‘ Tags: #Pandas #DataCleaning #ML

https://t.me/DataScienceM βœ…
Please open Telegram to view this post
VIEW IN TELEGRAM
πŸ‘21
Topic: Handling Datasets of All Types – Part 2 of 5: Data Cleaning and Preprocessing

---

1. Importance of Data Cleaning

β€’ Real-world data is often noisy, incomplete, or inconsistent.

β€’ Cleaning improves data quality and model performance.

---

2. Handling Missing Data

β€’ Detect missing values using isnull() or isna() in pandas.

β€’ Strategies to handle missing data:

* Remove rows or columns with missing values:

df.dropna(inplace=True)


* Impute missing values with mean, median, or mode:

df['column'].fillna(df['column'].mean(), inplace=True)


---

3. Handling Outliers

β€’ Outliers can skew analysis and model results.

β€’ Detect outliers using:

* Boxplots
* Z-score method
* IQR (Interquartile Range)

β€’ Handle by removal or transformation.

---

4. Data Normalization and Scaling

β€’ Many ML models require features to be on a similar scale.

β€’ Common techniques:

* Min-Max Scaling (scales values between 0 and 1)

* Standardization (mean = 0, std = 1)

from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
df_scaled = scaler.fit_transform(df[['feature1', 'feature2']])


---

5. Encoding Categorical Variables

β€’ Convert categorical data into numerical:

* Label Encoding: Assigns an integer to each category.

* One-Hot Encoding: Creates binary columns for each category.

pd.get_dummies(df['category_column'])


---

6. Summary

β€’ Data cleaning is essential for reliable modeling.

β€’ Handling missing values, outliers, scaling, and encoding are key preprocessing steps.

---

Exercise

β€’ Load a dataset, identify missing values, and apply mean imputation.

β€’ Detect outliers using IQR and remove them.

β€’ Normalize numeric features using standardization.

---

#DataCleaning #DataPreprocessing #MachineLearning #Python #DataScience

https://t.me/DataScienceM
❀6πŸ‘1
Topic: Handling Datasets of All Types – Part 2 of 5: Data Cleaning and Preprocessing

---

1. Importance of Data Cleaning

β€’ Real-world data is often noisy, incomplete, or inconsistent.

β€’ Cleaning improves data quality and model performance.

---

2. Handling Missing Data

β€’ Detect missing values using isnull() or isna() in pandas.

β€’ Strategies to handle missing data:

* Remove rows or columns with missing values:

df.dropna(inplace=True)


* Impute missing values with mean, median, or mode:

df['column'].fillna(df['column'].mean(), inplace=True)


---

3. Handling Outliers

β€’ Outliers can skew analysis and model results.

β€’ Detect outliers using:

* Boxplots
* Z-score method
* IQR (Interquartile Range)

β€’ Handle by removal or transformation.

---

4. Data Normalization and Scaling

β€’ Many ML models require features to be on a similar scale.

β€’ Common techniques:

* Min-Max Scaling (scales values between 0 and 1)

* Standardization (mean = 0, std = 1)

from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
df_scaled = scaler.fit_transform(df[['feature1', 'feature2']])


---

5. Encoding Categorical Variables

β€’ Convert categorical data into numerical:

* Label Encoding: Assigns an integer to each category.

* One-Hot Encoding: Creates binary columns for each category.

pd.get_dummies(df['category_column'])


---

6. Summary

β€’ Data cleaning is essential for reliable modeling.

β€’ Handling missing values, outliers, scaling, and encoding are key preprocessing steps.

---

Exercise

β€’ Load a dataset, identify missing values, and apply mean imputation.

β€’ Detect outliers using IQR and remove them.

β€’ Normalize numeric features using standardization.

---

#DataCleaning #DataPreprocessing #MachineLearning #Python #DataScience

https://t.me/DataScience4M
❀4πŸ‘1
Please open Telegram to view this post
VIEW IN TELEGRAM
❀2
Age
count 5.000000
mean 30.000000
std 6.363961
min 22.000000
25% 26.000000
50% 29.000000
75% 35.000000
max 38.000000


---

10. df.columns
Returns the column labels of the DataFrame.

import pandas as pd
df = pd.DataFrame({'Name': [], 'Age': [], 'City': []})
print(df.columns)

Index(['Name', 'Age', 'City'], dtype='object')


---

11. df.dtypes
Returns the data type of each column.

import pandas as pd
df = pd.DataFrame({'Name': ['Alice'], 'Age': [25], 'Salary': [75000.50]})
print(df.dtypes)

Name       object
Age int64
Salary float64
dtype: object


---

12. Selecting a Column
Select a single column, which returns a Pandas Series.

import pandas as pd
data = {'Name': ['Alice', 'Bob'], 'Age': [25, 30]}
df = pd.DataFrame(data)
ages = df['Age']
print(ages)

0    25
1 30
Name: Age, dtype: int64

#DataSelection #Indexing #Statistics

---

13. df.loc[]
Access a group of rows and columns by label(s) or a boolean array.

import pandas as pd
data = {'Age': [25, 30, 35], 'City': ['NY', 'LA', 'CH']}
df = pd.DataFrame(data, index=['Alice', 'Bob', 'Charlie'])
print(df.loc['Bob'])

Age     30
City LA
Name: Bob, dtype: object


---

14. df.iloc[]
Access a group of rows and columns by integer position(s).

import pandas as pd
data = {'Age': [25, 30, 35], 'City': ['NY', 'LA', 'CH']}
df = pd.DataFrame(data, index=['Alice', 'Bob', 'Charlie'])
print(df.iloc[1]) # Get the second row (index 1)

Age     30
City LA
Name: Bob, dtype: object


---

15. df.isnull()
Returns a DataFrame of the same shape with boolean values indicating if a value is missing (NaN).

import pandas as pd
import numpy as np
df = pd.DataFrame({'A': [1, np.nan], 'B': [3, 4]})
print(df.isnull())

A      B
0 False False
1 True False


---

16. df.dropna()
Removes missing values.

import pandas as pd
import numpy as np
df = pd.DataFrame({'A': [1, np.nan, 3], 'B': [4, 5, 6]})
cleaned_df = df.dropna()
print(cleaned_df)

A  B
0 1.0 4
2 3.0 6

#DataCleaning #MissingData

---

17. df.fillna()
Fills missing (NaN) values with a specified value or method.

import pandas as pd
import numpy as np
df = pd.DataFrame({'Score': [90, 85, np.nan, 92]})
filled_df = df.fillna(0)
print(filled_df)

Score
0 90.0
1 85.0
2 0.0
3 92.0


---

18. df.drop_duplicates()
Removes duplicate rows from the DataFrame.

import pandas as pd
data = {'Name': ['Alice', 'Bob', 'Alice'], 'Age': [25, 30, 25]}
df = pd.DataFrame(data)
unique_df = df.drop_duplicates()
print(unique_df)

Name  Age
0 Alice 25
1 Bob 30


---

19. df.rename()
Alters axes labels (e.g., column names).

import pandas as pd
df = pd.DataFrame({'A': [1], 'B': [2]})
renamed_df = df.rename(columns={'A': 'Column_A', 'B': 'Column_B'})
print(renamed_df)

Column_A  Column_B
0 1 2


---

20. series.value_counts()
Returns a Series containing counts of unique values.
πŸ“Œ I Cleaned a Messy CSV File Using Pandasβ€Š. β€ŠHere’s the Exact Process I Follow Every Time.

πŸ—‚ Category: DATA SCIENCE

πŸ•’ Date: 2025-11-26 | ⏱️ Read time: 17 min read

Stop guessing when cleaning messy CSV files. This article details a repeatable 5-step workflow using Python's Pandas library to systematically diagnose and fix data quality issues. Learn a structured, practical process to transform your data preparation, moving from haphazard fixes to a reliable methodology for any data professional.

#Python #Pandas #DataCleaning #DataScience
❀4