متخصصان علم داده ها data scientists
ML Algorithms2⃣ Logistic Regression Logistic regression is used for binary classification problems, where the outcome is a categorical variable with two possible outcomes (e.g., 0 or 1, true or false). Instead of predicting a continuous value like linear…
Ex: Suppose we have a dataset that records whether a student has passed an exam based on the number of hours they studied using Python:
# Import necessary libraries
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix, classification_report, roc_auc_score, roc_curve
import matplotlib.pyplot as plt
# Example data
data = {
'Hours_Studied': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10],
'Passed': [0, 0, 0, 0, 1, 1, 1, 1, 1, 1]
}
df = pd.DataFrame(data)
# Independent variable (feature) and dependent variable (target)
X = df[['Hours_Studied']]
y = df['Passed']
# Splitting the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)
# Creating and training the logistic regression model
model = LogisticRegression()
model.fit(X_train, y_train)
# Making predictions
y_pred = model.predict(X_test)
y_pred_prob = model.predict_proba(X_test)[:, 1]
# Evaluating the model
conf_matrix = confusion_matrix(y_test, y_pred)
class_report = classification_report(y_test, y_pred)
roc_auc = roc_auc_score(y_test, y_pred_prob)
print(f"Confusion Matrix:\n{conf_matrix}")
print(f"Classification Report:\n{class_report}")
print(f"ROC-AUC: {roc_auc}")
# Plotting the ROC curve
fpr, tpr, thresholds = roc_curve(y_test, y_pred_prob)
plt.plot(fpr, tpr, label='Logistic Regression (area = %0.2f)' % roc_auc)
plt.plot([0, 1], [0, 1], 'k--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver Operating Characteristic')
plt.legend(loc="lower right")
plt.show()
🌲ML ALGORITHMS3⃣ Decision Trees :
Decision trees are a non-parametric supervised learning method used for both classification and regression tasks. They model decisions and their possible consequences in a tree-like structure, where internal nodes represent tests on features, branches represent the outcome of the test, and leaf nodes represent the final prediction (class label or value).
For classification, decision trees use measures like Gini impurity or entropy to split the data:
- Gini Impurity: Measures the likelihood of an incorrect classification of a randomly chosen element.
- Entropy (Information Gain): Measures the amount of uncertainty or impurity in the data.
For regression, decision trees minimize the variance (mean squared error) in the splits.
Decision trees are a non-parametric supervised learning method used for both classification and regression tasks. They model decisions and their possible consequences in a tree-like structure, where internal nodes represent tests on features, branches represent the outcome of the test, and leaf nodes represent the final prediction (class label or value).
For classification, decision trees use measures like Gini impurity or entropy to split the data:
- Gini Impurity: Measures the likelihood of an incorrect classification of a randomly chosen element.
- Entropy (Information Gain): Measures the amount of uncertainty or impurity in the data.
For regression, decision trees minimize the variance (mean squared error) in the splits.
متخصصان علم داده ها data scientists
🌲ML ALGORITHMS3⃣ Decision Trees : Decision trees are a non-parametric supervised learning method used for both classification and regression tasks. They model decisions and their possible consequences in a tree-like structure, where internal nodes represent…
Ex: Suppose we have a dataset with features like age, income, and student status to predict whether a person buys a computer using Python and its Libraries:
# Import necessary libraries
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier, plot_tree
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report
import matplotlib.pyplot as plt
# Example data
data = {
'Age': [25, 45, 35, 50, 23, 37, 32, 28, 40, 27],
'Income': ['High', 'High', 'High', 'Medium', 'Low', 'Low', 'Low', 'Medium', 'Low', 'Medium'],
'Student': ['No', 'No', 'No', 'No', 'Yes', 'Yes', 'Yes', 'Yes', 'Yes', 'No'],
'Buys_Computer': ['No', 'No', 'Yes', 'Yes', 'Yes', 'No', 'Yes', 'No', 'Yes', 'Yes']
}
df = pd.DataFrame(data)
# Convert categorical features to numeric
df['Income'] = df['Income'].map({'Low': 1, 'Medium': 2, 'High': 3})
df['Student'] = df['Student'].map({'No': 0, 'Yes': 1})
df['Buys_Computer'] = df['Buys_Computer'].map({'No': 0, 'Yes': 1})
# Independent variables (features) and dependent variable (target)
X = df[['Age', 'Income', 'Student']]
y = df['Buys_Computer']
# Splitting the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)
# Creating and training the decision tree model
model = DecisionTreeClassifier(criterion='gini', max_depth=3, random_state=0)
model.fit(X_train, y_train)
# Making predictions
y_pred = model.predict(X_test)
# Evaluating the model
accuracy = accuracy_score(y_test, y_pred)
conf_matrix = confusion_matrix(y_test, y_pred)
class_report = classification_report(y_test, y_pred)
print(f"Accuracy: {accuracy}")
print(f"Confusion Matrix:\n{conf_matrix}")
print(f"Classification Report:\n{class_report}")
# Plotting the decision tree
plt.figure(figsize=(12,8))
plot_tree(model, feature_names=['Age', 'Income', 'Student'], class_names=['No', 'Yes'], filled=True)
plt.title('Decision Tree')
plt.show()
# Import necessary libraries
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier, plot_tree
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report
import matplotlib.pyplot as plt
# Example data
data = {
'Age': [25, 45, 35, 50, 23, 37, 32, 28, 40, 27],
'Income': ['High', 'High', 'High', 'Medium', 'Low', 'Low', 'Low', 'Medium', 'Low', 'Medium'],
'Student': ['No', 'No', 'No', 'No', 'Yes', 'Yes', 'Yes', 'Yes', 'Yes', 'No'],
'Buys_Computer': ['No', 'No', 'Yes', 'Yes', 'Yes', 'No', 'Yes', 'No', 'Yes', 'Yes']
}
df = pd.DataFrame(data)
# Convert categorical features to numeric
df['Income'] = df['Income'].map({'Low': 1, 'Medium': 2, 'High': 3})
df['Student'] = df['Student'].map({'No': 0, 'Yes': 1})
df['Buys_Computer'] = df['Buys_Computer'].map({'No': 0, 'Yes': 1})
# Independent variables (features) and dependent variable (target)
X = df[['Age', 'Income', 'Student']]
y = df['Buys_Computer']
# Splitting the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)
# Creating and training the decision tree model
model = DecisionTreeClassifier(criterion='gini', max_depth=3, random_state=0)
model.fit(X_train, y_train)
# Making predictions
y_pred = model.predict(X_test)
# Evaluating the model
accuracy = accuracy_score(y_test, y_pred)
conf_matrix = confusion_matrix(y_test, y_pred)
class_report = classification_report(y_test, y_pred)
print(f"Accuracy: {accuracy}")
print(f"Confusion Matrix:\n{conf_matrix}")
print(f"Classification Report:\n{class_report}")
# Plotting the decision tree
plt.figure(figsize=(12,8))
plot_tree(model, feature_names=['Age', 'Income', 'Student'], class_names=['No', 'Yes'], filled=True)
plt.title('Decision Tree')
plt.show()
ML Algorithms4⃣ : Random Forest 🌴🌳🌲🌵
Random Forest is an ensemble learning method that combines multiple decision trees to improve classification or regression performance. Each tree in the forest is built on a random subset of the data and a random subset of features. The final prediction is made by aggregating the predictions from all individual trees (majority vote for classification, average for regression).
Key advantages of Random Forest include:
- Reduced Overfitting: By averaging multiple trees, Random Forest reduces the risk of overfitting compared to individual decision trees.
- Robustness: Less sensitive to the variability in the data.
Random Forest is an ensemble learning method that combines multiple decision trees to improve classification or regression performance. Each tree in the forest is built on a random subset of the data and a random subset of features. The final prediction is made by aggregating the predictions from all individual trees (majority vote for classification, average for regression).
Key advantages of Random Forest include:
- Reduced Overfitting: By averaging multiple trees, Random Forest reduces the risk of overfitting compared to individual decision trees.
- Robustness: Less sensitive to the variability in the data.
متخصصان علم داده ها data scientists
ML Algorithms4⃣ : Random Forest 🌴🌳🌲🌵 Random Forest is an ensemble learning method that combines multiple decision trees to improve classification or regression performance. Each tree in the forest is built on a random subset of the data and a random subset…
Ex:Suppose we have a dataset that records whether a patient has a heart disease based on features like age, cholesterol level, and maximum heart rate.
# Import necessary libraries
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report
import matplotlib.pyplot as plt
import seaborn as sns
# Example data
data = {
'Age': [29, 45, 50, 39, 48, 50, 55, 60, 62, 43],
'Cholesterol': [220, 250, 230, 180, 240, 290, 310, 275, 300, 280],
'Max_Heart_Rate': [180, 165, 170, 190, 155, 160, 150, 140, 130, 148],
'Heart_Disease': [0, 1, 1, 0, 1, 1, 1, 1, 1, 0]
}
df = pd.DataFrame(data)
# Independent variables (features) and dependent variable (target)
X = df[['Age', 'Cholesterol', 'Max_Heart_Rate']]
y = df['Heart_Disease']
# Splitting the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)
# Creating and training the random forest model
model = RandomForestClassifier(n_estimators=100, random_state=0)
model.fit(X_train, y_train)
# Making predictions
y_pred = model.predict(X_test)
# Evaluating the model
accuracy = accuracy_score(y_test, y_pred)
conf_matrix = confusion_matrix(y_test, y_pred)
class_report = classification_report(y_test, y_pred)
print(f"Accuracy: {accuracy}")
print(f"Confusion Matrix:\n{conf_matrix}")
print(f"Classification Report:\n{class_report}")
# Feature importance
feature_importances = pd.DataFrame(model.feature_importances_, index=X.columns, columns=['Importance']).sort_values('Importance', ascending=False)
print(f"Feature Importances:\n{feature_importances}")
# Plotting the feature importances
sns.barplot(x=feature_importances.index, y=feature_importances['Importance'])
plt.title('Feature Importances')
plt.xlabel('Feature')
plt.ylabel('Importance')
plt.show()
ML ALGORITHMS6⃣:KNN
K-Nearest Neighbors (KNN) is a simple, instance-based learning algorithm used for both classification and regression tasks. The main idea is to predict the value or class of a new sample based on the \( k \) closest samples (neighbors) in the training dataset.
For classification, the predicted class is the most common class among the \( k \) nearest neighbors. For regression, the predicted value is the average (or weighted average) of the values of the \( k \) nearest neighbors.
Key points:
- Distance Metric: Common distance metrics include Euclidean distance, Manhattan distance, and Minkowski distance.
- Choosing \( k \): The value of \( k \) is a crucial hyperparameter that needs to be chosen carefully. Smaller \( k \) values can lead to noise sensitivity, while larger \( k \) values can smooth out the decision boundary
K-Nearest Neighbors (KNN) is a simple, instance-based learning algorithm used for both classification and regression tasks. The main idea is to predict the value or class of a new sample based on the \( k \) closest samples (neighbors) in the training dataset.
For classification, the predicted class is the most common class among the \( k \) nearest neighbors. For regression, the predicted value is the average (or weighted average) of the values of the \( k \) nearest neighbors.
Key points:
- Distance Metric: Common distance metrics include Euclidean distance, Manhattan distance, and Minkowski distance.
- Choosing \( k \): The value of \( k \) is a crucial hyperparameter that needs to be chosen carefully. Smaller \( k \) values can lead to noise sensitivity, while larger \( k \) values can smooth out the decision boundary
متخصصان علم داده ها data scientists
ML ALGORITHMS6⃣:KNN K-Nearest Neighbors (KNN) is a simple, instance-based learning algorithm used for both classification and regression tasks. The main idea is to predict the value or class of a new sample based on the \( k \) closest samples (neighbors)…
Ex: Suppose we have a dataset that records features like sepal length and sepal width to classify the species of iris flowers:
# Import necessary libraries
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report
import matplotlib.pyplot as plt
import seaborn as sns
# Example data (Iris dataset)
from sklearn.datasets import load_iris
iris = load_iris()
X = iris.data[:, :2] # Using sepal length and sepal width as features
y = iris.target
# Splitting the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)
# Creating and training the KNN model with k=5
model = KNeighborsClassifier(n_neighbors=5)
model.fit(X_train, y_train)
# Making predictions
y_pred = model.predict(X_test)
# Evaluating the model
accuracy = accuracy_score(y_test, y_pred)
conf_matrix = confusion_matrix(y_test, y_pred)
class_report = classification_report(y_test, y_pred)
print(f"Accuracy: {accuracy}")
print(f"Confusion Matrix:\n{conf_matrix}")
print(f"Classification Report:\n{class_report}")
# Plotting the decision boundary
def plot_decision_boundary(X, y, model):
h = .02 # step size in the mesh
x_min, x_max = X[:, 0].min() - 1, X[:, 0].max() + 1
y_min, y_max = X[:, 1].min() - 1, X[:, 1].max() + 1
xx, yy = np.meshgrid(np.arange(x_min, x_max, h), np.arange(y_min, y_max, h))
Z = model.predict(np.c_[xx.ravel(), yy.ravel()])
Z = Z.reshape(xx.shape)
plt.contourf(xx, yy, Z, alpha=0.8)
sns.scatterplot(x=X[:, 0], y=X[:, 1], hue=y, palette='bright', edgecolor='k', s=50)
plt.xlabel('Sepal Length')
plt.ylabel('Sepal Width')
plt.title('KNN Decision Boundary')
plt.show()
plot_decision_boundary(X_test, y_test, model)
ML ALGORITHMS7⃣: Naive Bayes
Naive Bayes is a family of probabilistic algorithms based on Bayes' Theorem with the "naive" assumption of independence between every pair of features. Despite this strong assumption, Naive Bayes classifiers have performed surprisingly well in many real-world applications, particularly for text classification.
🔱Types of Naive Bayes Classifiers
1. Gaussian Naive Bayes: Assumes that the features follow a normal distribution.
2. Multinomial Naive Bayes: Typically used for discrete data (e.g., text classification with word counts).
3. Bernoulli Naive Bayes: Used for binary/boolean features
Naive Bayes is a family of probabilistic algorithms based on Bayes' Theorem with the "naive" assumption of independence between every pair of features. Despite this strong assumption, Naive Bayes classifiers have performed surprisingly well in many real-world applications, particularly for text classification.
🔱Types of Naive Bayes Classifiers
1. Gaussian Naive Bayes: Assumes that the features follow a normal distribution.
2. Multinomial Naive Bayes: Typically used for discrete data (e.g., text classification with word counts).
3. Bernoulli Naive Bayes: Used for binary/boolean features
متخصصان علم داده ها data scientists
ML ALGORITHMS7⃣: Naive Bayes Naive Bayes is a family of probabilistic algorithms based on Bayes' Theorem with the "naive" assumption of independence between every pair of features. Despite this strong assumption, Naive Bayes classifiers have performed surprisingly…
Ex: Suppose we have a dataset that records features of different emails, such as word frequencies, to classify them as spam or not spam:
# Import necessary libraries
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report
# Example data
data = {
'Feature1': [1, 2, 3, 4, 5, 1, 2, 3, 4, 5],
'Feature2': [5, 4, 3, 2, 1, 5, 4, 3, 2, 1],
'Feature3': [1, 1, 1, 1, 1, 0, 0, 0, 0, 0],
'Spam': [0, 0, 0, 0, 0, 1, 1, 1, 1, 1]
}
df = pd.DataFrame(data)
# Independent variables (features) and dependent variable (target)
X = df[['Feature1', 'Feature2', 'Feature3']]
y = df['Spam']
# Splitting the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)
# Creating and training the Multinomial Naive Bayes model
model = MultinomialNB()
model.fit(X_train, y_train)
# Making predictions
y_pred = model.predict(X_test)
# Evaluating the model
accuracy = accuracy_score(y_test, y_pred)
conf_matrix = confusion_matrix(y_test, y_pred)
class_report = classification_report(y_test, y_pred)
print(f"Accuracy: {accuracy}")
print(f"Confusion Matrix:\n{conf_matrix}")
print(f"Classification Report:\n{class_report}")
ML ALGORITHMS5⃣ :Support Vector Machines
Support Vector Machines (SVM) are supervised learning models used for classification and regression tasks. The goal of SVM is to find the optimal hyperplane that maximally separates the classes in the feature space. The hyperplane is chosen to maximize the margin, which is the distance between the hyperplane and the nearest data points from each class, known as support vectors.
For nonlinear data, SVM uses a kernel trick to transform the input features into a higher-dimensional space where a linear separation is possible. Common kernels include:
- Linear Kernel
- Polynomial Kernel
- Radial Basis Function (RBF) Kernel
- Sigmoid Kernel
Support Vector Machines (SVM) are supervised learning models used for classification and regression tasks. The goal of SVM is to find the optimal hyperplane that maximally separates the classes in the feature space. The hyperplane is chosen to maximize the margin, which is the distance between the hyperplane and the nearest data points from each class, known as support vectors.
For nonlinear data, SVM uses a kernel trick to transform the input features into a higher-dimensional space where a linear separation is possible. Common kernels include:
- Linear Kernel
- Polynomial Kernel
- Radial Basis Function (RBF) Kernel
- Sigmoid Kernel
متخصصان علم داده ها data scientists
ML ALGORITHMS5⃣ :Support Vector Machines Support Vector Machines (SVM) are supervised learning models used for classification and regression tasks. The goal of SVM is to find the optimal hyperplane that maximally separates the classes in the feature space.…
Ex:
Suppose we have a dataset that records features like petal length and petal width to classify the species of iris flowers :
Suppose we have a dataset that records features like petal length and petal width to classify the species of iris flowers :
# Import necessary libraries
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report
import matplotlib.pyplot as plt
import seaborn as sns
# Example data (Iris dataset)
from sklearn.datasets import load_iris
iris = load_iris()
X = iris.data[:, 2:4] # Using petal length and petal width as features
y = iris.target
# Splitting the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)
# Creating and training the SVM model with RBF kernel
model = SVC(kernel='rbf', C=1.0, gamma='scale', random_state=0)
model.fit(X_train, y_train)
# Making predictions
y_pred = model.predict(X_test)
# Evaluating the model
accuracy = accuracy_score(y_test, y_pred)
conf_matrix = confusion_matrix(y_test, y_pred)
class_report = classification_report(y_test, y_pred)
print(f"Accuracy: {accuracy}")
print(f"Confusion Matrix:\n{conf_matrix}")
print(f"Classification Report:\n{class_report}")
# Plotting the decision boundary
def plot_decision_boundary(X, y, model):
h = .02 # step size in the mesh
x_min, x_max = X[:, 0].min() - 1, X[:, 0].max() + 1
y_min, y_max = X[:, 1].min() - 1, X[:, 1].max() + 1
xx, yy = np.meshgrid(np.arange(x_min, x_max, h), np.arange(y_min, y_max, h))
Z = model.predict(np.c_[xx.ravel(), yy.ravel()])
Z = Z.reshape(xx.shape)
plt.contourf(xx, yy, Z, alpha=0.8)
sns.scatterplot(x=X[:, 0], y=X[:, 1], hue=y, palette='bright', edgecolor='k', s=50)
plt.xlabel('Petal Length')
plt.ylabel('Petal Width')
plt.title('SVM Decision Boundary')
plt.show()
plot_decision_boundary(X_test, y_test, model)
ML Algorithms8⃣: K-Means Clustering
k-Means is an unsupervised learning algorithm used for clustering tasks. The goal is to partition a dataset into \( k \) clusters, where each data point belongs to the cluster with the nearest mean. It is an iterative algorithm that aims to minimize the variance within each cluster.
The steps involved in k-Means clustering are:
1. Initialization: Choose \( k \) initial cluster centroids randomly.
2. Assignment: Assign each data point to the nearest cluster centroid.
3. Update: Recalculate the centroids as the mean of all points in each cluster.
4. Repeat: Repeat steps 2 and 3 until the centroids do not change significantly or a maximum number of iterations is reached.
k-Means is an unsupervised learning algorithm used for clustering tasks. The goal is to partition a dataset into \( k \) clusters, where each data point belongs to the cluster with the nearest mean. It is an iterative algorithm that aims to minimize the variance within each cluster.
The steps involved in k-Means clustering are:
1. Initialization: Choose \( k \) initial cluster centroids randomly.
2. Assignment: Assign each data point to the nearest cluster centroid.
3. Update: Recalculate the centroids as the mean of all points in each cluster.
4. Repeat: Repeat steps 2 and 3 until the centroids do not change significantly or a maximum number of iterations is reached.
متخصصان علم داده ها data scientists
ML Algorithms8⃣: K-Means Clustering k-Means is an unsupervised learning algorithm used for clustering tasks. The goal is to partition a dataset into \( k \) clusters, where each data point belongs to the cluster with the nearest mean. It is an iterative…
Ex: Suppose we have a dataset with points in 2D space, and we want to cluster them into \( k = 3 \) clusters.
# Import necessary libraries
import numpy as np
import pandas as pd
from sklearn.cluster import KMeans
import matplotlib.pyplot as plt
import seaborn as sns
# Example data
np.random.seed(0)
X = np.vstack((np.random.normal(0, 1, (100, 2)),
np.random.normal(5, 1, (100, 2)),
np.random.normal(-5, 1, (100, 2))))
# Applying k-Means clustering
k = 3
kmeans = KMeans(n_clusters=k, random_state=0)
y_kmeans = kmeans.fit_predict(X)
# Plotting the clusters
plt.figure(figsize=(8,6))
sns.scatterplot(x=X[:, 0], y=X[:, 1], hue=y_kmeans, palette='viridis', s=50, edgecolor='k')
plt.scatter(kmeans.cluster_centers_[:, 0], kmeans.cluster_centers_[:, 1], s=300, c='red', label='Centroids')
plt.xlabel('Feature 1')
plt.ylabel('Feature 2')
plt.title('k-Means Clustering')
plt.legend()
plt.show()