Tech Jobs Hub
21.1K subscribers
776 photos
12 videos
26 files
451 links
Jobs is your go-to channel for the latest job opportunities in Data Science, Programming, Web Development, Design, and more.

We bring you handpicked job listings, career tips, and resources to help you learn, grow, and land your dream role.
Download Telegram
#How can I implement the K-Nearest Neighbors (KNN) algorithm for classification using scikit-learn? Provide a Python example, explain how distance metrics affect predictions, and discuss the impact of choosing different values of k.

Answer:
KNN is a non-parametric algorithm that classifies data points based on the majority class among their k nearest neighbors in feature space.

import numpy as np
import matplotlib.pyplot as plt
from sklearn import datasets
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score, confusion_matrix
import seaborn as sns

# Load dataset
data = datasets.load_iris()
X = data.data
y = data.target
feature_names = data.feature_names
target_names = data.target_names

# Split and scale data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Train KNN model with k=5
knn = KNeighborsClassifier(n_neighbors=5, metric='euclidean')
knn.fit(X_train_scaled, y_train)

# Predict and evaluate
y_pred = knn.predict(X_test_scaled)
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy:.2f}")

# Confusion Matrix
cm = confusion_matrix(y_test, y_pred)
plt.figure(figsize=(6, 4))
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', xticklabels=target_names, yticklabels=target_names)
plt.title('Confusion Matrix')
plt.ylabel('True Label')
plt.xlabel('Predicted Label')
plt.show()

# Visualize decision boundaries (for first two features only)
plt.figure(figsize=(8, 6))
X_plot = X[:, :2] # Use only first two features for visualization
X_plot_scaled = scaler.fit_transform(X_plot)
knn_visual = KNeighborsClassifier(n_neighbors=5)
knn_visual.fit(X_plot_scaled, y)
h = 0.02
x_min, x_max = X_plot_scaled[:, 0].min() - 1, X_plot_scaled[:, 0].max() + 1
y_min, y_max = X_plot_scaled[:, 1].min() - 1, X_plot_scaled[:, 1].max() + 1
xx, yy = np.meshgrid(np.arange(x_min, x_max, h), np.arange(y_min, y_max, h))
Z = knn_visual.predict(np.c_[xx.ravel(), yy.ravel()])
Z = Z.reshape(xx.shape)
plt.contourf(xx, yy, Z, alpha=0.3, cmap=plt.cm.Paired)
for i, color in enumerate(['red', 'green', 'blue']):
idx = np.where(y == i)
plt.scatter(X_plot_scaled[idx, 0], X_plot_scaled[idx, 1], c=color, label=target_names[i], edgecolors='k')
plt.xlabel(feature_names[0])
plt.ylabel(feature_names[1])
plt.title('KNN Decision Boundaries (First Two Features)')
plt.legend()
plt.show()


Explanation:
- Distance Metrics: Common choices include Euclidean, Manhattan, and Minkowski. Euclidean is default and suitable for continuous variables.
- Choice of k:
- Small k (e.g., 1 or 3): Sensitive to noise, may overfit.
- Large k: Smoother decision boundaries, but may underfit.
- Optimal k is found via cross-validation.
- Standardization: Crucial because KNN uses distance; unscaled features can dominate results.

Time Complexity: O(nm) per prediction, where n is training samples and m is features.
Space Complexity: O(nm) to store training data.
Use Case: KNN is simple, effective for small-to-medium datasets, and works well when patterns are localized.

#MachineLearning #KNN #Classification #ScikitLearn #DataScience #PythonProgramming #AlgorithmExplained #DimensionalityReduction #SupervisedLearning

By: @DataScienceQ 🚀