Machine Learning And AI
1.65K subscribers
198 photos
1 video
19 files
351 links
Hi All and Welcome Join our channel for Jobs,latest Programming Blogs, machine learning blogs.
In case any doubt regarding ML/Data Science please reach out to me @ved1104 subscribe my channel
https://youtube.com/@geekycodesin?si=JzJo3WS5E_VFmD1k
Download Telegram
Amazon Data Science Interview Question:
In a linear regression model, what are the key assumptions that need to be satisfied for the model to be valid? How would you evaluate whether these assumptions hold in your dataset?

This is also, the most common question I see across companies!

So the assumptions are -

๐—Ÿ๐—ถ๐—ป๐—ฒ๐—ฎ๐—ฟ๐—ถ๐˜๐˜†
The relationship between the independent variables (predictors) and the dependent variable is linear. This means that the effect of each predictor on the outcome is constant and additive.
How to evaluate? - Scatter plots of predictors vs. the dependent variable and residual vs. fitted value plots. You can also use polynomial regression or transformations (log, square root) if non-linearity is detected.
How to fix? - Apply feature transformations (e.g., log, square root, polynomial) or use non-linear models.

๐—ก๐—ผ๐—ฟ๐—บ๐—ฎ๐—น๐—ถ๐˜๐˜† ๐—ผ๐—ณ ๐—˜๐—ฟ๐—ฟ๐—ผ๐—ฟ๐˜€
The residuals are normally distributed, especially for the purpose of conducting statistical tests and constructing confidence intervals.
How to evaluate - Residual autocorrelation plots or the Durbin-Watson test for time-series data. For non-time-series data, this assumption can often be assumed to be satisfied if the data is randomly sampled.
How to fix - Transform the dependent variable (log, box-cox) and/or check for outliers.

๐—›๐—ผ๐—บ๐—ผ๐˜€๐—ฐ๐—ฒ๐—ฑ๐—ฎ๐˜€๐˜๐—ถ๐—ฐ๐—ถ๐˜๐˜† (๐—–๐—ผ๐—ป๐˜€๐˜๐—ฎ๐—ป๐˜ ๐—ฉ๐—ฎ๐—ฟ๐—ถ๐—ฎ๐—ป๐—ฐ๐—ฒ ๐—ผ๐—ณ ๐—˜๐—ฟ๐—ฟ๐—ผ๐—ฟ๐˜€)
The variance of the residuals (errors) is constant across all levels of the independent variables. In other words, the spread of residuals should not increase or decrease as the predicted values increase.
How to evaluate - Plot the residuals against fitted values. If the plot shows a "fan" shape (i.e., increasing or decreasing spread of residuals), you may need to address heteroscedasticity using robust standard errors or a transformation (e.g., log-transformation).
How to fix - Transformation of dependent variable (log, box-cox) or weighted least squares regression can help

๐—ก๐—ผ ๐— ๐˜‚๐—น๐˜๐—ถ๐—ฐ๐—ผ๐—น๐—น๐—ถ๐—ป๐—ฒ๐—ฎ๐—ฟ๐—ถ๐˜๐˜†
The independent variables (predictors) are not highly correlated with each other. High correlation between predictors can lead to multicollinearity, which makes it difficult to determine the individual effect of each predictor on the dependent variable.
How to evaluate - Calculate the Variance Inflation Factor (VIF) for each predictor. If VIF is high, consider removing highly correlated predictors or combining them into a single predictor (e.g., using Principal Component Analysis).
How to fix - Remove or combine correlated predictors, or use regularized regression models like Ridge or Lasso regression.
๐——๐—ฎ๐˜๐—ฎ ๐—ฆ๐—ฐ๐—ถ๐—ฒ๐—ป๐—ฐ๐—ฒ ๐—œ๐—ป๐˜๐—ฒ๐—ฟ๐˜ƒ๐—ถ๐—ฒ๐˜„ ๐—ค๐˜‚๐—ฒ๐˜€๐˜๐—ถ๐—ผ๐—ป:
How does outliers impact kNN?

Outliers can significantly impact the performance of kNN, leading to inaccurate predictions due to the model's reliance on proximity for decision-making. Hereโ€™s a breakdown of how outliers influence kNN:

๐—›๐—ถ๐—ด๐—ต ๐—ฉ๐—ฎ๐—ฟ๐—ถ๐—ฎ๐—ป๐—ฐ๐—ฒ
The presence of outliers can increase the model's variance, as predictions near outliers may fluctuate unpredictably depending on which neighbors are included. This makes the model less reliable for regression tasks with scattered or sparse data.

๐——๐—ถ๐˜€๐˜๐—ฎ๐—ป๐—ฐ๐—ฒ ๐— ๐—ฒ๐˜๐—ฟ๐—ถ๐—ฐ ๐—ฆ๐—ฒ๐—ป๐˜€๐—ถ๐˜๐—ถ๐˜ƒ๐—ถ๐˜๐˜†
kNN relies on distance metrics, which can be significantly affected by outliers. In high-dimensional spaces, outliers can increase the range of distances, making it harder for the algorithm to distinguish between nearby points and those farther away. This issue can lead to an overall reduction in accuracy as the modelโ€™s ability to effectively measure "closeness" degrades.

๐—ฅ๐—ฒ๐—ฑ๐˜‚๐—ฐ๐—ฒ ๐—ฝ๐—ฒ๐—ฟ๐—ณ๐—ผ๐—ฟ๐—บ๐—ฎ๐—ป๐—ฐ๐—ฒ ๐—ถ๐—ป ๐—–๐—น๐—ฎ๐˜€๐˜€๐—ถ๐—ณ๐—ถ๐—ฐ๐—ฎ๐˜๐—ถ๐—ผ๐—ป/๐—ฅ๐—ฒ๐—ด๐—ฟ๐—ฒ๐˜€๐˜€๐—ถ๐—ผ๐—ป ๐—ง๐—ฎ๐˜€๐—ธ๐˜€
Outliers near class boundaries can pull the decision boundary toward them, potentially misclassifying nearby points that should belong to a different class. This is particularly problematic if k is small, as individual points (like outliers) have a greater influence. The same happens in regression tasks as well.

๐—™๐—ฒ๐—ฎ๐˜๐˜‚๐—ฟ๐—ฒ ๐—œ๐—ป๐—ณ๐—น๐˜‚๐—ฒ๐—ป๐—ฐ๐—ฒ ๐——๐—ถ๐˜€๐—ฝ๐—ฟ๐—ผ๐—ฝ๐—ผ๐—ฟ๐˜๐—ถ๐—ผ๐—ป
If certain features contain outliers, they can dominate the distance calculations and overshadow the impact of other features. For example, an outlier in a high-magnitude feature may cause distances to be determined largely by that feature, affecting the quality of the neighbor selection.
Company Name : Amazon
Role : Cloud Support Associate
Batch : 2024/2023 passouts

Link : https://www.amazon.jobs/en/jobs/2676989/cloud-support-associate
๐Ÿ‘1
Company Name : Swiggy
Role : Associate Software Engineer
Batch : 2024/2023/2022 passouts

Link : https://docs.google.com/forms/d/1E029cjZV8Em6zPC0YJYAMDDP_NjPtDkwufqHfvkVG2E/viewform?edit_requested=true&pli=1