Amazon Data Science Interview Question:
In a linear regression model, what are the key assumptions that need to be satisfied for the model to be valid? How would you evaluate whether these assumptions hold in your dataset?
This is also, the most common question I see across companies!
So the assumptions are -
๐๐ถ๐ป๐ฒ๐ฎ๐ฟ๐ถ๐๐
The relationship between the independent variables (predictors) and the dependent variable is linear. This means that the effect of each predictor on the outcome is constant and additive.
How to evaluate? - Scatter plots of predictors vs. the dependent variable and residual vs. fitted value plots. You can also use polynomial regression or transformations (log, square root) if non-linearity is detected.
How to fix? - Apply feature transformations (e.g., log, square root, polynomial) or use non-linear models.
๐ก๐ผ๐ฟ๐บ๐ฎ๐น๐ถ๐๐ ๐ผ๐ณ ๐๐ฟ๐ฟ๐ผ๐ฟ๐
The residuals are normally distributed, especially for the purpose of conducting statistical tests and constructing confidence intervals.
How to evaluate - Residual autocorrelation plots or the Durbin-Watson test for time-series data. For non-time-series data, this assumption can often be assumed to be satisfied if the data is randomly sampled.
How to fix - Transform the dependent variable (log, box-cox) and/or check for outliers.
๐๐ผ๐บ๐ผ๐๐ฐ๐ฒ๐ฑ๐ฎ๐๐๐ถ๐ฐ๐ถ๐๐ (๐๐ผ๐ป๐๐๐ฎ๐ป๐ ๐ฉ๐ฎ๐ฟ๐ถ๐ฎ๐ป๐ฐ๐ฒ ๐ผ๐ณ ๐๐ฟ๐ฟ๐ผ๐ฟ๐)
The variance of the residuals (errors) is constant across all levels of the independent variables. In other words, the spread of residuals should not increase or decrease as the predicted values increase.
How to evaluate - Plot the residuals against fitted values. If the plot shows a "fan" shape (i.e., increasing or decreasing spread of residuals), you may need to address heteroscedasticity using robust standard errors or a transformation (e.g., log-transformation).
How to fix - Transformation of dependent variable (log, box-cox) or weighted least squares regression can help
๐ก๐ผ ๐ ๐๐น๐๐ถ๐ฐ๐ผ๐น๐น๐ถ๐ป๐ฒ๐ฎ๐ฟ๐ถ๐๐
The independent variables (predictors) are not highly correlated with each other. High correlation between predictors can lead to multicollinearity, which makes it difficult to determine the individual effect of each predictor on the dependent variable.
How to evaluate - Calculate the Variance Inflation Factor (VIF) for each predictor. If VIF is high, consider removing highly correlated predictors or combining them into a single predictor (e.g., using Principal Component Analysis).
How to fix - Remove or combine correlated predictors, or use regularized regression models like Ridge or Lasso regression.
In a linear regression model, what are the key assumptions that need to be satisfied for the model to be valid? How would you evaluate whether these assumptions hold in your dataset?
This is also, the most common question I see across companies!
So the assumptions are -
๐๐ถ๐ป๐ฒ๐ฎ๐ฟ๐ถ๐๐
The relationship between the independent variables (predictors) and the dependent variable is linear. This means that the effect of each predictor on the outcome is constant and additive.
How to evaluate? - Scatter plots of predictors vs. the dependent variable and residual vs. fitted value plots. You can also use polynomial regression or transformations (log, square root) if non-linearity is detected.
How to fix? - Apply feature transformations (e.g., log, square root, polynomial) or use non-linear models.
๐ก๐ผ๐ฟ๐บ๐ฎ๐น๐ถ๐๐ ๐ผ๐ณ ๐๐ฟ๐ฟ๐ผ๐ฟ๐
The residuals are normally distributed, especially for the purpose of conducting statistical tests and constructing confidence intervals.
How to evaluate - Residual autocorrelation plots or the Durbin-Watson test for time-series data. For non-time-series data, this assumption can often be assumed to be satisfied if the data is randomly sampled.
How to fix - Transform the dependent variable (log, box-cox) and/or check for outliers.
๐๐ผ๐บ๐ผ๐๐ฐ๐ฒ๐ฑ๐ฎ๐๐๐ถ๐ฐ๐ถ๐๐ (๐๐ผ๐ป๐๐๐ฎ๐ป๐ ๐ฉ๐ฎ๐ฟ๐ถ๐ฎ๐ป๐ฐ๐ฒ ๐ผ๐ณ ๐๐ฟ๐ฟ๐ผ๐ฟ๐)
The variance of the residuals (errors) is constant across all levels of the independent variables. In other words, the spread of residuals should not increase or decrease as the predicted values increase.
How to evaluate - Plot the residuals against fitted values. If the plot shows a "fan" shape (i.e., increasing or decreasing spread of residuals), you may need to address heteroscedasticity using robust standard errors or a transformation (e.g., log-transformation).
How to fix - Transformation of dependent variable (log, box-cox) or weighted least squares regression can help
๐ก๐ผ ๐ ๐๐น๐๐ถ๐ฐ๐ผ๐น๐น๐ถ๐ป๐ฒ๐ฎ๐ฟ๐ถ๐๐
The independent variables (predictors) are not highly correlated with each other. High correlation between predictors can lead to multicollinearity, which makes it difficult to determine the individual effect of each predictor on the dependent variable.
How to evaluate - Calculate the Variance Inflation Factor (VIF) for each predictor. If VIF is high, consider removing highly correlated predictors or combining them into a single predictor (e.g., using Principal Component Analysis).
How to fix - Remove or combine correlated predictors, or use regularized regression models like Ridge or Lasso regression.
๐๐ฎ๐๐ฎ ๐ฆ๐ฐ๐ถ๐ฒ๐ป๐ฐ๐ฒ ๐๐ป๐๐ฒ๐ฟ๐๐ถ๐ฒ๐ ๐ค๐๐ฒ๐๐๐ถ๐ผ๐ป:
How does outliers impact kNN?
Outliers can significantly impact the performance of kNN, leading to inaccurate predictions due to the model's reliance on proximity for decision-making. Hereโs a breakdown of how outliers influence kNN:
๐๐ถ๐ด๐ต ๐ฉ๐ฎ๐ฟ๐ถ๐ฎ๐ป๐ฐ๐ฒ
The presence of outliers can increase the model's variance, as predictions near outliers may fluctuate unpredictably depending on which neighbors are included. This makes the model less reliable for regression tasks with scattered or sparse data.
๐๐ถ๐๐๐ฎ๐ป๐ฐ๐ฒ ๐ ๐ฒ๐๐ฟ๐ถ๐ฐ ๐ฆ๐ฒ๐ป๐๐ถ๐๐ถ๐๐ถ๐๐
kNN relies on distance metrics, which can be significantly affected by outliers. In high-dimensional spaces, outliers can increase the range of distances, making it harder for the algorithm to distinguish between nearby points and those farther away. This issue can lead to an overall reduction in accuracy as the modelโs ability to effectively measure "closeness" degrades.
๐ฅ๐ฒ๐ฑ๐๐ฐ๐ฒ ๐ฝ๐ฒ๐ฟ๐ณ๐ผ๐ฟ๐บ๐ฎ๐ป๐ฐ๐ฒ ๐ถ๐ป ๐๐น๐ฎ๐๐๐ถ๐ณ๐ถ๐ฐ๐ฎ๐๐ถ๐ผ๐ป/๐ฅ๐ฒ๐ด๐ฟ๐ฒ๐๐๐ถ๐ผ๐ป ๐ง๐ฎ๐๐ธ๐
Outliers near class boundaries can pull the decision boundary toward them, potentially misclassifying nearby points that should belong to a different class. This is particularly problematic if k is small, as individual points (like outliers) have a greater influence. The same happens in regression tasks as well.
๐๐ฒ๐ฎ๐๐๐ฟ๐ฒ ๐๐ป๐ณ๐น๐๐ฒ๐ป๐ฐ๐ฒ ๐๐ถ๐๐ฝ๐ฟ๐ผ๐ฝ๐ผ๐ฟ๐๐ถ๐ผ๐ป
If certain features contain outliers, they can dominate the distance calculations and overshadow the impact of other features. For example, an outlier in a high-magnitude feature may cause distances to be determined largely by that feature, affecting the quality of the neighbor selection.
How does outliers impact kNN?
Outliers can significantly impact the performance of kNN, leading to inaccurate predictions due to the model's reliance on proximity for decision-making. Hereโs a breakdown of how outliers influence kNN:
๐๐ถ๐ด๐ต ๐ฉ๐ฎ๐ฟ๐ถ๐ฎ๐ป๐ฐ๐ฒ
The presence of outliers can increase the model's variance, as predictions near outliers may fluctuate unpredictably depending on which neighbors are included. This makes the model less reliable for regression tasks with scattered or sparse data.
๐๐ถ๐๐๐ฎ๐ป๐ฐ๐ฒ ๐ ๐ฒ๐๐ฟ๐ถ๐ฐ ๐ฆ๐ฒ๐ป๐๐ถ๐๐ถ๐๐ถ๐๐
kNN relies on distance metrics, which can be significantly affected by outliers. In high-dimensional spaces, outliers can increase the range of distances, making it harder for the algorithm to distinguish between nearby points and those farther away. This issue can lead to an overall reduction in accuracy as the modelโs ability to effectively measure "closeness" degrades.
๐ฅ๐ฒ๐ฑ๐๐ฐ๐ฒ ๐ฝ๐ฒ๐ฟ๐ณ๐ผ๐ฟ๐บ๐ฎ๐ป๐ฐ๐ฒ ๐ถ๐ป ๐๐น๐ฎ๐๐๐ถ๐ณ๐ถ๐ฐ๐ฎ๐๐ถ๐ผ๐ป/๐ฅ๐ฒ๐ด๐ฟ๐ฒ๐๐๐ถ๐ผ๐ป ๐ง๐ฎ๐๐ธ๐
Outliers near class boundaries can pull the decision boundary toward them, potentially misclassifying nearby points that should belong to a different class. This is particularly problematic if k is small, as individual points (like outliers) have a greater influence. The same happens in regression tasks as well.
๐๐ฒ๐ฎ๐๐๐ฟ๐ฒ ๐๐ป๐ณ๐น๐๐ฒ๐ป๐ฐ๐ฒ ๐๐ถ๐๐ฝ๐ฟ๐ผ๐ฝ๐ผ๐ฟ๐๐ถ๐ผ๐ป
If certain features contain outliers, they can dominate the distance calculations and overshadow the impact of other features. For example, an outlier in a high-magnitude feature may cause distances to be determined largely by that feature, affecting the quality of the neighbor selection.
Company Name : Amazon
Role : Cloud Support Associate
Batch : 2024/2023 passouts
Link : https://www.amazon.jobs/en/jobs/2676989/cloud-support-associate
Role : Cloud Support Associate
Batch : 2024/2023 passouts
Link : https://www.amazon.jobs/en/jobs/2676989/cloud-support-associate
Company Name : Swiggy
Role : Associate Software Engineer
Batch : 2024/2023/2022 passouts
Link : https://docs.google.com/forms/d/1E029cjZV8Em6zPC0YJYAMDDP_NjPtDkwufqHfvkVG2E/viewform?edit_requested=true&pli=1
Role : Associate Software Engineer
Batch : 2024/2023/2022 passouts
Link : https://docs.google.com/forms/d/1E029cjZV8Em6zPC0YJYAMDDP_NjPtDkwufqHfvkVG2E/viewform?edit_requested=true&pli=1