๐๐ฎ๐๐ฎ ๐ฆ๐ฐ๐ถ๐ฒ๐ป๐ฐ๐ฒ ๐๐ป๐๐ฒ๐ฟ๐๐ถ๐ฒ๐ ๐ค๐๐ฒ๐๐๐ถ๐ผ๐ป:
How does outliers impact kNN?
Outliers can significantly impact the performance of kNN, leading to inaccurate predictions due to the model's reliance on proximity for decision-making. Hereโs a breakdown of how outliers influence kNN:
๐๐ถ๐ด๐ต ๐ฉ๐ฎ๐ฟ๐ถ๐ฎ๐ป๐ฐ๐ฒ
The presence of outliers can increase the model's variance, as predictions near outliers may fluctuate unpredictably depending on which neighbors are included. This makes the model less reliable for regression tasks with scattered or sparse data.
๐๐ถ๐๐๐ฎ๐ป๐ฐ๐ฒ ๐ ๐ฒ๐๐ฟ๐ถ๐ฐ ๐ฆ๐ฒ๐ป๐๐ถ๐๐ถ๐๐ถ๐๐
kNN relies on distance metrics, which can be significantly affected by outliers. In high-dimensional spaces, outliers can increase the range of distances, making it harder for the algorithm to distinguish between nearby points and those farther away. This issue can lead to an overall reduction in accuracy as the modelโs ability to effectively measure "closeness" degrades.
๐ฅ๐ฒ๐ฑ๐๐ฐ๐ฒ ๐ฝ๐ฒ๐ฟ๐ณ๐ผ๐ฟ๐บ๐ฎ๐ป๐ฐ๐ฒ ๐ถ๐ป ๐๐น๐ฎ๐๐๐ถ๐ณ๐ถ๐ฐ๐ฎ๐๐ถ๐ผ๐ป/๐ฅ๐ฒ๐ด๐ฟ๐ฒ๐๐๐ถ๐ผ๐ป ๐ง๐ฎ๐๐ธ๐
Outliers near class boundaries can pull the decision boundary toward them, potentially misclassifying nearby points that should belong to a different class. This is particularly problematic if k is small, as individual points (like outliers) have a greater influence. The same happens in regression tasks as well.
๐๐ฒ๐ฎ๐๐๐ฟ๐ฒ ๐๐ป๐ณ๐น๐๐ฒ๐ป๐ฐ๐ฒ ๐๐ถ๐๐ฝ๐ฟ๐ผ๐ฝ๐ผ๐ฟ๐๐ถ๐ผ๐ป
If certain features contain outliers, they can dominate the distance calculations and overshadow the impact of other features. For example, an outlier in a high-magnitude feature may cause distances to be determined largely by that feature, affecting the quality of the neighbor selection.
How does outliers impact kNN?
Outliers can significantly impact the performance of kNN, leading to inaccurate predictions due to the model's reliance on proximity for decision-making. Hereโs a breakdown of how outliers influence kNN:
๐๐ถ๐ด๐ต ๐ฉ๐ฎ๐ฟ๐ถ๐ฎ๐ป๐ฐ๐ฒ
The presence of outliers can increase the model's variance, as predictions near outliers may fluctuate unpredictably depending on which neighbors are included. This makes the model less reliable for regression tasks with scattered or sparse data.
๐๐ถ๐๐๐ฎ๐ป๐ฐ๐ฒ ๐ ๐ฒ๐๐ฟ๐ถ๐ฐ ๐ฆ๐ฒ๐ป๐๐ถ๐๐ถ๐๐ถ๐๐
kNN relies on distance metrics, which can be significantly affected by outliers. In high-dimensional spaces, outliers can increase the range of distances, making it harder for the algorithm to distinguish between nearby points and those farther away. This issue can lead to an overall reduction in accuracy as the modelโs ability to effectively measure "closeness" degrades.
๐ฅ๐ฒ๐ฑ๐๐ฐ๐ฒ ๐ฝ๐ฒ๐ฟ๐ณ๐ผ๐ฟ๐บ๐ฎ๐ป๐ฐ๐ฒ ๐ถ๐ป ๐๐น๐ฎ๐๐๐ถ๐ณ๐ถ๐ฐ๐ฎ๐๐ถ๐ผ๐ป/๐ฅ๐ฒ๐ด๐ฟ๐ฒ๐๐๐ถ๐ผ๐ป ๐ง๐ฎ๐๐ธ๐
Outliers near class boundaries can pull the decision boundary toward them, potentially misclassifying nearby points that should belong to a different class. This is particularly problematic if k is small, as individual points (like outliers) have a greater influence. The same happens in regression tasks as well.
๐๐ฒ๐ฎ๐๐๐ฟ๐ฒ ๐๐ป๐ณ๐น๐๐ฒ๐ป๐ฐ๐ฒ ๐๐ถ๐๐ฝ๐ฟ๐ผ๐ฝ๐ผ๐ฟ๐๐ถ๐ผ๐ป
If certain features contain outliers, they can dominate the distance calculations and overshadow the impact of other features. For example, an outlier in a high-magnitude feature may cause distances to be determined largely by that feature, affecting the quality of the neighbor selection.
Company Name : Amazon
Role : Cloud Support Associate
Batch : 2024/2023 passouts
Link : https://www.amazon.jobs/en/jobs/2676989/cloud-support-associate
Role : Cloud Support Associate
Batch : 2024/2023 passouts
Link : https://www.amazon.jobs/en/jobs/2676989/cloud-support-associate
Company Name : Swiggy
Role : Associate Software Engineer
Batch : 2024/2023/2022 passouts
Link : https://docs.google.com/forms/d/1E029cjZV8Em6zPC0YJYAMDDP_NjPtDkwufqHfvkVG2E/viewform?edit_requested=true&pli=1
Role : Associate Software Engineer
Batch : 2024/2023/2022 passouts
Link : https://docs.google.com/forms/d/1E029cjZV8Em6zPC0YJYAMDDP_NjPtDkwufqHfvkVG2E/viewform?edit_requested=true&pli=1
Data science interview questions :
Position : Data Scientist
There were 3 rounds of interview followed by 1 HR discussion.
Coding related questions :
1. You are given 2 lists
l1 = [1,2,2,3,4,5,6,6,7]
l2= [1,2,4,5,5,6,6,7]
Return all the elements from list l2 which were spotted in l1 as per their frequency of occurence.
For eg: elements in l2 which occured in l1 are : 1, 2 (only once) ,4 ,5,6 ,7
Expected Output : [1,2,4,5,6,6,7]
2.
text = ' I am a data scientist working in Paypal'
Return the longest word along with count of letters
Expected Output : ('scientist', 9)
(In case of ties , sort the words in alphabetical order and return 1st one)
3. You are given 2 tables in SQL , table 1 contains one column which has only single value '3' repeated 5 times , table 2 contains only one column which has 2 values ,'2' repeated 3 times & '3' repeated 4 times. Tell me number of records in case of inner . left , right ,outer , cross join.
4. You are given a transaction table which has txn_id (primary key) ,cust_id (foreign key) , txn_date (datetime) , txn_amt as 4 columns , one cust id has multiple txn_ids. Your job is to find out all the cust_id for which there were minimum 2 txns which are made in 10 seconds of duration. (This might help to identify fraudulent patterns)
Case study questions :
1. Tell me business model and revneue sources of Paypal.
Tell me general consequences when you change pricing of any product.
2. You are data scientist who studies impact of pricing change. Suppose business team comes to you and asks you what will happen if they increase the merchant fees per txn by 5%.
What will be your recommendation & strategy ?
What all factors will you think of and what will be the proposed ML solutioning look like ?
3. You see that some merchants are heavily misusing the refund facility (for incorrect txn /disputed txn merchants get refund) , they are claiming reimbursements by doing fake txns.
List possible scenarios and ways to identify such merchants ?
4. How will you decide pricing of a premier product like Iphone in India vs say South Africa ? What factors will you consider ?
Statistics Questions:
1. What is multicollinearity ?
2. What is Type1 error and type 2 error (explain in pricing experimentation pov)
3. What is Weibull distribution ?
4. What is CLT ? What is difference between t & normal distribution.
5. What is Wald's test ?
6. What is Ljung box test and explain null hypothesis for ADF test in Time series
7. What is causality ?
ML Questions:
1. What is logistic regression ? What is deviance ?
2. What is difference between R-Squared & Adj R-squared
3. How does Randomforest works ?
4. Difference between bagging and boosting ?
On paradigm of Variance -Bias what does bagging/boosting attempt to solve ?
Position : Data Scientist
There were 3 rounds of interview followed by 1 HR discussion.
Coding related questions :
1. You are given 2 lists
l1 = [1,2,2,3,4,5,6,6,7]
l2= [1,2,4,5,5,6,6,7]
Return all the elements from list l2 which were spotted in l1 as per their frequency of occurence.
For eg: elements in l2 which occured in l1 are : 1, 2 (only once) ,4 ,5,6 ,7
Expected Output : [1,2,4,5,6,6,7]
2.
text = ' I am a data scientist working in Paypal'
Return the longest word along with count of letters
Expected Output : ('scientist', 9)
(In case of ties , sort the words in alphabetical order and return 1st one)
3. You are given 2 tables in SQL , table 1 contains one column which has only single value '3' repeated 5 times , table 2 contains only one column which has 2 values ,'2' repeated 3 times & '3' repeated 4 times. Tell me number of records in case of inner . left , right ,outer , cross join.
4. You are given a transaction table which has txn_id (primary key) ,cust_id (foreign key) , txn_date (datetime) , txn_amt as 4 columns , one cust id has multiple txn_ids. Your job is to find out all the cust_id for which there were minimum 2 txns which are made in 10 seconds of duration. (This might help to identify fraudulent patterns)
Case study questions :
1. Tell me business model and revneue sources of Paypal.
Tell me general consequences when you change pricing of any product.
2. You are data scientist who studies impact of pricing change. Suppose business team comes to you and asks you what will happen if they increase the merchant fees per txn by 5%.
What will be your recommendation & strategy ?
What all factors will you think of and what will be the proposed ML solutioning look like ?
3. You see that some merchants are heavily misusing the refund facility (for incorrect txn /disputed txn merchants get refund) , they are claiming reimbursements by doing fake txns.
List possible scenarios and ways to identify such merchants ?
4. How will you decide pricing of a premier product like Iphone in India vs say South Africa ? What factors will you consider ?
Statistics Questions:
1. What is multicollinearity ?
2. What is Type1 error and type 2 error (explain in pricing experimentation pov)
3. What is Weibull distribution ?
4. What is CLT ? What is difference between t & normal distribution.
5. What is Wald's test ?
6. What is Ljung box test and explain null hypothesis for ADF test in Time series
7. What is causality ?
ML Questions:
1. What is logistic regression ? What is deviance ?
2. What is difference between R-Squared & Adj R-squared
3. How does Randomforest works ?
4. Difference between bagging and boosting ?
On paradigm of Variance -Bias what does bagging/boosting attempt to solve ?
I struggled with Data Science interviews until...
I followed this roadmap:
๐ฃ๐๐๐ต๐ผ๐ป
๐๐ผ Master the basics: syntax, loops, functions, and data structures (lists, dictionaries, sets, tuples)
๐๐ผ Learn Pandas & NumPy for data manipulation
๐๐ผ Matplotlib & Seaborn for data visualization
๐ฆ๐๐ฎ๐๐ถ๐๐๐ถ๐ฐ๐ & ๐ฃ๐ฟ๐ผ๐ฏ๐ฎ๐ฏ๐ถ๐น๐ถ๐๐
๐๐ผ Descriptive statistics: mean, median, mode, standard deviation
๐๐ผ Probability theory: distributions, Bayes' theorem, conditional probability
๐๐ผ Hypothesis testing & A/B testing
๐ ๐ฎ๐ฐ๐ต๐ถ๐ป๐ฒ ๐๐ฒ๐ฎ๐ฟ๐ป๐ถ๐ป๐ด
๐๐ผ Supervised vs. unsupervised learning
๐๐ผ Key algorithms: Linear & Logistic Regression, Decision Trees, Random Forest, KNN, SVM
๐๐ผ Model evaluation metrics: accuracy, precision, recall, F1 score, ROC-AUC
๐๐ผ Cross-validation & hyperparameter tuning
๐๐ฒ๐ฒ๐ฝ ๐๐ฒ๐ฎ๐ฟ๐ป๐ถ๐ป๐ด
๐๐ผ Neural Networks & their architecture
๐๐ผ Working with Keras & TensorFlow/PyTorch
๐๐ผ CNNs for image data and RNNs for sequence data
๐๐ฎ๐๐ฎ ๐๐น๐ฒ๐ฎ๐ป๐ถ๐ป๐ด & ๐๐ฒ๐ฎ๐๐๐ฟ๐ฒ ๐๐ป๐ด๐ถ๐ป๐ฒ๐ฒ๐ฟ๐ถ๐ป๐ด
๐๐ผ Handling missing data, outliers, and data scaling
๐๐ผ Feature selection techniques (e.g., correlation, mutual information)
๐ก๐๐ฃ (๐ก๐ฎ๐๐๐ฟ๐ฎ๐น ๐๐ฎ๐ป๐ด๐๐ฎ๐ด๐ฒ ๐ฃ๐ฟ๐ผ๐ฐ๐ฒ๐๐๐ถ๐ป๐ด)
๐๐ผ Tokenization, stemming, lemmatization
๐๐ผ Bag-of-Words, TF-IDF
๐๐ผ Sentiment analysis & topic modeling
๐๐น๐ผ๐๐ฑ ๐ฎ๐ป๐ฑ ๐๐ถ๐ด ๐๐ฎ๐๐ฎ
๐๐ผ Understanding cloud services (AWS, GCP, Azure) for data storage & computing
๐๐ผ Working with distributed data using Spark
๐๐ผ SQL for querying large datasets
Donโt get overwhelmed by the breadth of topics. Start smallโmaster one concept, then move to the next. ๐
Youโve got this! ๐ช๐ผ
I followed this roadmap:
๐ฃ๐๐๐ต๐ผ๐ป
๐๐ผ Master the basics: syntax, loops, functions, and data structures (lists, dictionaries, sets, tuples)
๐๐ผ Learn Pandas & NumPy for data manipulation
๐๐ผ Matplotlib & Seaborn for data visualization
๐ฆ๐๐ฎ๐๐ถ๐๐๐ถ๐ฐ๐ & ๐ฃ๐ฟ๐ผ๐ฏ๐ฎ๐ฏ๐ถ๐น๐ถ๐๐
๐๐ผ Descriptive statistics: mean, median, mode, standard deviation
๐๐ผ Probability theory: distributions, Bayes' theorem, conditional probability
๐๐ผ Hypothesis testing & A/B testing
๐ ๐ฎ๐ฐ๐ต๐ถ๐ป๐ฒ ๐๐ฒ๐ฎ๐ฟ๐ป๐ถ๐ป๐ด
๐๐ผ Supervised vs. unsupervised learning
๐๐ผ Key algorithms: Linear & Logistic Regression, Decision Trees, Random Forest, KNN, SVM
๐๐ผ Model evaluation metrics: accuracy, precision, recall, F1 score, ROC-AUC
๐๐ผ Cross-validation & hyperparameter tuning
๐๐ฒ๐ฒ๐ฝ ๐๐ฒ๐ฎ๐ฟ๐ป๐ถ๐ป๐ด
๐๐ผ Neural Networks & their architecture
๐๐ผ Working with Keras & TensorFlow/PyTorch
๐๐ผ CNNs for image data and RNNs for sequence data
๐๐ฎ๐๐ฎ ๐๐น๐ฒ๐ฎ๐ป๐ถ๐ป๐ด & ๐๐ฒ๐ฎ๐๐๐ฟ๐ฒ ๐๐ป๐ด๐ถ๐ป๐ฒ๐ฒ๐ฟ๐ถ๐ป๐ด
๐๐ผ Handling missing data, outliers, and data scaling
๐๐ผ Feature selection techniques (e.g., correlation, mutual information)
๐ก๐๐ฃ (๐ก๐ฎ๐๐๐ฟ๐ฎ๐น ๐๐ฎ๐ป๐ด๐๐ฎ๐ด๐ฒ ๐ฃ๐ฟ๐ผ๐ฐ๐ฒ๐๐๐ถ๐ป๐ด)
๐๐ผ Tokenization, stemming, lemmatization
๐๐ผ Bag-of-Words, TF-IDF
๐๐ผ Sentiment analysis & topic modeling
๐๐น๐ผ๐๐ฑ ๐ฎ๐ป๐ฑ ๐๐ถ๐ด ๐๐ฎ๐๐ฎ
๐๐ผ Understanding cloud services (AWS, GCP, Azure) for data storage & computing
๐๐ผ Working with distributed data using Spark
๐๐ผ SQL for querying large datasets
Donโt get overwhelmed by the breadth of topics. Start smallโmaster one concept, then move to the next. ๐
Youโve got this! ๐ช๐ผ
โ
Count Rows:
SELECT COUNT(*) FROM source_table;
SELECT COUNT(*) FROM target_table;
๐นVerify if the number of rows in the source and target tables match
โ Check Duplicates:
SELECT column_name, COUNT(*)
FROM table_name
GROUP BY column_name
HAVING COUNT(*) > 1;
๐นIdentify duplicate records in a table
โ Compare Data Between Source and Target:
SELECT *
FROM source_table
MINUS
SELECT *
FROM target_table;
๐นCheck for discrepancies between the source and target tables
โ Validate Data Transformation:
SELECT column_name,
CASE
WHEN condition THEN 'Valid'
ELSE 'Invalid'
END AS validation_status
FROM table_name;
2. Performance and Integrity Checks
โ Check for Null Values:
SELECT *
FROM table_name
WHERE column_name IS NULL;
โ Data Type Validation:
SELECT column_name,
CASE
WHEN column_name LIKE '%[^0-9]%' THEN 'Invalid'
ELSE 'Valid'
END AS data_status
FROM table_name;
โ Primary Key Uniqueness:
SELECT primary_key_column, COUNT(*)
FROM table_name
GROUP BY primary_key_column
HAVING COUNT(*) > 1;
3. Data Aggregation and Summarization:
โ Aggregate Functions:
SELECT SUM(column_name), AVG(column_name), MAX(column_name), MIN(column_name)
FROM table_name;
โ Group Data:
SELECT column_name, COUNT(*)
FROM table_name
GROUP BY column_name;
4. Data Sampling and Extracting Subsets
โ Retrieve Top Records:
SELECT *
FROM table_name
WHERE ROWNUM <= 10; -- Oracle
SELECT *
FROM table_name
LIMIT 10; -- MySQL, PostgreSQL
โ Fetch Specific Columns:
SELECT column1, column2
FROM table_name;
5. Joins for Data Comparison
โ Inner Join:
SELECT a.column1, b.column2
FROM table_a a
INNER JOIN table_b b
ON a.common_column = b.common_column;
โ Left Join:
SELECT a.column1, b.column2
FROM table_a a
LEFT JOIN table_b b
ON a.common_column = b.common_column;
โ Full Outer Join:
SELECT a.column1, b.column2
FROM table_a a
FULL OUTER JOIN table_b b
ON a.common_column = b.common_column;
6. Data Cleaning
โ Remove Duplicates:
DELETE FROM table_name
WHERE rowid NOT IN (
SELECT MIN(rowid)
FROM table_name
GROUP BY column_name);
โ Update Data:
UPDATE table_name
SET column_name = new_value
WHERE condition;
7. ETL-Specific Validations
โ ETL Data Loading Validation:
SELECT COUNT(*)
FROM target_table
WHERE load_date = SYSDATE; -- For today's loaded data
โ Compare Aggregates Between Source and Target:
SELECT SUM(amount) AS source_sum
FROM source_table;
SELECT SUM(amount) AS target_sum
FROM target_table;
โ Check for Missing Records:
SELECT source_key
FROM source_table
WHERE source_key NOT IN (
SELECT target_key
FROM target_table);
8. Metadata Queries
โ View Table Structure:
DESCRIBE table_name; -- Oracle
SHOW COLUMNS FROM table_name; -- MySQL
โ View Indexes:
SELECT *
FROM user_indexes
WHERE table_name = 'TABLE_NAME
โ View Constraints:
SELECT constraint_name, constraint_type
FROM user_constraints
WHERE table_name = 'TABLE_NAME'; -- Oracle
SELECT COUNT(*) FROM source_table;
SELECT COUNT(*) FROM target_table;
๐นVerify if the number of rows in the source and target tables match
โ Check Duplicates:
SELECT column_name, COUNT(*)
FROM table_name
GROUP BY column_name
HAVING COUNT(*) > 1;
๐นIdentify duplicate records in a table
โ Compare Data Between Source and Target:
SELECT *
FROM source_table
MINUS
SELECT *
FROM target_table;
๐นCheck for discrepancies between the source and target tables
โ Validate Data Transformation:
SELECT column_name,
CASE
WHEN condition THEN 'Valid'
ELSE 'Invalid'
END AS validation_status
FROM table_name;
2. Performance and Integrity Checks
โ Check for Null Values:
SELECT *
FROM table_name
WHERE column_name IS NULL;
โ Data Type Validation:
SELECT column_name,
CASE
WHEN column_name LIKE '%[^0-9]%' THEN 'Invalid'
ELSE 'Valid'
END AS data_status
FROM table_name;
โ Primary Key Uniqueness:
SELECT primary_key_column, COUNT(*)
FROM table_name
GROUP BY primary_key_column
HAVING COUNT(*) > 1;
3. Data Aggregation and Summarization:
โ Aggregate Functions:
SELECT SUM(column_name), AVG(column_name), MAX(column_name), MIN(column_name)
FROM table_name;
โ Group Data:
SELECT column_name, COUNT(*)
FROM table_name
GROUP BY column_name;
4. Data Sampling and Extracting Subsets
โ Retrieve Top Records:
SELECT *
FROM table_name
WHERE ROWNUM <= 10; -- Oracle
SELECT *
FROM table_name
LIMIT 10; -- MySQL, PostgreSQL
โ Fetch Specific Columns:
SELECT column1, column2
FROM table_name;
5. Joins for Data Comparison
โ Inner Join:
SELECT a.column1, b.column2
FROM table_a a
INNER JOIN table_b b
ON a.common_column = b.common_column;
โ Left Join:
SELECT a.column1, b.column2
FROM table_a a
LEFT JOIN table_b b
ON a.common_column = b.common_column;
โ Full Outer Join:
SELECT a.column1, b.column2
FROM table_a a
FULL OUTER JOIN table_b b
ON a.common_column = b.common_column;
6. Data Cleaning
โ Remove Duplicates:
DELETE FROM table_name
WHERE rowid NOT IN (
SELECT MIN(rowid)
FROM table_name
GROUP BY column_name);
โ Update Data:
UPDATE table_name
SET column_name = new_value
WHERE condition;
7. ETL-Specific Validations
โ ETL Data Loading Validation:
SELECT COUNT(*)
FROM target_table
WHERE load_date = SYSDATE; -- For today's loaded data
โ Compare Aggregates Between Source and Target:
SELECT SUM(amount) AS source_sum
FROM source_table;
SELECT SUM(amount) AS target_sum
FROM target_table;
โ Check for Missing Records:
SELECT source_key
FROM source_table
WHERE source_key NOT IN (
SELECT target_key
FROM target_table);
8. Metadata Queries
โ View Table Structure:
DESCRIBE table_name; -- Oracle
SHOW COLUMNS FROM table_name; -- MySQL
โ View Indexes:
SELECT *
FROM user_indexes
WHERE table_name = 'TABLE_NAME
โ View Constraints:
SELECT constraint_name, constraint_type
FROM user_constraints
WHERE table_name = 'TABLE_NAME'; -- Oracle
๐1