Data science interview questions :
Position : Data Scientist
There were 3 rounds of interview followed by 1 HR discussion.
Coding related questions :
1. You are given 2 lists
l1 = [1,2,2,3,4,5,6,6,7]
l2= [1,2,4,5,5,6,6,7]
Return all the elements from list l2 which were spotted in l1 as per their frequency of occurence.
For eg: elements in l2 which occured in l1 are : 1, 2 (only once) ,4 ,5,6 ,7
Expected Output : [1,2,4,5,6,6,7]
2.
text = ' I am a data scientist working in Paypal'
Return the longest word along with count of letters
Expected Output : ('scientist', 9)
(In case of ties , sort the words in alphabetical order and return 1st one)
3. You are given 2 tables in SQL , table 1 contains one column which has only single value '3' repeated 5 times , table 2 contains only one column which has 2 values ,'2' repeated 3 times & '3' repeated 4 times. Tell me number of records in case of inner . left , right ,outer , cross join.
4. You are given a transaction table which has txn_id (primary key) ,cust_id (foreign key) , txn_date (datetime) , txn_amt as 4 columns , one cust id has multiple txn_ids. Your job is to find out all the cust_id for which there were minimum 2 txns which are made in 10 seconds of duration. (This might help to identify fraudulent patterns)
Case study questions :
1. Tell me business model and revneue sources of Paypal.
Tell me general consequences when you change pricing of any product.
2. You are data scientist who studies impact of pricing change. Suppose business team comes to you and asks you what will happen if they increase the merchant fees per txn by 5%.
What will be your recommendation & strategy ?
What all factors will you think of and what will be the proposed ML solutioning look like ?
3. You see that some merchants are heavily misusing the refund facility (for incorrect txn /disputed txn merchants get refund) , they are claiming reimbursements by doing fake txns.
List possible scenarios and ways to identify such merchants ?
4. How will you decide pricing of a premier product like Iphone in India vs say South Africa ? What factors will you consider ?
Statistics Questions:
1. What is multicollinearity ?
2. What is Type1 error and type 2 error (explain in pricing experimentation pov)
3. What is Weibull distribution ?
4. What is CLT ? What is difference between t & normal distribution.
5. What is Wald's test ?
6. What is Ljung box test and explain null hypothesis for ADF test in Time series
7. What is causality ?
ML Questions:
1. What is logistic regression ? What is deviance ?
2. What is difference between R-Squared & Adj R-squared
3. How does Randomforest works ?
4. Difference between bagging and boosting ?
On paradigm of Variance -Bias what does bagging/boosting attempt to solve ?
Position : Data Scientist
There were 3 rounds of interview followed by 1 HR discussion.
Coding related questions :
1. You are given 2 lists
l1 = [1,2,2,3,4,5,6,6,7]
l2= [1,2,4,5,5,6,6,7]
Return all the elements from list l2 which were spotted in l1 as per their frequency of occurence.
For eg: elements in l2 which occured in l1 are : 1, 2 (only once) ,4 ,5,6 ,7
Expected Output : [1,2,4,5,6,6,7]
2.
text = ' I am a data scientist working in Paypal'
Return the longest word along with count of letters
Expected Output : ('scientist', 9)
(In case of ties , sort the words in alphabetical order and return 1st one)
3. You are given 2 tables in SQL , table 1 contains one column which has only single value '3' repeated 5 times , table 2 contains only one column which has 2 values ,'2' repeated 3 times & '3' repeated 4 times. Tell me number of records in case of inner . left , right ,outer , cross join.
4. You are given a transaction table which has txn_id (primary key) ,cust_id (foreign key) , txn_date (datetime) , txn_amt as 4 columns , one cust id has multiple txn_ids. Your job is to find out all the cust_id for which there were minimum 2 txns which are made in 10 seconds of duration. (This might help to identify fraudulent patterns)
Case study questions :
1. Tell me business model and revneue sources of Paypal.
Tell me general consequences when you change pricing of any product.
2. You are data scientist who studies impact of pricing change. Suppose business team comes to you and asks you what will happen if they increase the merchant fees per txn by 5%.
What will be your recommendation & strategy ?
What all factors will you think of and what will be the proposed ML solutioning look like ?
3. You see that some merchants are heavily misusing the refund facility (for incorrect txn /disputed txn merchants get refund) , they are claiming reimbursements by doing fake txns.
List possible scenarios and ways to identify such merchants ?
4. How will you decide pricing of a premier product like Iphone in India vs say South Africa ? What factors will you consider ?
Statistics Questions:
1. What is multicollinearity ?
2. What is Type1 error and type 2 error (explain in pricing experimentation pov)
3. What is Weibull distribution ?
4. What is CLT ? What is difference between t & normal distribution.
5. What is Wald's test ?
6. What is Ljung box test and explain null hypothesis for ADF test in Time series
7. What is causality ?
ML Questions:
1. What is logistic regression ? What is deviance ?
2. What is difference between R-Squared & Adj R-squared
3. How does Randomforest works ?
4. Difference between bagging and boosting ?
On paradigm of Variance -Bias what does bagging/boosting attempt to solve ?
I struggled with Data Science interviews until...
I followed this roadmap:
๐ฃ๐๐๐ต๐ผ๐ป
๐๐ผ Master the basics: syntax, loops, functions, and data structures (lists, dictionaries, sets, tuples)
๐๐ผ Learn Pandas & NumPy for data manipulation
๐๐ผ Matplotlib & Seaborn for data visualization
๐ฆ๐๐ฎ๐๐ถ๐๐๐ถ๐ฐ๐ & ๐ฃ๐ฟ๐ผ๐ฏ๐ฎ๐ฏ๐ถ๐น๐ถ๐๐
๐๐ผ Descriptive statistics: mean, median, mode, standard deviation
๐๐ผ Probability theory: distributions, Bayes' theorem, conditional probability
๐๐ผ Hypothesis testing & A/B testing
๐ ๐ฎ๐ฐ๐ต๐ถ๐ป๐ฒ ๐๐ฒ๐ฎ๐ฟ๐ป๐ถ๐ป๐ด
๐๐ผ Supervised vs. unsupervised learning
๐๐ผ Key algorithms: Linear & Logistic Regression, Decision Trees, Random Forest, KNN, SVM
๐๐ผ Model evaluation metrics: accuracy, precision, recall, F1 score, ROC-AUC
๐๐ผ Cross-validation & hyperparameter tuning
๐๐ฒ๐ฒ๐ฝ ๐๐ฒ๐ฎ๐ฟ๐ป๐ถ๐ป๐ด
๐๐ผ Neural Networks & their architecture
๐๐ผ Working with Keras & TensorFlow/PyTorch
๐๐ผ CNNs for image data and RNNs for sequence data
๐๐ฎ๐๐ฎ ๐๐น๐ฒ๐ฎ๐ป๐ถ๐ป๐ด & ๐๐ฒ๐ฎ๐๐๐ฟ๐ฒ ๐๐ป๐ด๐ถ๐ป๐ฒ๐ฒ๐ฟ๐ถ๐ป๐ด
๐๐ผ Handling missing data, outliers, and data scaling
๐๐ผ Feature selection techniques (e.g., correlation, mutual information)
๐ก๐๐ฃ (๐ก๐ฎ๐๐๐ฟ๐ฎ๐น ๐๐ฎ๐ป๐ด๐๐ฎ๐ด๐ฒ ๐ฃ๐ฟ๐ผ๐ฐ๐ฒ๐๐๐ถ๐ป๐ด)
๐๐ผ Tokenization, stemming, lemmatization
๐๐ผ Bag-of-Words, TF-IDF
๐๐ผ Sentiment analysis & topic modeling
๐๐น๐ผ๐๐ฑ ๐ฎ๐ป๐ฑ ๐๐ถ๐ด ๐๐ฎ๐๐ฎ
๐๐ผ Understanding cloud services (AWS, GCP, Azure) for data storage & computing
๐๐ผ Working with distributed data using Spark
๐๐ผ SQL for querying large datasets
Donโt get overwhelmed by the breadth of topics. Start smallโmaster one concept, then move to the next. ๐
Youโve got this! ๐ช๐ผ
I followed this roadmap:
๐ฃ๐๐๐ต๐ผ๐ป
๐๐ผ Master the basics: syntax, loops, functions, and data structures (lists, dictionaries, sets, tuples)
๐๐ผ Learn Pandas & NumPy for data manipulation
๐๐ผ Matplotlib & Seaborn for data visualization
๐ฆ๐๐ฎ๐๐ถ๐๐๐ถ๐ฐ๐ & ๐ฃ๐ฟ๐ผ๐ฏ๐ฎ๐ฏ๐ถ๐น๐ถ๐๐
๐๐ผ Descriptive statistics: mean, median, mode, standard deviation
๐๐ผ Probability theory: distributions, Bayes' theorem, conditional probability
๐๐ผ Hypothesis testing & A/B testing
๐ ๐ฎ๐ฐ๐ต๐ถ๐ป๐ฒ ๐๐ฒ๐ฎ๐ฟ๐ป๐ถ๐ป๐ด
๐๐ผ Supervised vs. unsupervised learning
๐๐ผ Key algorithms: Linear & Logistic Regression, Decision Trees, Random Forest, KNN, SVM
๐๐ผ Model evaluation metrics: accuracy, precision, recall, F1 score, ROC-AUC
๐๐ผ Cross-validation & hyperparameter tuning
๐๐ฒ๐ฒ๐ฝ ๐๐ฒ๐ฎ๐ฟ๐ป๐ถ๐ป๐ด
๐๐ผ Neural Networks & their architecture
๐๐ผ Working with Keras & TensorFlow/PyTorch
๐๐ผ CNNs for image data and RNNs for sequence data
๐๐ฎ๐๐ฎ ๐๐น๐ฒ๐ฎ๐ป๐ถ๐ป๐ด & ๐๐ฒ๐ฎ๐๐๐ฟ๐ฒ ๐๐ป๐ด๐ถ๐ป๐ฒ๐ฒ๐ฟ๐ถ๐ป๐ด
๐๐ผ Handling missing data, outliers, and data scaling
๐๐ผ Feature selection techniques (e.g., correlation, mutual information)
๐ก๐๐ฃ (๐ก๐ฎ๐๐๐ฟ๐ฎ๐น ๐๐ฎ๐ป๐ด๐๐ฎ๐ด๐ฒ ๐ฃ๐ฟ๐ผ๐ฐ๐ฒ๐๐๐ถ๐ป๐ด)
๐๐ผ Tokenization, stemming, lemmatization
๐๐ผ Bag-of-Words, TF-IDF
๐๐ผ Sentiment analysis & topic modeling
๐๐น๐ผ๐๐ฑ ๐ฎ๐ป๐ฑ ๐๐ถ๐ด ๐๐ฎ๐๐ฎ
๐๐ผ Understanding cloud services (AWS, GCP, Azure) for data storage & computing
๐๐ผ Working with distributed data using Spark
๐๐ผ SQL for querying large datasets
Donโt get overwhelmed by the breadth of topics. Start smallโmaster one concept, then move to the next. ๐
Youโve got this! ๐ช๐ผ
โ
Count Rows:
SELECT COUNT(*) FROM source_table;
SELECT COUNT(*) FROM target_table;
๐นVerify if the number of rows in the source and target tables match
โ Check Duplicates:
SELECT column_name, COUNT(*)
FROM table_name
GROUP BY column_name
HAVING COUNT(*) > 1;
๐นIdentify duplicate records in a table
โ Compare Data Between Source and Target:
SELECT *
FROM source_table
MINUS
SELECT *
FROM target_table;
๐นCheck for discrepancies between the source and target tables
โ Validate Data Transformation:
SELECT column_name,
CASE
WHEN condition THEN 'Valid'
ELSE 'Invalid'
END AS validation_status
FROM table_name;
2. Performance and Integrity Checks
โ Check for Null Values:
SELECT *
FROM table_name
WHERE column_name IS NULL;
โ Data Type Validation:
SELECT column_name,
CASE
WHEN column_name LIKE '%[^0-9]%' THEN 'Invalid'
ELSE 'Valid'
END AS data_status
FROM table_name;
โ Primary Key Uniqueness:
SELECT primary_key_column, COUNT(*)
FROM table_name
GROUP BY primary_key_column
HAVING COUNT(*) > 1;
3. Data Aggregation and Summarization:
โ Aggregate Functions:
SELECT SUM(column_name), AVG(column_name), MAX(column_name), MIN(column_name)
FROM table_name;
โ Group Data:
SELECT column_name, COUNT(*)
FROM table_name
GROUP BY column_name;
4. Data Sampling and Extracting Subsets
โ Retrieve Top Records:
SELECT *
FROM table_name
WHERE ROWNUM <= 10; -- Oracle
SELECT *
FROM table_name
LIMIT 10; -- MySQL, PostgreSQL
โ Fetch Specific Columns:
SELECT column1, column2
FROM table_name;
5. Joins for Data Comparison
โ Inner Join:
SELECT a.column1, b.column2
FROM table_a a
INNER JOIN table_b b
ON a.common_column = b.common_column;
โ Left Join:
SELECT a.column1, b.column2
FROM table_a a
LEFT JOIN table_b b
ON a.common_column = b.common_column;
โ Full Outer Join:
SELECT a.column1, b.column2
FROM table_a a
FULL OUTER JOIN table_b b
ON a.common_column = b.common_column;
6. Data Cleaning
โ Remove Duplicates:
DELETE FROM table_name
WHERE rowid NOT IN (
SELECT MIN(rowid)
FROM table_name
GROUP BY column_name);
โ Update Data:
UPDATE table_name
SET column_name = new_value
WHERE condition;
7. ETL-Specific Validations
โ ETL Data Loading Validation:
SELECT COUNT(*)
FROM target_table
WHERE load_date = SYSDATE; -- For today's loaded data
โ Compare Aggregates Between Source and Target:
SELECT SUM(amount) AS source_sum
FROM source_table;
SELECT SUM(amount) AS target_sum
FROM target_table;
โ Check for Missing Records:
SELECT source_key
FROM source_table
WHERE source_key NOT IN (
SELECT target_key
FROM target_table);
8. Metadata Queries
โ View Table Structure:
DESCRIBE table_name; -- Oracle
SHOW COLUMNS FROM table_name; -- MySQL
โ View Indexes:
SELECT *
FROM user_indexes
WHERE table_name = 'TABLE_NAME
โ View Constraints:
SELECT constraint_name, constraint_type
FROM user_constraints
WHERE table_name = 'TABLE_NAME'; -- Oracle
SELECT COUNT(*) FROM source_table;
SELECT COUNT(*) FROM target_table;
๐นVerify if the number of rows in the source and target tables match
โ Check Duplicates:
SELECT column_name, COUNT(*)
FROM table_name
GROUP BY column_name
HAVING COUNT(*) > 1;
๐นIdentify duplicate records in a table
โ Compare Data Between Source and Target:
SELECT *
FROM source_table
MINUS
SELECT *
FROM target_table;
๐นCheck for discrepancies between the source and target tables
โ Validate Data Transformation:
SELECT column_name,
CASE
WHEN condition THEN 'Valid'
ELSE 'Invalid'
END AS validation_status
FROM table_name;
2. Performance and Integrity Checks
โ Check for Null Values:
SELECT *
FROM table_name
WHERE column_name IS NULL;
โ Data Type Validation:
SELECT column_name,
CASE
WHEN column_name LIKE '%[^0-9]%' THEN 'Invalid'
ELSE 'Valid'
END AS data_status
FROM table_name;
โ Primary Key Uniqueness:
SELECT primary_key_column, COUNT(*)
FROM table_name
GROUP BY primary_key_column
HAVING COUNT(*) > 1;
3. Data Aggregation and Summarization:
โ Aggregate Functions:
SELECT SUM(column_name), AVG(column_name), MAX(column_name), MIN(column_name)
FROM table_name;
โ Group Data:
SELECT column_name, COUNT(*)
FROM table_name
GROUP BY column_name;
4. Data Sampling and Extracting Subsets
โ Retrieve Top Records:
SELECT *
FROM table_name
WHERE ROWNUM <= 10; -- Oracle
SELECT *
FROM table_name
LIMIT 10; -- MySQL, PostgreSQL
โ Fetch Specific Columns:
SELECT column1, column2
FROM table_name;
5. Joins for Data Comparison
โ Inner Join:
SELECT a.column1, b.column2
FROM table_a a
INNER JOIN table_b b
ON a.common_column = b.common_column;
โ Left Join:
SELECT a.column1, b.column2
FROM table_a a
LEFT JOIN table_b b
ON a.common_column = b.common_column;
โ Full Outer Join:
SELECT a.column1, b.column2
FROM table_a a
FULL OUTER JOIN table_b b
ON a.common_column = b.common_column;
6. Data Cleaning
โ Remove Duplicates:
DELETE FROM table_name
WHERE rowid NOT IN (
SELECT MIN(rowid)
FROM table_name
GROUP BY column_name);
โ Update Data:
UPDATE table_name
SET column_name = new_value
WHERE condition;
7. ETL-Specific Validations
โ ETL Data Loading Validation:
SELECT COUNT(*)
FROM target_table
WHERE load_date = SYSDATE; -- For today's loaded data
โ Compare Aggregates Between Source and Target:
SELECT SUM(amount) AS source_sum
FROM source_table;
SELECT SUM(amount) AS target_sum
FROM target_table;
โ Check for Missing Records:
SELECT source_key
FROM source_table
WHERE source_key NOT IN (
SELECT target_key
FROM target_table);
8. Metadata Queries
โ View Table Structure:
DESCRIBE table_name; -- Oracle
SHOW COLUMNS FROM table_name; -- MySQL
โ View Indexes:
SELECT *
FROM user_indexes
WHERE table_name = 'TABLE_NAME
โ View Constraints:
SELECT constraint_name, constraint_type
FROM user_constraints
WHERE table_name = 'TABLE_NAME'; -- Oracle
๐1
๐ก ๐๐ผ๐ ๐ฑ๐ผ๐ฒ๐ ๐ฎ๐ป ๐๐๐ (๐น๐ฎ๐ฟ๐ด๐ฒ ๐น๐ฎ๐ป๐ด๐๐ฎ๐ด๐ฒ ๐บ๐ผ๐ฑ๐ฒ๐น) ๐ฎ๐ฐ๐๐๐ฎ๐น๐น๐ ๐น๐ฒ๐ฎ๐ฟ๐ป?
Itโs a journey through 3 key phases:
1๏ธโฃ ๐ฆ๐ฒ๐น๐ณ-๐ฆ๐๐ฝ๐ฒ๐ฟ๐๐ถ๐๐ฒ๐ฑ ๐๐ฒ๐ฎ๐ฟ๐ป๐ถ๐ป๐ด (๐จ๐ป๐ฑ๐ฒ๐ฟ๐๐๐ฎ๐ป๐ฑ๐ถ๐ป๐ด ๐๐ฎ๐ป๐ด๐๐ฎ๐ด๐ฒ)
The model is trained on massive text datasets (Wikipedia, blogs, websites). This is where the transformer architecture comes into picture which you can simply think of it as neural networks that sees words and predicts what comes next.
For example:
โA flash flood watch will be in effect all _____.โ
The model ranks possible answers like โnight,โ โday,โ or even โgiraffe.โ Over time, it gets really good at picking the right one.
2๏ธโฃ ๐ฆ๐๐ฝ๐ฒ๐ฟ๐๐ถ๐๐ฒ๐ฑ ๐๐ฒ๐ฎ๐ฟ๐ป๐ถ๐ป๐ด (๐จ๐ป๐ฑ๐ฒ๐ฟ๐๐๐ฎ๐ป๐ฑ๐ถ๐ป๐ด ๐๐ป๐๐๐ฟ๐๐ฐ๐๐ถ๐ผ๐ป๐)
Next, we teach it how humans like their answers. Thousands of examples of questions and well-crafted responses are fed to the model. This step is smaller but crucial, itโs where the model learns to align with human intent.
3๏ธโฃ ๐ฅ๐ฒ๐ถ๐ป๐ณ๐ผ๐ฟ๐ฐ๐ฒ๐บ๐ฒ๐ป๐ ๐๐ฒ๐ฎ๐ฟ๐ป๐ถ๐ป๐ด (๐๐บ๐ฝ๐ฟ๐ผ๐๐ถ๐ป๐ด ๐๐ฒ๐ต๐ฎ๐๐ถ๐ผ๐ฟ)
Finally, the model learns to improve its behavior based on feedback. Humans rate its answers (thumbs up or thumbs down), and the model adjusts.
This helps it avoid harmful or wrong answers and focus on being helpful, honest, and safe.
Through this process, the model learns patterns and relationships in language, which are stored as numerical weights. These weights are then compressed into the parameter file, the core of what makes the model function.
โ๏ธ So what happens when you ask a question?
The model breaks your question into tokens (small pieces of text, turned into numbers). It processes these numbers through its neural networks and predicts the most likely response.
For example:
โWhat should I eat today?โ might turn into numbers like [123, 11, 45, 78], which the model uses to calculate the next best words to give you the answer.
โ๏ธBut hereโs something important: every model has a token limit -> a maximum number of tokens it can handle at once. This can vary between small and larger models. Once it reaches that limit, it forgets the earlier context and focuses only on the most recent tokens.
Finally, you can imagine an LLM as just two files:
โก๏ธ ๐ฃ๐ฎ๐ฟ๐ฎ๐บ๐ฒ๐๐ฒ๐ฟ ๐ณ๐ถ๐น๐ฒ โ This is the big file, where all the knowledge lives. Think of it like a giant zip file containing everything the model has learned about language.
โก๏ธ ๐ฅ๐๐ป ๐ณ๐ถ๐น๐ฒ โ This is the set of instructions needed to use the parameter file. It defines the modelโs architecture, handles text tokenization, and manages how the model generates outputs.
Thatโs a very simple way to break down how LLMs work!
These models are the backbone of AI agents, so lets not forget about them ๐
Itโs a journey through 3 key phases:
1๏ธโฃ ๐ฆ๐ฒ๐น๐ณ-๐ฆ๐๐ฝ๐ฒ๐ฟ๐๐ถ๐๐ฒ๐ฑ ๐๐ฒ๐ฎ๐ฟ๐ป๐ถ๐ป๐ด (๐จ๐ป๐ฑ๐ฒ๐ฟ๐๐๐ฎ๐ป๐ฑ๐ถ๐ป๐ด ๐๐ฎ๐ป๐ด๐๐ฎ๐ด๐ฒ)
The model is trained on massive text datasets (Wikipedia, blogs, websites). This is where the transformer architecture comes into picture which you can simply think of it as neural networks that sees words and predicts what comes next.
For example:
โA flash flood watch will be in effect all _____.โ
The model ranks possible answers like โnight,โ โday,โ or even โgiraffe.โ Over time, it gets really good at picking the right one.
2๏ธโฃ ๐ฆ๐๐ฝ๐ฒ๐ฟ๐๐ถ๐๐ฒ๐ฑ ๐๐ฒ๐ฎ๐ฟ๐ป๐ถ๐ป๐ด (๐จ๐ป๐ฑ๐ฒ๐ฟ๐๐๐ฎ๐ป๐ฑ๐ถ๐ป๐ด ๐๐ป๐๐๐ฟ๐๐ฐ๐๐ถ๐ผ๐ป๐)
Next, we teach it how humans like their answers. Thousands of examples of questions and well-crafted responses are fed to the model. This step is smaller but crucial, itโs where the model learns to align with human intent.
3๏ธโฃ ๐ฅ๐ฒ๐ถ๐ป๐ณ๐ผ๐ฟ๐ฐ๐ฒ๐บ๐ฒ๐ป๐ ๐๐ฒ๐ฎ๐ฟ๐ป๐ถ๐ป๐ด (๐๐บ๐ฝ๐ฟ๐ผ๐๐ถ๐ป๐ด ๐๐ฒ๐ต๐ฎ๐๐ถ๐ผ๐ฟ)
Finally, the model learns to improve its behavior based on feedback. Humans rate its answers (thumbs up or thumbs down), and the model adjusts.
This helps it avoid harmful or wrong answers and focus on being helpful, honest, and safe.
Through this process, the model learns patterns and relationships in language, which are stored as numerical weights. These weights are then compressed into the parameter file, the core of what makes the model function.
โ๏ธ So what happens when you ask a question?
The model breaks your question into tokens (small pieces of text, turned into numbers). It processes these numbers through its neural networks and predicts the most likely response.
For example:
โWhat should I eat today?โ might turn into numbers like [123, 11, 45, 78], which the model uses to calculate the next best words to give you the answer.
โ๏ธBut hereโs something important: every model has a token limit -> a maximum number of tokens it can handle at once. This can vary between small and larger models. Once it reaches that limit, it forgets the earlier context and focuses only on the most recent tokens.
Finally, you can imagine an LLM as just two files:
โก๏ธ ๐ฃ๐ฎ๐ฟ๐ฎ๐บ๐ฒ๐๐ฒ๐ฟ ๐ณ๐ถ๐น๐ฒ โ This is the big file, where all the knowledge lives. Think of it like a giant zip file containing everything the model has learned about language.
โก๏ธ ๐ฅ๐๐ป ๐ณ๐ถ๐น๐ฒ โ This is the set of instructions needed to use the parameter file. It defines the modelโs architecture, handles text tokenization, and manages how the model generates outputs.
Thatโs a very simple way to break down how LLMs work!
These models are the backbone of AI agents, so lets not forget about them ๐
Q: How would you scale a dense retrieval system for billions of documents while ensuring query efficiency?
Scaling a dense retrieval system to handle billions of documents can be tricky, but here are some strategies to ensure both efficiency and accuracy:
1๏ธโฃ Leverage ANN (Approximate Nearest Neighbor) Search:
Use algorithms like HNSW and IVF-PQ. These algorithms provide fast, scalable searches by narrowing down the search space, allowing for quicker retrieval without compromising on quality.
2๏ธโฃ Compress Vectors for Efficiency:
Reduce the size of vectors using techniques like PCA or Product Quantization. This helps save memory and speeds up similarity searches, making it easier to scale.
3๏ธโฃ Multi-Stage Retrieval for Better Accuracy:
First, apply ANN to retrieve a large candidate pool. Then, refine the results with a more precise model (like a cross-encoder) to improve the relevance of the results.
4๏ธโฃ Implement Caching to Speed Up Responses:
Cache frequently queried results and precomputed embeddings. This reduces redundant processing and ensures faster query response times.
5๏ธโฃ Distribute the Load:
Use sharding to divide the document index across multiple nodes, enabling parallel processing. Also, replicate the index to handle heavy traffic and ensure reliability.
Scaling a dense retrieval system to handle billions of documents can be tricky, but here are some strategies to ensure both efficiency and accuracy:
1๏ธโฃ Leverage ANN (Approximate Nearest Neighbor) Search:
Use algorithms like HNSW and IVF-PQ. These algorithms provide fast, scalable searches by narrowing down the search space, allowing for quicker retrieval without compromising on quality.
2๏ธโฃ Compress Vectors for Efficiency:
Reduce the size of vectors using techniques like PCA or Product Quantization. This helps save memory and speeds up similarity searches, making it easier to scale.
3๏ธโฃ Multi-Stage Retrieval for Better Accuracy:
First, apply ANN to retrieve a large candidate pool. Then, refine the results with a more precise model (like a cross-encoder) to improve the relevance of the results.
4๏ธโฃ Implement Caching to Speed Up Responses:
Cache frequently queried results and precomputed embeddings. This reduces redundant processing and ensures faster query response times.
5๏ธโฃ Distribute the Load:
Use sharding to divide the document index across multiple nodes, enabling parallel processing. Also, replicate the index to handle heavy traffic and ensure reliability.
Missing data occurs when no value is stored for a variable in an observation, which is common in real-world datasets. Incomplete data can skew analyses, reduce the validity of conclusions, and hinder machine learning model performance. The goal of this blog is to cover how to identify, understand, and handle the missing values effectively to maintain the data integrity.
Impact: Often times, missing data can lead to:
Biased results, especially if missingness is related to the data itself.
Reduced sample size, leading to less robust analysis.
Poor model generalization, if handled improperly.
This article is divided into following 4 sections:
Identifying missing data
Types of missing data
Methods to handle missing data
Best Practices and Considerations
1. Identifying Missing Data:
Data Profiling:
We can use profiling tools like pandas-profiling or Sweetviz that generate automated reports with insights on missing values.
Example: Use pandas-profiling to generate a profile report in Python.
import pandas as pd
from pandas_profiling import ProfileReport
df = pd.read_csv("data.csv")
profile = ProfileReport(df, minimal=True)
profile.to_file("data_report.html")
Visualization Techniques:
Use library like missingno to create heatmaps and barplots showing missing data patterns.
import missingno as msno
msno.matrix(df) # Matrix plot to view missing values pattern
msno.bar(df) # Bar plot of total missing values per column
Custom Exploratory Functions:
This is my favorite, where we write custom function to display missing data counts and percentages.
def missing_data_summary(df):
missing_data = df.isnull().sum()
missing_percentage = (missing_data / len(df)) * 100
return pd.DataFrame({'Missing Count': missing_data, 'Missing Percentage': missing_percentage})
print(missing_data_summary(df))
2. Types of Missing Data:
Missing Completely at Random (MCAR):
Definition: The missing values are independent of both observed and unobserved data.
Example: Survey data where respondents skipped random questions unrelated to their characteristics.
Can be detected with Statistical tests (e.g., Little's MCAR test)
Missing at Random (MAR):
Definition: Missing values depend on observed data but is not related to the missing values themselves.
Example: Income data may be missing but is related to age or education level.
It can be addressed effectively with conditional imputation, as we can predict missingness based on related variables.
Missing Not at Random (MNAR):
Definition: Missingness depends on unobserved data or the missing values themselves.
Example: People may be unwilling to disclose income if it is very high or low.
Addressing MNAR is challenging, and solutions may require domain knowledge, assumptions, or advanced modeling techniques.
3. Methods To Handle Missing Data:
Listwise Deletion:
Discuss cases where removing rows with missing data may be appropriate (e.g., when missing values are very low or MCAR).
Example: If a dataset has <5% missing values, deletion may suffice, though itโs risky for larger proportions.
Mean/Median/Mode Imputation:
For numerical data, impute using mean or median values; for categorical data, use mode.
Pros & Cons: Simple but can introduce bias and reduce variance in the data. If not handled properly, the model will be impacted by biased data.
df['column'] = df['column'].fillna(df['column'].mean())
Advanced Imputation Techniques:
K-Nearest Neighbors (KNN):
Use neighbors to fill missing values based on similarity.
Pros & Cons: Maintains relationship between variables, but computationally it can be expensive.
from sklearn.impute import KNNImputer
imputer = KNNImputer(n_neighbors=5)
df_imputed = imputer.fit_transform(df)
Multivariate Imputation by Chained Equations (MICE):
Iteratively imputes missing values by treating each column as a function of others.
Best suited for MAR, though it can also be computationally expensive.
Sklearn has features like fancyimpute or IterativeImputer to apply MICE.
Impact: Often times, missing data can lead to:
Biased results, especially if missingness is related to the data itself.
Reduced sample size, leading to less robust analysis.
Poor model generalization, if handled improperly.
This article is divided into following 4 sections:
Identifying missing data
Types of missing data
Methods to handle missing data
Best Practices and Considerations
1. Identifying Missing Data:
Data Profiling:
We can use profiling tools like pandas-profiling or Sweetviz that generate automated reports with insights on missing values.
Example: Use pandas-profiling to generate a profile report in Python.
import pandas as pd
from pandas_profiling import ProfileReport
df = pd.read_csv("data.csv")
profile = ProfileReport(df, minimal=True)
profile.to_file("data_report.html")
Visualization Techniques:
Use library like missingno to create heatmaps and barplots showing missing data patterns.
import missingno as msno
msno.matrix(df) # Matrix plot to view missing values pattern
msno.bar(df) # Bar plot of total missing values per column
Custom Exploratory Functions:
This is my favorite, where we write custom function to display missing data counts and percentages.
def missing_data_summary(df):
missing_data = df.isnull().sum()
missing_percentage = (missing_data / len(df)) * 100
return pd.DataFrame({'Missing Count': missing_data, 'Missing Percentage': missing_percentage})
print(missing_data_summary(df))
2. Types of Missing Data:
Missing Completely at Random (MCAR):
Definition: The missing values are independent of both observed and unobserved data.
Example: Survey data where respondents skipped random questions unrelated to their characteristics.
Can be detected with Statistical tests (e.g., Little's MCAR test)
Missing at Random (MAR):
Definition: Missing values depend on observed data but is not related to the missing values themselves.
Example: Income data may be missing but is related to age or education level.
It can be addressed effectively with conditional imputation, as we can predict missingness based on related variables.
Missing Not at Random (MNAR):
Definition: Missingness depends on unobserved data or the missing values themselves.
Example: People may be unwilling to disclose income if it is very high or low.
Addressing MNAR is challenging, and solutions may require domain knowledge, assumptions, or advanced modeling techniques.
3. Methods To Handle Missing Data:
Listwise Deletion:
Discuss cases where removing rows with missing data may be appropriate (e.g., when missing values are very low or MCAR).
Example: If a dataset has <5% missing values, deletion may suffice, though itโs risky for larger proportions.
Mean/Median/Mode Imputation:
For numerical data, impute using mean or median values; for categorical data, use mode.
Pros & Cons: Simple but can introduce bias and reduce variance in the data. If not handled properly, the model will be impacted by biased data.
df['column'] = df['column'].fillna(df['column'].mean())
Advanced Imputation Techniques:
K-Nearest Neighbors (KNN):
Use neighbors to fill missing values based on similarity.
Pros & Cons: Maintains relationship between variables, but computationally it can be expensive.
from sklearn.impute import KNNImputer
imputer = KNNImputer(n_neighbors=5)
df_imputed = imputer.fit_transform(df)
Multivariate Imputation by Chained Equations (MICE):
Iteratively imputes missing values by treating each column as a function of others.
Best suited for MAR, though it can also be computationally expensive.
Sklearn has features like fancyimpute or IterativeImputer to apply MICE.
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer
imputer = IterativeImputer()
df_imputed = imputer.fit_transform(df)
selective closeup photo of brown guinea pig
Photo by Bonnie Kittle on Unsplash
Use Machine Learning Models:
Impute missing values by training a model on non-missing values.
Example: Predict missing income based on age, education, and occupation.
Build categories of similar data points. Find a group of similar people and use non-missing salary to get the median/ mean for missing ones.
Pros: Potentially accurate; Cons: Requires careful validation to prevent overfitting.
Data Augmentation:
For small sample sizes, consider generating synthetic data points.
Use models like GANs for more advanced augmentation.
4. Best Practices and Considerations:
Assess Method Effectiveness: Compare imputed values with actual values when possible or test multiple imputation methods to evaluate which yields better model performance.
Use Domain Knowledge: Understanding the domain can be crucial for addressing missing values, for example in MNAR data and/or identifying appropriate imputation techniques, or deciding to drop.
Monitor Impact: It is important to track model accuracy before and after handling missing data. Also we need to measure the gain based on implementation cost/ challenges.
Data Imbalance: We have to very careful while using simple methods (upsampling) for imbalanced datasets as they may not accurately reflect minority class values.
To summarize, missing values are an inevitable challenge, but addressing them effectively is key to successful data science projects. Understanding the type of missing data and choosing the right handling methodโwhether simple imputation or advanced techniques like MICE or KNNโis crucial. Thereโs no one-size-fits-all solution, so leveraging domain knowledge and validating your approach can ensure data integrity and reliable outcomes.
from sklearn.impute import IterativeImputer
imputer = IterativeImputer()
df_imputed = imputer.fit_transform(df)
selective closeup photo of brown guinea pig
Photo by Bonnie Kittle on Unsplash
Use Machine Learning Models:
Impute missing values by training a model on non-missing values.
Example: Predict missing income based on age, education, and occupation.
Build categories of similar data points. Find a group of similar people and use non-missing salary to get the median/ mean for missing ones.
Pros: Potentially accurate; Cons: Requires careful validation to prevent overfitting.
Data Augmentation:
For small sample sizes, consider generating synthetic data points.
Use models like GANs for more advanced augmentation.
4. Best Practices and Considerations:
Assess Method Effectiveness: Compare imputed values with actual values when possible or test multiple imputation methods to evaluate which yields better model performance.
Use Domain Knowledge: Understanding the domain can be crucial for addressing missing values, for example in MNAR data and/or identifying appropriate imputation techniques, or deciding to drop.
Monitor Impact: It is important to track model accuracy before and after handling missing data. Also we need to measure the gain based on implementation cost/ challenges.
Data Imbalance: We have to very careful while using simple methods (upsampling) for imbalanced datasets as they may not accurately reflect minority class values.
To summarize, missing values are an inevitable challenge, but addressing them effectively is key to successful data science projects. Understanding the type of missing data and choosing the right handling methodโwhether simple imputation or advanced techniques like MICE or KNNโis crucial. Thereโs no one-size-fits-all solution, so leveraging domain knowledge and validating your approach can ensure data integrity and reliable outcomes.
You will be 10x better at Machine Learning
๐๐ ๐ฒ๐จ๐ฎ ๐๐จ๐ฏ๐๐ซ ๐ญ๐ก๐๐ฌ๐ 5 ๐ค๐๐ฒ ๐ญ๐จ๐ฉ๐ข๐๐ฌ:
1. ๐ ๐ฎ๐ง๐๐๐ฆ๐๐ง๐ญ๐๐ฅ๐ฌ ๐จ๐ ๐๐๐๐ก๐ข๐ง๐ ๐๐๐๐ซ๐ง๐ข๐ง๐
โณ Supervised vs. Unsupervised Learning
โณ Regression, Classification, Clustering
โณ Overfitting, Underfitting, and Regularization
โณ Bias-Variance Tradeoff
2. ๐๐๐ญ๐ ๐๐ซ๐๐ฉ๐๐ซ๐๐ญ๐ข๐จ๐ง ๐๐ง๐ ๐ ๐๐๐ญ๐ฎ๐ซ๐ ๐๐ง๐ ๐ข๐ง๐๐๐ซ๐ข๐ง๐
โณ Data Cleaning (Handling Missing Values and outliers)
โณ Feature Scaling (Normalization, Standardization)
โณ Feature Selection and Dimensionality Reduction
โณ Encoding Categorical Variables
3. ๐๐ฅ๐ ๐จ๐ซ๐ข๐ญ๐ก๐ฆ๐ฌ ๐๐ง๐ ๐๐จ๐๐๐ฅ ๐๐ซ๐๐ข๐ง๐ข๐ง๐
โณ Linear Models (Linear Regression, Logistic Regression)
โณ Tree-Based Models (Random Forest, XGBoost)
โณ Neural Networks (ANN, CNN, RNN)
โณ Optimization Techniques
4. ๐๐จ๐๐๐ฅ ๐๐ฏ๐๐ฅ๐ฎ๐๐ญ๐ข๐จ๐ง ๐๐ง๐ ๐๐ฎ๐ง๐ข๐ง๐
โณ Cross-Validation (K-Fold, Leave-One-Out)
โณ Metrics (Accuracy, Precision, Recall, F1-Score, AUC-ROC)
โณ Hyperparameter Tuning (GridSearch, RandomSearch, etc.)
โณ Handling Imbalanced Datasets
5. ๐๐๐๐ก๐ข๐ง๐ ๐๐๐๐ซ๐ง๐ข๐ง๐ ๐ข๐ง ๐๐ซ๐จ๐๐ฎ๐๐ญ๐ข๐จ๐ง - ๐๐ฉ๐ฆ ๐๐ฐ๐ด๐ต ๐ช๐ฎ๐ฑ๐ฐ๐ณ๐ต๐ข๐ฏ๐ต ๐ฐ๐ฏ๐ฆ!
โณ Model Deployment (StreamLit, Azure ML, AWS SageMaker)
โณ Monitoring and Maintaining Models
โณ Pipelines for MLOps
โณ Interpretability and Explainability (SHAP, LIME, Feature Importance)
Why is this important?
Models in notebooks are cool... until theyโre not!
๐ฅ Deploying and scaling them into real, production-ready applications is where most ML practitioners face their biggest roadblocks. If your models are still sitting in Jupyter notebooks, youโre missing out on a HUGE opportunity!
๐๐ ๐ฒ๐จ๐ฎ ๐๐จ๐ฏ๐๐ซ ๐ญ๐ก๐๐ฌ๐ 5 ๐ค๐๐ฒ ๐ญ๐จ๐ฉ๐ข๐๐ฌ:
1. ๐ ๐ฎ๐ง๐๐๐ฆ๐๐ง๐ญ๐๐ฅ๐ฌ ๐จ๐ ๐๐๐๐ก๐ข๐ง๐ ๐๐๐๐ซ๐ง๐ข๐ง๐
โณ Supervised vs. Unsupervised Learning
โณ Regression, Classification, Clustering
โณ Overfitting, Underfitting, and Regularization
โณ Bias-Variance Tradeoff
2. ๐๐๐ญ๐ ๐๐ซ๐๐ฉ๐๐ซ๐๐ญ๐ข๐จ๐ง ๐๐ง๐ ๐ ๐๐๐ญ๐ฎ๐ซ๐ ๐๐ง๐ ๐ข๐ง๐๐๐ซ๐ข๐ง๐
โณ Data Cleaning (Handling Missing Values and outliers)
โณ Feature Scaling (Normalization, Standardization)
โณ Feature Selection and Dimensionality Reduction
โณ Encoding Categorical Variables
3. ๐๐ฅ๐ ๐จ๐ซ๐ข๐ญ๐ก๐ฆ๐ฌ ๐๐ง๐ ๐๐จ๐๐๐ฅ ๐๐ซ๐๐ข๐ง๐ข๐ง๐
โณ Linear Models (Linear Regression, Logistic Regression)
โณ Tree-Based Models (Random Forest, XGBoost)
โณ Neural Networks (ANN, CNN, RNN)
โณ Optimization Techniques
4. ๐๐จ๐๐๐ฅ ๐๐ฏ๐๐ฅ๐ฎ๐๐ญ๐ข๐จ๐ง ๐๐ง๐ ๐๐ฎ๐ง๐ข๐ง๐
โณ Cross-Validation (K-Fold, Leave-One-Out)
โณ Metrics (Accuracy, Precision, Recall, F1-Score, AUC-ROC)
โณ Hyperparameter Tuning (GridSearch, RandomSearch, etc.)
โณ Handling Imbalanced Datasets
5. ๐๐๐๐ก๐ข๐ง๐ ๐๐๐๐ซ๐ง๐ข๐ง๐ ๐ข๐ง ๐๐ซ๐จ๐๐ฎ๐๐ญ๐ข๐จ๐ง - ๐๐ฉ๐ฆ ๐๐ฐ๐ด๐ต ๐ช๐ฎ๐ฑ๐ฐ๐ณ๐ต๐ข๐ฏ๐ต ๐ฐ๐ฏ๐ฆ!
โณ Model Deployment (StreamLit, Azure ML, AWS SageMaker)
โณ Monitoring and Maintaining Models
โณ Pipelines for MLOps
โณ Interpretability and Explainability (SHAP, LIME, Feature Importance)
Why is this important?
Models in notebooks are cool... until theyโre not!
๐ฅ Deploying and scaling them into real, production-ready applications is where most ML practitioners face their biggest roadblocks. If your models are still sitting in Jupyter notebooks, youโre missing out on a HUGE opportunity!
As machine learning continues to transform industries and drive innovation, mastering both fundamental concepts and advanced techniques has become essential for aspiring data scientists and machine learning engineers. Whether youโre preparing for a job interview or simply looking to deepen your understanding, itโs important to be ready for a range of questions that test your knowledge across various levels of complexity.
In this article, weโll explore ten common interview questions that cover the breadth and depth of machine learning, providing not only answers but also real-world examples to help contextualize these concepts. This guide aims to equip you with the insights needed to confidently tackle your next machine learning interview.
Disclaimer: This guide is not a comprehensive guide, but it can serve as a quick refresher to spend 15 mins and get a wider coverage in machine learning.
General Breadth Questions
Explain the curse of dimensionality. How does it affect machine learning models?
Answer: The curse of dimensionality refers to the phenomenon where the feature space becomes increasingly sparse as the number of dimensions (features) increases. This sparsity makes it difficult for models to generalize well because the volume of the space grows exponentially, requiring more data to maintain the same level of accuracy. It can lead to overfitting and increased computational complexity.
Example: In a k-Nearest Neighbors (k-NN) classifier, as the number of features increases, the distance between points becomes less meaningful, leading to poor classification performance. Techniques like dimensionality reduction (PCA, t-SNE) are often used to mitigate this issue.
What is the difference between a generative and a discriminative model?
Answer: A generative model learns the joint probability distribution and can be used generate new instances of data, while a discriminative model learns the conditional probability and is focused on predicting the output labels given the inputs. Generative models are generally used for unsupervised tasks like data generation, while discriminative models are commonly used for supervised learning tasks like classification.
Example: Naive Bayes is a generative model that can generate new data points based on the learned distribution, while Logistic Regression is a discriminative model used to predict binary outcomes.
Large Language Models (LLMs) are also examples of generative models, as they generative tokens and predict next word in the sentence. While generative models are not the best for classification approach, it can be used.
One of the best paper to learn generative vs. discriminative models.
Describe how you would handle an imbalanced dataset.
Answer: Handling an imbalanced dataset can be done through several methods:
Resampling techniques: Oversampling the minority class or under-sampling the majority class.
Synthetic data generation: Techniques like SMOTE (Synthetic Minority Over-sampling Technique) create synthetic examples of the minority class.
Algorithmic adjustments: Using algorithms that account for class imbalance, such as adjusting class weights in models like SVM or Random Forest.
Evaluation metrics: Using metrics like Precision-Recall AUC instead of accuracy to better reflect model performance on imbalanced data.
Example: In fraud detection, where fraudulent transactions are rare, oversampling or SMOTE can be used to create a balanced dataset, and the model's performance is evaluated using the Precision-Recall curve rather than accuracy.
What is the role of a loss function in machine learning, and can you provide an example of how different loss functions are used?
Answer: A loss function measures how well the model's predictions match the true labels. It guides the optimization process during training by quantifying the error, which the algorithm then minimizes. Different tasks require different loss functions:
Mean Squared Error (MSE): Used for regression tasks, penalizing larger errors more heavily.
In this article, weโll explore ten common interview questions that cover the breadth and depth of machine learning, providing not only answers but also real-world examples to help contextualize these concepts. This guide aims to equip you with the insights needed to confidently tackle your next machine learning interview.
Disclaimer: This guide is not a comprehensive guide, but it can serve as a quick refresher to spend 15 mins and get a wider coverage in machine learning.
General Breadth Questions
Explain the curse of dimensionality. How does it affect machine learning models?
Answer: The curse of dimensionality refers to the phenomenon where the feature space becomes increasingly sparse as the number of dimensions (features) increases. This sparsity makes it difficult for models to generalize well because the volume of the space grows exponentially, requiring more data to maintain the same level of accuracy. It can lead to overfitting and increased computational complexity.
Example: In a k-Nearest Neighbors (k-NN) classifier, as the number of features increases, the distance between points becomes less meaningful, leading to poor classification performance. Techniques like dimensionality reduction (PCA, t-SNE) are often used to mitigate this issue.
What is the difference between a generative and a discriminative model?
Answer: A generative model learns the joint probability distribution and can be used generate new instances of data, while a discriminative model learns the conditional probability and is focused on predicting the output labels given the inputs. Generative models are generally used for unsupervised tasks like data generation, while discriminative models are commonly used for supervised learning tasks like classification.
Example: Naive Bayes is a generative model that can generate new data points based on the learned distribution, while Logistic Regression is a discriminative model used to predict binary outcomes.
Large Language Models (LLMs) are also examples of generative models, as they generative tokens and predict next word in the sentence. While generative models are not the best for classification approach, it can be used.
One of the best paper to learn generative vs. discriminative models.
Describe how you would handle an imbalanced dataset.
Answer: Handling an imbalanced dataset can be done through several methods:
Resampling techniques: Oversampling the minority class or under-sampling the majority class.
Synthetic data generation: Techniques like SMOTE (Synthetic Minority Over-sampling Technique) create synthetic examples of the minority class.
Algorithmic adjustments: Using algorithms that account for class imbalance, such as adjusting class weights in models like SVM or Random Forest.
Evaluation metrics: Using metrics like Precision-Recall AUC instead of accuracy to better reflect model performance on imbalanced data.
Example: In fraud detection, where fraudulent transactions are rare, oversampling or SMOTE can be used to create a balanced dataset, and the model's performance is evaluated using the Precision-Recall curve rather than accuracy.
What is the role of a loss function in machine learning, and can you provide an example of how different loss functions are used?
Answer: A loss function measures how well the model's predictions match the true labels. It guides the optimization process during training by quantifying the error, which the algorithm then minimizes. Different tasks require different loss functions:
Mean Squared Error (MSE): Used for regression tasks, penalizing larger errors more heavily.
Cross-Entropy Loss: Used for classification tasks, particularly in deep learning, measuring the difference between predicted probabilities and actual labels.
Hinge Loss: Used in Support Vector Machines, encouraging a margin between classes.
Example: In a binary classification problem using logistic regression, cross-entropy loss helps adjust the model weights to minimize the difference between predicted probabilities and the actual binary labels.
Explain what a convolutional neural network (CNN) is and how it differs from a traditional fully connected network.
Answer: A Convolutional Neural Network (CNN) is a type of deep learning model specifically designed to process structured grid data, like images. Unlike fully connected networks where each neuron is connected to every neuron in the previous layer, CNNs use convolutional layers that apply filters to capture spatial hierarchies in data. This makes them efficient for tasks like image recognition, as they reduce the number of parameters and take advantage of the local structure of the data.
Example: In image classification, a CNN can automatically learn to detect edges, textures, and shapes through its convolutional layers, leading to highly accurate models for tasks like object detection.
Some Depth Questions
Describe how you would implement and evaluate a machine learning model for time series forecasting.
Answer: Implementing a time series forecasting model involves several steps:
Data preparation: Handle missing data, decompose the series into trend, seasonality, and residual components, and create lag features.
Model selection: Choose models that are well-suited for temporal data, such as ARIMA, SARIMA, or LSTM (Long Short-Term Memory) networks.
Evaluation: Use metrics like Mean Absolute Error (MAE) or Root Mean Squared Error (RMSE) and evaluate the model using cross-validation methods like Time Series Cross-Validation to ensure that the temporal order is preserved.
Example: In predicting electricity demand, using LSTM can capture long-term dependencies in the data, and the model can be evaluated on unseen data to ensure that it generalizes well over time.
What is the significance of eigenvectors and eigenvalues in machine learning? How are they used in Principal Component Analysis (PCA)?
Answer: Eigenvectors and eigenvalues are fundamental in linear algebra and are used in PCA to reduce the dimensionality of data. Eigenvectors represent directions in the data space, and eigenvalues indicate the magnitude of variance in those directions. In PCA, the eigenvectors corresponding to the largest eigenvalues are selected as principal components, which capture the most significant variance in the data, reducing dimensionality while preserving as much information as possible.
Example: In facial recognition, PCA can be used to reduce the dimensionality of the image data by projecting the faces onto the principal components, making the model more efficient without losing essential features.
For simplicity of understanding, the diagram below shows 3-D to 2-D conversion. In reality, the input features are in high dimension (most of the times in hundreds). When we do clustering, and want to visualize the data, we can apply PCA to bring the features into 2-D to visualize.
Explain the concept of reinforcement learning and the difference between Q-learning and deep Q-learning.
Answer: Reinforcement learning is a type of machine learning where an agent learns to make decisions by interacting with an environment to maximize cumulative rewards. Q-learning is a value-based method where the agent learns a Q-function, which estimates the value of taking a particular action in a given state. Deep Q-learning extends this by using a deep neural network to approximate the Q-function, allowing it to handle more complex and high-dimensional state spaces.
Hinge Loss: Used in Support Vector Machines, encouraging a margin between classes.
Example: In a binary classification problem using logistic regression, cross-entropy loss helps adjust the model weights to minimize the difference between predicted probabilities and the actual binary labels.
Explain what a convolutional neural network (CNN) is and how it differs from a traditional fully connected network.
Answer: A Convolutional Neural Network (CNN) is a type of deep learning model specifically designed to process structured grid data, like images. Unlike fully connected networks where each neuron is connected to every neuron in the previous layer, CNNs use convolutional layers that apply filters to capture spatial hierarchies in data. This makes them efficient for tasks like image recognition, as they reduce the number of parameters and take advantage of the local structure of the data.
Example: In image classification, a CNN can automatically learn to detect edges, textures, and shapes through its convolutional layers, leading to highly accurate models for tasks like object detection.
Some Depth Questions
Describe how you would implement and evaluate a machine learning model for time series forecasting.
Answer: Implementing a time series forecasting model involves several steps:
Data preparation: Handle missing data, decompose the series into trend, seasonality, and residual components, and create lag features.
Model selection: Choose models that are well-suited for temporal data, such as ARIMA, SARIMA, or LSTM (Long Short-Term Memory) networks.
Evaluation: Use metrics like Mean Absolute Error (MAE) or Root Mean Squared Error (RMSE) and evaluate the model using cross-validation methods like Time Series Cross-Validation to ensure that the temporal order is preserved.
Example: In predicting electricity demand, using LSTM can capture long-term dependencies in the data, and the model can be evaluated on unseen data to ensure that it generalizes well over time.
What is the significance of eigenvectors and eigenvalues in machine learning? How are they used in Principal Component Analysis (PCA)?
Answer: Eigenvectors and eigenvalues are fundamental in linear algebra and are used in PCA to reduce the dimensionality of data. Eigenvectors represent directions in the data space, and eigenvalues indicate the magnitude of variance in those directions. In PCA, the eigenvectors corresponding to the largest eigenvalues are selected as principal components, which capture the most significant variance in the data, reducing dimensionality while preserving as much information as possible.
Example: In facial recognition, PCA can be used to reduce the dimensionality of the image data by projecting the faces onto the principal components, making the model more efficient without losing essential features.
For simplicity of understanding, the diagram below shows 3-D to 2-D conversion. In reality, the input features are in high dimension (most of the times in hundreds). When we do clustering, and want to visualize the data, we can apply PCA to bring the features into 2-D to visualize.
Explain the concept of reinforcement learning and the difference between Q-learning and deep Q-learning.
Answer: Reinforcement learning is a type of machine learning where an agent learns to make decisions by interacting with an environment to maximize cumulative rewards. Q-learning is a value-based method where the agent learns a Q-function, which estimates the value of taking a particular action in a given state. Deep Q-learning extends this by using a deep neural network to approximate the Q-function, allowing it to handle more complex and high-dimensional state spaces.
Example: In a self-driving car simulation, reinforcement learning can be used to train the car to navigate roads by maximizing the reward (e.g., staying on the road, avoiding collisions), with deep Q-learning enabling it to learn from high-dimensional inputs like images from cameras.
How would you approach a situation where your model performs well on the training data but poorly on the test data?
Answer: This situation indicates overfitting. The approach to address it includes:
Cross-validation: To ensure that the model generalizes well to unseen data.
Regularization: Techniques like L1 or L2 regularization to penalize complex models and reduce overfitting.
Simplifying the model: Reducing the complexity by decreasing the number of features or layers in the model.
Gathering more data: To help the model learn a broader range of patterns.
Ensembling: Using methods like bagging or boosting to improve model robustness.
Example: If a neural network overfits on a small dataset, adding dropout layers during training or using a simpler model might help it generalize better to the test data.
What is transfer learning, and when would you use it?
Answer: Transfer learning is a technique where a model developed for a particular task is reused as the starting point for a model on a second task. It is particularly useful when the second task has limited data. Instead of training a model from scratch, the pre-trained modelโs knowledge is transferred, and only the final layers are fine-tuned.
Example: In image classification, using a pre-trained model like VGG16 on ImageNet and fine-tuning it for a specific task like identifying specific animals can lead to good performance even with a small dataset.
To Summarize
Preparing for a machine learning interview requires a balance of theoretical knowledge and practical understanding. The questions covered in this article highlight key areas of machine learning, from foundational principles like the curse of dimensionality and loss functions to more complex topics like reinforcement learning and transfer learning. By familiarizing yourself with these questions and their corresponding answers, youโll be better positioned to demonstrate your expertise and critical thinking skills during an interview.
How would you approach a situation where your model performs well on the training data but poorly on the test data?
Answer: This situation indicates overfitting. The approach to address it includes:
Cross-validation: To ensure that the model generalizes well to unseen data.
Regularization: Techniques like L1 or L2 regularization to penalize complex models and reduce overfitting.
Simplifying the model: Reducing the complexity by decreasing the number of features or layers in the model.
Gathering more data: To help the model learn a broader range of patterns.
Ensembling: Using methods like bagging or boosting to improve model robustness.
Example: If a neural network overfits on a small dataset, adding dropout layers during training or using a simpler model might help it generalize better to the test data.
What is transfer learning, and when would you use it?
Answer: Transfer learning is a technique where a model developed for a particular task is reused as the starting point for a model on a second task. It is particularly useful when the second task has limited data. Instead of training a model from scratch, the pre-trained modelโs knowledge is transferred, and only the final layers are fine-tuned.
Example: In image classification, using a pre-trained model like VGG16 on ImageNet and fine-tuning it for a specific task like identifying specific animals can lead to good performance even with a small dataset.
To Summarize
Preparing for a machine learning interview requires a balance of theoretical knowledge and practical understanding. The questions covered in this article highlight key areas of machine learning, from foundational principles like the curse of dimensionality and loss functions to more complex topics like reinforcement learning and transfer learning. By familiarizing yourself with these questions and their corresponding answers, youโll be better positioned to demonstrate your expertise and critical thinking skills during an interview.
DoorDash now uses RAG to solve complex issues for their Dashers accurately.
Hereโs how --
The first step is to identify the core issue the Dasher is facing based on their messages, so that the system can retrieve articles on previous similar cases.
They break down complex tasks in the Dasherโs request, and use chain-of-thought prompting to generate an answer for it.
The most interesting part is their LLM Guardrail system:
1. First it applies a semantic similarity check between responses and knowledge base articles.
2. If the response fails, an LLM-powered evaluation checks for grounding, coherence, and compliance, so that it can prevent hallucinations and escalate problematic cases.
The knowledge base is continuously updated for completeness and accuracy, guided by the LLM quality assessments. For regression prevention they benchmark prompt changes before deployment as well.
Hereโs how --
The first step is to identify the core issue the Dasher is facing based on their messages, so that the system can retrieve articles on previous similar cases.
They break down complex tasks in the Dasherโs request, and use chain-of-thought prompting to generate an answer for it.
The most interesting part is their LLM Guardrail system:
1. First it applies a semantic similarity check between responses and knowledge base articles.
2. If the response fails, an LLM-powered evaluation checks for grounding, coherence, and compliance, so that it can prevent hallucinations and escalate problematic cases.
The knowledge base is continuously updated for completeness and accuracy, guided by the LLM quality assessments. For regression prevention they benchmark prompt changes before deployment as well.
Suppose you're tasked with building a recommendation system for Instagram's Stories. How would you design the architecture?
This is a system-design question you get for Applied Data Science or Machine Learning Engineer rounds.
These rounds are typically interactive where you want to understand what specific problem the interviewer wants you to solve.
A system like stories needs to be highly scalable and handle-real time updates.
Keeping this in mind a few architecture aspect you could dive into are -
๐๐๐ญ๐ ๐๐ง๐ ๐๐ฌ๐ญ๐ข๐จ๐ง ๐๐๐ฒ๐๐ซ
Needs to handle streaming data from multiple sources:
Event Streams: User interactions (views, likes, skips, etc.) and creator activity.
Content Metadata: Story metadata like geotags, captions, etc.
Real-Time Features: Temporal features like time since Story creation.
User Features: Long-term (preferences, demographics) and short-term (session-based) behavioral data.
Kafka for streaming, Spark Streaming for processing, and a NoSQL database like DynamoDB for fast feature lookups - could be a good tech stack
๐๐๐ง๐๐ข๐๐๐ญ๐ ๐๐๐ง๐๐ซ๐๐ญ๐ข๐จ๐ง
Narrow down the pool of millions of Stories to hundreds of candidates.
๐ Compute embeddings for users and Stories using models like two-tower architectures - Deep Average Network (DAN) for user embeddings based on interaction history, Story embeddings derived from metadata and visual features (using models like CLIP or Vision Transformers).
๐ Use ANN (Approximate Nearest Neighbors) methods (e.g., ScaNN) to retrieve top-N candidates efficiently.
๐๐๐ง๐ค๐ข๐ง๐ ๐๐๐ฒ๐๐ซ
Score and rank the candidates for personalization.
๐ Gradient Boosted Decision Trees (e.g., LightGBM) or BERT4Rec to capture sequence-level dependencies in user interactions.
๐ Use bandit algorithms (e.g., Thompson Sampling or CMABs) to explore less popular Stories while exploiting known preferences.
๐๐ฏ๐๐ฅ๐ฎ๐๐ญ๐ข๐จ๐ง
๐ Offline Metrics - NDCG (Normalized Discounted Cumulative Gain), and diversity scores.
๐ Online Metrics - View-through rate, engagement time, and session duration.
This is a system-design question you get for Applied Data Science or Machine Learning Engineer rounds.
These rounds are typically interactive where you want to understand what specific problem the interviewer wants you to solve.
A system like stories needs to be highly scalable and handle-real time updates.
Keeping this in mind a few architecture aspect you could dive into are -
๐๐๐ญ๐ ๐๐ง๐ ๐๐ฌ๐ญ๐ข๐จ๐ง ๐๐๐ฒ๐๐ซ
Needs to handle streaming data from multiple sources:
Event Streams: User interactions (views, likes, skips, etc.) and creator activity.
Content Metadata: Story metadata like geotags, captions, etc.
Real-Time Features: Temporal features like time since Story creation.
User Features: Long-term (preferences, demographics) and short-term (session-based) behavioral data.
Kafka for streaming, Spark Streaming for processing, and a NoSQL database like DynamoDB for fast feature lookups - could be a good tech stack
๐๐๐ง๐๐ข๐๐๐ญ๐ ๐๐๐ง๐๐ซ๐๐ญ๐ข๐จ๐ง
Narrow down the pool of millions of Stories to hundreds of candidates.
๐ Compute embeddings for users and Stories using models like two-tower architectures - Deep Average Network (DAN) for user embeddings based on interaction history, Story embeddings derived from metadata and visual features (using models like CLIP or Vision Transformers).
๐ Use ANN (Approximate Nearest Neighbors) methods (e.g., ScaNN) to retrieve top-N candidates efficiently.
๐๐๐ง๐ค๐ข๐ง๐ ๐๐๐ฒ๐๐ซ
Score and rank the candidates for personalization.
๐ Gradient Boosted Decision Trees (e.g., LightGBM) or BERT4Rec to capture sequence-level dependencies in user interactions.
๐ Use bandit algorithms (e.g., Thompson Sampling or CMABs) to explore less popular Stories while exploiting known preferences.
๐๐ฏ๐๐ฅ๐ฎ๐๐ญ๐ข๐จ๐ง
๐ Offline Metrics - NDCG (Normalized Discounted Cumulative Gain), and diversity scores.
๐ Online Metrics - View-through rate, engagement time, and session duration.