Company Name : Swiggy
Role : Associate Software Engineer
Batch : 2024/2023/2022 passouts
Link : https://docs.google.com/forms/d/1E029cjZV8Em6zPC0YJYAMDDP_NjPtDkwufqHfvkVG2E/viewform?edit_requested=true&pli=1
Role : Associate Software Engineer
Batch : 2024/2023/2022 passouts
Link : https://docs.google.com/forms/d/1E029cjZV8Em6zPC0YJYAMDDP_NjPtDkwufqHfvkVG2E/viewform?edit_requested=true&pli=1
Data science interview questions :
Position : Data Scientist
There were 3 rounds of interview followed by 1 HR discussion.
Coding related questions :
1. You are given 2 lists
l1 = [1,2,2,3,4,5,6,6,7]
l2= [1,2,4,5,5,6,6,7]
Return all the elements from list l2 which were spotted in l1 as per their frequency of occurence.
For eg: elements in l2 which occured in l1 are : 1, 2 (only once) ,4 ,5,6 ,7
Expected Output : [1,2,4,5,6,6,7]
2.
text = ' I am a data scientist working in Paypal'
Return the longest word along with count of letters
Expected Output : ('scientist', 9)
(In case of ties , sort the words in alphabetical order and return 1st one)
3. You are given 2 tables in SQL , table 1 contains one column which has only single value '3' repeated 5 times , table 2 contains only one column which has 2 values ,'2' repeated 3 times & '3' repeated 4 times. Tell me number of records in case of inner . left , right ,outer , cross join.
4. You are given a transaction table which has txn_id (primary key) ,cust_id (foreign key) , txn_date (datetime) , txn_amt as 4 columns , one cust id has multiple txn_ids. Your job is to find out all the cust_id for which there were minimum 2 txns which are made in 10 seconds of duration. (This might help to identify fraudulent patterns)
Case study questions :
1. Tell me business model and revneue sources of Paypal.
Tell me general consequences when you change pricing of any product.
2. You are data scientist who studies impact of pricing change. Suppose business team comes to you and asks you what will happen if they increase the merchant fees per txn by 5%.
What will be your recommendation & strategy ?
What all factors will you think of and what will be the proposed ML solutioning look like ?
3. You see that some merchants are heavily misusing the refund facility (for incorrect txn /disputed txn merchants get refund) , they are claiming reimbursements by doing fake txns.
List possible scenarios and ways to identify such merchants ?
4. How will you decide pricing of a premier product like Iphone in India vs say South Africa ? What factors will you consider ?
Statistics Questions:
1. What is multicollinearity ?
2. What is Type1 error and type 2 error (explain in pricing experimentation pov)
3. What is Weibull distribution ?
4. What is CLT ? What is difference between t & normal distribution.
5. What is Wald's test ?
6. What is Ljung box test and explain null hypothesis for ADF test in Time series
7. What is causality ?
ML Questions:
1. What is logistic regression ? What is deviance ?
2. What is difference between R-Squared & Adj R-squared
3. How does Randomforest works ?
4. Difference between bagging and boosting ?
On paradigm of Variance -Bias what does bagging/boosting attempt to solve ?
Position : Data Scientist
There were 3 rounds of interview followed by 1 HR discussion.
Coding related questions :
1. You are given 2 lists
l1 = [1,2,2,3,4,5,6,6,7]
l2= [1,2,4,5,5,6,6,7]
Return all the elements from list l2 which were spotted in l1 as per their frequency of occurence.
For eg: elements in l2 which occured in l1 are : 1, 2 (only once) ,4 ,5,6 ,7
Expected Output : [1,2,4,5,6,6,7]
2.
text = ' I am a data scientist working in Paypal'
Return the longest word along with count of letters
Expected Output : ('scientist', 9)
(In case of ties , sort the words in alphabetical order and return 1st one)
3. You are given 2 tables in SQL , table 1 contains one column which has only single value '3' repeated 5 times , table 2 contains only one column which has 2 values ,'2' repeated 3 times & '3' repeated 4 times. Tell me number of records in case of inner . left , right ,outer , cross join.
4. You are given a transaction table which has txn_id (primary key) ,cust_id (foreign key) , txn_date (datetime) , txn_amt as 4 columns , one cust id has multiple txn_ids. Your job is to find out all the cust_id for which there were minimum 2 txns which are made in 10 seconds of duration. (This might help to identify fraudulent patterns)
Case study questions :
1. Tell me business model and revneue sources of Paypal.
Tell me general consequences when you change pricing of any product.
2. You are data scientist who studies impact of pricing change. Suppose business team comes to you and asks you what will happen if they increase the merchant fees per txn by 5%.
What will be your recommendation & strategy ?
What all factors will you think of and what will be the proposed ML solutioning look like ?
3. You see that some merchants are heavily misusing the refund facility (for incorrect txn /disputed txn merchants get refund) , they are claiming reimbursements by doing fake txns.
List possible scenarios and ways to identify such merchants ?
4. How will you decide pricing of a premier product like Iphone in India vs say South Africa ? What factors will you consider ?
Statistics Questions:
1. What is multicollinearity ?
2. What is Type1 error and type 2 error (explain in pricing experimentation pov)
3. What is Weibull distribution ?
4. What is CLT ? What is difference between t & normal distribution.
5. What is Wald's test ?
6. What is Ljung box test and explain null hypothesis for ADF test in Time series
7. What is causality ?
ML Questions:
1. What is logistic regression ? What is deviance ?
2. What is difference between R-Squared & Adj R-squared
3. How does Randomforest works ?
4. Difference between bagging and boosting ?
On paradigm of Variance -Bias what does bagging/boosting attempt to solve ?
I struggled with Data Science interviews until...
I followed this roadmap:
𝗣𝘆𝘁𝗵𝗼𝗻
👉🏼 Master the basics: syntax, loops, functions, and data structures (lists, dictionaries, sets, tuples)
👉🏼 Learn Pandas & NumPy for data manipulation
👉🏼 Matplotlib & Seaborn for data visualization
𝗦𝘁𝗮𝘁𝗶𝘀𝘁𝗶𝗰𝘀 & 𝗣𝗿𝗼𝗯𝗮𝗯𝗶𝗹𝗶𝘁𝘆
👉🏼 Descriptive statistics: mean, median, mode, standard deviation
👉🏼 Probability theory: distributions, Bayes' theorem, conditional probability
👉🏼 Hypothesis testing & A/B testing
𝗠𝗮𝗰𝗵𝗶𝗻𝗲 𝗟𝗲𝗮𝗿𝗻𝗶𝗻𝗴
👉🏼 Supervised vs. unsupervised learning
👉🏼 Key algorithms: Linear & Logistic Regression, Decision Trees, Random Forest, KNN, SVM
👉🏼 Model evaluation metrics: accuracy, precision, recall, F1 score, ROC-AUC
👉🏼 Cross-validation & hyperparameter tuning
𝗗𝗲𝗲𝗽 𝗟𝗲𝗮𝗿𝗻𝗶𝗻𝗴
👉🏼 Neural Networks & their architecture
👉🏼 Working with Keras & TensorFlow/PyTorch
👉🏼 CNNs for image data and RNNs for sequence data
𝗗𝗮𝘁𝗮 𝗖𝗹𝗲𝗮𝗻𝗶𝗻𝗴 & 𝗙𝗲𝗮𝘁𝘂𝗿𝗲 𝗘𝗻𝗴𝗶𝗻𝗲𝗲𝗿𝗶𝗻𝗴
👉🏼 Handling missing data, outliers, and data scaling
👉🏼 Feature selection techniques (e.g., correlation, mutual information)
𝗡𝗟𝗣 (𝗡𝗮𝘁𝘂𝗿𝗮𝗹 𝗟𝗮𝗻𝗴𝘂𝗮𝗴𝗲 𝗣𝗿𝗼𝗰𝗲𝘀𝘀𝗶𝗻𝗴)
👉🏼 Tokenization, stemming, lemmatization
👉🏼 Bag-of-Words, TF-IDF
👉🏼 Sentiment analysis & topic modeling
𝗖𝗹𝗼𝘂𝗱 𝗮𝗻𝗱 𝗕𝗶𝗴 𝗗𝗮𝘁𝗮
👉🏼 Understanding cloud services (AWS, GCP, Azure) for data storage & computing
👉🏼 Working with distributed data using Spark
👉🏼 SQL for querying large datasets
Don’t get overwhelmed by the breadth of topics. Start small—master one concept, then move to the next. 📈
You’ve got this! 💪🏼
I followed this roadmap:
𝗣𝘆𝘁𝗵𝗼𝗻
👉🏼 Master the basics: syntax, loops, functions, and data structures (lists, dictionaries, sets, tuples)
👉🏼 Learn Pandas & NumPy for data manipulation
👉🏼 Matplotlib & Seaborn for data visualization
𝗦𝘁𝗮𝘁𝗶𝘀𝘁𝗶𝗰𝘀 & 𝗣𝗿𝗼𝗯𝗮𝗯𝗶𝗹𝗶𝘁𝘆
👉🏼 Descriptive statistics: mean, median, mode, standard deviation
👉🏼 Probability theory: distributions, Bayes' theorem, conditional probability
👉🏼 Hypothesis testing & A/B testing
𝗠𝗮𝗰𝗵𝗶𝗻𝗲 𝗟𝗲𝗮𝗿𝗻𝗶𝗻𝗴
👉🏼 Supervised vs. unsupervised learning
👉🏼 Key algorithms: Linear & Logistic Regression, Decision Trees, Random Forest, KNN, SVM
👉🏼 Model evaluation metrics: accuracy, precision, recall, F1 score, ROC-AUC
👉🏼 Cross-validation & hyperparameter tuning
𝗗𝗲𝗲𝗽 𝗟𝗲𝗮𝗿𝗻𝗶𝗻𝗴
👉🏼 Neural Networks & their architecture
👉🏼 Working with Keras & TensorFlow/PyTorch
👉🏼 CNNs for image data and RNNs for sequence data
𝗗𝗮𝘁𝗮 𝗖𝗹𝗲𝗮𝗻𝗶𝗻𝗴 & 𝗙𝗲𝗮𝘁𝘂𝗿𝗲 𝗘𝗻𝗴𝗶𝗻𝗲𝗲𝗿𝗶𝗻𝗴
👉🏼 Handling missing data, outliers, and data scaling
👉🏼 Feature selection techniques (e.g., correlation, mutual information)
𝗡𝗟𝗣 (𝗡𝗮𝘁𝘂𝗿𝗮𝗹 𝗟𝗮𝗻𝗴𝘂𝗮𝗴𝗲 𝗣𝗿𝗼𝗰𝗲𝘀𝘀𝗶𝗻𝗴)
👉🏼 Tokenization, stemming, lemmatization
👉🏼 Bag-of-Words, TF-IDF
👉🏼 Sentiment analysis & topic modeling
𝗖𝗹𝗼𝘂𝗱 𝗮𝗻𝗱 𝗕𝗶𝗴 𝗗𝗮𝘁𝗮
👉🏼 Understanding cloud services (AWS, GCP, Azure) for data storage & computing
👉🏼 Working with distributed data using Spark
👉🏼 SQL for querying large datasets
Don’t get overwhelmed by the breadth of topics. Start small—master one concept, then move to the next. 📈
You’ve got this! 💪🏼
✅Count Rows:
SELECT COUNT(*) FROM source_table;
SELECT COUNT(*) FROM target_table;
🔹Verify if the number of rows in the source and target tables match
✅Check Duplicates:
SELECT column_name, COUNT(*)
FROM table_name
GROUP BY column_name
HAVING COUNT(*) > 1;
🔹Identify duplicate records in a table
✅Compare Data Between Source and Target:
SELECT *
FROM source_table
MINUS
SELECT *
FROM target_table;
🔹Check for discrepancies between the source and target tables
✅Validate Data Transformation:
SELECT column_name,
CASE
WHEN condition THEN 'Valid'
ELSE 'Invalid'
END AS validation_status
FROM table_name;
2. Performance and Integrity Checks
✅Check for Null Values:
SELECT *
FROM table_name
WHERE column_name IS NULL;
✅Data Type Validation:
SELECT column_name,
CASE
WHEN column_name LIKE '%[^0-9]%' THEN 'Invalid'
ELSE 'Valid'
END AS data_status
FROM table_name;
✅Primary Key Uniqueness:
SELECT primary_key_column, COUNT(*)
FROM table_name
GROUP BY primary_key_column
HAVING COUNT(*) > 1;
3. Data Aggregation and Summarization:
✅Aggregate Functions:
SELECT SUM(column_name), AVG(column_name), MAX(column_name), MIN(column_name)
FROM table_name;
✅Group Data:
SELECT column_name, COUNT(*)
FROM table_name
GROUP BY column_name;
4. Data Sampling and Extracting Subsets
✅Retrieve Top Records:
SELECT *
FROM table_name
WHERE ROWNUM <= 10; -- Oracle
SELECT *
FROM table_name
LIMIT 10; -- MySQL, PostgreSQL
✅Fetch Specific Columns:
SELECT column1, column2
FROM table_name;
5. Joins for Data Comparison
✅Inner Join:
SELECT a.column1, b.column2
FROM table_a a
INNER JOIN table_b b
ON a.common_column = b.common_column;
✅Left Join:
SELECT a.column1, b.column2
FROM table_a a
LEFT JOIN table_b b
ON a.common_column = b.common_column;
✅Full Outer Join:
SELECT a.column1, b.column2
FROM table_a a
FULL OUTER JOIN table_b b
ON a.common_column = b.common_column;
6. Data Cleaning
✅Remove Duplicates:
DELETE FROM table_name
WHERE rowid NOT IN (
SELECT MIN(rowid)
FROM table_name
GROUP BY column_name);
✅Update Data:
UPDATE table_name
SET column_name = new_value
WHERE condition;
7. ETL-Specific Validations
✅ETL Data Loading Validation:
SELECT COUNT(*)
FROM target_table
WHERE load_date = SYSDATE; -- For today's loaded data
✅Compare Aggregates Between Source and Target:
SELECT SUM(amount) AS source_sum
FROM source_table;
SELECT SUM(amount) AS target_sum
FROM target_table;
✅Check for Missing Records:
SELECT source_key
FROM source_table
WHERE source_key NOT IN (
SELECT target_key
FROM target_table);
8. Metadata Queries
✅View Table Structure:
DESCRIBE table_name; -- Oracle
SHOW COLUMNS FROM table_name; -- MySQL
✅View Indexes:
SELECT *
FROM user_indexes
WHERE table_name = 'TABLE_NAME
✅View Constraints:
SELECT constraint_name, constraint_type
FROM user_constraints
WHERE table_name = 'TABLE_NAME'; -- Oracle
SELECT COUNT(*) FROM source_table;
SELECT COUNT(*) FROM target_table;
🔹Verify if the number of rows in the source and target tables match
✅Check Duplicates:
SELECT column_name, COUNT(*)
FROM table_name
GROUP BY column_name
HAVING COUNT(*) > 1;
🔹Identify duplicate records in a table
✅Compare Data Between Source and Target:
SELECT *
FROM source_table
MINUS
SELECT *
FROM target_table;
🔹Check for discrepancies between the source and target tables
✅Validate Data Transformation:
SELECT column_name,
CASE
WHEN condition THEN 'Valid'
ELSE 'Invalid'
END AS validation_status
FROM table_name;
2. Performance and Integrity Checks
✅Check for Null Values:
SELECT *
FROM table_name
WHERE column_name IS NULL;
✅Data Type Validation:
SELECT column_name,
CASE
WHEN column_name LIKE '%[^0-9]%' THEN 'Invalid'
ELSE 'Valid'
END AS data_status
FROM table_name;
✅Primary Key Uniqueness:
SELECT primary_key_column, COUNT(*)
FROM table_name
GROUP BY primary_key_column
HAVING COUNT(*) > 1;
3. Data Aggregation and Summarization:
✅Aggregate Functions:
SELECT SUM(column_name), AVG(column_name), MAX(column_name), MIN(column_name)
FROM table_name;
✅Group Data:
SELECT column_name, COUNT(*)
FROM table_name
GROUP BY column_name;
4. Data Sampling and Extracting Subsets
✅Retrieve Top Records:
SELECT *
FROM table_name
WHERE ROWNUM <= 10; -- Oracle
SELECT *
FROM table_name
LIMIT 10; -- MySQL, PostgreSQL
✅Fetch Specific Columns:
SELECT column1, column2
FROM table_name;
5. Joins for Data Comparison
✅Inner Join:
SELECT a.column1, b.column2
FROM table_a a
INNER JOIN table_b b
ON a.common_column = b.common_column;
✅Left Join:
SELECT a.column1, b.column2
FROM table_a a
LEFT JOIN table_b b
ON a.common_column = b.common_column;
✅Full Outer Join:
SELECT a.column1, b.column2
FROM table_a a
FULL OUTER JOIN table_b b
ON a.common_column = b.common_column;
6. Data Cleaning
✅Remove Duplicates:
DELETE FROM table_name
WHERE rowid NOT IN (
SELECT MIN(rowid)
FROM table_name
GROUP BY column_name);
✅Update Data:
UPDATE table_name
SET column_name = new_value
WHERE condition;
7. ETL-Specific Validations
✅ETL Data Loading Validation:
SELECT COUNT(*)
FROM target_table
WHERE load_date = SYSDATE; -- For today's loaded data
✅Compare Aggregates Between Source and Target:
SELECT SUM(amount) AS source_sum
FROM source_table;
SELECT SUM(amount) AS target_sum
FROM target_table;
✅Check for Missing Records:
SELECT source_key
FROM source_table
WHERE source_key NOT IN (
SELECT target_key
FROM target_table);
8. Metadata Queries
✅View Table Structure:
DESCRIBE table_name; -- Oracle
SHOW COLUMNS FROM table_name; -- MySQL
✅View Indexes:
SELECT *
FROM user_indexes
WHERE table_name = 'TABLE_NAME
✅View Constraints:
SELECT constraint_name, constraint_type
FROM user_constraints
WHERE table_name = 'TABLE_NAME'; -- Oracle
👍1
💡 𝗛𝗼𝘄 𝗱𝗼𝗲𝘀 𝗮𝗻 𝗟𝗟𝗠 (𝗹𝗮𝗿𝗴𝗲 𝗹𝗮𝗻𝗴𝘂𝗮𝗴𝗲 𝗺𝗼𝗱𝗲𝗹) 𝗮𝗰𝘁𝘂𝗮𝗹𝗹𝘆 𝗹𝗲𝗮𝗿𝗻?
It’s a journey through 3 key phases:
1️⃣ 𝗦𝗲𝗹𝗳-𝗦𝘂𝗽𝗲𝗿𝘃𝗶𝘀𝗲𝗱 𝗟𝗲𝗮𝗿𝗻𝗶𝗻𝗴 (𝗨𝗻𝗱𝗲𝗿𝘀𝘁𝗮𝗻𝗱𝗶𝗻𝗴 𝗟𝗮𝗻𝗴𝘂𝗮𝗴𝗲)
The model is trained on massive text datasets (Wikipedia, blogs, websites). This is where the transformer architecture comes into picture which you can simply think of it as neural networks that sees words and predicts what comes next.
For example:
“A flash flood watch will be in effect all _____.”
The model ranks possible answers like “night,” “day,” or even “giraffe.” Over time, it gets really good at picking the right one.
2️⃣ 𝗦𝘂𝗽𝗲𝗿𝘃𝗶𝘀𝗲𝗱 𝗟𝗲𝗮𝗿𝗻𝗶𝗻𝗴 (𝗨𝗻𝗱𝗲𝗿𝘀𝘁𝗮𝗻𝗱𝗶𝗻𝗴 𝗜𝗻𝘀𝘁𝗿𝘂𝗰𝘁𝗶𝗼𝗻𝘀)
Next, we teach it how humans like their answers. Thousands of examples of questions and well-crafted responses are fed to the model. This step is smaller but crucial, it’s where the model learns to align with human intent.
3️⃣ 𝗥𝗲𝗶𝗻𝗳𝗼𝗿𝗰𝗲𝗺𝗲𝗻𝘁 𝗟𝗲𝗮𝗿𝗻𝗶𝗻𝗴 (𝗜𝗺𝗽𝗿𝗼𝘃𝗶𝗻𝗴 𝗕𝗲𝗵𝗮𝘃𝗶𝗼𝗿)
Finally, the model learns to improve its behavior based on feedback. Humans rate its answers (thumbs up or thumbs down), and the model adjusts.
This helps it avoid harmful or wrong answers and focus on being helpful, honest, and safe.
Through this process, the model learns patterns and relationships in language, which are stored as numerical weights. These weights are then compressed into the parameter file, the core of what makes the model function.
⚙️ So what happens when you ask a question?
The model breaks your question into tokens (small pieces of text, turned into numbers). It processes these numbers through its neural networks and predicts the most likely response.
For example:
“What should I eat today?” might turn into numbers like [123, 11, 45, 78], which the model uses to calculate the next best words to give you the answer.
❗️But here’s something important: every model has a token limit -> a maximum number of tokens it can handle at once. This can vary between small and larger models. Once it reaches that limit, it forgets the earlier context and focuses only on the most recent tokens.
Finally, you can imagine an LLM as just two files:
➡️ 𝗣𝗮𝗿𝗮𝗺𝗲𝘁𝗲𝗿 𝗳𝗶𝗹𝗲 – This is the big file, where all the knowledge lives. Think of it like a giant zip file containing everything the model has learned about language.
➡️ 𝗥𝘂𝗻 𝗳𝗶𝗹𝗲 – This is the set of instructions needed to use the parameter file. It defines the model’s architecture, handles text tokenization, and manages how the model generates outputs.
That’s a very simple way to break down how LLMs work!
These models are the backbone of AI agents, so lets not forget about them 😉
It’s a journey through 3 key phases:
1️⃣ 𝗦𝗲𝗹𝗳-𝗦𝘂𝗽𝗲𝗿𝘃𝗶𝘀𝗲𝗱 𝗟𝗲𝗮𝗿𝗻𝗶𝗻𝗴 (𝗨𝗻𝗱𝗲𝗿𝘀𝘁𝗮𝗻𝗱𝗶𝗻𝗴 𝗟𝗮𝗻𝗴𝘂𝗮𝗴𝗲)
The model is trained on massive text datasets (Wikipedia, blogs, websites). This is where the transformer architecture comes into picture which you can simply think of it as neural networks that sees words and predicts what comes next.
For example:
“A flash flood watch will be in effect all _____.”
The model ranks possible answers like “night,” “day,” or even “giraffe.” Over time, it gets really good at picking the right one.
2️⃣ 𝗦𝘂𝗽𝗲𝗿𝘃𝗶𝘀𝗲𝗱 𝗟𝗲𝗮𝗿𝗻𝗶𝗻𝗴 (𝗨𝗻𝗱𝗲𝗿𝘀𝘁𝗮𝗻𝗱𝗶𝗻𝗴 𝗜𝗻𝘀𝘁𝗿𝘂𝗰𝘁𝗶𝗼𝗻𝘀)
Next, we teach it how humans like their answers. Thousands of examples of questions and well-crafted responses are fed to the model. This step is smaller but crucial, it’s where the model learns to align with human intent.
3️⃣ 𝗥𝗲𝗶𝗻𝗳𝗼𝗿𝗰𝗲𝗺𝗲𝗻𝘁 𝗟𝗲𝗮𝗿𝗻𝗶𝗻𝗴 (𝗜𝗺𝗽𝗿𝗼𝘃𝗶𝗻𝗴 𝗕𝗲𝗵𝗮𝘃𝗶𝗼𝗿)
Finally, the model learns to improve its behavior based on feedback. Humans rate its answers (thumbs up or thumbs down), and the model adjusts.
This helps it avoid harmful or wrong answers and focus on being helpful, honest, and safe.
Through this process, the model learns patterns and relationships in language, which are stored as numerical weights. These weights are then compressed into the parameter file, the core of what makes the model function.
⚙️ So what happens when you ask a question?
The model breaks your question into tokens (small pieces of text, turned into numbers). It processes these numbers through its neural networks and predicts the most likely response.
For example:
“What should I eat today?” might turn into numbers like [123, 11, 45, 78], which the model uses to calculate the next best words to give you the answer.
❗️But here’s something important: every model has a token limit -> a maximum number of tokens it can handle at once. This can vary between small and larger models. Once it reaches that limit, it forgets the earlier context and focuses only on the most recent tokens.
Finally, you can imagine an LLM as just two files:
➡️ 𝗣𝗮𝗿𝗮𝗺𝗲𝘁𝗲𝗿 𝗳𝗶𝗹𝗲 – This is the big file, where all the knowledge lives. Think of it like a giant zip file containing everything the model has learned about language.
➡️ 𝗥𝘂𝗻 𝗳𝗶𝗹𝗲 – This is the set of instructions needed to use the parameter file. It defines the model’s architecture, handles text tokenization, and manages how the model generates outputs.
That’s a very simple way to break down how LLMs work!
These models are the backbone of AI agents, so lets not forget about them 😉
Q: How would you scale a dense retrieval system for billions of documents while ensuring query efficiency?
Scaling a dense retrieval system to handle billions of documents can be tricky, but here are some strategies to ensure both efficiency and accuracy:
1️⃣ Leverage ANN (Approximate Nearest Neighbor) Search:
Use algorithms like HNSW and IVF-PQ. These algorithms provide fast, scalable searches by narrowing down the search space, allowing for quicker retrieval without compromising on quality.
2️⃣ Compress Vectors for Efficiency:
Reduce the size of vectors using techniques like PCA or Product Quantization. This helps save memory and speeds up similarity searches, making it easier to scale.
3️⃣ Multi-Stage Retrieval for Better Accuracy:
First, apply ANN to retrieve a large candidate pool. Then, refine the results with a more precise model (like a cross-encoder) to improve the relevance of the results.
4️⃣ Implement Caching to Speed Up Responses:
Cache frequently queried results and precomputed embeddings. This reduces redundant processing and ensures faster query response times.
5️⃣ Distribute the Load:
Use sharding to divide the document index across multiple nodes, enabling parallel processing. Also, replicate the index to handle heavy traffic and ensure reliability.
Scaling a dense retrieval system to handle billions of documents can be tricky, but here are some strategies to ensure both efficiency and accuracy:
1️⃣ Leverage ANN (Approximate Nearest Neighbor) Search:
Use algorithms like HNSW and IVF-PQ. These algorithms provide fast, scalable searches by narrowing down the search space, allowing for quicker retrieval without compromising on quality.
2️⃣ Compress Vectors for Efficiency:
Reduce the size of vectors using techniques like PCA or Product Quantization. This helps save memory and speeds up similarity searches, making it easier to scale.
3️⃣ Multi-Stage Retrieval for Better Accuracy:
First, apply ANN to retrieve a large candidate pool. Then, refine the results with a more precise model (like a cross-encoder) to improve the relevance of the results.
4️⃣ Implement Caching to Speed Up Responses:
Cache frequently queried results and precomputed embeddings. This reduces redundant processing and ensures faster query response times.
5️⃣ Distribute the Load:
Use sharding to divide the document index across multiple nodes, enabling parallel processing. Also, replicate the index to handle heavy traffic and ensure reliability.
Missing data occurs when no value is stored for a variable in an observation, which is common in real-world datasets. Incomplete data can skew analyses, reduce the validity of conclusions, and hinder machine learning model performance. The goal of this blog is to cover how to identify, understand, and handle the missing values effectively to maintain the data integrity.
Impact: Often times, missing data can lead to:
Biased results, especially if missingness is related to the data itself.
Reduced sample size, leading to less robust analysis.
Poor model generalization, if handled improperly.
This article is divided into following 4 sections:
Identifying missing data
Types of missing data
Methods to handle missing data
Best Practices and Considerations
1. Identifying Missing Data:
Data Profiling:
We can use profiling tools like pandas-profiling or Sweetviz that generate automated reports with insights on missing values.
Example: Use pandas-profiling to generate a profile report in Python.
import pandas as pd
from pandas_profiling import ProfileReport
df = pd.read_csv("data.csv")
profile = ProfileReport(df, minimal=True)
profile.to_file("data_report.html")
Visualization Techniques:
Use library like missingno to create heatmaps and barplots showing missing data patterns.
import missingno as msno
msno.matrix(df) # Matrix plot to view missing values pattern
msno.bar(df) # Bar plot of total missing values per column
Custom Exploratory Functions:
This is my favorite, where we write custom function to display missing data counts and percentages.
def missing_data_summary(df):
missing_data = df.isnull().sum()
missing_percentage = (missing_data / len(df)) * 100
return pd.DataFrame({'Missing Count': missing_data, 'Missing Percentage': missing_percentage})
print(missing_data_summary(df))
2. Types of Missing Data:
Missing Completely at Random (MCAR):
Definition: The missing values are independent of both observed and unobserved data.
Example: Survey data where respondents skipped random questions unrelated to their characteristics.
Can be detected with Statistical tests (e.g., Little's MCAR test)
Missing at Random (MAR):
Definition: Missing values depend on observed data but is not related to the missing values themselves.
Example: Income data may be missing but is related to age or education level.
It can be addressed effectively with conditional imputation, as we can predict missingness based on related variables.
Missing Not at Random (MNAR):
Definition: Missingness depends on unobserved data or the missing values themselves.
Example: People may be unwilling to disclose income if it is very high or low.
Addressing MNAR is challenging, and solutions may require domain knowledge, assumptions, or advanced modeling techniques.
3. Methods To Handle Missing Data:
Listwise Deletion:
Discuss cases where removing rows with missing data may be appropriate (e.g., when missing values are very low or MCAR).
Example: If a dataset has <5% missing values, deletion may suffice, though it’s risky for larger proportions.
Mean/Median/Mode Imputation:
For numerical data, impute using mean or median values; for categorical data, use mode.
Pros & Cons: Simple but can introduce bias and reduce variance in the data. If not handled properly, the model will be impacted by biased data.
df['column'] = df['column'].fillna(df['column'].mean())
Advanced Imputation Techniques:
K-Nearest Neighbors (KNN):
Use neighbors to fill missing values based on similarity.
Pros & Cons: Maintains relationship between variables, but computationally it can be expensive.
from sklearn.impute import KNNImputer
imputer = KNNImputer(n_neighbors=5)
df_imputed = imputer.fit_transform(df)
Multivariate Imputation by Chained Equations (MICE):
Iteratively imputes missing values by treating each column as a function of others.
Best suited for MAR, though it can also be computationally expensive.
Sklearn has features like fancyimpute or IterativeImputer to apply MICE.
Impact: Often times, missing data can lead to:
Biased results, especially if missingness is related to the data itself.
Reduced sample size, leading to less robust analysis.
Poor model generalization, if handled improperly.
This article is divided into following 4 sections:
Identifying missing data
Types of missing data
Methods to handle missing data
Best Practices and Considerations
1. Identifying Missing Data:
Data Profiling:
We can use profiling tools like pandas-profiling or Sweetviz that generate automated reports with insights on missing values.
Example: Use pandas-profiling to generate a profile report in Python.
import pandas as pd
from pandas_profiling import ProfileReport
df = pd.read_csv("data.csv")
profile = ProfileReport(df, minimal=True)
profile.to_file("data_report.html")
Visualization Techniques:
Use library like missingno to create heatmaps and barplots showing missing data patterns.
import missingno as msno
msno.matrix(df) # Matrix plot to view missing values pattern
msno.bar(df) # Bar plot of total missing values per column
Custom Exploratory Functions:
This is my favorite, where we write custom function to display missing data counts and percentages.
def missing_data_summary(df):
missing_data = df.isnull().sum()
missing_percentage = (missing_data / len(df)) * 100
return pd.DataFrame({'Missing Count': missing_data, 'Missing Percentage': missing_percentage})
print(missing_data_summary(df))
2. Types of Missing Data:
Missing Completely at Random (MCAR):
Definition: The missing values are independent of both observed and unobserved data.
Example: Survey data where respondents skipped random questions unrelated to their characteristics.
Can be detected with Statistical tests (e.g., Little's MCAR test)
Missing at Random (MAR):
Definition: Missing values depend on observed data but is not related to the missing values themselves.
Example: Income data may be missing but is related to age or education level.
It can be addressed effectively with conditional imputation, as we can predict missingness based on related variables.
Missing Not at Random (MNAR):
Definition: Missingness depends on unobserved data or the missing values themselves.
Example: People may be unwilling to disclose income if it is very high or low.
Addressing MNAR is challenging, and solutions may require domain knowledge, assumptions, or advanced modeling techniques.
3. Methods To Handle Missing Data:
Listwise Deletion:
Discuss cases where removing rows with missing data may be appropriate (e.g., when missing values are very low or MCAR).
Example: If a dataset has <5% missing values, deletion may suffice, though it’s risky for larger proportions.
Mean/Median/Mode Imputation:
For numerical data, impute using mean or median values; for categorical data, use mode.
Pros & Cons: Simple but can introduce bias and reduce variance in the data. If not handled properly, the model will be impacted by biased data.
df['column'] = df['column'].fillna(df['column'].mean())
Advanced Imputation Techniques:
K-Nearest Neighbors (KNN):
Use neighbors to fill missing values based on similarity.
Pros & Cons: Maintains relationship between variables, but computationally it can be expensive.
from sklearn.impute import KNNImputer
imputer = KNNImputer(n_neighbors=5)
df_imputed = imputer.fit_transform(df)
Multivariate Imputation by Chained Equations (MICE):
Iteratively imputes missing values by treating each column as a function of others.
Best suited for MAR, though it can also be computationally expensive.
Sklearn has features like fancyimpute or IterativeImputer to apply MICE.
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer
imputer = IterativeImputer()
df_imputed = imputer.fit_transform(df)
selective closeup photo of brown guinea pig
Photo by Bonnie Kittle on Unsplash
Use Machine Learning Models:
Impute missing values by training a model on non-missing values.
Example: Predict missing income based on age, education, and occupation.
Build categories of similar data points. Find a group of similar people and use non-missing salary to get the median/ mean for missing ones.
Pros: Potentially accurate; Cons: Requires careful validation to prevent overfitting.
Data Augmentation:
For small sample sizes, consider generating synthetic data points.
Use models like GANs for more advanced augmentation.
4. Best Practices and Considerations:
Assess Method Effectiveness: Compare imputed values with actual values when possible or test multiple imputation methods to evaluate which yields better model performance.
Use Domain Knowledge: Understanding the domain can be crucial for addressing missing values, for example in MNAR data and/or identifying appropriate imputation techniques, or deciding to drop.
Monitor Impact: It is important to track model accuracy before and after handling missing data. Also we need to measure the gain based on implementation cost/ challenges.
Data Imbalance: We have to very careful while using simple methods (upsampling) for imbalanced datasets as they may not accurately reflect minority class values.
To summarize, missing values are an inevitable challenge, but addressing them effectively is key to successful data science projects. Understanding the type of missing data and choosing the right handling method—whether simple imputation or advanced techniques like MICE or KNN—is crucial. There’s no one-size-fits-all solution, so leveraging domain knowledge and validating your approach can ensure data integrity and reliable outcomes.
from sklearn.impute import IterativeImputer
imputer = IterativeImputer()
df_imputed = imputer.fit_transform(df)
selective closeup photo of brown guinea pig
Photo by Bonnie Kittle on Unsplash
Use Machine Learning Models:
Impute missing values by training a model on non-missing values.
Example: Predict missing income based on age, education, and occupation.
Build categories of similar data points. Find a group of similar people and use non-missing salary to get the median/ mean for missing ones.
Pros: Potentially accurate; Cons: Requires careful validation to prevent overfitting.
Data Augmentation:
For small sample sizes, consider generating synthetic data points.
Use models like GANs for more advanced augmentation.
4. Best Practices and Considerations:
Assess Method Effectiveness: Compare imputed values with actual values when possible or test multiple imputation methods to evaluate which yields better model performance.
Use Domain Knowledge: Understanding the domain can be crucial for addressing missing values, for example in MNAR data and/or identifying appropriate imputation techniques, or deciding to drop.
Monitor Impact: It is important to track model accuracy before and after handling missing data. Also we need to measure the gain based on implementation cost/ challenges.
Data Imbalance: We have to very careful while using simple methods (upsampling) for imbalanced datasets as they may not accurately reflect minority class values.
To summarize, missing values are an inevitable challenge, but addressing them effectively is key to successful data science projects. Understanding the type of missing data and choosing the right handling method—whether simple imputation or advanced techniques like MICE or KNN—is crucial. There’s no one-size-fits-all solution, so leveraging domain knowledge and validating your approach can ensure data integrity and reliable outcomes.
You will be 10x better at Machine Learning
𝐈𝐟 𝐲𝐨𝐮 𝐜𝐨𝐯𝐞𝐫 𝐭𝐡𝐞𝐬𝐞 5 𝐤𝐞𝐲 𝐭𝐨𝐩𝐢𝐜𝐬:
1. 𝐅𝐮𝐧𝐝𝐚𝐦𝐞𝐧𝐭𝐚𝐥𝐬 𝐨𝐟 𝐌𝐚𝐜𝐡𝐢𝐧𝐞 𝐋𝐞𝐚𝐫𝐧𝐢𝐧𝐠
↳ Supervised vs. Unsupervised Learning
↳ Regression, Classification, Clustering
↳ Overfitting, Underfitting, and Regularization
↳ Bias-Variance Tradeoff
2. 𝐃𝐚𝐭𝐚 𝐏𝐫𝐞𝐩𝐚𝐫𝐚𝐭𝐢𝐨𝐧 𝐚𝐧𝐝 𝐅𝐞𝐚𝐭𝐮𝐫𝐞 𝐄𝐧𝐠𝐢𝐧𝐞𝐞𝐫𝐢𝐧𝐠
↳ Data Cleaning (Handling Missing Values and outliers)
↳ Feature Scaling (Normalization, Standardization)
↳ Feature Selection and Dimensionality Reduction
↳ Encoding Categorical Variables
3. 𝐀𝐥𝐠𝐨𝐫𝐢𝐭𝐡𝐦𝐬 𝐚𝐧𝐝 𝐌𝐨𝐝𝐞𝐥 𝐓𝐫𝐚𝐢𝐧𝐢𝐧𝐠
↳ Linear Models (Linear Regression, Logistic Regression)
↳ Tree-Based Models (Random Forest, XGBoost)
↳ Neural Networks (ANN, CNN, RNN)
↳ Optimization Techniques
4. 𝐌𝐨𝐝𝐞𝐥 𝐄𝐯𝐚𝐥𝐮𝐚𝐭𝐢𝐨𝐧 𝐚𝐧𝐝 𝐓𝐮𝐧𝐢𝐧𝐠
↳ Cross-Validation (K-Fold, Leave-One-Out)
↳ Metrics (Accuracy, Precision, Recall, F1-Score, AUC-ROC)
↳ Hyperparameter Tuning (GridSearch, RandomSearch, etc.)
↳ Handling Imbalanced Datasets
5. 𝐌𝐚𝐜𝐡𝐢𝐧𝐞 𝐋𝐞𝐚𝐫𝐧𝐢𝐧𝐠 𝐢𝐧 𝐏𝐫𝐨𝐝𝐮𝐜𝐭𝐢𝐨𝐧 - 𝘛𝘩𝘦 𝘔𝘰𝘴𝘵 𝘪𝘮𝘱𝘰𝘳𝘵𝘢𝘯𝘵 𝘰𝘯𝘦!
↳ Model Deployment (StreamLit, Azure ML, AWS SageMaker)
↳ Monitoring and Maintaining Models
↳ Pipelines for MLOps
↳ Interpretability and Explainability (SHAP, LIME, Feature Importance)
Why is this important?
Models in notebooks are cool... until they’re not!
💥 Deploying and scaling them into real, production-ready applications is where most ML practitioners face their biggest roadblocks. If your models are still sitting in Jupyter notebooks, you’re missing out on a HUGE opportunity!
𝐈𝐟 𝐲𝐨𝐮 𝐜𝐨𝐯𝐞𝐫 𝐭𝐡𝐞𝐬𝐞 5 𝐤𝐞𝐲 𝐭𝐨𝐩𝐢𝐜𝐬:
1. 𝐅𝐮𝐧𝐝𝐚𝐦𝐞𝐧𝐭𝐚𝐥𝐬 𝐨𝐟 𝐌𝐚𝐜𝐡𝐢𝐧𝐞 𝐋𝐞𝐚𝐫𝐧𝐢𝐧𝐠
↳ Supervised vs. Unsupervised Learning
↳ Regression, Classification, Clustering
↳ Overfitting, Underfitting, and Regularization
↳ Bias-Variance Tradeoff
2. 𝐃𝐚𝐭𝐚 𝐏𝐫𝐞𝐩𝐚𝐫𝐚𝐭𝐢𝐨𝐧 𝐚𝐧𝐝 𝐅𝐞𝐚𝐭𝐮𝐫𝐞 𝐄𝐧𝐠𝐢𝐧𝐞𝐞𝐫𝐢𝐧𝐠
↳ Data Cleaning (Handling Missing Values and outliers)
↳ Feature Scaling (Normalization, Standardization)
↳ Feature Selection and Dimensionality Reduction
↳ Encoding Categorical Variables
3. 𝐀𝐥𝐠𝐨𝐫𝐢𝐭𝐡𝐦𝐬 𝐚𝐧𝐝 𝐌𝐨𝐝𝐞𝐥 𝐓𝐫𝐚𝐢𝐧𝐢𝐧𝐠
↳ Linear Models (Linear Regression, Logistic Regression)
↳ Tree-Based Models (Random Forest, XGBoost)
↳ Neural Networks (ANN, CNN, RNN)
↳ Optimization Techniques
4. 𝐌𝐨𝐝𝐞𝐥 𝐄𝐯𝐚𝐥𝐮𝐚𝐭𝐢𝐨𝐧 𝐚𝐧𝐝 𝐓𝐮𝐧𝐢𝐧𝐠
↳ Cross-Validation (K-Fold, Leave-One-Out)
↳ Metrics (Accuracy, Precision, Recall, F1-Score, AUC-ROC)
↳ Hyperparameter Tuning (GridSearch, RandomSearch, etc.)
↳ Handling Imbalanced Datasets
5. 𝐌𝐚𝐜𝐡𝐢𝐧𝐞 𝐋𝐞𝐚𝐫𝐧𝐢𝐧𝐠 𝐢𝐧 𝐏𝐫𝐨𝐝𝐮𝐜𝐭𝐢𝐨𝐧 - 𝘛𝘩𝘦 𝘔𝘰𝘴𝘵 𝘪𝘮𝘱𝘰𝘳𝘵𝘢𝘯𝘵 𝘰𝘯𝘦!
↳ Model Deployment (StreamLit, Azure ML, AWS SageMaker)
↳ Monitoring and Maintaining Models
↳ Pipelines for MLOps
↳ Interpretability and Explainability (SHAP, LIME, Feature Importance)
Why is this important?
Models in notebooks are cool... until they’re not!
💥 Deploying and scaling them into real, production-ready applications is where most ML practitioners face their biggest roadblocks. If your models are still sitting in Jupyter notebooks, you’re missing out on a HUGE opportunity!