Essential Topics to Master Data Science Interviews: ๐
SQL:
1. Foundations
- Craft SELECT statements with WHERE, ORDER BY, GROUP BY, HAVING
- Embrace Basic JOINS (INNER, LEFT, RIGHT, FULL)
- Navigate through simple databases and tables
2. Intermediate SQL
- Utilize Aggregate functions (COUNT, SUM, AVG, MAX, MIN)
- Embrace Subqueries and nested queries
- Master Common Table Expressions (WITH clause)
- Implement CASE statements for logical queries
3. Advanced SQL
- Explore Advanced JOIN techniques (self-join, non-equi join)
- Dive into Window functions (OVER, PARTITION BY, ROW_NUMBER, RANK, DENSE_RANK, lead, lag)
- Optimize queries with indexing
- Execute Data manipulation (INSERT, UPDATE, DELETE)
Python:
1. Python Basics
- Grasp Syntax, variables, and data types
- Command Control structures (if-else, for and while loops)
- Understand Basic data structures (lists, dictionaries, sets, tuples)
- Master Functions, lambda functions, and error handling (try-except)
- Explore Modules and packages
2. Pandas & Numpy
- Create and manipulate DataFrames and Series
- Perfect Indexing, selecting, and filtering data
- Handle missing data (fillna, dropna)
- Aggregate data with groupby, summarizing data
- Merge, join, and concatenate datasets
3. Data Visualization with Python
- Plot with Matplotlib (line plots, bar plots, histograms)
- Visualize with Seaborn (scatter plots, box plots, pair plots)
- Customize plots (sizes, labels, legends, color palettes)
- Introduction to interactive visualizations (e.g., Plotly)
Excel:
1. Excel Essentials
- Conduct Cell operations, basic formulas (SUMIFS, COUNTIFS, AVERAGEIFS, IF, AND, OR, NOT & Nested Functions etc.)
- Dive into charts and basic data visualization
- Sort and filter data, use Conditional formatting
2. Intermediate Excel
- Master Advanced formulas (V/XLOOKUP, INDEX-MATCH, nested IF)
- Leverage PivotTables and PivotCharts for summarizing data
- Utilize data validation tools
- Employ What-if analysis tools (Data Tables, Goal Seek)
3. Advanced Excel
- Harness Array formulas and advanced functions
- Dive into Data Model & Power Pivot
- Explore Advanced Filter, Slicers, and Timelines in Pivot Tables
- Create dynamic charts and interactive dashboards
Power BI:
1. Data Modeling in Power BI
- Import data from various sources
- Establish and manage relationships between datasets
- Grasp Data modeling basics (star schema, snowflake schema)
2. Data Transformation in Power BI
- Use Power Query for data cleaning and transformation
- Apply advanced data shaping techniques
- Create Calculated columns and measures using DAX
3. Data Visualization and Reporting in Power BI
- Craft interactive reports and dashboards
- Utilize Visualizations (bar, line, pie charts, maps)
- Publish and share reports, schedule data refreshes
Statistics Fundamentals:
- Mean, Median, Mode
- Standard Deviation, Variance
- Probability Distributions, Hypothesis Testing
- P-values, Confidence Intervals
- Correlation, Simple Linear Regression
- Normal Distribution, Binomial Distribution, Poisson Distribution.
Show some โค๏ธ if you're ready to elevate your data science game! ๐
ENJOY LEARNING ๐๐
SQL:
1. Foundations
- Craft SELECT statements with WHERE, ORDER BY, GROUP BY, HAVING
- Embrace Basic JOINS (INNER, LEFT, RIGHT, FULL)
- Navigate through simple databases and tables
2. Intermediate SQL
- Utilize Aggregate functions (COUNT, SUM, AVG, MAX, MIN)
- Embrace Subqueries and nested queries
- Master Common Table Expressions (WITH clause)
- Implement CASE statements for logical queries
3. Advanced SQL
- Explore Advanced JOIN techniques (self-join, non-equi join)
- Dive into Window functions (OVER, PARTITION BY, ROW_NUMBER, RANK, DENSE_RANK, lead, lag)
- Optimize queries with indexing
- Execute Data manipulation (INSERT, UPDATE, DELETE)
Python:
1. Python Basics
- Grasp Syntax, variables, and data types
- Command Control structures (if-else, for and while loops)
- Understand Basic data structures (lists, dictionaries, sets, tuples)
- Master Functions, lambda functions, and error handling (try-except)
- Explore Modules and packages
2. Pandas & Numpy
- Create and manipulate DataFrames and Series
- Perfect Indexing, selecting, and filtering data
- Handle missing data (fillna, dropna)
- Aggregate data with groupby, summarizing data
- Merge, join, and concatenate datasets
3. Data Visualization with Python
- Plot with Matplotlib (line plots, bar plots, histograms)
- Visualize with Seaborn (scatter plots, box plots, pair plots)
- Customize plots (sizes, labels, legends, color palettes)
- Introduction to interactive visualizations (e.g., Plotly)
Excel:
1. Excel Essentials
- Conduct Cell operations, basic formulas (SUMIFS, COUNTIFS, AVERAGEIFS, IF, AND, OR, NOT & Nested Functions etc.)
- Dive into charts and basic data visualization
- Sort and filter data, use Conditional formatting
2. Intermediate Excel
- Master Advanced formulas (V/XLOOKUP, INDEX-MATCH, nested IF)
- Leverage PivotTables and PivotCharts for summarizing data
- Utilize data validation tools
- Employ What-if analysis tools (Data Tables, Goal Seek)
3. Advanced Excel
- Harness Array formulas and advanced functions
- Dive into Data Model & Power Pivot
- Explore Advanced Filter, Slicers, and Timelines in Pivot Tables
- Create dynamic charts and interactive dashboards
Power BI:
1. Data Modeling in Power BI
- Import data from various sources
- Establish and manage relationships between datasets
- Grasp Data modeling basics (star schema, snowflake schema)
2. Data Transformation in Power BI
- Use Power Query for data cleaning and transformation
- Apply advanced data shaping techniques
- Create Calculated columns and measures using DAX
3. Data Visualization and Reporting in Power BI
- Craft interactive reports and dashboards
- Utilize Visualizations (bar, line, pie charts, maps)
- Publish and share reports, schedule data refreshes
Statistics Fundamentals:
- Mean, Median, Mode
- Standard Deviation, Variance
- Probability Distributions, Hypothesis Testing
- P-values, Confidence Intervals
- Correlation, Simple Linear Regression
- Normal Distribution, Binomial Distribution, Poisson Distribution.
Show some โค๏ธ if you're ready to elevate your data science game! ๐
ENJOY LEARNING ๐๐
LearnSQL
SQL online courses | LearnSQL.com
Learn the SQL standard and other SQL dialects comprehensively or simply upskill yourself with our interactive online SQL courses.
๐๐ฎ๐๐ฎ ๐ฆ๐ฐ๐ถ๐ฒ๐ป๐ฐ๐ฒ ๐๐ป๐๐ฒ๐ฟ๐๐ถ๐ฒ๐ ๐ค๐๐ฒ๐๐๐ถ๐ผ๐ป:
How would you extend SVM for multi-class classification?
Two common ways are -
๐ข๐ป๐ฒ-๐๐-๐ฅ๐ฒ๐๐ (๐ข๐๐ฅ) (๐ผ๐ฟ ๐ข๐ป๐ฒ-๐๐-๐๐น๐น)
Each classifier is trained to separate one class from all others. For K classes, OvR builds K SVM models, where each model is trained with the class of interest labeled as positive and all other classes labeled as negative. For a new instance, each classifier outputs a score, and the class with the highest score is chosen as the predicted class.
Pros of OvR -
๐งค Computationally efficient, especially when there are many classes, as it requires fewer classifiers.
๐งค Works well when the dataset is large, and class overlap isnโt significant.
Cons of OvR -
๐ป The negative class for each classifier can be a mix of very different classes, which can make the boundary between classes less distinct.
๐ป May struggle with overlapping classes, as it requires each classifier to make broad distinctions between one class and all others.
๐ข๐ป๐ฒ-๐๐-๐ข๐ป๐ฒ (๐ข๐๐ข)
This method involves building a separate binary classifier for each pair of classes, resulting in (K(Kโ1))/2 classifiers for K classes. Each classifier learns to distinguish between just two classes. For classification, each binary classifier votes for a class, and the class with the most votes is selected.
Pros of OvO -
๐งค Creates simpler decision boundaries, as each classifier only has to separate two classes.
๐งค Often yields higher accuracy for complex, overlapping classes since it doesn't force each classifier to distinguish between all classes.
Cons of OvO -
๐ป Computationally intensive for large numbers of classes, due to the higher number of classifiers.
๐ป Prediction time can be slower as it requires voting among all classifiers, which can be significant if there are many classes.
๐๐ต๐ผ๐ผ๐๐ถ๐ป๐ด ๐๐ฒ๐๐๐ฒ๐ฒ๐ป ๐ข๐๐ฅ ๐ฎ๐ป๐ฑ ๐ข๐๐ข
The choice between OvR and OvO depends largely on the specific dataset characteristics and computational constraints:
๐ If computational resources are limited and the number of classes is high, OvR may be preferred, as it requires fewer classifiers and is faster to train and predict with.
๐ If accuracy is critical and the classes overlap significantly, OvO often performs better since it learns more specialized decision boundaries for each pair of classes.
How would you extend SVM for multi-class classification?
Two common ways are -
๐ข๐ป๐ฒ-๐๐-๐ฅ๐ฒ๐๐ (๐ข๐๐ฅ) (๐ผ๐ฟ ๐ข๐ป๐ฒ-๐๐-๐๐น๐น)
Each classifier is trained to separate one class from all others. For K classes, OvR builds K SVM models, where each model is trained with the class of interest labeled as positive and all other classes labeled as negative. For a new instance, each classifier outputs a score, and the class with the highest score is chosen as the predicted class.
Pros of OvR -
๐งค Computationally efficient, especially when there are many classes, as it requires fewer classifiers.
๐งค Works well when the dataset is large, and class overlap isnโt significant.
Cons of OvR -
๐ป The negative class for each classifier can be a mix of very different classes, which can make the boundary between classes less distinct.
๐ป May struggle with overlapping classes, as it requires each classifier to make broad distinctions between one class and all others.
๐ข๐ป๐ฒ-๐๐-๐ข๐ป๐ฒ (๐ข๐๐ข)
This method involves building a separate binary classifier for each pair of classes, resulting in (K(Kโ1))/2 classifiers for K classes. Each classifier learns to distinguish between just two classes. For classification, each binary classifier votes for a class, and the class with the most votes is selected.
Pros of OvO -
๐งค Creates simpler decision boundaries, as each classifier only has to separate two classes.
๐งค Often yields higher accuracy for complex, overlapping classes since it doesn't force each classifier to distinguish between all classes.
Cons of OvO -
๐ป Computationally intensive for large numbers of classes, due to the higher number of classifiers.
๐ป Prediction time can be slower as it requires voting among all classifiers, which can be significant if there are many classes.
๐๐ต๐ผ๐ผ๐๐ถ๐ป๐ด ๐๐ฒ๐๐๐ฒ๐ฒ๐ป ๐ข๐๐ฅ ๐ฎ๐ป๐ฑ ๐ข๐๐ข
The choice between OvR and OvO depends largely on the specific dataset characteristics and computational constraints:
๐ If computational resources are limited and the number of classes is high, OvR may be preferred, as it requires fewer classifiers and is faster to train and predict with.
๐ If accuracy is critical and the classes overlap significantly, OvO often performs better since it learns more specialized decision boundaries for each pair of classes.
So what should an entry-level interview experience look like?
Having been on both sides of the process - this format, IMO, is the most effective one
Round 1:
โญ๏ธ 30 minutes LeetCode, 30 minutes SQL
The goal? Understand how candidate approaches the problem - clarifies ambiguity, addresses edge cases, and writes code.
Passing a few test cases is required, but not all.
Better than brute force is required, optimal solution is not.
Round 2:
โญ๏ธ Machine Learning/Statistics and Resume-based
The goal? Make sure they understand basic concepts - bias vs variance, hypothesis testing, cleaning data etc. and how they have approached ML formulation, metric selection and modelling in the past.
Round 3:
โญ๏ธ Hiring Manager (+ senior team member) to review work on resume + culture fit
The goal? For the HM and senior team members to assess if the candidate is a culture fit with the team; To review prior work and see if how they think about solving a data/ML problem would work in the team (or if the person is coachable)
Join our channel for more information like this
Having been on both sides of the process - this format, IMO, is the most effective one
Round 1:
โญ๏ธ 30 minutes LeetCode, 30 minutes SQL
The goal? Understand how candidate approaches the problem - clarifies ambiguity, addresses edge cases, and writes code.
Passing a few test cases is required, but not all.
Better than brute force is required, optimal solution is not.
Round 2:
โญ๏ธ Machine Learning/Statistics and Resume-based
The goal? Make sure they understand basic concepts - bias vs variance, hypothesis testing, cleaning data etc. and how they have approached ML formulation, metric selection and modelling in the past.
Round 3:
โญ๏ธ Hiring Manager (+ senior team member) to review work on resume + culture fit
The goal? For the HM and senior team members to assess if the candidate is a culture fit with the team; To review prior work and see if how they think about solving a data/ML problem would work in the team (or if the person is coachable)
Join our channel for more information like this
https://geekycodesin.wordpress.com/2024/11/13/understanding-scrum-methodology-a-comprehensive-guide/
Geeky Codes
Understanding Scrum Methodology: A Comprehensive Guide
In todayโs fast-paced, ever-changing world of software development, traditional project management approaches often struggle to keep up with the demands of innovation, speed, and flexibility. Enterโฆ
Amazon Data Science Interview Question:
In a linear regression model, what are the key assumptions that need to be satisfied for the model to be valid? How would you evaluate whether these assumptions hold in your dataset?
This is also, the most common question I see across companies!
So the assumptions are -
๐๐ถ๐ป๐ฒ๐ฎ๐ฟ๐ถ๐๐
The relationship between the independent variables (predictors) and the dependent variable is linear. This means that the effect of each predictor on the outcome is constant and additive.
How to evaluate? - Scatter plots of predictors vs. the dependent variable and residual vs. fitted value plots. You can also use polynomial regression or transformations (log, square root) if non-linearity is detected.
How to fix? - Apply feature transformations (e.g., log, square root, polynomial) or use non-linear models.
๐ก๐ผ๐ฟ๐บ๐ฎ๐น๐ถ๐๐ ๐ผ๐ณ ๐๐ฟ๐ฟ๐ผ๐ฟ๐
The residuals are normally distributed, especially for the purpose of conducting statistical tests and constructing confidence intervals.
How to evaluate - Residual autocorrelation plots or the Durbin-Watson test for time-series data. For non-time-series data, this assumption can often be assumed to be satisfied if the data is randomly sampled.
How to fix - Transform the dependent variable (log, box-cox) and/or check for outliers.
๐๐ผ๐บ๐ผ๐๐ฐ๐ฒ๐ฑ๐ฎ๐๐๐ถ๐ฐ๐ถ๐๐ (๐๐ผ๐ป๐๐๐ฎ๐ป๐ ๐ฉ๐ฎ๐ฟ๐ถ๐ฎ๐ป๐ฐ๐ฒ ๐ผ๐ณ ๐๐ฟ๐ฟ๐ผ๐ฟ๐)
The variance of the residuals (errors) is constant across all levels of the independent variables. In other words, the spread of residuals should not increase or decrease as the predicted values increase.
How to evaluate - Plot the residuals against fitted values. If the plot shows a "fan" shape (i.e., increasing or decreasing spread of residuals), you may need to address heteroscedasticity using robust standard errors or a transformation (e.g., log-transformation).
How to fix - Transformation of dependent variable (log, box-cox) or weighted least squares regression can help
๐ก๐ผ ๐ ๐๐น๐๐ถ๐ฐ๐ผ๐น๐น๐ถ๐ป๐ฒ๐ฎ๐ฟ๐ถ๐๐
The independent variables (predictors) are not highly correlated with each other. High correlation between predictors can lead to multicollinearity, which makes it difficult to determine the individual effect of each predictor on the dependent variable.
How to evaluate - Calculate the Variance Inflation Factor (VIF) for each predictor. If VIF is high, consider removing highly correlated predictors or combining them into a single predictor (e.g., using Principal Component Analysis).
How to fix - Remove or combine correlated predictors, or use regularized regression models like Ridge or Lasso regression.
In a linear regression model, what are the key assumptions that need to be satisfied for the model to be valid? How would you evaluate whether these assumptions hold in your dataset?
This is also, the most common question I see across companies!
So the assumptions are -
๐๐ถ๐ป๐ฒ๐ฎ๐ฟ๐ถ๐๐
The relationship between the independent variables (predictors) and the dependent variable is linear. This means that the effect of each predictor on the outcome is constant and additive.
How to evaluate? - Scatter plots of predictors vs. the dependent variable and residual vs. fitted value plots. You can also use polynomial regression or transformations (log, square root) if non-linearity is detected.
How to fix? - Apply feature transformations (e.g., log, square root, polynomial) or use non-linear models.
๐ก๐ผ๐ฟ๐บ๐ฎ๐น๐ถ๐๐ ๐ผ๐ณ ๐๐ฟ๐ฟ๐ผ๐ฟ๐
The residuals are normally distributed, especially for the purpose of conducting statistical tests and constructing confidence intervals.
How to evaluate - Residual autocorrelation plots or the Durbin-Watson test for time-series data. For non-time-series data, this assumption can often be assumed to be satisfied if the data is randomly sampled.
How to fix - Transform the dependent variable (log, box-cox) and/or check for outliers.
๐๐ผ๐บ๐ผ๐๐ฐ๐ฒ๐ฑ๐ฎ๐๐๐ถ๐ฐ๐ถ๐๐ (๐๐ผ๐ป๐๐๐ฎ๐ป๐ ๐ฉ๐ฎ๐ฟ๐ถ๐ฎ๐ป๐ฐ๐ฒ ๐ผ๐ณ ๐๐ฟ๐ฟ๐ผ๐ฟ๐)
The variance of the residuals (errors) is constant across all levels of the independent variables. In other words, the spread of residuals should not increase or decrease as the predicted values increase.
How to evaluate - Plot the residuals against fitted values. If the plot shows a "fan" shape (i.e., increasing or decreasing spread of residuals), you may need to address heteroscedasticity using robust standard errors or a transformation (e.g., log-transformation).
How to fix - Transformation of dependent variable (log, box-cox) or weighted least squares regression can help
๐ก๐ผ ๐ ๐๐น๐๐ถ๐ฐ๐ผ๐น๐น๐ถ๐ป๐ฒ๐ฎ๐ฟ๐ถ๐๐
The independent variables (predictors) are not highly correlated with each other. High correlation between predictors can lead to multicollinearity, which makes it difficult to determine the individual effect of each predictor on the dependent variable.
How to evaluate - Calculate the Variance Inflation Factor (VIF) for each predictor. If VIF is high, consider removing highly correlated predictors or combining them into a single predictor (e.g., using Principal Component Analysis).
How to fix - Remove or combine correlated predictors, or use regularized regression models like Ridge or Lasso regression.