DATA SCIENCE INTERVIEW QUESTIONS
[PART -15]
๐1. ๐๐๐๐ฅ ๐ฐ๐ข๐ญ๐ก ๐ฎ๐ง๐๐๐ฅ๐๐ง๐๐๐ ๐๐ข๐ง๐๐ซ๐ฒ ๐๐ฅ๐๐ฌ๐ฌ๐ข๐๐ข๐๐๐ญ๐ข๐จ๐ง?
๐ns. Techniques to Handle unbalanced Data:
1. Use the right evaluation metrics
2. Use K-fold Cross-Validation in the right way
3. Ensemble different resampled datasets
4. Resample with different ratios
5. Design your own models
๐2. ๐๐๐ญ๐ข๐ฏ๐๐ญ๐ข๐จ๐ง ๐๐ฎ๐ง๐๐ญ๐ข๐จ๐ง?
๐ns. Activation functions are mathematical equations that determine the output of a neural network model. It is a non-linear transformation that we do over the input before sending it to the next layer of neurons or finalizing it as output.
๐3. ๐๐ข๐ฆ๐๐ง๐ฌ๐ข๐จ๐ง ๐ซ๐๐๐ฎ๐๐ญ๐ข๐จ๐ง?
๐ns. Dimensionality Reduction is used to reduce the feature space with consideration by a set of principal features.
๐4. ๐๐ก๐ฒ ๐ข๐ฌ ๐ฆ๐๐๐ง ๐ฌ๐ช๐ฎ๐๐ซ๐ ๐๐ซ๐ซ๐จ๐ซ ๐ ๐๐๐ ๐ฆ๐๐๐ฌ๐ฎ๐ซ๐ ๐จ๐ ๐ฆ๐จ๐๐๐ฅ ๐ฉ๐๐ซ๐๐จ๐ซ๐ฆ๐๐ง๐๐?
๐ns. Mean Squared Error (MSE) gives a relatively high weight to large errors โ therefore, MSE tends to put too much emphasis on large deviations.
๐5. ๐๐๐ฆ๐จ๐ฏ๐ ๐ฆ๐ฎ๐ฅ๐ญ๐ข๐๐จ๐ฅ๐ฅ๐ข๐ง๐๐๐ซ๐ข๐ญ๐ฒ?
๐ns. To remove multicollinearities, we can do two things.
1. We can create new features
2. remove them from our data.
๐6. ๐ฅ๐จ๐ง๐ -๐ญ๐๐ข๐ฅ๐๐ ๐๐ข๐ฌ๐ญ๐ซ๐ข๐๐ฎ๐ญ๐ข๐จ๐ง ?
๐ns. A long tail distribution of numbers is a kind of distribution having many occurrences far from the "head" or central part of the distribution. Most of occurrences in this kind of distributions occurs at early frequencies/values of x-axis.
๐7. ๐๐ฎ๐ญ๐ฅ๐ข๐๐ซ? ๐๐๐๐ฅ ๐ฐ๐ข๐ญ๐ก ๐ข๐ญ?
๐ns. An outlier is an object that deviates significantly from the rest of the objects. They can be caused by measurement or execution error.
Removing outliers is legitimate only for specific reasons. Outliers can be very informative about the subject-area and data collection process. If the outlier does not change the results but does affect assumptions, you may drop the outlier. Or just trim the data set, but replace outliers with the nearest โgoodโ data, as opposed to truncating them completely.
๐8. ๐๐ฑ๐๐ฆ๐ฉ๐ฅ๐ ๐ฐ๐ก๐๐ซ๐ ๐ญ๐ก๐ ๐ฆ๐๐๐ข๐๐ง ๐ข๐ฌ ๐ ๐๐๐ญ๐ญ๐๐ซ ๐ฆ๐๐๐ฌ๐ฎ๐ซ๐ ๐ญ๐ก๐๐ง ๐ญ๐ก๐ ๐ฆ๐๐๐ง ?
๐ns. If your data contains outliers, then you would typically rather use the median because otherwise the value of the mean would be dominated by the outliers rather than the typical values. In conclusion, if you are considering the mean, check your data for outliers, if any then better choose median.
ENJOY LEARNING ๐๐
[PART -15]
๐1. ๐๐๐๐ฅ ๐ฐ๐ข๐ญ๐ก ๐ฎ๐ง๐๐๐ฅ๐๐ง๐๐๐ ๐๐ข๐ง๐๐ซ๐ฒ ๐๐ฅ๐๐ฌ๐ฌ๐ข๐๐ข๐๐๐ญ๐ข๐จ๐ง?
๐ns. Techniques to Handle unbalanced Data:
1. Use the right evaluation metrics
2. Use K-fold Cross-Validation in the right way
3. Ensemble different resampled datasets
4. Resample with different ratios
5. Design your own models
๐2. ๐๐๐ญ๐ข๐ฏ๐๐ญ๐ข๐จ๐ง ๐๐ฎ๐ง๐๐ญ๐ข๐จ๐ง?
๐ns. Activation functions are mathematical equations that determine the output of a neural network model. It is a non-linear transformation that we do over the input before sending it to the next layer of neurons or finalizing it as output.
๐3. ๐๐ข๐ฆ๐๐ง๐ฌ๐ข๐จ๐ง ๐ซ๐๐๐ฎ๐๐ญ๐ข๐จ๐ง?
๐ns. Dimensionality Reduction is used to reduce the feature space with consideration by a set of principal features.
๐4. ๐๐ก๐ฒ ๐ข๐ฌ ๐ฆ๐๐๐ง ๐ฌ๐ช๐ฎ๐๐ซ๐ ๐๐ซ๐ซ๐จ๐ซ ๐ ๐๐๐ ๐ฆ๐๐๐ฌ๐ฎ๐ซ๐ ๐จ๐ ๐ฆ๐จ๐๐๐ฅ ๐ฉ๐๐ซ๐๐จ๐ซ๐ฆ๐๐ง๐๐?
๐ns. Mean Squared Error (MSE) gives a relatively high weight to large errors โ therefore, MSE tends to put too much emphasis on large deviations.
๐5. ๐๐๐ฆ๐จ๐ฏ๐ ๐ฆ๐ฎ๐ฅ๐ญ๐ข๐๐จ๐ฅ๐ฅ๐ข๐ง๐๐๐ซ๐ข๐ญ๐ฒ?
๐ns. To remove multicollinearities, we can do two things.
1. We can create new features
2. remove them from our data.
๐6. ๐ฅ๐จ๐ง๐ -๐ญ๐๐ข๐ฅ๐๐ ๐๐ข๐ฌ๐ญ๐ซ๐ข๐๐ฎ๐ญ๐ข๐จ๐ง ?
๐ns. A long tail distribution of numbers is a kind of distribution having many occurrences far from the "head" or central part of the distribution. Most of occurrences in this kind of distributions occurs at early frequencies/values of x-axis.
๐7. ๐๐ฎ๐ญ๐ฅ๐ข๐๐ซ? ๐๐๐๐ฅ ๐ฐ๐ข๐ญ๐ก ๐ข๐ญ?
๐ns. An outlier is an object that deviates significantly from the rest of the objects. They can be caused by measurement or execution error.
Removing outliers is legitimate only for specific reasons. Outliers can be very informative about the subject-area and data collection process. If the outlier does not change the results but does affect assumptions, you may drop the outlier. Or just trim the data set, but replace outliers with the nearest โgoodโ data, as opposed to truncating them completely.
๐8. ๐๐ฑ๐๐ฆ๐ฉ๐ฅ๐ ๐ฐ๐ก๐๐ซ๐ ๐ญ๐ก๐ ๐ฆ๐๐๐ข๐๐ง ๐ข๐ฌ ๐ ๐๐๐ญ๐ญ๐๐ซ ๐ฆ๐๐๐ฌ๐ฎ๐ซ๐ ๐ญ๐ก๐๐ง ๐ญ๐ก๐ ๐ฆ๐๐๐ง ?
๐ns. If your data contains outliers, then you would typically rather use the median because otherwise the value of the mean would be dominated by the outliers rather than the typical values. In conclusion, if you are considering the mean, check your data for outliers, if any then better choose median.
ENJOY LEARNING ๐๐
๐ฅ2๐1
Which of the following method/s can be used to handle missing values?
Anonymous Quiz
16%
Mean Substitution
6%
Pairwise deletion
11%
Regression imputation
66%
All of the above
๐2
Which of the following is not a feature selection technique?
Anonymous Quiz
21%
Information Gain
13%
Forward Selection
23%
Regularisation
44%
K-means clustering
Data Science Interview Questions
[PART-16]
Q. How can outlier values be treated?
A. An outlier is an observation in a dataset that differs significantly from the rest of the data. This signifies that an outlier is much larger or smaller than the rest of the data.
Given are some of the methods of treating the outliers: Trimming or removing the outlier, Quantile based flooring and capping, Mean/Median imputation.
Q. What is root cause analysis?
A. A root cause is a component that contributed to a nonconformance and should be eradicated permanently through process improvement. The root cause is the most fundamental problemโthe most fundamental reasonโthat puts in motion the entire cause-and-effect chain that leads to the problem (s). Root cause analysis (RCA) is a word that refers to a variety of approaches, tools, and procedures used to identify the root causes of problems. Some RCA approaches are more directed toward uncovering actual root causes than others, while others are more general problem-solving procedures, and yet others just provide support for the root cause analysis core activity.
Q. What is bias and variance in Data Science?
A. The model's simplifying assumptions simplify the target function, making it easier to estimate. Bias is the difference between the Predicted Value and the Expected Value in its most basic form. Variance refers to how much the target function's estimate will fluctuate as a result of varied training data. In contrast to bias, variance occurs when the model takes into account the data's fluctuations, or noise.
Q. What is a confusion matrix?
A. A confusion matrix is a method of summarising a classification algorithm's performance. Calculating a confusion matrix can help you understand what your classification model is getting right and where it is going wrong. This gives us the following: "True positive" for event values that were successfully predicted. "False positive" for event values that were mistakenly predicted. For successfully anticipated no-event values, "true negative" is used. "False negative" for no-event values that were mistakenly predicted.
ENJOY LEARNING ๐๐
[PART-16]
Q. How can outlier values be treated?
A. An outlier is an observation in a dataset that differs significantly from the rest of the data. This signifies that an outlier is much larger or smaller than the rest of the data.
Given are some of the methods of treating the outliers: Trimming or removing the outlier, Quantile based flooring and capping, Mean/Median imputation.
Q. What is root cause analysis?
A. A root cause is a component that contributed to a nonconformance and should be eradicated permanently through process improvement. The root cause is the most fundamental problemโthe most fundamental reasonโthat puts in motion the entire cause-and-effect chain that leads to the problem (s). Root cause analysis (RCA) is a word that refers to a variety of approaches, tools, and procedures used to identify the root causes of problems. Some RCA approaches are more directed toward uncovering actual root causes than others, while others are more general problem-solving procedures, and yet others just provide support for the root cause analysis core activity.
Q. What is bias and variance in Data Science?
A. The model's simplifying assumptions simplify the target function, making it easier to estimate. Bias is the difference between the Predicted Value and the Expected Value in its most basic form. Variance refers to how much the target function's estimate will fluctuate as a result of varied training data. In contrast to bias, variance occurs when the model takes into account the data's fluctuations, or noise.
Q. What is a confusion matrix?
A. A confusion matrix is a method of summarising a classification algorithm's performance. Calculating a confusion matrix can help you understand what your classification model is getting right and where it is going wrong. This gives us the following: "True positive" for event values that were successfully predicted. "False positive" for event values that were mistakenly predicted. For successfully anticipated no-event values, "true negative" is used. "False negative" for no-event values that were mistakenly predicted.
ENJOY LEARNING ๐๐
๐4โค1
Which of the following is not a python library?
Anonymous Quiz
3%
Pandas
2%
Numpy
3%
Matplotlib
10%
Scikit-learn
83%
Array
Which of the following is not a machine learning algorithm?
Anonymous Quiz
5%
Linear Regression
9%
Random Forest
77%
Standard scalar
6%
Decision Tree
4%
Logistic Regression
Which of the following is not a supervised algorithm?
Anonymous Quiz
11%
Linear Regression
9%
Logistic Regression
64%
Clustering
16%
Decision Tree
๐3
Which of the following tool can be used for Data Visualization?
Anonymous Quiz
9%
Tableau
11%
Matplotlib
7%
Power BI
74%
All of the above
Data Science & Machine Learning
Do you want daily quiz to enhance your knowledge?
Thats an amazing response from you guys โค๏ธ๐
Which of the following cannot give 10 as an answer?
Anonymous Quiz
8%
5*2
7%
2+5*2-2
69%
2+5*(2-2)
16%
3*2+9//2
๐2
Data Science & Machine Learning
Which of the following cannot give 10 as an answer?
Well done guys!!
Explanation for those who marked wrong answer:
Read the question again
The Answer to (9//2) is 4 and not 4.5
Explanation for those who marked wrong answer:
Read the question again
The Answer to (9//2) is 4 and not 4.5
Mathematics for Machine Learning
Published by Cambridge University Press (published April 2020)
https://mml-book.com
PDF: https://mml-book.github.io/book/mml-book.pdf
Published by Cambridge University Press (published April 2020)
https://mml-book.com
PDF: https://mml-book.github.io/book/mml-book.pdf
๐5
Neural Networks and Learning Machines Third Edition
๐๐
https://cours.etsmtl.ca/sys843/REFS/Books/ebook_Haykin09.pdf
๐๐
https://cours.etsmtl.ca/sys843/REFS/Books/ebook_Haykin09.pdf
๐3
Which of the following is not an Unsupervised algorithm?
Anonymous Quiz
13%
K-means clustering
14%
Hierarchical Clustering
21%
Anomaly detection
52%
Logistic Regression
ยฉHow fresher can get a job as a data scientist?ยฉ
India as a job market is highly resistant to hire data scientist as a fresher. Everyone out there asks for at least 2 years of experience, but then the question is where will we get the two years experience from?
The important thing here to build a portfolio. As you are a fresher I would assume you had learnt data science through online courses. They only teach you the basics, the analytical skills required to clean the data and apply machine learning algorithms to them comes only from practice.
Do some real-world data science projects, participate in Kaggle competition. kaggle provides data sets for practice as well. Whatever projects you do, create a GitHub repository for it. Place all your projects there so when a recruiter is looking at your profile they know you have hands-on practice and do know the basics. This will take you a long way.
All the major data science jobs for freshers will only be available through off-campus interviews.
Some companies that hires data scientists are:
Siemens
Accenture
IBM
Cerner
Creating a technical portfolio will showcase the knowledge you have already gained and that is essential while you got out there as a fresher and try to find a data scientist job.
India as a job market is highly resistant to hire data scientist as a fresher. Everyone out there asks for at least 2 years of experience, but then the question is where will we get the two years experience from?
The important thing here to build a portfolio. As you are a fresher I would assume you had learnt data science through online courses. They only teach you the basics, the analytical skills required to clean the data and apply machine learning algorithms to them comes only from practice.
Do some real-world data science projects, participate in Kaggle competition. kaggle provides data sets for practice as well. Whatever projects you do, create a GitHub repository for it. Place all your projects there so when a recruiter is looking at your profile they know you have hands-on practice and do know the basics. This will take you a long way.
All the major data science jobs for freshers will only be available through off-campus interviews.
Some companies that hires data scientists are:
Siemens
Accenture
IBM
Cerner
Creating a technical portfolio will showcase the knowledge you have already gained and that is essential while you got out there as a fresher and try to find a data scientist job.
๐5
Forwarded from Data Science & Machine Learning
7 Steps of the Machine Learning Process
Data Collection: The process of extracting raw datasets for the machine learning task. This data can come from a variety of places, ranging from open-source online resources to paid crowdsourcing. The first step of the machine learning process is arguably the most important. If the data you collect is poor quality or irrelevant, then the model you train will be poor quality as well.
Data Processing and Preparation: Once youโve gathered the relevant data, you need to process it and make sure that it is in a usable format for training a machine learning model. This includes handling missing data, dealing with outliers, etc.
Feature Engineering: Once youโve collected and processed your dataset, you will likely need to transform some of the features (and sometimes even drop some features) in order to optimize how well a model can be trained on the data.
Model Selection: Based on the dataset, you will choose which model architecture to use. This is one of the main tasks of industry engineers. Rather than attempting to come up with a completely novel model architecture, most tasks can be thoroughly performed with an existing architecture (or combination of model architectures).
Model Training and Data Pipeline: After selecting the model architecture, you will create a data pipeline for training the model. This means creating a continuous stream of batched data observations to efficiently train the model. Since training can take a long time, you want your data pipeline to be as efficient as possible.
Model Validation: After training the model for a sufficient amount of time, you will need to validate the modelโs performance on a held-out portion of the overall dataset. This data needs to come from the same underlying distribution as the training dataset, but needs to be different data that the model has not seen before.
Model Persistence: Finally, after training and validating the modelโs performance, you need to be able to properly save the model weights and possibly push the model to production. This means setting up a process with which new users can easily use your pre-trained model to make predictions.
Data Collection: The process of extracting raw datasets for the machine learning task. This data can come from a variety of places, ranging from open-source online resources to paid crowdsourcing. The first step of the machine learning process is arguably the most important. If the data you collect is poor quality or irrelevant, then the model you train will be poor quality as well.
Data Processing and Preparation: Once youโve gathered the relevant data, you need to process it and make sure that it is in a usable format for training a machine learning model. This includes handling missing data, dealing with outliers, etc.
Feature Engineering: Once youโve collected and processed your dataset, you will likely need to transform some of the features (and sometimes even drop some features) in order to optimize how well a model can be trained on the data.
Model Selection: Based on the dataset, you will choose which model architecture to use. This is one of the main tasks of industry engineers. Rather than attempting to come up with a completely novel model architecture, most tasks can be thoroughly performed with an existing architecture (or combination of model architectures).
Model Training and Data Pipeline: After selecting the model architecture, you will create a data pipeline for training the model. This means creating a continuous stream of batched data observations to efficiently train the model. Since training can take a long time, you want your data pipeline to be as efficient as possible.
Model Validation: After training the model for a sufficient amount of time, you will need to validate the modelโs performance on a held-out portion of the overall dataset. This data needs to come from the same underlying distribution as the training dataset, but needs to be different data that the model has not seen before.
Model Persistence: Finally, after training and validating the modelโs performance, you need to be able to properly save the model weights and possibly push the model to production. This means setting up a process with which new users can easily use your pre-trained model to make predictions.
5_6339144778529113396.pdf
11.1 MB
Machine learning notes in 15 pages