Machine Learning And AI
1.65K subscribers
198 photos
1 video
19 files
351 links
Hi All and Welcome Join our channel for Jobs,latest Programming Blogs, machine learning blogs.
In case any doubt regarding ML/Data Science please reach out to me @ved1104 subscribe my channel
https://youtube.com/@geekycodesin?si=JzJo3WS5E_VFmD1k
Download Telegram
๐——๐—ฎ๐˜๐—ฎ ๐—ฆ๐—ฐ๐—ถ๐—ฒ๐—ป๐—ฐ๐—ฒ ๐—œ๐—ป๐˜๐—ฒ๐—ฟ๐˜ƒ๐—ถ๐—ฒ๐˜„ ๐—ค๐˜‚๐—ฒ๐˜€๐˜๐—ถ๐—ผ๐—ป:
How does outliers impact kNN?

Outliers can significantly impact the performance of kNN, leading to inaccurate predictions due to the model's reliance on proximity for decision-making. Hereโ€™s a breakdown of how outliers influence kNN:

๐—›๐—ถ๐—ด๐—ต ๐—ฉ๐—ฎ๐—ฟ๐—ถ๐—ฎ๐—ป๐—ฐ๐—ฒ
The presence of outliers can increase the model's variance, as predictions near outliers may fluctuate unpredictably depending on which neighbors are included. This makes the model less reliable for regression tasks with scattered or sparse data.

๐——๐—ถ๐˜€๐˜๐—ฎ๐—ป๐—ฐ๐—ฒ ๐— ๐—ฒ๐˜๐—ฟ๐—ถ๐—ฐ ๐—ฆ๐—ฒ๐—ป๐˜€๐—ถ๐˜๐—ถ๐˜ƒ๐—ถ๐˜๐˜†
kNN relies on distance metrics, which can be significantly affected by outliers. In high-dimensional spaces, outliers can increase the range of distances, making it harder for the algorithm to distinguish between nearby points and those farther away. This issue can lead to an overall reduction in accuracy as the modelโ€™s ability to effectively measure "closeness" degrades.

๐—ฅ๐—ฒ๐—ฑ๐˜‚๐—ฐ๐—ฒ ๐—ฝ๐—ฒ๐—ฟ๐—ณ๐—ผ๐—ฟ๐—บ๐—ฎ๐—ป๐—ฐ๐—ฒ ๐—ถ๐—ป ๐—–๐—น๐—ฎ๐˜€๐˜€๐—ถ๐—ณ๐—ถ๐—ฐ๐—ฎ๐˜๐—ถ๐—ผ๐—ป/๐—ฅ๐—ฒ๐—ด๐—ฟ๐—ฒ๐˜€๐˜€๐—ถ๐—ผ๐—ป ๐—ง๐—ฎ๐˜€๐—ธ๐˜€
Outliers near class boundaries can pull the decision boundary toward them, potentially misclassifying nearby points that should belong to a different class. This is particularly problematic if k is small, as individual points (like outliers) have a greater influence. The same happens in regression tasks as well.

๐—™๐—ฒ๐—ฎ๐˜๐˜‚๐—ฟ๐—ฒ ๐—œ๐—ป๐—ณ๐—น๐˜‚๐—ฒ๐—ป๐—ฐ๐—ฒ ๐——๐—ถ๐˜€๐—ฝ๐—ฟ๐—ผ๐—ฝ๐—ผ๐—ฟ๐˜๐—ถ๐—ผ๐—ป
If certain features contain outliers, they can dominate the distance calculations and overshadow the impact of other features. For example, an outlier in a high-magnitude feature may cause distances to be determined largely by that feature, affecting the quality of the neighbor selection.
Company Name : Amazon
Role : Cloud Support Associate
Batch : 2024/2023 passouts

Link : https://www.amazon.jobs/en/jobs/2676989/cloud-support-associate
๐Ÿ‘1
Company Name : Swiggy
Role : Associate Software Engineer
Batch : 2024/2023/2022 passouts

Link : https://docs.google.com/forms/d/1E029cjZV8Em6zPC0YJYAMDDP_NjPtDkwufqHfvkVG2E/viewform?edit_requested=true&pli=1
Data science interview questions :

Position : Data Scientist
There were 3 rounds of interview followed by 1 HR discussion.

Coding related questions :

1. You are given 2 lists
l1 = [1,2,2,3,4,5,6,6,7]
l2= [1,2,4,5,5,6,6,7]
Return all the elements from list l2 which were spotted in l1 as per their frequency of occurence.
For eg: elements in l2 which occured in l1 are : 1, 2 (only once) ,4 ,5,6 ,7
Expected Output : [1,2,4,5,6,6,7]

2.
text = ' I am a data scientist working in Paypal'
Return the longest word along with count of letters
Expected Output : ('scientist', 9)
(In case of ties , sort the words in alphabetical order and return 1st one)

3. You are given 2 tables in SQL , table 1 contains one column which has only single value '3' repeated 5 times , table 2 contains only one column which has 2 values ,'2' repeated 3 times & '3' repeated 4 times. Tell me number of records in case of inner . left , right ,outer , cross join.

4. You are given a transaction table which has txn_id (primary key) ,cust_id (foreign key) , txn_date (datetime) , txn_amt as 4 columns , one cust id has multiple txn_ids. Your job is to find out all the cust_id for which there were minimum 2 txns which are made in 10 seconds of duration. (This might help to identify fraudulent patterns)

Case study questions :
1. Tell me business model and revneue sources of Paypal.
Tell me general consequences when you change pricing of any product.

2. You are data scientist who studies impact of pricing change. Suppose business team comes to you and asks you what will happen if they increase the merchant fees per txn by 5%.
What will be your recommendation & strategy ?
What all factors will you think of and what will be the proposed ML solutioning look like ?

3. You see that some merchants are heavily misusing the refund facility (for incorrect txn /disputed txn merchants get refund) , they are claiming reimbursements by doing fake txns.
List possible scenarios and ways to identify such merchants ?

4. How will you decide pricing of a premier product like Iphone in India vs say South Africa ? What factors will you consider ?

Statistics Questions:
1. What is multicollinearity ?
2. What is Type1 error and type 2 error (explain in pricing experimentation pov)
3. What is Weibull distribution ?
4. What is CLT ? What is difference between t & normal distribution.
5. What is Wald's test ?
6. What is Ljung box test and explain null hypothesis for ADF test in Time series
7. What is causality ?

ML Questions:
1. What is logistic regression ? What is deviance ?
2. What is difference between R-Squared & Adj R-squared
3. How does Randomforest works ?
4. Difference between bagging and boosting ?
On paradigm of Variance -Bias what does bagging/boosting attempt to solve ?
I struggled with Data Science interviews until...

I followed this roadmap:

๐—ฃ๐˜†๐˜๐—ต๐—ผ๐—ป
๐Ÿ‘‰๐Ÿผ Master the basics: syntax, loops, functions, and data structures (lists, dictionaries, sets, tuples)
๐Ÿ‘‰๐Ÿผ Learn Pandas & NumPy for data manipulation
๐Ÿ‘‰๐Ÿผ Matplotlib & Seaborn for data visualization

๐—ฆ๐˜๐—ฎ๐˜๐—ถ๐˜€๐˜๐—ถ๐—ฐ๐˜€ & ๐—ฃ๐—ฟ๐—ผ๐—ฏ๐—ฎ๐—ฏ๐—ถ๐—น๐—ถ๐˜๐˜†
๐Ÿ‘‰๐Ÿผ Descriptive statistics: mean, median, mode, standard deviation
๐Ÿ‘‰๐Ÿผ Probability theory: distributions, Bayes' theorem, conditional probability
๐Ÿ‘‰๐Ÿผ Hypothesis testing & A/B testing

๐— ๐—ฎ๐—ฐ๐—ต๐—ถ๐—ป๐—ฒ ๐—Ÿ๐—ฒ๐—ฎ๐—ฟ๐—ป๐—ถ๐—ป๐—ด
๐Ÿ‘‰๐Ÿผ Supervised vs. unsupervised learning
๐Ÿ‘‰๐Ÿผ Key algorithms: Linear & Logistic Regression, Decision Trees, Random Forest, KNN, SVM
๐Ÿ‘‰๐Ÿผ Model evaluation metrics: accuracy, precision, recall, F1 score, ROC-AUC
๐Ÿ‘‰๐Ÿผ Cross-validation & hyperparameter tuning

๐——๐—ฒ๐—ฒ๐—ฝ ๐—Ÿ๐—ฒ๐—ฎ๐—ฟ๐—ป๐—ถ๐—ป๐—ด
๐Ÿ‘‰๐Ÿผ Neural Networks & their architecture
๐Ÿ‘‰๐Ÿผ Working with Keras & TensorFlow/PyTorch
๐Ÿ‘‰๐Ÿผ CNNs for image data and RNNs for sequence data

๐——๐—ฎ๐˜๐—ฎ ๐—–๐—น๐—ฒ๐—ฎ๐—ป๐—ถ๐—ป๐—ด & ๐—™๐—ฒ๐—ฎ๐˜๐˜‚๐—ฟ๐—ฒ ๐—˜๐—ป๐—ด๐—ถ๐—ป๐—ฒ๐—ฒ๐—ฟ๐—ถ๐—ป๐—ด
๐Ÿ‘‰๐Ÿผ Handling missing data, outliers, and data scaling
๐Ÿ‘‰๐Ÿผ Feature selection techniques (e.g., correlation, mutual information)

๐—ก๐—Ÿ๐—ฃ (๐—ก๐—ฎ๐˜๐˜‚๐—ฟ๐—ฎ๐—น ๐—Ÿ๐—ฎ๐—ป๐—ด๐˜‚๐—ฎ๐—ด๐—ฒ ๐—ฃ๐—ฟ๐—ผ๐—ฐ๐—ฒ๐˜€๐˜€๐—ถ๐—ป๐—ด)
๐Ÿ‘‰๐Ÿผ Tokenization, stemming, lemmatization
๐Ÿ‘‰๐Ÿผ Bag-of-Words, TF-IDF
๐Ÿ‘‰๐Ÿผ Sentiment analysis & topic modeling

๐—–๐—น๐—ผ๐˜‚๐—ฑ ๐—ฎ๐—ป๐—ฑ ๐—•๐—ถ๐—ด ๐——๐—ฎ๐˜๐—ฎ
๐Ÿ‘‰๐Ÿผ Understanding cloud services (AWS, GCP, Azure) for data storage & computing
๐Ÿ‘‰๐Ÿผ Working with distributed data using Spark
๐Ÿ‘‰๐Ÿผ SQL for querying large datasets

Donโ€™t get overwhelmed by the breadth of topics. Start smallโ€”master one concept, then move to the next. ๐Ÿ“ˆ

Youโ€™ve got this! ๐Ÿ’ช๐Ÿผ