Aspiring Data Science

🥴1

117 viewsAnatoly Alekseev, 09:20

Forwarded from New Yorko Times (Yury Kashnitsky)

Фэйлы на собесах: 2023 edition
#career #interviews

Тут в описании канала говорится, про фэйлы. Так что надо поддерживать темп фэйлов.

Картинка – авторства Бори Зубарева (placement: проверьте его X-LLM для файнтюнинга LLM, вдруг зайдет), которую он мне скинул после поста про неудачные собесы в 2022.

Погнали:

- Uber, Senior Applied Scientist – прошел один собес, сказали, закрыли саму вакансию (и правда, звучало подозрительно, что это синьор без подчиненных);

- eBay, Principal Applied Scientist (Gen AI) – прошел HM, а вот миддл решил меня погонять по своей боевой задаче – всякие bi-encoders, cross-encoders в задаче предсказания ключевых слов для объявлений. Причем копал глубоко. Я в теме про семантический поиск, слежу за проектом коллег, но тут прям реально глубоко... если сам не ковырял, не ответишь. Ну да, и хотел production-опыт RLHF 😳 Удачи! Надеюсь, нашли такого.

- LLM researcher в устоявшийся стартап – команда крутая, много GM-ов, все прошел, услышал много комплиментов, а дальше отмазу, что “мне у них будет не интересно”, расстался с довольно странным ощущением, как будто не знают, что хотят (еще один сильный чел то же самое от них услышал)

- 2 хардкорных HFT-фонда - в одном домашка на дебаггинг PyTorch-кода не зашла, в другом почти идеально решил алгоритмическую задачу, но к концу 4-го часа начал тупить с ML-ной задачей. Денег там, конечно, море, но и work-life balance хромает, и C++… В-общем, тут я сам не дотягиваю.

- наконец, Amazon, да в Амстере, прошел снова всю хурмомятню из 7 собесов. В этот раз, в отличие от 2022 года, до матча с командой не дошло, услышал стандартный минимальный фидбек, что bar raiser-у что-то не понравилось в одном из ответов.

Ну, как Би-2 поет, “Я двигаюсь дальше”, благо текущие задачи очень интересные, да что-то из сторонних проектов тоже заходит. По собесам из “успехов” на данный момент только отклик другого бигтеха на мой холодный заброс резюме, готовимся к еще одной мясорубке из семи собесов.

Жду ехидные комментарии с отсылками к этому посту про менторство. Делаю шаг на опережение: у моих менти дела идут отлично, лучше, чем у меня 🙂 Один менти устроился так, что сам меня собеседовал 😂, второй, мой друг, скоро переезжает в Нидерланды, третий – на финальных этапах с тем же Амазоном, четвертый получил оффер в долине. Еще четверо в процессе. А сапожник пока без сапог, беру откровенностью (а новых менти все равно пока не ищу).

Хороших вам фэйлов, таких чтоб с градиентами. Ну и любая череда фэйлов когла-то заканчивается, чего всем и желаю. Если очень упорно подкидывать монетку, она таки упадет нужной стороной.

Please open Telegram to view this post

VIEW IN TELEGRAM

84 viewsAnatoly Alekseev, 14:59

Aspiring Data Science

#interviews

https://medium.com/towards-data-science/mathematics-i-look-for-in-data-scientist-interviews-7c7cb1aaebe5

Medium

Mathematics I Look for in Data Scientist Interviews

Let’s rebuild our data science foundation.

77 viewsAnatoly Alekseev, 05:19

Aspiring Data Science

#books #kagglebook #interviews

Paweł Jankiewicz

I tend to always build a framework for each competition that allows me to create as many experiments as possible.
You should create a framework that allows you to change the most sensitive parts of the pipeline quickly.

Я тоже пытаюсь сделать свой фреймворк, чтобы не начинать каждый раз с нуля. для области DS этим неизбежно становится automl.

What Kaggle competitions taught me is the importance of validation, data leakage prevention, etc. For example, if data leaks happen in so many competitions, when people who prepare them are the best in the field, you can ask yourself what percentage of production models have data leaks in training; personally, I think 80%+ of production models are probably not validated correctly, but don’t quote me on that.

Software engineering skills are probably underestimated a lot. Every competition and problem is slightly different and needs some framework to streamline the solution (look at https://github.com/bestfitting/ instance_level_recognition and how well their code is organized). Good code organization helps you to iterate faster and eventually try more things.

Andrew Maranhão

While libraries are great, I also suggest that at some point in your career you take the time to implement it yourself. I first heard this advice from Andrew Ng and then from many others of equal calibre. Doing this creates very in-depth knowledge that sheds new light on what your model does and how it responds to tuning, data, noise, and more.

Over the years, the things I wished I realized sooner the most were:
1. Absorbing all the knowledge at the end of a competition
2. Replication of winning solutions in finished competitions

Это сильнейшая идея. Развивая её дальше, можно сказать, что учиться надо и по закончившимся соревам, в которых ты НЕ участвовал, и даже по синтетическим, которые ты сам создал!

In the pressure of a competition drawing to a close, you can see the leaderboard shaking more than ever
before. This makes it less likely that you will take risks and take the time to see things in all their detail.
When a competition is over, you don’t have that rush and can take as long as you need; you can also
replicate the rationale of the winners who made their solutions known.

If you have the discipline, this will do wonders for your data science skills, so the bottom line is: stop when
you are done, not when the competition ends. I have also heard this advice from an Andrew Ng keynote,
where he recommended replicating papers as one of his best ways to develop yourself as an AI practitioner.

Martin Henze

In many cases, after those first few days we’re more than 80% on the way to the ultimate winner’s solution, in terms of scoring metric. Of course, the fun and the challenge of Kaggle are to find creative ways to get those last few percent of, say, accuracy. But in an industry job, your time is often more efficiently spent in tackling a new project instead.

I don’t know how often a hiring manager would actually look at those resources, but I frequently got the impression that my Grandmaster title might have opened more doors than my PhD did. Or maybe it was a combination of the two. In any case, I can much recommend having a portfolio of public Notebooks.

Even if you’re a die-hard Python aficionado, it pays off to have a look beyond pandas and friends every once in a while. Different tools often lead to different viewpoints and more creativity.

Andrada Olteanu

I believe the most overlooked aspect of Kaggle is the community. Kaggle has the biggest pool of people, all gathered in one convenient place, from which one could connect, interact, and learn from. The best way to leverage this is to take, for example, the first 100 people from each Kaggle section (Competitions, Datasets, Notebooks – and if you want, Discussions), and follow on Twitter/LinkedIn everybody that has this information shared on their profile. This way, you can start interacting on a regular basis with these amazing people, who are so rich in insights and knowledge.

80 viewsAnatoly Alekseev, edited 01:57

Aspiring Data Science

#books #kagglebook #interviews

Yifan Xie

In terms of techniques, I have built up a solid pipeline of machine learning modules that allow me to quickly apply typical techniques and algorithms on most data problems. I would say this is a kind of competitive advantage for me: a focus on standardizing, both in terms of work routine and technical artifacts over time. This allows for quicker iteration and in turn helps improve efficiency when conducting data experiments, which is a core component of Kaggle.

I am a very active participant on Numerai. For me, based on my four reasons to do data science, it is more for profit, as they provide a payout via their cryptocurrency. It is more of a solitary effort, as there is not really an advantage to teaming; they don’t encourage or forbid it, but it is just that more human resources don’t always equate to better profit on a trading competition platform like Numerai.

Ryan Chesler

For me, error analysis is one of the most illuminating processes; understanding where the model is failing and trying to find some way to improve the model or input data representation to address the weakness.

I started from very little knowledge and tried out a Kaggle competition without much success at first. I went to a local meetup and found people to team up with and learn from. At the time, I got to work with people of a much higher skill level than me and we did really well in a competition, 3rd/4500+ teams. After this, the group stopped being as consistent and I wanted to keep the community going, so I made my own group and started organizing my own events. I’ve been doing that for almost 4 years and I get to be on the opposite side of the table teaching people and helping them get started.

Bojan Tunguz

For a while I was really into the NLP competitions, but those have always been rare on Kaggle. One constant over the years, though, has been my interest in tabular data problems. Those used to be the quintessential Kaggle competition problems but have unfortunately become extinct. I am still very interested in that area of ML and have moved into doing some basic research in this domain. Compared to the other areas of ML/DL, there has been very little progress on improving ML for tabular data, and I believe there is a lot of opportunity here.

Some of Kaggle techniques are also applicable to my day-to-day modeling, but there one important aspect is missing – and that’s the support and feedback from the community and the leaderboard. When you are working on your own or with a small team, you never know if what you are building is the best that can be done, or if a better solution is possible.

Поэтому, думаю, должен быть еще 1 этап жизненного цикла ML модели (или даже бизнес-задачи): вывод её на уровень kaggle или подобной платформы, с хорошими призами, чтобы понять границы возможного.

The single biggest impact on your model’s performance will come from very good features. Unfortunately, feature engineering is more of an art than a science and is usually very model- and dataset-dependent. Most of the more interesting feature engineering tricks and practices are rarely, if ever, taught in standard ML courses or resources. Many of them cannot be taught and are dependent on some special problem-specific insights. But the mindset of looking into feature engineering as default is something that can be cultivated. It will usually take many years of practice to get good at it.

GitHub

bestfitting - Overview

bestfitting has 2 repositories available. Follow their code on GitHub.

73 viewsAnatoly Alekseev, edited 01:57

Aspiring Data Science

#books #kagglebook #interviews #todo

Jean-François Puget

I like competitions with a scientific background, or a background I can relate to. I dislike anonymous data and synthetic data, unless the data is generated via a very precise physics simulation. More generally, I like Kaggle competitions on domains I don’t know much about, as this is where I will learn the most. It is not the most effective way to get ranking points, but it is the one I entertain most.

What I often do is plot samples using two features or derived features on the x and y axis, and a third feature for color coding samples. One of the three features can be the target. I use lots of visualization, as I believe that human vision is the best data analysis tool there is.

К этому совету стоит отнестись серьёзно. Я сам видел, как в одной сореве Джиба что-то странное заметил в повторении некоторых чисел в дальних столбцах огромной матрицы, что привело к выявлению утечки. Ну кто в работе вообще смотрит на таблицу данных, тем более здоровую?

Hyperparameter tuning is one of the best ways to overfit, and I fear overfitting a lot.

Kazuki Onodera

-In your experience, what do inexperienced Kagglers often overlook?
-Target analysis.

-What mistakes have you made in competitions in the past?
-Target analysis. Top teams always analyze the target better than others.

Xavier Conort

-Tell us about a particularly challenging competition you entered, and what insights you used to tackle the task.

-My favorite competition is GE Flight Quest, a competition organised by GE where competitors had to predict arrival time of domestic flights in the US.

I was very careful to exclude the name of the airport from my primary feature lists. Indeed, some airports hadn’t experienced bad weather conditions during the few months of history. So, I was very concerned that my favorite ML algorithm, GBM, would use the name of the airport as a proxy for good weather and then fail to predict well for those airports in the private leaderboard. To capture the fact that some airports are better managed than others and improve my leaderboard score slightly, I eventually did use the name of the airport, but as a residual effect only. It was a feature of my second layer of models that used as an offset the predictions of my first layer of models. This approach can be considered a two-step boosting, where you censor some information during the first step. I learnt it from actuaries applying this approach in insurance to capture geospatial residual effects.

I would advise inexperienced Kagglers not to look at the solutions posted during the competition but to try to find good solutions on their own. I am happy that competitors didn’t share code during the early days of Kaggle. It forced me to learn the hard way.

-What mistakes have you made in competitions in the past?
-One mistake is to keep on competing in competitions that are badly designed with leaks. It is just a waste of time. You don’t learn much from those competitions.

-What’s the most important thing someone should keep in mind or do when they’re entering a competition?
-Compete to learn. Compete to connect with other passionate data scientists. Don’t compete only to win.

Chris Deotte

I enjoy competitions with fascinating data and competitions that require building creative novel models.
My specialty is analyzing trained models to determine their strengths and weaknesses. Feature engineering and quick experimentation are important when optimizing tabular data models. In order to accelerate the cycle of experimentation and validation, using NVIDIA RAPIDS cuDF and cuML on GPU are essential.

Laura Fink

My favorite competitions are those that want to yield something good to humanity. I especially like all
healthcare-related challenges.

✍1

60 viewsAnatoly Alekseev, edited 02:07

Aspiring Data Science

#books #kagglebook #interviews

Shotaro Ishihara

Kaggle assumes the use of advanced machine learning, but this is not the case in business. In practice, I try to find ways to avoid using machine learning. Even when I do use it, I prefer working with classical methods such as TF-IDF and linear regression rather than advanced methods such as BERT.

Gilberto Titericz

I always end a competition with at least one Gradient Boosted Tree model and one deep learning-based approach. A blend of such diverse approaches is very important to increase diversity in the predictions and boost the competition metric.

Headhunters all around the world look at Kaggle to find good matches for their positions and the knowledge and experience gained from competitions can boost any career.

The number of medals I have in Kaggle competitions is my portfolio; up to now (11/2021) it’s 58 gold and 47 silver, which summarizes well the ML experience I got from Kaggle. Taking into account that each competition runs for at least 1 month, this is more than 105 consecutive months of experience doing competitive ML.

The data scientist must take into account how the model is going to be used in the future, and make the validation as close as possible to that.

Keep in mind that what makes a competition winner is not just replicating what everyone else is doing, but thinking out of the box and coming up with novel ideas, strategies, architectures, and approaches.

Gabriel Preda

There are also potential employers that make very clear that they do not consider Kaggle relevant. I disagree with this view; personally, before interviewing candidates, I normally check their GitHub and Kaggle profiles. I find them extremely relevant.

A good Kaggle profile will demonstrate not only technical skills and experience with certain languages, tools, techniques, or problem-solving skills, but also how well someone is able to communicate through discussions and Notebooks. This is a very important quality for a data scientist.

Jeong-Yoon Lee

In 2012, I used Factorization Machine, which was introduced by Steffen Rendle at KDD Cup 2012, and improved on prediction performance by 30% over an existing SVM model in a month after I joined a new company. At a start-up I co-founded, our main pitch was the ensemble algorithm to beat the market-standard linear regression. At Uber, I introduced adversarial validation to address covariate shifts in features in the machine learning pipelines.

The more ideas you try, the better chance you have to do well in a competition. The principle applies to my day-to-day work as well.

67 viewsAnatoly Alekseev, edited 02:07

About

Blog

Apps

Platform