Spark in me
2.27K subscribers
744 photos
47 videos
114 files
2.63K links
Lost like tears in rain. DS, ML, a bit of philosophy and math. No bs or ads.
Download Telegram
Fast-text trained on a random mix of Russian Wikipedia / Taiga / Common Crawl

On our benchmarks was marginally better than fast-text trained on Araneum from Rusvectors.

Download link
https://goo.gl/g6HmLU

Params
Standard params - (3,6) n-grams + vector dimensionality is 300.

Usage:
import fastText as ft
ft_model_big = ft.load_model('model')
And then just refer to
https://github.com/facebookresearch/fastText/blob/master/python/fastText/FastText.py

#nlp
Playing with Transformer

TLDR - use only pre-trained.

On classification tasks performed the same as classic models.

On seq2seq - much worse time / memory wise. Inference is faster though.

#nlp
Towards Data Science

Our article was accepted to their publication:
- https://towardsdatascience.com/building-client-routing-semantic-search-in-the-wild-14db04687c7e

Also when you have published once there, then you can just publish your work on TDS on recurrent basis =)

I doubt that this will be properly distributed to all 130k of their subs, but nevertheless this is a milestone.

#data_science
An intro to RL

Though published by OpenAI with TF, this is simply amazing:
- https://spinningup.openai.com/en/latest/spinningup/rl_intro.html

#rl
Forwarded from Karim Iskakov - канал (karfly_bot)
"80 years of AI research. Epic battle between connectionist (~neural networks) and symbolic (~rule based) methods. Who will win?"
👤 @OriolVinyalsML (twitter)
📉 @loss_function_porn
Problems with GPUs in DL box

Cryptic messages like:
GPU is lost

Usually this is either:
- PSU;
- Or bad PCIE contact;
- Too much load on PCIE bus;

#deep_learning
Our victory in CFT-2018 competition

TLDR
- Multi-task learning + seq2seq models rule;
- The domain seems to be easy, but it is not;
- You can also build a pipeline based on manual features, but it will not be task agnostic;
- Loss weighting is crucial for such tasks;
- Transformer trains 10x longer;

https://spark-in.me/post/cft-spelling-2018

#nlp
#deep_learning
#data_science
Jupyter extensions

Looks like they are near end of their support.
Alas.
On a fresh build you will need this
conda install notebook=5.6

To use them.

Will need to invest some time into making Jupyter Lab actually usable.

#data_science
Getting your public key from Github ... with wget!

I kind of saw it when installing Ubuntu 18 from scratch. But it is super awesome!

wget -O - https://github.com/snakers4.keys >> test

Just replace test with your authorized_keys file and profit!

#linux
Creating a new user

With the above hack, user creation can be done as easy as:

USER="YOUR_USER" && \
GROUP="YOUR_GROUP" && \
sudo useradd $USER && \
sudo adduser $USER $GROUP && \
sudo mkdir -p /home/$USER/.ssh/ && \
sudo touch /home/$USER/.ssh/authorized_keys && \
sudo chown -R $USER:$USER /home/$USER/.ssh/ && \
sudo wget -O - https://github.com/$USER.keys | sudo tee -a /home/$USER/.ssh/authorized_keys
#linux
Article about the reality of CV in Russia / CIS

(RU)
http://cv-blog.ru/?p=253

Also a bit on how to handle various types of "customers", who want to contract CV systems from you.
Warning - too much harsh reality)

#deep_learning
Channel photo updated
A cheeky ML/DS themed sticker pack for our channel

Thanks to @birdborn for his art.

You are welcome to use it:
https://t.me/addstickers/ML_spark_in_me_by_BB

If you would like to contribute / create your own stickers - please ask around in our channel chat.

#data_science