DS/ML digest 28
Google open sources pre-trained BERT ... with 102 languages ...
https://spark-in.me/post/2018_ds_ml_digest_28
#digest
#deep_learning
#data_science
Google open sources pre-trained BERT ... with 102 languages ...
https://spark-in.me/post/2018_ds_ml_digest_28
#digest
#deep_learning
#data_science
Spark in me
2018 DS/ML digest 28
2018 DS/ML digest 28
Статьи автора - http://spark-in.me/author/snakers41
Блог - http://spark-in.me
Статьи автора - http://spark-in.me/author/snakers41
Блог - http://spark-in.me
Fast-text trained on a random mix of Russian Wikipedia / Taiga / Common Crawl
On our benchmarks was marginally better than fast-text trained on Araneum from Rusvectors.
Download link
https://goo.gl/g6HmLU
Params
Standard params - (3,6) n-grams + vector dimensionality is 300.
Usage:
https://github.com/facebookresearch/fastText/blob/master/python/fastText/FastText.py
#nlp
On our benchmarks was marginally better than fast-text trained on Araneum from Rusvectors.
Download link
https://goo.gl/g6HmLU
Params
Standard params - (3,6) n-grams + vector dimensionality is 300.
Usage:
import fastText as ftAnd then just refer to
ft_model_big = ft.load_model('model')
https://github.com/facebookresearch/fastText/blob/master/python/fastText/FastText.py
#nlp
Playing with Transformer
TLDR - use only pre-trained.
On classification tasks performed the same as classic models.
On seq2seq - much worse time / memory wise. Inference is faster though.
#nlp
TLDR - use only pre-trained.
On classification tasks performed the same as classic models.
On seq2seq - much worse time / memory wise. Inference is faster though.
#nlp
Towards Data Science
Our article was accepted to their publication:
- https://towardsdatascience.com/building-client-routing-semantic-search-in-the-wild-14db04687c7e
Also when you have published once there, then you can just publish your work on TDS on recurrent basis =)
I doubt that this will be properly distributed to all 130k of their subs, but nevertheless this is a milestone.
#data_science
Our article was accepted to their publication:
- https://towardsdatascience.com/building-client-routing-semantic-search-in-the-wild-14db04687c7e
Also when you have published once there, then you can just publish your work on TDS on recurrent basis =)
I doubt that this will be properly distributed to all 130k of their subs, but nevertheless this is a milestone.
#data_science
Medium
Building client routing / semantic search in the wild
A comparison of novel NLP techniques within an applied business setting
Spark in me
A small saga about keeping GPUs cool (1) 1-2 GPUs with blower fans (or turbo fans) in a full tower -- idle 40-45C -- full load - 80-85C (2) 3-4 GPUs with blower fans (or turbo fans) in a full tower -- idle - 45-55C -- full load - 85-95С Also with 3-4+…
When it is colder, under full load GPUs run at 70C
An intro to RL
Though published by OpenAI with TF, this is simply amazing:
- https://spinningup.openai.com/en/latest/spinningup/rl_intro.html
#rl
Though published by OpenAI with TF, this is simply amazing:
- https://spinningup.openai.com/en/latest/spinningup/rl_intro.html
#rl
Forwarded from Karim Iskakov - канал (karfly_bot)
"80 years of AI research. Epic battle between connectionist (~neural networks) and symbolic (~rule based) methods. Who will win?"
👤 @OriolVinyalsML (twitter)
📉 @loss_function_porn
👤 @OriolVinyalsML (twitter)
📉 @loss_function_porn
Problems with GPUs in DL box
Cryptic messages like:
Usually this is either:
- PSU;
- Or bad PCIE contact;
- Too much load on PCIE bus;
#deep_learning
Cryptic messages like:
GPU is lost
Usually this is either:
- PSU;
- Or bad PCIE contact;
- Too much load on PCIE bus;
#deep_learning
Our victory in CFT-2018 competition
TLDR
- Multi-task learning + seq2seq models rule;
- The domain seems to be easy, but it is not;
- You can also build a pipeline based on manual features, but it will not be task agnostic;
- Loss weighting is crucial for such tasks;
- Transformer trains 10x longer;
https://spark-in.me/post/cft-spelling-2018
#nlp
#deep_learning
#data_science
TLDR
- Multi-task learning + seq2seq models rule;
- The domain seems to be easy, but it is not;
- You can also build a pipeline based on manual features, but it will not be task agnostic;
- Loss weighting is crucial for such tasks;
- Transformer trains 10x longer;
https://spark-in.me/post/cft-spelling-2018
#nlp
#deep_learning
#data_science
TDS article follow-up
TDS also accepted a reprint of the article
https://towardsdatascience.com/winning-a-cft-2018-spelling-correction-competition-b771d0c1b9f6
#nlp
TDS also accepted a reprint of the article
https://towardsdatascience.com/winning-a-cft-2018-spelling-correction-competition-b771d0c1b9f6
#nlp
Medium
Winning a CFT 2018 spelling correction competition
Or building a task-agnostic seq2seq pipeline on a challenging domain
Jupyter extensions
Looks like they are near end of their support.
Alas.
On a fresh build you will need this
To use them.
Will need to invest some time into making Jupyter Lab actually usable.
#data_science
Looks like they are near end of their support.
Alas.
On a fresh build you will need this
conda install notebook=5.6
To use them.
Will need to invest some time into making Jupyter Lab actually usable.
#data_science
Getting your public key from Github ... with wget!
I kind of saw it when installing Ubuntu 18 from scratch. But it is super awesome!
Just replace test with your authorized_keys file and profit!
#linux
I kind of saw it when installing Ubuntu 18 from scratch. But it is super awesome!
wget -O - https://github.com/snakers4.keys >> test
Just replace test with your authorized_keys file and profit!
#linux
Creating a new user
With the above hack, user creation can be done as easy as:
With the above hack, user creation can be done as easy as:
USER="YOUR_USER" && \#linux
GROUP="YOUR_GROUP" && \
sudo useradd $USER && \
sudo adduser $USER $GROUP && \
sudo mkdir -p /home/$USER/.ssh/ && \
sudo touch /home/$USER/.ssh/authorized_keys && \
sudo chown -R $USER:$USER /home/$USER/.ssh/ && \
sudo wget -O - https://github.com/$USER.keys | sudo tee -a /home/$USER/.ssh/authorized_keys
Article about the reality of CV in Russia / CIS
(RU)
http://cv-blog.ru/?p=253
Also a bit on how to handle various types of "customers", who want to contract CV systems from you.
Warning - too much harsh reality)
#deep_learning
(RU)
http://cv-blog.ru/?p=253
Also a bit on how to handle various types of "customers", who want to contract CV systems from you.
Warning - too much harsh reality)
#deep_learning
A cheeky ML/DS themed sticker pack for our channel
Thanks to @birdborn for his art.
You are welcome to use it:
https://t.me/addstickers/ML_spark_in_me_by_BB
If you would like to contribute / create your own stickers - please ask around in our channel chat.
#data_science
Thanks to @birdborn for his art.
You are welcome to use it:
https://t.me/addstickers/ML_spark_in_me_by_BB
If you would like to contribute / create your own stickers - please ask around in our channel chat.
#data_science