Spark in me

Practical creepiness

Now Google Photos explicitly shows that it knows faces of your family members.

#deep_learning

1.0K viewsAlexander, 15:19

Spark in me

https://gist.github.com/lucidyan/4359b5973e5c3cee818595734c0ab7a9#gistcomment-2794677

Gist

Prevent NVIDIA GPUs' throttling on headless server

Prevent NVIDIA GPUs' throttling on headless server - gpu-control.md

1.1K viewsAlexander, 18:37

Spark in me

(My GPUs are ~70C under full load xD)

1.3K viewsAlexander, 18:41

Spark in me

Sentiment datasets in Russian

Just randomly found several links.

- http://study.mokoron.com/ - annotated tweets
- http://text-machine.cs.uml.edu/projects/rusentiment/ - some more posts from VK
- https://github.com/dkulagin/kartaslov/tree/master/dataset/emo_dict

Russian SQUAD
- https://github.com/deepmipt/DeepPavlov/blob/0.0.9/deeppavlov/dataset_readers/squad_dataset_reader.py#L43

Happy holidays!

#nlp

1.5K viewsAlexander, 04:54

Spark in me

Environment setup for DS / ML / DL

Some time ago made a small guide for setting up an environment on a black Ubuntu machine.

If works both for CV and NLP.

If you like this, please tell me, I will add newer things:
- nvtop;
- CUDA10 with PyTorch 1.0;
- Scripts for managing GPU fan speed;

http://github.com/snakers4/gpu-box-setup/

#deep_learning
#linux

GitHub

GitHub - snakers4/gpu-box-setup

Contribute to snakers4/gpu-box-setup development by creating an account on GitHub.

1.1K viewsAlexander, edited 02:23

Spark in me

Yet another repo with all possible pre-trained imagenet models

Now on 4 frameworks...
Looks too good to be true

https://github.com/osmr/imgclsmob

#deep_learning

GitHub

GitHub - osmr/imgclsmob: Sandbox for training deep learning networks

Sandbox for training deep learning networks. Contribute to osmr/imgclsmob development by creating an account on GitHub.

1.2K viewsAlexander, 02:34

Spark in me

Spark in me 2018 annual retrospective

TLDR:
- My personal progress and some views;
- ML is still amazing, but there are no illusions anymore;
- Telegram is still amazing, but commercialization looms;
- FAIR is an inspiration;
- Imcinnes with UMAP and HDBSCAN as well;

https://spark-in.me/post/2018

ЗЫ
Еще написал немного по-русски, немного со спецификой, если вам так удобнее

https://tinyletter.com/snakers41/letters/spark-in-me-2018

#data_science
#deep_learning

Spark in me

Spark in me - annual retrospective 2018

Spark in me - annual retrospective 2018
Статьи автора - http://spark-in.me/author/snakers41
Блог - http://spark-in.me

1.2K viewsAlexander, edited 04:41

Spark in me

This media is not supported in your browser

VIEW IN TELEGRAM

970 viewsAlexander, 13:11

Spark in me

Happy holidays to everyone)

1.0K viewsAlexander, 13:11

Spark in me

ML trends in 2019?

Anonymous Poll

New TF API

16%

Pytorch 1.0+ mobile deploy

20%

NLP explosion

15%

GANs become mainstream

CNNs / TCNs for tabular - mainstream

10%

RL becomes less fragile

Finally a competitor to Nvidia?

Nvlink or sth similar

19%

End of hype

Real deploy of cars

162 voters1.3K viewsAlexander, 16:25

Spark in me

Linux subsystem in Windows 10

It works and installs in literally 2 clicks (run one command in Powershell and then just one-click install your Linux distro of choice in Windows Store (yes, this very funny indeed))!

Why would you need this?
To make and backup files on one command for example =)

Something like this becomes reality on Windows:

cd /mnt/d/ && \
TIME=`date +%b-%d-%y` && \
FILENAME=working_files_tar-$TIME.tar.gz && \
INCREMENTAL_FILE=backup_data.snar && \
echo 'Using folderlist' $FOLDERS && \
tar -czg $(<folders_backup.txt) --listed-incremental=$INCREMENTAL_FILE --verbose -f $FILENAME

Also, you may add rsync or scp and you are good to go!

Also other potential use cases:

- You are somehow vendor locked (I depend on proprietary drivers for my thunderbolt port to attach an external GPU) or just are used to Windows' windows (or are just lazy to install Linux);
- You need one particular Linux program or you need to quickly test something / do not want to bother replicating your environment under Windows (yes, you can also run Docker, but there will be some learning curve);
- You run all of your programs remotely, and use your Windows machine as a thin client, but sometimes you need git / bash / rsync - i.e. to download movies from your personal NAS;

#linux

1.2K viewsAlexander, 04:08

Spark in me

Forwarded from SK

https://techcrunch.com/2019/01/07/github-free-users-now-get-unlimited-private-repositories/?guccounter=1

TechCrunch

GitHub Free users now get unlimited private repositories

If you’re a GitHub user, but you don’t pay, this is a good week. Historically, GitHub always offered free accounts but the caveat was that your code had to be public. To get private repositories, you had to pay. Starting tomorrow, that limitation is gone.…

1.4K viewsAlexander, 03:12

Spark in me

Using nargs

Wrote about this a year ago.
Forgot about it, a friend reminded me.
You can pass lists to the python command line arguments.

parser.add_argument('--classifier_conf', default=[512, 2048, 5005], nargs='+', type=int)

and then just add params to your call as follows

--classifier_conf 512 2048 5005

#deep_learning

1.4K viewsAlexander, 09:15

Spark in me

Someone implemented instance weighted CE loss for PyTorch

https://gist.github.com/nasimrahaman/a5fb23f096d7b0c3880e1622938d0901

#deep_learning

Gist

Pytorch instance-wise weighted cross-entropy loss

Pytorch instance-wise weighted cross-entropy loss. GitHub Gist: instantly share code, notes, and snippets.

1.6K viewsAlexander, 09:49

Spark in me

First 2019 DS / ML digest

No particular highlights - just maybe ML industrialization vector is here to stay?

https://spark-in.me/post/2019_ds_ml_digest_01

#digest
#deep_learning
#data_science

Spark in me

2019 DS/ML digest 01

2019 DS/ML digest 01
Статьи автора - http://spark-in.me/author/snakers41
Блог - http://spark-in.me

1.4K viewsAlexander, 08:33

Spark in me

New amazing video by 3B1B

https://youtu.be/jsYwFizhncE

1.1K viewsAlexander, edited 04:01

Spark in me

Pre-trained BERT in PyTorch

https://github.com/huggingface/pytorch-pretrained-BERT

(1)
Model code here is just awesome.
Integrated DataParallel / DDP wrappers / FP16 wrappers also are awesome.

FP16 precision training from APEX just works (no idea about convergence though yet).

(2)
As for model weights - I cannot really tell, there is no dedicated Russian model.
The only problem I am facing now - using large embeddings bags batch size is literally 1-4 even for smaller models.

And training models with sentence piece is kind of feasible for rich languages, but you will always worry about generalization.

(3)
Did not try the generative pre-training (and sentence prediction pre-training), I hope that properly initializing embeddings will also work for a closed domain with a smaller model (they pre-train 4 days on 4+ TPUs, lol).

(5)
Why even tackle such models?
Chat / dialogue / machine comprehension models are complex / require one-off feature engineering.
Being able to tune something like BERT on publicly available benchmarks and then on your domain can provide a good way to embed complex situations (like questions in dialogues).

#nlp
#deep_learning

GitHub

GitHub - huggingface/transformers: 🤗 Transformers: State-of-the-art Machine Learning for Pytorch, TensorFlow, and JAX.

🤗 Transformers: State-of-the-art Machine Learning for Pytorch, TensorFlow, and JAX. - GitHub - huggingface/transformers: 🤗 Transformers: State-of-the-art Machine Learning for Pytorch, TensorFlow, a...

1.6K viewsAlexander, 08:12

Spark in me

NLP - Highlight of the week - LASER

- Hm, a new sentence embedding tool?
- Plain PyTorch 1.0 / numpy / FAISS based;
- [Release](https://code.fb.com/ai-research/laser-multilingual-sentence-embeddings/), [library](https://github.com/facebookresearch/LASER);
- Looks like an off-shoot of their "unsupervised" NMT project;

LASER’s vector representations of sentences are generic with respect to both the
input language and the NLP task. The tool maps a sentence in any language to
point in a high-dimensional space with the goal that the same statement in any
language will end up in the same neighborhood. This representation could be seen
as a universal language in a semantic vector space. We have observed that the
distance in that space correlates very well to the semantic closeness of the
sentences.

- Alleged pros:

It delivers extremely fast performance, processing up to 2,000 sentences per second on GPU.
    The sentence encoder is implemented in PyTorch with minimal external dependencies.
    Languages with limited resources can benefit from joint training over many languages.
    The model supports the use of multiple languages in one sentence.
    Performance improves as new languages are added, as the system learns to recognize characteristics of language families.

They essentially trained an NMT model with a shared encoder for many languages.

I tried training sth similar - but it quickly over-fitted into just memorizing the indexes of words.

#nlp
#deep_learning

#

Engineering at Meta

Zero-shot transfer across 93 languages: Open-sourcing enhanced LASER library

To accelerate the transfer of natural language processing (NLP) applications to many more languages, we have significantly expanded and enhanced our LASER (Language-Agnostic SEntence Representation…

1.3K viewsAlexander, edited 11:26

Spark in me

Neat PyTorch hack

(1) If possible Implement your complex loss / logic within your model.forward()
(2) Enjoy the multi-GPU / multi-node training wrappers from APEX, PyTorch DataParallel, DistributedDataParallel etc

=)

#deep_learning

1.2K viewsAlexander, 14:48

Spark in me

Downsides of using Common Crawl

Took a look at the Common Crawl data I myself pre-processed last year and could not find abstracts - only sentences.

Took a look at these - archives - http://data.statmt.org/ngrams/deduped/ - also only sentences, though they seem to be in logical order sometimes.

You can use any form of CC - but only to learn word representations. Not sentences.
Sad.

#nlp

1.3K viewsAlexander, edited 12:31

Spark in me

https://youtu.be/-cOYwZ2XcAc

YouTube

None of These Faces Are Real!

The paper "A Style-Based Generator Architecture for Generative Adversarial Networks", i.e., #StyleGAN and its video available here:
https://arxiv.org/abs/1812.04948
https://www.youtube.com/watch?v=kSLJriaOumA

Our material synthesis paper is available here:…

1.3K viewsAlexander, 17:28

About

Blog

Apps

Platform