Data Science by ODS.ai 🦜
51K subscribers
363 photos
34 videos
7 files
1.52K links
First Telegram Data Science channel. Covering all technical and popular staff about anything related to Data Science: AI, Big Data, Machine Learning, Statistics, general Math and the applications of former. To reach editors contact: @haarrp
Download Telegram
​​Lectures on computer architecture

Videos and slides about computer architecture by Professor Onur Mutlu

Channel: https://www.youtube.com/channel/UCIwQ8uOeRFgOEvBLYc3kc3g/featured
Professor: https://people.inf.ethz.ch/omutlu/

#hardware #lectures
ODS breakfast in Paris! See you this Saturday (9th of November) at 10:30 at Malongo Café, 50 Rue Saint-André des Arts.
Generalization through Memorization: Nearest Neighbor Language Models

Introduced kNN-LMs, which extend LMs with nearest neighbor search in embedding space, achieving a new SOTA perplexity on Wikitext-103, without additional training!
Also show that kNN-LM can efficiently scale up LMs to larger training sets and allows for effective domain adaptation, by simply varying the nearest neighbor datastore without further training. It seems to be helpful in predicting long tail patterns, such as factual knowledge!

code available soon
Paper: https://arxiv.org/abs/1911.00172

#nlp #generalization #kNN
Data science Munich dinner at Nov 8 Fri 20:00, table booked by name Eugen. 12 persons
Wirtshaus Valley´s
Aberlestraße 52, 81371 München
089 76775151
https://maps.app.goo.gl/XyrWcx15LBmMzGZV9
​​Separate voice from music

Spleeter is the Deezer source separation library with pretrained models written in Python and uses Tensorflow. It makes it easy to train source separation model (assuming you have a dataset of isolated sources), and provides already trained state of the art model for performing various flavor of separation:
* vocals (singing voice) / accompaniment separation (2 stems)
* vocals / drums / bass / other separation (4 stems)
* vocals / drums / bass / piano / other separation (5 stems)

Spleeter is also very fast as it can perform separation of audio files to 4 stems 100x faster than real-time when run on a GPU

blog: https://deezer.io/releasing-spleeter-deezer-r-d-source-separation-engine-2b88985e797e
paper: http://archives.ismir.net/ismir2019/latebreaking/000036.pdf
github: https://github.com/deezer/spleeter

#voice #music #tf
​​Revealing the Dark Secrets of BERT

This work interpretation of self-attention.
Using a subset of GLUE tasks and a set of handcrafted features-of-interest, they proposed the methodology and carry out a qualitative and quantitative analysis of the information encoded by the individual BERT’s heads.
The findings suggest that there is a limited set of attention patterns that are repeated across different heads, indicating the overall model overparametrization.
Also, show that manually disabling attention in certain heads leads to a performance improvement over the regular fine-tuned BERT models.

paper: https://arxiv.org/abs/1908.08593

#nlp #bert
​​Unsupervised Cross-lingual Representation Learning at Scale

They release XLM-R, a Transformer MLM trained in 100 langs on 2.5 TB of text data! Which obtains state-of-the-art performance on cross-lingual classification, sequence labeling and question answering.
Introduced a comprehensive analysis of the capacity and limits of unsupervised multilingual masked language modeling at scale.
XLM-R especially outperforms mBERT and XLM-100 on low-resource languages, for which CommonCrawl data enables representation learning: +13.7% and +9.3% for Urdu, +21.6% and +13.8% accuracy for Swahili on XNLI.

Soon on transformers by huggingface repo & at tf.hub

paper: https://arxiv.org/abs/1911.02116
code: https://github.com/pytorch/fairseq/tree/master/examples/xlmr

#nlp #bert #xlu #transformer
​​DialoGPT: Large-Scale Generative Pre-training for Conversational Response Generation

tl;dr: GPT2 + Dialogue data = DialoGPT
trained on Reddit comments from 2005 through 2017 (not a very big dataset, about 2Gb)


Paper: https://arxiv.org/abs/1911.00536
Code: https://github.com/microsoft/DialoGPT
Blog: https://www.microsoft.com/en-us/research/project/large-scale-pretraining-for-response-generation/

#nlp #gpt2 #dialog
GPU cooling tool

This script lets you set a custom GPU fan curve on a headless Linux server.

If you want to install multiple GPUs in a single machine, you have to use blower-style GPUs else the hot exhaust builds up in your case. Blower-style GPUs can get very loud, so to avoid annoying customers nvidia artificially limits their fans to ~50% duty. At 50% duty and a heavy workload, blower-style GPUs will hot up to 85C or so and throttle themselves.

Now if you're on Windows nvidia happily lets you override that limit by setting a custom fan curve. If you're on Linux though you need to use nvidia-settings, which - as of Sept 2019 - requires a display attached to each GPU you want to set the fan for. This is a pain to set up, as is checking the GPU temp every few seconds and adjusting the fan speed.

This script does all that for you.


Code: https://github.com/andyljones/coolgpus

#hardware #gpu
BPE-Dropout: Simple and Effective Subword Regularization

The dominant approach to subword segmentation is Byte Pair Encoding (BPE), which keeps the most frequent words intact while splitting the rare ones into multiple tokens.
And while multiple segmentations are possible even with the same vocabulary, BPE splits words into unique sequences; this may prevent a model from better learning the compositionality of words and being robust to segmentation errors.

In this paper introduced BPE-dropout – simple and effective subword regularization method based on and compatible with conventional BPE.
It stochastically corrupts the segmentation procedure of BPE, which leads to producing multiple segmentations within the same fixed BPE framework.

Using BPE-dropout during training and the standard BPE during inference improves translation quality up to 3 BLEU compared to BPE and up to 0.9 BLEU compared to the previous subword regularization.

Paper: https://arxiv.org/abs/1910.13267
Code: https://github.com/rsennrich/subword-nmt

#nlp #bpe
How to remember difference between Type 1 and Type 2 errors.
🏆 Moscow ML Trainings meetup on the 16th of November

ML Trainings are based on Kaggle and other platform competitions and are held regularly with free attendance. Winners and top-performing participants discuss competition tasks, share their solutions, and results.

You may find the program and the registration link here - @mltrainings
* Note: this time the first talk will be in English and the rest will be in Russian.
ODS breakfast in Paris! See you this Saturday at 10:30 at Malongo Café, 50 Rue Saint-André des Arts. We are expecting at least from 5 to 10 people.
​​Self-training with Noisy Student improves ImageNet classification

Using unlabeled data with pseudolabeling improves accuracy in ImageNet.

In the work uses self-training on unlabeled data to achieve 87.4% top-1 on ImageNet, 1% better than SOTA. Huge gains are seen on harder benchmarks (ImageNet-A, C and P).

Method is super simple:

1) Train a classifier on ImageNet
2) Infer labels on a much larger unlabeled dataset
3) Train a larger classifier on the combined set
4) Iterate the process, adding noise

For the beginning they use EfficientNet-B0, pretrained on ImageNet, and predict JFT dataset. Then use predicted classes with confidence > 0.3 on each classes 130k images and get 130М images, delete duplicates, so left 81M.

Architecture:

EfficeintNet, student model much bigger than teacher.

Learning process:

Batch size of 2048
Sota model L2 learned 350 epochs.
The learning rate starts at 0.128 for labeled batch size 2048 and decays by 0.97 every 2.4 epochs if trained for 350 epochs or every 4.8 epochs if trained for 700 epochs :alchemy:

The biggest model L2 have learned 3.5 days on Cloud TPU v3 Pod with 2048 cores.

To start they learned B7 as a student and a teacher. Then, using B7 like teacher learned L0 student. Then they learned L1 and so on to L2 and L2 as the teacher learned L2 student.

Result:
SOTA with 2 times fewer params then last SOTA (FixRes ResNeXt-101 WSL 829M par)


paper: https://arxiv.org/abs/1911.04252
tweet: https://twitter.com/quocleix/status/1194334947156193280?s=20

#cv #selfTraining
​​Updating Pre-trained Word Vectors and Text Classifiers using Monolingual Alignment

The authors drew inspiration from the way #multilingual word vectors are learned. They treated general-purpose and domain-specific corpora as separate languages and used a word-embedding model to learn independent vectors from each. Then they aligned the vectors from one corpus with those from another.

To align word vectors from two corpora, common words are used to find a consistent way to represent all words. For example, if one corpus is [human, cat] and the other is [cat, dog], the model applies a transformation that unifies the dog word vectors while retaining the relative positions of the word vectors between cats, dogs, and humans.
A word-embedding model learns independent word vectors from both corpora.

The authors use a loss function called #RCSLS for training. RCSLS balances two objectives: General-purpose vectors that are close together remain close together, while general-purpose vectors that far apart remain far apart. Common words in the two corpora now have duplicate vectors. Averaging them produces a single vector representation.

They consider applications to word embedding and text, classification models. Show that the proposed approach yields good performance in all setups and outperforms a baseline consisting of fine-tuning the model on new data.

paper: https://arxiv.org/abs/1910.06241

#nlp
Emerging Cross-lingual Structure in Pretrained Language Models

tl;dr – dissect mBERT & XLM and show monolingual BERTs are similar

They offer an ablation study on bilingual #MLM considering all relevant factors. Sharing only the top 2 layers of the #transformer finally break cross-lingual transfer.
Factors importance: parameter sharing >> domain similarity, anchor points, language universal softmax, joint BPE

We can align monolingual BERT representation at word-level & sentence level with orthogonal mapping. CKA visualizes the similarity of monitoring. & billing. BERT

Paper: https://arxiv.org/abs/1911.01464

#nlp #multilingual