🔥OpenAI realesed the 1.5billion parameter GPT-2 model
Post: https://openai.com/blog/gpt-2-1-5b-release/
GPT-2 output detection model: https://github.com/openai/gpt-2-output-dataset/tree/master/detector
Research from partners on potential malicious uses: https://d4mucfpksywv.cloudfront.net/papers/GPT_2_Report.pdf
#NLU #GPT2 #OpenAI #NLP
Post: https://openai.com/blog/gpt-2-1-5b-release/
GPT-2 output detection model: https://github.com/openai/gpt-2-output-dataset/tree/master/detector
Research from partners on potential malicious uses: https://d4mucfpksywv.cloudfront.net/papers/GPT_2_Report.pdf
#NLU #GPT2 #OpenAI #NLP
Openai
GPT-2: 1.5B release
As the final model release of GPT-2’s staged release, we’re releasing the largest version (1.5B parameters) of GPT-2 along with code and model weights to facilitate detection of outputs of GPT-2 models. While there have been larger language models released…
Lectures on computer architecture
Videos and slides about computer architecture by Professor Onur Mutlu
Channel: https://www.youtube.com/channel/UCIwQ8uOeRFgOEvBLYc3kc3g/featured
Professor: https://people.inf.ethz.ch/omutlu/
#hardware #lectures
Videos and slides about computer architecture by Professor Onur Mutlu
Channel: https://www.youtube.com/channel/UCIwQ8uOeRFgOEvBLYc3kc3g/featured
Professor: https://people.inf.ethz.ch/omutlu/
#hardware #lectures
ODS breakfast in Paris! See you this Saturday (9th of November) at 10:30 at Malongo Café, 50 Rue Saint-André des Arts.
Generalization through Memorization: Nearest Neighbor Language Models
Introduced kNN-LMs, which extend LMs with nearest neighbor search in embedding space, achieving a new SOTA perplexity on Wikitext-103, without additional training!
Also show that kNN-LM can efficiently scale up LMs to larger training sets and allows for effective domain adaptation, by simply varying the nearest neighbor datastore without further training. It seems to be helpful in predicting long tail patterns, such as factual knowledge!
code available soon
Paper: https://arxiv.org/abs/1911.00172
#nlp #generalization #kNN
Introduced kNN-LMs, which extend LMs with nearest neighbor search in embedding space, achieving a new SOTA perplexity on Wikitext-103, without additional training!
Also show that kNN-LM can efficiently scale up LMs to larger training sets and allows for effective domain adaptation, by simply varying the nearest neighbor datastore without further training. It seems to be helpful in predicting long tail patterns, such as factual knowledge!
code available soon
Paper: https://arxiv.org/abs/1911.00172
#nlp #generalization #kNN
Data science Munich dinner at Nov 8 Fri 20:00, table booked by name Eugen. 12 persons
Wirtshaus Valley´s
Aberlestraße 52, 81371 München
089 76775151
https://maps.app.goo.gl/XyrWcx15LBmMzGZV9
Wirtshaus Valley´s
Aberlestraße 52, 81371 München
089 76775151
https://maps.app.goo.gl/XyrWcx15LBmMzGZV9
Wirtshaus Valley´s · Aberlestraße 52, 81371 München, Germany
★★★★★ · German restaurant
Separate voice from music
Spleeter is the Deezer source separation library with pretrained models written in Python and uses Tensorflow. It makes it easy to train source separation model (assuming you have a dataset of isolated sources), and provides already trained state of the art model for performing various flavor of separation:
* vocals (singing voice) / accompaniment separation (2 stems)
* vocals / drums / bass / other separation (4 stems)
* vocals / drums / bass / piano / other separation (5 stems)
Spleeter is also very fast as it can perform separation of audio files to 4 stems 100x faster than real-time when run on a GPU
blog: https://deezer.io/releasing-spleeter-deezer-r-d-source-separation-engine-2b88985e797e
paper: http://archives.ismir.net/ismir2019/latebreaking/000036.pdf
github: https://github.com/deezer/spleeter
#voice #music #tf
Spleeter is the Deezer source separation library with pretrained models written in Python and uses Tensorflow. It makes it easy to train source separation model (assuming you have a dataset of isolated sources), and provides already trained state of the art model for performing various flavor of separation:
* vocals (singing voice) / accompaniment separation (2 stems)
* vocals / drums / bass / other separation (4 stems)
* vocals / drums / bass / piano / other separation (5 stems)
Spleeter is also very fast as it can perform separation of audio files to 4 stems 100x faster than real-time when run on a GPU
blog: https://deezer.io/releasing-spleeter-deezer-r-d-source-separation-engine-2b88985e797e
paper: http://archives.ismir.net/ismir2019/latebreaking/000036.pdf
github: https://github.com/deezer/spleeter
#voice #music #tf
Revealing the Dark Secrets of BERT
This work interpretation of self-attention.
Using a subset of GLUE tasks and a set of handcrafted features-of-interest, they proposed the methodology and carry out a qualitative and quantitative analysis of the information encoded by the individual BERT’s heads.
The findings suggest that there is a limited set of attention patterns that are repeated across different heads, indicating the overall model overparametrization.
Also, show that manually disabling attention in certain heads leads to a performance improvement over the regular fine-tuned BERT models.
paper: https://arxiv.org/abs/1908.08593
#nlp #bert
This work interpretation of self-attention.
Using a subset of GLUE tasks and a set of handcrafted features-of-interest, they proposed the methodology and carry out a qualitative and quantitative analysis of the information encoded by the individual BERT’s heads.
The findings suggest that there is a limited set of attention patterns that are repeated across different heads, indicating the overall model overparametrization.
Also, show that manually disabling attention in certain heads leads to a performance improvement over the regular fine-tuned BERT models.
paper: https://arxiv.org/abs/1908.08593
#nlp #bert
Unsupervised Cross-lingual Representation Learning at Scale
They release XLM-R, a Transformer MLM trained in 100 langs on 2.5 TB of text data! Which obtains state-of-the-art performance on cross-lingual classification, sequence labeling and question answering.
Introduced a comprehensive analysis of the capacity and limits of unsupervised multilingual masked language modeling at scale.
XLM-R especially outperforms mBERT and XLM-100 on low-resource languages, for which CommonCrawl data enables representation learning: +13.7% and +9.3% for Urdu, +21.6% and +13.8% accuracy for Swahili on XNLI.
Soon on transformers by huggingface repo & at tf.hub
paper: https://arxiv.org/abs/1911.02116
code: https://github.com/pytorch/fairseq/tree/master/examples/xlmr
#nlp #bert #xlu #transformer
They release XLM-R, a Transformer MLM trained in 100 langs on 2.5 TB of text data! Which obtains state-of-the-art performance on cross-lingual classification, sequence labeling and question answering.
Introduced a comprehensive analysis of the capacity and limits of unsupervised multilingual masked language modeling at scale.
XLM-R especially outperforms mBERT and XLM-100 on low-resource languages, for which CommonCrawl data enables representation learning: +13.7% and +9.3% for Urdu, +21.6% and +13.8% accuracy for Swahili on XNLI.
Soon on transformers by huggingface repo & at tf.hub
paper: https://arxiv.org/abs/1911.02116
code: https://github.com/pytorch/fairseq/tree/master/examples/xlmr
#nlp #bert #xlu #transformer
DialoGPT: Large-Scale Generative Pre-training for Conversational Response Generation
tl;dr: GPT2 + Dialogue data = DialoGPT
trained on Reddit comments from 2005 through 2017 (not a very big dataset, about 2Gb)
Paper: https://arxiv.org/abs/1911.00536
Code: https://github.com/microsoft/DialoGPT
Blog: https://www.microsoft.com/en-us/research/project/large-scale-pretraining-for-response-generation/
#nlp #gpt2 #dialog
tl;dr: GPT2 + Dialogue data = DialoGPT
trained on Reddit comments from 2005 through 2017 (not a very big dataset, about 2Gb)
Paper: https://arxiv.org/abs/1911.00536
Code: https://github.com/microsoft/DialoGPT
Blog: https://www.microsoft.com/en-us/research/project/large-scale-pretraining-for-response-generation/
#nlp #gpt2 #dialog
GPU cooling tool
This script lets you set a custom GPU fan curve on a headless Linux server.
If you want to install multiple GPUs in a single machine, you have to use blower-style GPUs else the hot exhaust builds up in your case. Blower-style GPUs can get very loud, so to avoid annoying customers nvidia artificially limits their fans to ~50% duty. At 50% duty and a heavy workload, blower-style GPUs will hot up to 85C or so and throttle themselves.
Now if you're on Windows nvidia happily lets you override that limit by setting a custom fan curve. If you're on Linux though you need to use nvidia-settings, which - as of Sept 2019 - requires a display attached to each GPU you want to set the fan for. This is a pain to set up, as is checking the GPU temp every few seconds and adjusting the fan speed.
This script does all that for you.
Code: https://github.com/andyljones/coolgpus
#hardware #gpu
This script lets you set a custom GPU fan curve on a headless Linux server.
If you want to install multiple GPUs in a single machine, you have to use blower-style GPUs else the hot exhaust builds up in your case. Blower-style GPUs can get very loud, so to avoid annoying customers nvidia artificially limits their fans to ~50% duty. At 50% duty and a heavy workload, blower-style GPUs will hot up to 85C or so and throttle themselves.
Now if you're on Windows nvidia happily lets you override that limit by setting a custom fan curve. If you're on Linux though you need to use nvidia-settings, which - as of Sept 2019 - requires a display attached to each GPU you want to set the fan for. This is a pain to set up, as is checking the GPU temp every few seconds and adjusting the fan speed.
This script does all that for you.
Code: https://github.com/andyljones/coolgpus
#hardware #gpu
BPE-Dropout: Simple and Effective Subword Regularization
The dominant approach to subword segmentation is Byte Pair Encoding (BPE), which keeps the most frequent words intact while splitting the rare ones into multiple tokens.
And while multiple segmentations are possible even with the same vocabulary, BPE splits words into unique sequences; this may prevent a model from better learning the compositionality of words and being robust to segmentation errors.
In this paper introduced BPE-dropout – simple and effective subword regularization method based on and compatible with conventional BPE.
It stochastically corrupts the segmentation procedure of BPE, which leads to producing multiple segmentations within the same fixed BPE framework.
Using BPE-dropout during training and the standard BPE during inference improves translation quality up to 3 BLEU compared to BPE and up to 0.9 BLEU compared to the previous subword regularization.
Paper: https://arxiv.org/abs/1910.13267
Code: https://github.com/rsennrich/subword-nmt
#nlp #bpe
The dominant approach to subword segmentation is Byte Pair Encoding (BPE), which keeps the most frequent words intact while splitting the rare ones into multiple tokens.
And while multiple segmentations are possible even with the same vocabulary, BPE splits words into unique sequences; this may prevent a model from better learning the compositionality of words and being robust to segmentation errors.
In this paper introduced BPE-dropout – simple and effective subword regularization method based on and compatible with conventional BPE.
It stochastically corrupts the segmentation procedure of BPE, which leads to producing multiple segmentations within the same fixed BPE framework.
Using BPE-dropout during training and the standard BPE during inference improves translation quality up to 3 BLEU compared to BPE and up to 0.9 BLEU compared to the previous subword regularization.
Paper: https://arxiv.org/abs/1910.13267
Code: https://github.com/rsennrich/subword-nmt
#nlp #bpe
Neural network reconstructs human thoughts from brain waves
in real time
MIPT (top Russian university) researchers published results on mind-reading technology.
Link: https://techxplore.com/news/2019-10-neural-network-reconstructs-human-thoughts.html
Video: https://www.youtube.com/watch?v=nf-P3b2AnZw
#Neuroscience #thoughts2pic #BCI #neuralink #MIPT
in real time
MIPT (top Russian university) researchers published results on mind-reading technology.
Link: https://techxplore.com/news/2019-10-neural-network-reconstructs-human-thoughts.html
Video: https://www.youtube.com/watch?v=nf-P3b2AnZw
#Neuroscience #thoughts2pic #BCI #neuralink #MIPT
YouTube
Нейросети научили "читать мысли" в режиме реального времени
https://www.biorxiv.org/content/10.1101/787101v2
В рамках проекта "Ассистивные нейротехнологии" NeuroNet НТИ сотрудники ГК "Нейроботикс" и МФТИ обучили нейросети воссоздавать изображения по электрической активности мозга, ранее такие эксперименты никем не…
В рамках проекта "Ассистивные нейротехнологии" NeuroNet НТИ сотрудники ГК "Нейроботикс" и МФТИ обучили нейросети воссоздавать изображения по электрической активности мозга, ранее такие эксперименты никем не…
Using AI to Understand What Causes Diseases
An overview on applying data science in healthcare
Poster: https://info.gnshealthcare.com/hubfs/Publications_2019/ESMO_GI_Final_Poster_Printed_PD_20.pdf
Link: https://hbr.org/2019/11/using-ai-to-understand-what-causes-diseases
#meta #biolearning #dl #medical #healthcare
An overview on applying data science in healthcare
Poster: https://info.gnshealthcare.com/hubfs/Publications_2019/ESMO_GI_Final_Poster_Printed_PD_20.pdf
Link: https://hbr.org/2019/11/using-ai-to-understand-what-causes-diseases
#meta #biolearning #dl #medical #healthcare
🏆 Moscow ML Trainings meetup on the 16th of November
ML Trainings are based on Kaggle and other platform competitions and are held regularly with free attendance. Winners and top-performing participants discuss competition tasks, share their solutions, and results.
You may find the program and the registration link here - @mltrainings
* Note: this time the first talk will be in English and the rest will be in Russian.
ML Trainings are based on Kaggle and other platform competitions and are held regularly with free attendance. Winners and top-performing participants discuss competition tasks, share their solutions, and results.
You may find the program and the registration link here - @mltrainings
* Note: this time the first talk will be in English and the rest will be in Russian.
The female problem: how male bias in medical trials ruined women's health
Intersting article on #bias in #medical trials and how proper #statistics training is still important.
Link: https://www.theguardian.com/lifeandstyle/2019/nov/13/the-female-problem-male-bias-in-medical-trials
Intersting article on #bias in #medical trials and how proper #statistics training is still important.
Link: https://www.theguardian.com/lifeandstyle/2019/nov/13/the-female-problem-male-bias-in-medical-trials
the Guardian
The female problem: how male bias in medical trials ruined women's health
Centuries of female exclusion has meant women’s diseases are often missed, misdiagnosed or remain a total mystery
ODS breakfast in Paris! See you this Saturday at 10:30 at Malongo Café, 50 Rue Saint-André des Arts. We are expecting at least from 5 to 10 people.
Self-training with Noisy Student improves ImageNet classification
Using unlabeled data with pseudolabeling improves accuracy in ImageNet.
In the work uses self-training on unlabeled data to achieve 87.4% top-1 on ImageNet, 1% better than SOTA. Huge gains are seen on harder benchmarks (ImageNet-A, C and P).
Method is super simple:
1) Train a classifier on ImageNet
2) Infer labels on a much larger unlabeled dataset
3) Train a larger classifier on the combined set
4) Iterate the process, adding noise
For the beginning they use EfficientNet-B0, pretrained on ImageNet, and predict JFT dataset. Then use predicted classes with confidence > 0.3 on each classes 130k images and get 130М images, delete duplicates, so left 81M.
Architecture:
EfficeintNet, student model much bigger than teacher.
Learning process:
Batch size of 2048
Sota model L2 learned 350 epochs.
The learning rate starts at 0.128 for labeled batch size 2048 and decays by 0.97 every 2.4 epochs if trained for 350 epochs or every 4.8 epochs if trained for 700 epochs :alchemy:
The biggest model L2 have learned 3.5 days on Cloud TPU v3 Pod with 2048 cores.
To start they learned B7 as a student and a teacher. Then, using B7 like teacher learned L0 student. Then they learned L1 and so on to L2 and L2 as the teacher learned L2 student.
Result:
SOTA with 2 times fewer params then last SOTA (FixRes ResNeXt-101 WSL 829M par)
paper: https://arxiv.org/abs/1911.04252
tweet: https://twitter.com/quocleix/status/1194334947156193280?s=20
#cv #selfTraining
Using unlabeled data with pseudolabeling improves accuracy in ImageNet.
In the work uses self-training on unlabeled data to achieve 87.4% top-1 on ImageNet, 1% better than SOTA. Huge gains are seen on harder benchmarks (ImageNet-A, C and P).
Method is super simple:
1) Train a classifier on ImageNet
2) Infer labels on a much larger unlabeled dataset
3) Train a larger classifier on the combined set
4) Iterate the process, adding noise
For the beginning they use EfficientNet-B0, pretrained on ImageNet, and predict JFT dataset. Then use predicted classes with confidence > 0.3 on each classes 130k images and get 130М images, delete duplicates, so left 81M.
Architecture:
EfficeintNet, student model much bigger than teacher.
Learning process:
Batch size of 2048
Sota model L2 learned 350 epochs.
The learning rate starts at 0.128 for labeled batch size 2048 and decays by 0.97 every 2.4 epochs if trained for 350 epochs or every 4.8 epochs if trained for 700 epochs :alchemy:
The biggest model L2 have learned 3.5 days on Cloud TPU v3 Pod with 2048 cores.
To start they learned B7 as a student and a teacher. Then, using B7 like teacher learned L0 student. Then they learned L1 and so on to L2 and L2 as the teacher learned L2 student.
Result:
SOTA with 2 times fewer params then last SOTA (FixRes ResNeXt-101 WSL 829M par)
paper: https://arxiv.org/abs/1911.04252
tweet: https://twitter.com/quocleix/status/1194334947156193280?s=20
#cv #selfTraining
Updating Pre-trained Word Vectors and Text Classifiers using Monolingual Alignment
The authors drew inspiration from the way #multilingual word vectors are learned. They treated general-purpose and domain-specific corpora as separate languages and used a word-embedding model to learn independent vectors from each. Then they aligned the vectors from one corpus with those from another.
To align word vectors from two corpora, common words are used to find a consistent way to represent all words. For example, if one corpus is [human, cat] and the other is [cat, dog], the model applies a transformation that unifies the dog word vectors while retaining the relative positions of the word vectors between cats, dogs, and humans.
A word-embedding model learns independent word vectors from both corpora.
The authors use a loss function called #RCSLS for training. RCSLS balances two objectives: General-purpose vectors that are close together remain close together, while general-purpose vectors that far apart remain far apart. Common words in the two corpora now have duplicate vectors. Averaging them produces a single vector representation.
They consider applications to word embedding and text, classification models. Show that the proposed approach yields good performance in all setups and outperforms a baseline consisting of fine-tuning the model on new data.
paper: https://arxiv.org/abs/1910.06241
#nlp
The authors drew inspiration from the way #multilingual word vectors are learned. They treated general-purpose and domain-specific corpora as separate languages and used a word-embedding model to learn independent vectors from each. Then they aligned the vectors from one corpus with those from another.
To align word vectors from two corpora, common words are used to find a consistent way to represent all words. For example, if one corpus is [human, cat] and the other is [cat, dog], the model applies a transformation that unifies the dog word vectors while retaining the relative positions of the word vectors between cats, dogs, and humans.
A word-embedding model learns independent word vectors from both corpora.
The authors use a loss function called #RCSLS for training. RCSLS balances two objectives: General-purpose vectors that are close together remain close together, while general-purpose vectors that far apart remain far apart. Common words in the two corpora now have duplicate vectors. Averaging them produces a single vector representation.
They consider applications to word embedding and text, classification models. Show that the proposed approach yields good performance in all setups and outperforms a baseline consisting of fine-tuning the model on new data.
paper: https://arxiv.org/abs/1910.06241
#nlp
Emerging Cross-lingual Structure in Pretrained Language Models
tl;dr – dissect mBERT & XLM and show monolingual BERTs are similar
They offer an ablation study on bilingual #MLM considering all relevant factors. Sharing only the top 2 layers of the #transformer finally break cross-lingual transfer.
Factors importance: parameter sharing >> domain similarity, anchor points, language universal softmax, joint BPE
We can align monolingual BERT representation at word-level & sentence level with orthogonal mapping. CKA visualizes the similarity of monitoring. & billing. BERT
Paper: https://arxiv.org/abs/1911.01464
#nlp #multilingual
tl;dr – dissect mBERT & XLM and show monolingual BERTs are similar
They offer an ablation study on bilingual #MLM considering all relevant factors. Sharing only the top 2 layers of the #transformer finally break cross-lingual transfer.
Factors importance: parameter sharing >> domain similarity, anchor points, language universal softmax, joint BPE
We can align monolingual BERT representation at word-level & sentence level with orthogonal mapping. CKA visualizes the similarity of monitoring. & billing. BERT
Paper: https://arxiv.org/abs/1911.01464
#nlp #multilingual