Data Science by ODS.ai 🦜
51K subscribers
363 photos
34 videos
7 files
1.52K links
First Telegram Data Science channel. Covering all technical and popular staff about anything related to Data Science: AI, Big Data, Machine Learning, Statistics, general Math and the applications of former. To reach editors contact: @haarrp
Download Telegram
#NLP #News (by Sebastian Ruder):
* 2020 NLP wish lists
* #HuggingFace + #fastai
* #NeurIPS 2019
* #GPT2 things
* #ML Interviews

blog post: http://newsletter.ruder.io/archive/211277
ODS breakfast in Paris! β˜•οΈ πŸ‡«πŸ‡· See you this Saturday at 10:30 (many people come around 11:00) at Malongo CafΓ©, 50 Rue Saint-AndrΓ© des Arts.
​​BETO: Spanish BERT

#BETO trained on a big Spanish corpus (3B number of tokens).
Similar to a BERT-Base & was trained with the Whole Word Masking technique.
Weight available for tensorflow & pytorch (cased & uncased versions)

blog post: https://medium.com/dair-ai/beto-spanish-bert-420e4860d2c6
github: https://github.com/dccuchile/beto
​​Driverless DeLorean drifting

#Stanford researchers taught autonomous car driving AI to drift to handle hazardous conditions better.

Link: https://news.stanford.edu/2019/12/20/autonomous-delorean-drives-sideways-move-forward/

#Autonomous #selfdriving #RL #CV #DL #DeLorean
Becoming an Independent Researcher and getting published in ICLR with spotlight

* It is possible to get published as an independent researcher, but it is really HARD.
* Now you need a top tier publication (ACL/EMNLP/CVPR/ICCV/ICLR/NeurIPS or ICML paper) to get accepted into PhD program.
* Mind acceptance rate of 20% and keep on grinding.

Link: https://medium.com/@andreas_madsen/becoming-an-independent-researcher-and-getting-published-in-iclr-with-spotlight-c93ef0b39b8b

#Academia #PhD #conference #learning
​​top podcast episodes from 2k19 by @lexfridman:
⚫️ on nature:
Glenn Villeneuve on @joerogan #1395 | link
⚫️ on perception:
Donald Hoffman on @SamHarrisOrg's Making Sense #178 | link
⚫️ on physics:
Garrett Lisi on @EricRWeinstein's The Portal #15 | link
⚫️ on consciousness:
Philip Goff on @seanmcarroll's Mindscape #71 | link
Happy new year πŸŽ†
Thank you for being awesome πŸ‘
​​SberQuAD – Russian Reading Comprehension Dataset: Description and Analysis

#SberQuAD – a large scale analog of #Stanford #SQuAD in the Russian language – is a valuable resource that has not been properly presented to the scientific community.

SberQuAD creators generally followed a procedure described by the SQuAD authors, which resulted in the similarly high lexical overlap between questions and sentences with answers.

paper: https://arxiv.org/abs/1912.09723
link to SDSJ Task B dataset: http://files.deeppavlov.ai/datasets/sber_squad-v1.1.tar.gz
ODS breakfast in Paris! β˜•οΈ πŸ‡«πŸ‡· See you this Saturday (tomorrow) at 10:30 (some people come around 11:00) at Malongo CafΓ©, 50 Rue Saint-AndrΓ© des Arts. We are expecting from 3 to 7 people.
​​Uber AI Plug and Play Language Model (PPLM)

PPLM allows a user to flexibly plug in one or more simple attribute models representing the desired control objective into a large, unconditional language modeling (LM). The method has the key property that it uses the LM as is – no training or fine-tuning is required – which enables researchers to leverage best-in-class LMs even if they don't have the extensive hardware required to train them.

PPLM lets users combine small attribute models with an LM to steer its generation. Attribute models can be 100k times smaller than the LM and still be effective in steering it

PPLM algorithm entails three simple steps to generate a sample:
* given a partially generated sentence, compute log(p(x)) and log(p(a|x)) and the gradients of each with respect to the hidden representation of the underlying language model. These quantities are both available using an efficient forward and backward pass of both models;
* use the gradients to move the hidden representation of the language model a small step in the direction of increasing log(p(a|x)) and increasing log(p(x));
* sample the next word

more at paper: https://arxiv.org/abs/1912.02164

blogpost: https://eng.uber.com/pplm/
code: https://github.com/uber-research/PPLM
online demo: https://transformer.huggingface.co/model/pplm

#nlp #lm #languagemodeling #uber #pplm
​​10 ML & NLP Research Highlights of 2019
by Sebastian Ruder @ huggingface

The full list of highlights:
* Universal unsupervised pretraining
* Lottery tickets
* The Neural Tangent Kernel
* Unsupervised multilingual learning
* More robust benchmarks
* ML and NLP for science
* Fixing decoding errors in NLG
* Augmenting pretrained models
* Efficient and long-range Transformers
* More reliable analysis methods

blogpost: https://ruder.io/research-highlights-2019/
πŸ“šGuest post on great example of book abandonment at GoodReads

An excellent new article from Gwern on analyzing abandoned (hard to finish, hard to read) books on Goodreads. This write up includes step by step instructions with source code, even the way he parsed the data from the website without an API.

It’s a shame analysis like this does not come from an online book subscription service like Bookmate or MyBook. They have vastly superior datasets and many able data scientists. I am quite sure amazon kindle team does prepare internal reports like that for some evil business purposes, but that’s a whole different story.

During my time at video game database company RAWG.io we’ve compiled β€˜most abandoned’ and β€˜most addictive’ reports for video games.

Do you make a popular service with valuable user behavior data? Funny data analysis reports are a good way to get some attention to your product. Take a lead from Pornhub, they are great at publicizing their data.

Link: https://www.gwern.net/GoodReads
Pornhub Insights: https://www.pornhub.com/insights/

β€”
This is a guest post by Samat Galimov, who writes about technology, programming and management in Russian on @ctodaily.


#DataAnalysis #GoodReads #statistics #greatstats #talkingnumbers
Robust breast cancer detection in mammography and digital breast tomosynthesis using annotation-efficient deep learning approach

ArXiV: https://arxiv.org/abs/1912.11027

#Cancer #BreastCancer #DL #CV #biolearning
ODS breakfast in Paris! β˜•οΈ πŸ‡«πŸ‡· See you this Saturday (tomorrow) at 10:30 (some people come around 11:00) at Malongo CafΓ©, 50 Rue Saint-AndrΓ© des Arts. We are expecting from 7 to 11 people.
​​Cross-Lingual Ability of Multilingual BERT: An Empirical Study to #ICLR2020

In this work, the authors provide a comprehensive study of the contribution of different components in multilingual #BERT (M-BERT) to its cross-lingual ability.
They study the impact of linguistic properties of the languages, the architecture of the model, and the learning objectives. The experimental study is done in the context of three typologically different languages – #Spanish, #Hindi, & #Russian – & using two conceptually different #NLP tasks, textual entailment & #NER.

Also, they construct a new corpus – Fake-English (#enfake), by shifting the Unicode of each character in English Wikipedia text by a large constant so that there is strictly no character overlap with any other Wikipedia text.
And, in this work, they consider Fake-English as a different language.

Among their key conclusions are the fact that the lexical overlap between languages plays a negligible role in the cross-lingual success, while the depth of the network is an integral part of it.

paper: https://arxiv.org/abs/1912.07840
Data Science by ODS.ai 🦜
​​YouTokenToMe, new tool for text tokenisation from VK team Meet new enhanced tokenisation tool on steroids. Works 7-10 times faster alphabetic languages and 40 to 50 times faster on logographic languages, than alternatives. Under the hood (watch source)…
New rust tokenization library from #HuggingFace

Tokenization is a process of converting strings in model input tensors. Library provides BPE/Byte-Level-BPE/WordPiece/SentencePiece tokenization, computes exhaustive set of outputs (offset mappings, attention masks, special token masks).

Library has python and node.js bindings.

The quoted post contains information on another fast #tokenization implementation. Looking forward for speed comparison.

Install: pip install tokenizers
Github: https://github.com/huggingface/tokenizers/tree/master/tokenizers

#NLU #NLP #Transformers #Rust #NotOnlyPython
​​Online speech recognition with wav2letter@anywhere

Facebook have open-sourced wav2letter@anywhere, an inference framework for online speech recognition that delivers state-of-the-art performance.

Link: https://ai.facebook.com/blog/online-speech-recognition-with-wav2letteranywhere/

#wav2letter #audiolearning #soundlearning #sound #acoustic #audio #facebook
​​GAN Lab
Understanding Complex Deep Generative Models using Interactive Visual Experimentation

#GAN Lab is a novel interactive visualization tool for anyone to learn & experiment with Generative Adversarial Networks (GANs), a popular class of complex #DL models. With GAN Lab, you can interactively train GAN models for #2D data #distributions and visualize their inner-workings, similar to #TensorFlow Playground.

web-page: https://poloclub.github.io/ganlab/
github: https://github.com/poloclub/ganlab
paper: https://minsuk.com/research/papers/kahng-ganlab-vast2018.pdf