Data Science by ODS.ai 🦜
51.9K subscribers
312 photos
29 videos
7 files
1.48K links
First Telegram Data Science channel. Covering all technical and popular staff about anything related to Data Science: AI, Big Data, Machine Learning, Statistics, general Math and the applications of former. To reach editors contact: @haarrp
Download Telegram
​​Optimus: Organizing Sentences via Pre-trained Modeling of a Latent Space

The authors propose the first large-scale language VAE model – Optimus.

This new model uses BERT weights in the encoder and GPT-2 weights in the decoder. Thanks to this Optimus supports NLU and text generation tasks. Learned language representation is more universal, which means that it is easier to fine-tune this model to a new domain/task. Also, Optimus can control high-level semantics in text generation (tense, topic, sentiment).

There are several novel contributions, which are made thanks to this work:
– latent vector injection: two schemes are suggested to inject conditioning vectors into GPT-2 without retraining it;
– the idea to combine BERT and GPT-2 could inspire people to integrate existing language models into larger and ever more complex models;
– pre-training on a big corpora is an effective approach to reduce KL vanishing;
– VAE is a good approach to balance the compactness and usability of learned representations;
– pre-training latent space improves performance on several language tasks;

Experimental results on a wide range of tasks and datasets have demonstrated the strong performance of OPTIMUS, including new state-of-the-art for language VAEs.


Paper: https://arxiv.org/abs/2004.04092v1
Github: https://github.com/ChunyuanLI/Optimus

#deeplearning #nlp #nlu #transformer #vae #bert #gpt2
​​(Re)Discovering Protein Structure and Function Through Language Modeling

Trained solely on unsupervised language modeling, the Transformer's attention mechanism recovers high-level structural (folding) and functional properties of proteins!

Why this is important: traditional protein modelling requires lots of computational power. This might be a key to more efficient structure modelling. Protein structure => function. Function => faster drug research and understanding of diseases mechanisms.

Blog: https://blog.einstein.ai/provis/
Paper: https://arxiv.org/abs/2006.15222
Code: https://github.com/salesforce/provis

#DL #NLU #proteinmodelling #bio #biolearning #insilico
​​GPT-3 application for website form generation

Turns out #GPT3 model is capable of generating #JSX code (which is HTML layout for #React ) given the description of the required blocks to generate.

Author reports that there are exceptions, given current output limit of the model of 512 tokens.

Why this is important: one might suppose that in the future programmers will just write specifications and tests for the AI to generate the code. Given the speed of progress that won’t be surprising at all.

And probably the more sophisticated models will be capable of using hard output limit to produce a code for the output generation but that obviously is still an area for active research.

More realistic evaluation is that the upcoming code generation tools is that it will just allow more people to build products, following #nocode movement.

Twitter thread: https://twitter.com/sharifshameem/status/1282676454690451457

#codegeneration #NLU
This media is not supported in your browser
VIEW IN TELEGRAM
Applying GPT-3 to generate neural network code

Matt Shumer used GPT-3 to generate code for a machine learning model, just by describing the dataset and required output.

#GPT3 #inception #codegeneration #NLU #NLP
Deep learning to translate between programming languages

#FacebookAI released TransCoder, an entirely self-supervised neural transcompiler system that is claimed to make code migration easier and more efficient.

ArXiV: https://arxiv.org/pdf/2006.03511.pdf
Github: https://github.com/facebookresearch/TransCoder/

#NLU #codegeneration #NLP
​​Stanford updated tool Stanza with #NER for biomedical and clinical terms

Stanza extended with first domain-specific models for biomedical and clinical medical English. They range from approaching to significantly improving state of the art results on syntactic and NER tasks.

That means that now neural networks are capable of understanding difficult texts with lots of specific terms. That means better search, improved knowledge extraction and approach for performing META analysis, or even research with medical ArXiV publications.

Demo: http://stanza.run/bio
ArXiV: https://arxiv.org/abs/2007.14640

#NLProc #NLU #Stanford #biolearning #medicallearning
​​Philosopher AI β€” website to generate text with #GPT3

Tool to generate text on different topics. Sensible topics such as sex, religion or even nationality are blocked.

Great way to spread the awareness on #ai and to show nontechnical friends that #Skynet is not a problem to be concerned with yet.

Website: https://philosopherai.com/philosopher/humanity-on-mars-73ac00

#nlu #nlp
Most of the Scots NLP models used Wikipedia for training are wrong

One person who had done 200,000 edits and written 20,000 articles of Scots Wikipedia was not using Scots language but rather faking it. Since Wikipedia texts are often used as a dataset for #NLU / #NLP / #NMT neural nets training, those models using it as an input had a flaw.

Reddit thread: https://www.reddit.com/r/Scotland/comments/ig9jia/ive_discovered_that_almost_every_single_article/

#datasets #translation #scots #wikipedia
​​DeepMind significally (+100%) improved protein folding modelling

Why is this important: protein folding = protein structure = protein function = how protein works in the living speciment and what it does.
What this means: better vaccines, better meds, more curable diseases and more calamities easen by the medications or better understanding.

Dataset: ~170000 available protein structures from PDB
Hardware: 128 TPUv3 cores (roughly  equivalent to ~100-200 GPUs)

Link: https://deepmind.com/blog/article/alphafold-a-solution-to-a-50-year-old-grand-challenge-in-biology

#DL #NLU #proteinmodelling #bio #biolearning #insilico #deepmind #AlphaFold
​​Supporting content decision makers with machine learning

#Netflix shared a post providing information about how they research and prepare data for new title production.

Link: https://netflixtechblog.com/supporting-content-decision-makers-with-machine-learning-995b7b76006f

#NLU #NLP #recommendation #embeddings
​​Blender Bot 2.0: An open source chatbot that builds long-term memory and searches the internet

Bot is capable of supporting a dialog and remembering the context of the sequential questions.

Blogpost: https://ai.facebook.com/blog/blender-bot-2-an-open-source-chatbot-that-builds-long-term-memory-and-searches-the-internet
Github: https://github.com/facebookresearch/ParlAI
Paper 1: https://parl.ai/projects/sea
Paper 2: https://parl.ai/projects/msc

#chatbot #NLU #facebookai
​​Program Synthesis with Large Language Models

Paper compares models used for program synthesis in general purpose programming languages against two new benchmarks, MBPP (The Mostly Basic Programming Problems) and MathQA-Python, in both the few-shot and fine-tuning regimes.

MBPP contains 974 programming tasks, designed to be solvable by entry-level programmers. MathQA benchmark, contains 23914 problems that evaluate the ability of the models to synthesize code from more complex text.

Largest fine-tuned model achieves 83.8 percent accuracy on the latter benchmark.

Why this is interesting: better models for code / problem understanding means improved search for the coding tasks and the improvement of the coding-assistant projects like #TabNine or #Copilot

ArXiV: https://arxiv.org/abs/2108.07732

#DL #NLU #codewritingcode #benchmark
​​Summarizing Books with Human Feedback

#OpenAI fine-tuned #GPT3 to summarize books well enough to be human-readable. Main approach: recursively split text into parts and then meta-summarize summaries.

This is really important because once there will be a great summarization #SOTA we won't need editors to write posts for you. And researchers ultimatively will have some asisstance interpreting models' results.

BlogPost: https://openai.com/blog/summarizing-books/
ArXiV: https://arxiv.org/abs/2109.10862

#summarization #NLU #NLP
​​AI Generated Pokemon Sprites with GPT-2

Author trained #GPT2 model to generate #pokemon sprites, encoding them as the lines of characters (including color). Surprisingly, results were decent, so this leaves us wonder if #GPT3 results would be better.

YouTube: https://www.youtube.com/watch?v=Z9K3cwSL6uM
GitHub: https://github.com/MatthewRayfield/pokemon-gpt-2
Article: https://matthewrayfield.com/articles/ai-generated-pokemon-sprites-with-gpt-2/
Example: https://matthewrayfield.com/projects/ai-pokemon/

#NLU #NLP #generation #neuralart
It's All in the Heads: Using Attention Heads as a Baseline for Cross-Lingual Transfer in Commonsense Reasoning

Researchers from #Yandex have discovered that the reasoning capabilities of cross-lingual Transformers are concentrated in a small set of attention heads. A new multilingual dataset could encourage research on commonsense reasoning in Russian, French, Chinese and other languages.

Link: https://research.yandex.com/news/a-few-attention-heads-for-reasoning-in-multiple-languages

ArXiV: https://arxiv.org/abs/2106.12066

#transformer #nlu #nlp
YaTalks β€” Yandex's conference for IT community.

Yandex will host its traditional conference on 3-4 December (starting tomorrow). Registration is open.

One of the tracks is devoted to Machine/Deep Learning with the focus on content generation.

Featured reports:

πŸ“šHow too train text model on the minimal corpus
πŸŽ™οΈHow Yandex.Browser Machine Translation works
πŸ€– Facial Expressions Animation

Conference website: https://yatalks.yandex.ru/?from=tg_opendatascience

#conference #mt #nlu
🦜 Hi!

We are the first Telegram Data Science channel.


Channel was started as a collection of notable papers, news and releases shared for the members of Open Data Science (ODS) community. Through the years of just keeping the thing going we grew to an independent online Media supporting principles of Free and Open access to the information related to Data Science.


Ultimate Posts

* Where to start learning more about Data Science. https://github.com/open-data-science/ultimate_posts/tree/master/where_to_start
* @opendatascience channel audience research. https://github.com/open-data-science/ods_channel_stats_eda


Open Data Science

ODS.ai is an international community of people anyhow related to Data Science.

Website: https://ods.ai



Hashtags

Through the years we accumulated a big collection of materials, most of them accompanied by hashtags.

#deeplearning #DL β€” post about deep neural networks (> 1 layer)
#cv β€” posts related to Computer Vision. Pictures and videos
#nlp #nlu β€” Natural Language Processing and Natural Language Understanding. Texts and sequences
#audiolearning #speechrecognition β€” related to audio information processing
#ar β€” augmeneted reality related content
#rl β€” Reinforcement Learning (agents, bots and neural networks capable of playing games)
#gan #generation #generatinveart #neuralart β€” about neural artt and image generation
#transformer #vqgan #vae #bert #clip #StyleGAN2 #Unet #resnet #keras #Pytorch #GPT3 #GPT2 β€” related to special architectures or frameworks
#coding #CS β€” content related to software engineering sphere
#OpenAI #microsoft #Github #DeepMind #Yandex #Google #Facebook #huggingface β€” hashtags related to certain companies
#productionml #sota #recommendation #embeddings #selfdriving #dataset #opensource #analytics #statistics #attention #machine #translation #visualization


Chats

- Data Science Chat https://t.me/datascience_chat
- ODS Slack through invite form at website

ODS resources

* Main website: https://ods.ai
* ODS Community Telegram Channel (in Russian): @ods_ru
* ML trainings Telegram Channel: @mltrainings
* ODS Community Twitter: https://twitter.com/ods_ai

Feedback and Contacts

You are welcome to reach administration through telegram bot: @opendatasciencebot
πŸ”₯Out of One, Many: Using Language Models to Simulate Human Samples

TLDR: GPT-3 has unexpected application β€” modelling of socialogical studies. Average responses of a certain groups can be to some algorithmical accuracy predicted by in silico modelling.

What this means: sociologists won’t have to conduct costly live researches and will be able to run experiments in simulations. Marketers and politicians are getting their hands on cheap solution for modelling their slogans and value propositions. This enables people to check more hypothesis faster and to manipulate society with more efficiency.

ArXiV: https://arxiv.org/abs/2209.06899

#gpt3 #psychohistory #nlu #sociology
Please open Telegram to view this post
VIEW IN TELEGRAM
Amos: An Adam-style Optimizer with Adaptive Weight Decay towards Model-Oriented Scale

Amos is a new optimizer that we propose to pre-train large language models. It is more efficient and converges faster than AdamW: ≀ 51% memory for slot variables, and better valid loss within ≀ 70% training time!Amos is a new optimizer that we propose to pre-train large language models. It is more efficient and converges faster than AdamW: ≀ 51% memory for slot variables, and better valid loss within ≀ 70% training time!

ArXiV: https://arxiv.org/abs/2210.11693

#NLU #NLP #optimizer