Data Science by ODS.ai 🦜
51K subscribers
363 photos
34 videos
7 files
1.52K links
First Telegram Data Science channel. Covering all technical and popular staff about anything related to Data Science: AI, Big Data, Machine Learning, Statistics, general Math and the applications of former. To reach editors contact: @haarrp
Download Telegram
Natural Questions

Google published Natural Questions, a new, large-scale corpus and challenge for training and evaluating open-domain question answering systems, and the first to replicate the end-to-end process in which people find answers to questions.

Link: https://ai.googleblog.com/2019/01/natural-questions-new-corpus-and.html

#Google #NLP #data
​​🤖Handl: New dataset labeling tool release

Handl is a tool to label and manage data for machine learning. It employs 25k qualified crowdworkers who help tech companies to deal with data preparation and get paid for it. Consensus algorithm ensures the quality of labeling for any type of data — images, texts, and sounds.
#Handl was released today at Product Hunt, so developers might benefit from community upvotes, please consider supporting such useful tool on Product Hunt.


Link: https://handl.ai
Product Hunt url: https://www.producthunt.com/posts/handl-3

#handl #machinelearning #ai #data #datalabeling
Forwarded from Spark in me (Alexander)
Streamlit vs. viola vs. panel vs. dash vs. bokeh server

TLDR - make scientific web-apps via python only w/o any web-programming (i.e. django, tornado).

Dash
- Mostly for BI
- Also a paid product
- Looks like the new Tableau
- Serving and out-of-the-box scaling options

Bokeh server
- Mostly plotting (very flexible, unlimited capabilities)
- High entry cost, though bokeh is kind of easy to use
- Also should scale well

Panel
- A bokeh server wrapper with a lot of capabilities for geo + templates

Streamlit
- The nicest looking app for interactive ML apps (maybe even annotation)
- Has pre-built styles and grid
- Limited only to its pre-built widgets
- Built on tornado with a very specific data model incompatible with the majority of available widgets
- Supposed to scale well - built on top of tornado

Viola
- If it runs in a notebook - it will run in viola
- Just turns a notebook into a server
- The app with the most promise for DS / ML
- Scales kind of meh - you need to run a jupyter kernel for each user - also takes some time to spin up a kernel
- Fully benefits from a rich ecosystem of jupyter / python / widgets
- In theory has customizable grid and CSS, but does not come pre-built with this => higher barrier to entry

Also most of these apps have no authentication buil-in.

More details:

- A nice summary here;
- A very detailed pros and cons summary of Streamlit + Viola. Also a very in-depth detailed discussion;
- Also awesome streamlit boilerplate is awesome;

#data_science
​​Not Enough Data? Deep Learning to the Rescue!

The authors use a powerful pre-trained #NLP NN model to artificially synthesize new #labeled #data for #supervised learning. They mainly focus on cases with scarce labeled data. Their method referred to as language-model-based data augmentation (#LAMBADA), involves fine-tuning a SOTA language generator to a specific task through an initial training phase on the existing (usually small) labeled data.
Using the fine-tuned model and given a class label, new sentences for the class are generated. Then they filter these new sentences by using a classifier trained on the original data.

In a series of experiments, they show that LAMBADA improves classifiers performance on a variety of datasets. Moreover, LAMBADA significantly improves upon the SOTA techniques for data augmentation, specifically those applicable to text classification tasks with little data.

paper: https://arxiv.org/abs/1911.03118
​​Using ‘radioactive data’ to detect if a data set was used for training

The authors have developed a new technique to mark the images in a data set so that researchers can determine whether a particular machine learning model has been trained using those images. This can help researchers and engineers to keep track of which data set was used to train a model so they can better understand how various data sets affect the performance of different neural networks.

The key points:
- the marks are harmless and have no impact on the classification accuracy of models, but are detectable with high confidence in a neural network;
- the image features are moved in a particular direction (the carrier) that has been sampled randomly and independently of the data
- after a model is trained on such data, its classifier will align with the direction of the carrier
- the method works in such a way that it is difficult to detect whether a data set is radioactive and to remove the marks from the trained model.

blogpost: https://ai.facebook.com/blog/using-radioactive-data-to-detect-if-a-data-set-was-used-for-training/
paper: https://arxiv.org/abs/2002.00937

#cv #cnn #datavalidation #image #data
​​CCMatrix: A billion-scale bitext data set for training translation models

The authors show that margin-based bitext mining in LASER's multilingual sentence space can be applied to monolingual corpora of billions of sentences.

They are using 10 snapshots of a curated common crawl corpus CCNet totaling 32.7 billion unique sentences. Using one unified approach for 38 languages, they were able to mine 3.5 billion parallel sentences, out of which 661 million are aligned with English. 17 language pairs have more than 30 million parallel sentences, 82 more than 10 million, and most more than one million, including direct alignments between many European or Asian languages.

They train NMT systems for most of the language pairs and evaluate them on TED, WMT and WAT test sets. Also, they achieve a new SOTA for a single system on the WMT'19 test set for translation between English and German, Russian and Chinese, as well as German/French.

But, they will soon provide a script to extract the parallel data from this corpus

blog post: https://ai.facebook.com/blog/ccmatrix-a-billion-scale-bitext-data-set-for-training-translation-models/
paper: https://arxiv.org/abs/1911.04944.pdf
github: https://github.com/facebookresearch/LASER/tree/master/tasks/CCMatrix

#nlp #multilingual #laser #data #monolingual
​​TyDi QA: A Multilingual Question Answering Benchmark

it's a q&a corpus covering 11 Typologically Diverse languages: russian, english, arabic, bengali, finnish, indonesian, japanese, kiswahili, korean, telugu, thai.

the authors collected questions from people who wanted an answer but did not know the answer yet.
they showed people an interesting passage from Wikipedia written in their native language and then had them ask a question, any question, as long as it was not answered by the passage and they actually wanted to know the answer.

blog post: https://ai.googleblog.com/2020/02/tydi-qa-multilingual-question-answering.html?m=1
paper: only pdf

#nlp #qa #multilingual #data
​​the latest news from :hugging_face_mask:

[0] Helsinki-NLP

With v2.9.1 released 1,008 machine translation models, covering of 140 different languages trained with marian-nmt

link to models: https://huggingface.co/models?search=Helsinki-NLP%2Fopus-mt


[1] updated colab notebook with the new Trainer

colab: https://t.co/nGQxwqwwZu?amp=1


[2] NLP – library to easily share & load data/metrics already providing access to 99+ datasets!

features
– get them all: built-in interoperability with pytorch, tensorflow, pandas, numpy
– simple transparent pythonic API
– strive on large datasets: nlp frees you from RAM memory limits
– smart cache: process once reuse forever
– add your dataset

colab: https://t.co/37pfogRWIZ?amp=1
github: https://github.com/huggingface/nlp


#nlp #huggingface #helsinki #marian #trainer # #data #metrics
Kolesa Conf 2022 — Kolesa Group's conference for IT community

Kolesa Group will host its traditional conference on 8 October. Registration is open. Kolesa Conf is an annual Kazakhstan IT conference where leading experts from the best IT companies of the CIS take part.

One of the tracks is devoted to Data. There will be 12 speakers. Some of featured reports:

• «Methodology of construction of cloud CDW: case Air Astana».
• How to correctly estimate the sample size and time of the test. And what non-obvious difficulties can be encountered along the way.
• «Creation of a synthetic control group with the help of Propensity Score Matching».
• How the dynamic minimum cheque (surge pricing) hated by users was developed and implemented.
• «Analytics at the launch of a new product. An example of market place spare parts»

Conference website: https://bit.ly/3UmCoWC

#conference #it #data #analysis