Microsoft Research announced Open Data project! This single, cloud-hosted location offers datasets representing many years of data curation and research efforts by Microsoft.
https://www.microsoft.com/en-us/research/blog/announcing-microsoft-research-open-data-datasets-by-microsoft-research-now-available-in-the-cloud/
#data #microsoft #opensource
https://www.microsoft.com/en-us/research/blog/announcing-microsoft-research-open-data-datasets-by-microsoft-research-now-available-in-the-cloud/
#data #microsoft #opensource
Microsoft Research
Announcing Microsoft Research Open Data - Datasets by Microsoft Research now available in the cloud - Microsoft Research
The Microsoft Research Outreach team has worked extensively with the external research community to enable adoption of cloud-based research infrastructure over the past few years. Through this process, we experienced the ubiquity of Jim Gray’s fourth paradigm…
Natural Questions
Google published Natural Questions, a new, large-scale corpus and challenge for training and evaluating open-domain question answering systems, and the first to replicate the end-to-end process in which people find answers to questions.
Link: https://ai.googleblog.com/2019/01/natural-questions-new-corpus-and.html
#Google #NLP #data
Google published Natural Questions, a new, large-scale corpus and challenge for training and evaluating open-domain question answering systems, and the first to replicate the end-to-end process in which people find answers to questions.
Link: https://ai.googleblog.com/2019/01/natural-questions-new-corpus-and.html
#Google #NLP #data
blog.research.google
Natural Questions: a New Corpus and Challenge for Question Answering Research
🤖Handl: New dataset labeling tool release
Handl is a tool to label and manage data for machine learning. It employs 25k qualified crowdworkers who help tech companies to deal with data preparation and get paid for it. Consensus algorithm ensures the quality of labeling for any type of data — images, texts, and sounds.
#Handl was released today at Product Hunt, so developers might benefit from community upvotes, please consider supporting such useful tool on Product Hunt.
Link: https://handl.ai
Product Hunt url: https://www.producthunt.com/posts/handl-3
#handl #machinelearning #ai #data #datalabeling
Handl is a tool to label and manage data for machine learning. It employs 25k qualified crowdworkers who help tech companies to deal with data preparation and get paid for it. Consensus algorithm ensures the quality of labeling for any type of data — images, texts, and sounds.
#Handl was released today at Product Hunt, so developers might benefit from community upvotes, please consider supporting such useful tool on Product Hunt.
Link: https://handl.ai
Product Hunt url: https://www.producthunt.com/posts/handl-3
#handl #machinelearning #ai #data #datalabeling
Forwarded from Spark in me (Alexander)
Streamlit vs. viola vs. panel vs. dash vs. bokeh server
TLDR - make scientific web-apps via python only w/o any web-programming (i.e. django, tornado).
Dash
- Mostly for BI
- Also a paid product
- Looks like the new Tableau
- Serving and out-of-the-box scaling options
Bokeh server
- Mostly plotting (very flexible, unlimited capabilities)
- High entry cost, though bokeh is kind of easy to use
- Also should scale well
Panel
- A bokeh server wrapper with a lot of capabilities for geo + templates
Streamlit
- The nicest looking app for interactive ML apps (maybe even annotation)
- Has pre-built styles and grid
- Limited only to its pre-built widgets
- Built on tornado with a very specific data model incompatible with the majority of available widgets
- Supposed to scale well - built on top of tornado
Viola
- If it runs in a notebook - it will run in viola
- Just turns a notebook into a server
- The app with the most promise for DS / ML
- Scales kind of meh - you need to run a jupyter kernel for each user - also takes some time to spin up a kernel
- Fully benefits from a rich ecosystem of jupyter / python / widgets
- In theory has customizable grid and CSS, but does not come pre-built with this => higher barrier to entry
Also most of these apps have no authentication buil-in.
More details:
- A nice summary here;
- A very detailed pros and cons summary of Streamlit + Viola. Also a very in-depth detailed discussion;
- Also awesome streamlit boilerplate is awesome;
#data_science
TLDR - make scientific web-apps via python only w/o any web-programming (i.e. django, tornado).
Dash
- Mostly for BI
- Also a paid product
- Looks like the new Tableau
- Serving and out-of-the-box scaling options
Bokeh server
- Mostly plotting (very flexible, unlimited capabilities)
- High entry cost, though bokeh is kind of easy to use
- Also should scale well
Panel
- A bokeh server wrapper with a lot of capabilities for geo + templates
Streamlit
- The nicest looking app for interactive ML apps (maybe even annotation)
- Has pre-built styles and grid
- Limited only to its pre-built widgets
- Built on tornado with a very specific data model incompatible with the majority of available widgets
- Supposed to scale well - built on top of tornado
Viola
- If it runs in a notebook - it will run in viola
- Just turns a notebook into a server
- The app with the most promise for DS / ML
- Scales kind of meh - you need to run a jupyter kernel for each user - also takes some time to spin up a kernel
- Fully benefits from a rich ecosystem of jupyter / python / widgets
- In theory has customizable grid and CSS, but does not come pre-built with this => higher barrier to entry
Also most of these apps have no authentication buil-in.
More details:
- A nice summary here;
- A very detailed pros and cons summary of Streamlit + Viola. Also a very in-depth detailed discussion;
- Also awesome streamlit boilerplate is awesome;
#data_science
Medium
Jupyter Dashboarding — some thoughts on Voila, Panel and Dash
There are three main players in the Python dashboarding space, let’s discuss.
Not Enough Data? Deep Learning to the Rescue!
The authors use a powerful pre-trained #NLP NN model to artificially synthesize new #labeled #data for #supervised learning. They mainly focus on cases with scarce labeled data. Their method referred to as language-model-based data augmentation (#LAMBADA), involves fine-tuning a SOTA language generator to a specific task through an initial training phase on the existing (usually small) labeled data.
Using the fine-tuned model and given a class label, new sentences for the class are generated. Then they filter these new sentences by using a classifier trained on the original data.
In a series of experiments, they show that LAMBADA improves classifiers performance on a variety of datasets. Moreover, LAMBADA significantly improves upon the SOTA techniques for data augmentation, specifically those applicable to text classification tasks with little data.
paper: https://arxiv.org/abs/1911.03118
The authors use a powerful pre-trained #NLP NN model to artificially synthesize new #labeled #data for #supervised learning. They mainly focus on cases with scarce labeled data. Their method referred to as language-model-based data augmentation (#LAMBADA), involves fine-tuning a SOTA language generator to a specific task through an initial training phase on the existing (usually small) labeled data.
Using the fine-tuned model and given a class label, new sentences for the class are generated. Then they filter these new sentences by using a classifier trained on the original data.
In a series of experiments, they show that LAMBADA improves classifiers performance on a variety of datasets. Moreover, LAMBADA significantly improves upon the SOTA techniques for data augmentation, specifically those applicable to text classification tasks with little data.
paper: https://arxiv.org/abs/1911.03118
AI & Art
some artist use the large collections of #data & #ML #algorithms to create mesmerizing & dynamic #installations
watch the video —> https://youtu.be/I-EIVlHvHRM
some artist use the large collections of #data & #ML #algorithms to create mesmerizing & dynamic #installations
watch the video —> https://youtu.be/I-EIVlHvHRM
YouTube
How This Guy Uses A.I. to Create Art | Obsessed | WIRED
Artist Refik Anadol doesn't work with paintbrushes or clay. Instead, he uses large collections of data and machine learning algorithms to create mesmerizing and dynamic installations.
Machine Hallucination at Artechouse NYC: https://www.artechouse.com/nyc…
Machine Hallucination at Artechouse NYC: https://www.artechouse.com/nyc…
Using ‘radioactive data’ to detect if a data set was used for training
The authors have developed a new technique to mark the images in a data set so that researchers can determine whether a particular machine learning model has been trained using those images. This can help researchers and engineers to keep track of which data set was used to train a model so they can better understand how various data sets affect the performance of different neural networks.
The key points:
- the marks are harmless and have no impact on the classification accuracy of models, but are detectable with high confidence in a neural network;
- the image features are moved in a particular direction (the carrier) that has been sampled randomly and independently of the data
- after a model is trained on such data, its classifier will align with the direction of the carrier
- the method works in such a way that it is difficult to detect whether a data set is radioactive and to remove the marks from the trained model.
blogpost: https://ai.facebook.com/blog/using-radioactive-data-to-detect-if-a-data-set-was-used-for-training/
paper: https://arxiv.org/abs/2002.00937
#cv #cnn #datavalidation #image #data
The authors have developed a new technique to mark the images in a data set so that researchers can determine whether a particular machine learning model has been trained using those images. This can help researchers and engineers to keep track of which data set was used to train a model so they can better understand how various data sets affect the performance of different neural networks.
The key points:
- the marks are harmless and have no impact on the classification accuracy of models, but are detectable with high confidence in a neural network;
- the image features are moved in a particular direction (the carrier) that has been sampled randomly and independently of the data
- after a model is trained on such data, its classifier will align with the direction of the carrier
- the method works in such a way that it is difficult to detect whether a data set is radioactive and to remove the marks from the trained model.
blogpost: https://ai.facebook.com/blog/using-radioactive-data-to-detect-if-a-data-set-was-used-for-training/
paper: https://arxiv.org/abs/2002.00937
#cv #cnn #datavalidation #image #data
CCMatrix: A billion-scale bitext data set for training translation models
The authors show that margin-based bitext mining in LASER's multilingual sentence space can be applied to monolingual corpora of billions of sentences.
They are using 10 snapshots of a curated common crawl corpus CCNet totaling 32.7 billion unique sentences. Using one unified approach for 38 languages, they were able to mine 3.5 billion parallel sentences, out of which 661 million are aligned with English. 17 language pairs have more than 30 million parallel sentences, 82 more than 10 million, and most more than one million, including direct alignments between many European or Asian languages.
They train NMT systems for most of the language pairs and evaluate them on TED, WMT and WAT test sets. Also, they achieve a new SOTA for a single system on the WMT'19 test set for translation between English and German, Russian and Chinese, as well as German/French.
But, they will soon provide a script to extract the parallel data from this corpus
blog post: https://ai.facebook.com/blog/ccmatrix-a-billion-scale-bitext-data-set-for-training-translation-models/
paper: https://arxiv.org/abs/1911.04944.pdf
github: https://github.com/facebookresearch/LASER/tree/master/tasks/CCMatrix
#nlp #multilingual #laser #data #monolingual
The authors show that margin-based bitext mining in LASER's multilingual sentence space can be applied to monolingual corpora of billions of sentences.
They are using 10 snapshots of a curated common crawl corpus CCNet totaling 32.7 billion unique sentences. Using one unified approach for 38 languages, they were able to mine 3.5 billion parallel sentences, out of which 661 million are aligned with English. 17 language pairs have more than 30 million parallel sentences, 82 more than 10 million, and most more than one million, including direct alignments between many European or Asian languages.
They train NMT systems for most of the language pairs and evaluate them on TED, WMT and WAT test sets. Also, they achieve a new SOTA for a single system on the WMT'19 test set for translation between English and German, Russian and Chinese, as well as German/French.
But, they will soon provide a script to extract the parallel data from this corpus
blog post: https://ai.facebook.com/blog/ccmatrix-a-billion-scale-bitext-data-set-for-training-translation-models/
paper: https://arxiv.org/abs/1911.04944.pdf
github: https://github.com/facebookresearch/LASER/tree/master/tasks/CCMatrix
#nlp #multilingual #laser #data #monolingual
TyDi QA: A Multilingual Question Answering Benchmark
it's a q&a corpus covering 11 Typologically Diverse languages: russian, english, arabic, bengali, finnish, indonesian, japanese, kiswahili, korean, telugu, thai.
the authors collected questions from people who wanted an answer but did not know the answer yet.
they showed people an interesting passage from Wikipedia written in their native language and then had them ask a question, any question, as long as it was not answered by the passage and they actually wanted to know the answer.
blog post: https://ai.googleblog.com/2020/02/tydi-qa-multilingual-question-answering.html?m=1
paper: only pdf
#nlp #qa #multilingual #data
it's a q&a corpus covering 11 Typologically Diverse languages: russian, english, arabic, bengali, finnish, indonesian, japanese, kiswahili, korean, telugu, thai.
the authors collected questions from people who wanted an answer but did not know the answer yet.
they showed people an interesting passage from Wikipedia written in their native language and then had them ask a question, any question, as long as it was not answered by the passage and they actually wanted to know the answer.
blog post: https://ai.googleblog.com/2020/02/tydi-qa-multilingual-question-answering.html?m=1
paper: only pdf
#nlp #qa #multilingual #data
the latest news from :hugging_face_mask:
[0] Helsinki-NLP
With v2.9.1 released 1,008 machine translation models, covering of 140 different languages trained with marian-nmt
link to models: https://huggingface.co/models?search=Helsinki-NLP%2Fopus-mt
[1] updated colab notebook with the new Trainer
colab: https://t.co/nGQxwqwwZu?amp=1
[2] NLP – library to easily share & load data/metrics already providing access to 99+ datasets!
features
– get them all: built-in interoperability with pytorch, tensorflow, pandas, numpy
– simple transparent pythonic API
– strive on large datasets: nlp frees you from RAM memory limits
– smart cache: process once reuse forever
– add your dataset
colab: https://t.co/37pfogRWIZ?amp=1
github: https://github.com/huggingface/nlp
#nlp #huggingface #helsinki #marian #trainer # #data #metrics
[0] Helsinki-NLP
With v2.9.1 released 1,008 machine translation models, covering of 140 different languages trained with marian-nmt
link to models: https://huggingface.co/models?search=Helsinki-NLP%2Fopus-mt
[1] updated colab notebook with the new Trainer
colab: https://t.co/nGQxwqwwZu?amp=1
[2] NLP – library to easily share & load data/metrics already providing access to 99+ datasets!
features
– get them all: built-in interoperability with pytorch, tensorflow, pandas, numpy
– simple transparent pythonic API
– strive on large datasets: nlp frees you from RAM memory limits
– smart cache: process once reuse forever
– add your dataset
colab: https://t.co/37pfogRWIZ?amp=1
github: https://github.com/huggingface/nlp
#nlp #huggingface #helsinki #marian #trainer # #data #metrics
Kolesa Conf 2022 — Kolesa Group's conference for IT community
Kolesa Group will host its traditional conference on 8 October. Registration is open. Kolesa Conf is an annual Kazakhstan IT conference where leading experts from the best IT companies of the CIS take part.
One of the tracks is devoted to Data. There will be 12 speakers. Some of featured reports:
• «Methodology of construction of cloud CDW: case Air Astana».
• How to correctly estimate the sample size and time of the test. And what non-obvious difficulties can be encountered along the way.
• «Creation of a synthetic control group with the help of Propensity Score Matching».
• How the dynamic minimum cheque (surge pricing) hated by users was developed and implemented.
• «Analytics at the launch of a new product. An example of market place spare parts»
Conference website: https://bit.ly/3UmCoWC
#conference #it #data #analysis
Kolesa Group will host its traditional conference on 8 October. Registration is open. Kolesa Conf is an annual Kazakhstan IT conference where leading experts from the best IT companies of the CIS take part.
One of the tracks is devoted to Data. There will be 12 speakers. Some of featured reports:
• «Methodology of construction of cloud CDW: case Air Astana».
• How to correctly estimate the sample size and time of the test. And what non-obvious difficulties can be encountered along the way.
• «Creation of a synthetic control group with the help of Propensity Score Matching».
• How the dynamic minimum cheque (surge pricing) hated by users was developed and implemented.
• «Analytics at the launch of a new product. An example of market place spare parts»
Conference website: https://bit.ly/3UmCoWC
#conference #it #data #analysis
kolesa-conf.kz
Kolesa Conf`24
Самая масштабная IT-конференция в Казахстане