Data Science by ODS.ai 🦜
🔝Great OpenDataScience Channel Audience Research The first audience research was done on 25.06.18 and it is time to update our knowledge on what are we. Please fill in this form: https://forms.gle/GGNgukYNQbAZPtmk8 all the collected data will be used to…
👏After the weekend total number of collected responses is 222.
Thank you for the answers, but we still need more data to make any decisions about channel policy. If you still haven’t filled in the form, please do so.
We need your opinion, we need to know more, we need you!
Google form: https://forms.gle/GGNgukYNQbAZPtmk8
Thank you for the answers, but we still need more data to make any decisions about channel policy. If you still haven’t filled in the form, please do so.
We need your opinion, we need to know more, we need you!
Google form: https://forms.gle/GGNgukYNQbAZPtmk8
Google Docs
@opendatascience audience research 2020
Hey, this is a form to study the audience of our channel to post more relevant and interesting content for you. Please fill in.
However, first question with residence country is not obligatory, channel administration will highly appreciate any answers, for…
However, first question with residence country is not obligatory, channel administration will highly appreciate any answers, for…
+22 responses 💪 !
Thank you!
30% of audience are in GMT+3 for now.
To make analysis of audience more reliable, diverse and robust, we need more respondents from other timezones.
Vote for diversity!
Google form: https://forms.gle/GGNgukYNQbAZPtmk8
Thank you!
30% of audience are in GMT+3 for now.
To make analysis of audience more reliable, diverse and robust, we need more respondents from other timezones.
Vote for diversity!
Google form: https://forms.gle/GGNgukYNQbAZPtmk8
Microsoft Research 2019 reflection—a year of progress on technology’s toughest challenges
Highlights:
* MT-DNN — a model for learning universal language embeddings that combines the multi-task learning and the language model pre-training of BERT.
* Guidelines for human-AI interaction design
* AirSim, coming from strong MS background with flight simulations, for AI realisting testing environment.
* Sand Dance, a data visualization tool included in Visual Studio Code
* Icecaps — a toolkit for conversation modeling
Link: https://www.microsoft.com/en-us/research/blog/microsoft-research-2019-reflection-a-year-of-progress-on-technologys-toughest-challenges/
#microsoft #yearinreview
Highlights:
* MT-DNN — a model for learning universal language embeddings that combines the multi-task learning and the language model pre-training of BERT.
* Guidelines for human-AI interaction design
* AirSim, coming from strong MS background with flight simulations, for AI realisting testing environment.
* Sand Dance, a data visualization tool included in Visual Studio Code
* Icecaps — a toolkit for conversation modeling
Link: https://www.microsoft.com/en-us/research/blog/microsoft-research-2019-reflection-a-year-of-progress-on-technologys-toughest-challenges/
#microsoft #yearinreview
Microsoft Research
Microsoft Research 2019 reflection—a year of progress on technology’s toughest challenges
In 2019, Microsoft researchers assembled guidelines for human-AI interaction design, explored gender bias in machine learning, and created numerous technologies that improved accessibility. Learn how these advances emphasize inclusivity.
Top Trends of Graph Machine Learning in 2020
In this blogpost the author shares an overview of ICLR 2020 papers on Graph Machine Learning and highlights several trends:
1. More solid theoretical understanding of GNN:
* the dimension of the node embeddings should be proportional to the size of the graph if we want GNN being able to compute a solution to popular graph problems
* under certain conditions on the weights, GCNs cannot learn anything except node degrees and connected components when the number of layers grows
* a certain readout operation after neighborhood aggregation could help capture different types of node classification
2. New cool applications of GNN:
* a way to detect and fix bugs simultaneously in Javascript code
* inferring the types of variables for languages like Python or TypeScript
* reasoning in IQ-like tests (Raven Progressive Matrices (RPM) and Diagram Syllogism (DS)) with GNNs
* an RL algorithm to optimize the cost of TensorFlow computation graphs
3. Knowledge graphs become more popular:
* an idea to embed a query into a latent space not as a single point, but as a rectangular box
* a way to work with numerical entities and rules
* the re-evaluation of the existing models and how do they perform in a fair environment
4. New frameworks for graph embeddings:
* a way to improve running time and accuracy in node classification problem for any unsupervised embedding method
* a simple baseline that does not utilize a topology of the graph (i.e. it works on the aggregated node features) performs on par with the SOTA GNNs
blog post:
https://towardsdatascience.com/top-trends-of-graph-machine-learning-in-2020-1194175351a3
#ICLR #gnn #graphs
In this blogpost the author shares an overview of ICLR 2020 papers on Graph Machine Learning and highlights several trends:
1. More solid theoretical understanding of GNN:
* the dimension of the node embeddings should be proportional to the size of the graph if we want GNN being able to compute a solution to popular graph problems
* under certain conditions on the weights, GCNs cannot learn anything except node degrees and connected components when the number of layers grows
* a certain readout operation after neighborhood aggregation could help capture different types of node classification
2. New cool applications of GNN:
* a way to detect and fix bugs simultaneously in Javascript code
* inferring the types of variables for languages like Python or TypeScript
* reasoning in IQ-like tests (Raven Progressive Matrices (RPM) and Diagram Syllogism (DS)) with GNNs
* an RL algorithm to optimize the cost of TensorFlow computation graphs
3. Knowledge graphs become more popular:
* an idea to embed a query into a latent space not as a single point, but as a rectangular box
* a way to work with numerical entities and rules
* the re-evaluation of the existing models and how do they perform in a fair environment
4. New frameworks for graph embeddings:
* a way to improve running time and accuracy in node classification problem for any unsupervised embedding method
* a simple baseline that does not utilize a topology of the graph (i.e. it works on the aggregated node features) performs on par with the SOTA GNNs
blog post:
https://towardsdatascience.com/top-trends-of-graph-machine-learning-in-2020-1194175351a3
#ICLR #gnn #graphs
First movie ever upscaled and enhanced by couple of neural networks
Arrival of a Train at La Ciotat upscaled and upscaled to 4K 60 FPS
Algorithms that were used:
* Gigapixel AI by Topaz Labs for upscale
* FPS enhancement — Dain
Author on YouTube promises to experiment on the colorization and to release the update later.
YouTube: https://m.youtube.com/watch?v=3RYNThid23g
Author’s channel (in Russian): @denissexy
#upscale #dl #videoprocessing
Arrival of a Train at La Ciotat upscaled and upscaled to 4K 60 FPS
Algorithms that were used:
* Gigapixel AI by Topaz Labs for upscale
* FPS enhancement — Dain
Author on YouTube promises to experiment on the colorization and to release the update later.
YouTube: https://m.youtube.com/watch?v=3RYNThid23g
Author’s channel (in Russian): @denissexy
#upscale #dl #videoprocessing
Using ‘radioactive data’ to detect if a data set was used for training
The authors have developed a new technique to mark the images in a data set so that researchers can determine whether a particular machine learning model has been trained using those images. This can help researchers and engineers to keep track of which data set was used to train a model so they can better understand how various data sets affect the performance of different neural networks.
The key points:
- the marks are harmless and have no impact on the classification accuracy of models, but are detectable with high confidence in a neural network;
- the image features are moved in a particular direction (the carrier) that has been sampled randomly and independently of the data
- after a model is trained on such data, its classifier will align with the direction of the carrier
- the method works in such a way that it is difficult to detect whether a data set is radioactive and to remove the marks from the trained model.
blogpost: https://ai.facebook.com/blog/using-radioactive-data-to-detect-if-a-data-set-was-used-for-training/
paper: https://arxiv.org/abs/2002.00937
#cv #cnn #datavalidation #image #data
The authors have developed a new technique to mark the images in a data set so that researchers can determine whether a particular machine learning model has been trained using those images. This can help researchers and engineers to keep track of which data set was used to train a model so they can better understand how various data sets affect the performance of different neural networks.
The key points:
- the marks are harmless and have no impact on the classification accuracy of models, but are detectable with high confidence in a neural network;
- the image features are moved in a particular direction (the carrier) that has been sampled randomly and independently of the data
- after a model is trained on such data, its classifier will align with the direction of the carrier
- the method works in such a way that it is difficult to detect whether a data set is radioactive and to remove the marks from the trained model.
blogpost: https://ai.facebook.com/blog/using-radioactive-data-to-detect-if-a-data-set-was-used-for-training/
paper: https://arxiv.org/abs/2002.00937
#cv #cnn #datavalidation #image #data
ODS breakfast in Paris! ☕️ 🇫🇷 See you this Saturday at 10:30 (some people come around 11:00) at Malongo Café, 50 Rue Saint-André des Arts. We are expecting from 9 to 19 people. Tableoverflow 💥 is possible.
REST: Robust and Efficient Neural Networks for Sleep Monitoring in the Wild
New approach for sleep monitoring.
Nowadays a lot of people suffer from sleep disorders thataffects their daily functioning, long-term health and longevity. Thelong-term effects of sleep deprivation and sleep disorders includean increased risk of hypertension, diabetes, obesity, depression, heart attack, and stroke. As a result sleep monitoring is a very important topic.
Currently automatical documentation of sleep stages isn't robust against noises (which can be introduced by electrical interferences (e.g., power-line) and user motions (e.g., muscle contraction, respiration)) and isn't computationaly efficient enough for fast calculations on user devices.
The authors offer the following improvenents:
- adversarial training and spectral regularization to improve robustness to noise
- sparsity regularization to improve energy and computational efficiency
Rest models achieves a macro-F1 score of 0.67 vs. 0.39 for the state-of-the-art model in the presence of Gaussian noise, with 19×parameter and 15×MFLOPS reduction.
The model is also deployed onto a Pixel 2 smartphone. It achieves 17x energy reduction and 9x faster inference compared to uncompressed models.
Paper: https://arxiv.org/abs/2001.11363
Code: https://github.com/duggalrahul/REST
#deeplearning #compression #adversarial #sleepstaging
New approach for sleep monitoring.
Nowadays a lot of people suffer from sleep disorders thataffects their daily functioning, long-term health and longevity. Thelong-term effects of sleep deprivation and sleep disorders includean increased risk of hypertension, diabetes, obesity, depression, heart attack, and stroke. As a result sleep monitoring is a very important topic.
Currently automatical documentation of sleep stages isn't robust against noises (which can be introduced by electrical interferences (e.g., power-line) and user motions (e.g., muscle contraction, respiration)) and isn't computationaly efficient enough for fast calculations on user devices.
The authors offer the following improvenents:
- adversarial training and spectral regularization to improve robustness to noise
- sparsity regularization to improve energy and computational efficiency
Rest models achieves a macro-F1 score of 0.67 vs. 0.39 for the state-of-the-art model in the presence of Gaussian noise, with 19×parameter and 15×MFLOPS reduction.
The model is also deployed onto a Pixel 2 smartphone. It achieves 17x energy reduction and 9x faster inference compared to uncompressed models.
Paper: https://arxiv.org/abs/2001.11363
Code: https://github.com/duggalrahul/REST
#deeplearning #compression #adversarial #sleepstaging
P-value, explained, one more time with demos
Article includes not only great explanation of what is #pvalue, but how it works and how it can be used to make a correct conclusions.
Link:https://www.freecodecamp.org/news/what-is-statistical-significance-p-value-defined-and-how-to-calculate-it/
#entrylevel #dsformanagers #tutorial #explained #interactive #statistics
Article includes not only great explanation of what is #pvalue, but how it works and how it can be used to make a correct conclusions.
Link:https://www.freecodecamp.org/news/what-is-statistical-significance-p-value-defined-and-how-to-calculate-it/
#entrylevel #dsformanagers #tutorial #explained #interactive #statistics
freeCodeCamp.org
What is Statistical Significance? P Value Defined and How to Calculate It
P values are one of the most widely used concepts in statistical analysis. They are used by researchers, analysts and statisticians to draw insights from data and make informed decisions. Along with statistical significance, they are also one of the most…
Please vote in our Mega Imprtant Audience Research!
So far we have collected 384 responses, which is really cool!
But we need more filled questinnaires to know YOU and YOUR PREFERENCES better.
Some to-date data about residency:
* 🇮🇹 There are 4.5% of people who reside in Italy
* 🇧🇷 Brazil — 3.2%
* 🇫🇷 France — 1.9%
* 🇳🇬 Nigeria — 1.1%
* 🇪🇸 Spain — 5.6%
Please, fill in the form https://forms.gle/GGNgukYNQbAZPtmk8 to help us provide better and more relevant content for you!
So far we have collected 384 responses, which is really cool!
But we need more filled questinnaires to know YOU and YOUR PREFERENCES better.
Some to-date data about residency:
* 🇮🇹 There are 4.5% of people who reside in Italy
* 🇧🇷 Brazil — 3.2%
* 🇫🇷 France — 1.9%
* 🇳🇬 Nigeria — 1.1%
* 🇪🇸 Spain — 5.6%
Please, fill in the form https://forms.gle/GGNgukYNQbAZPtmk8 to help us provide better and more relevant content for you!
Google Docs
@opendatascience audience research 2020
Hey, this is a form to study the audience of our channel to post more relevant and interesting content for you. Please fill in.
However, first question with residence country is not obligatory, channel administration will highly appreciate any answers, for…
However, first question with residence country is not obligatory, channel administration will highly appreciate any answers, for…
Please, vote: https://forms.gle/GGNgukYNQbAZPtmk8 (this is a scheduled message, we hopefuly have more than 400 responses by now)
CCMatrix: A billion-scale bitext data set for training translation models
The authors show that margin-based bitext mining in LASER's multilingual sentence space can be applied to monolingual corpora of billions of sentences.
They are using 10 snapshots of a curated common crawl corpus CCNet totaling 32.7 billion unique sentences. Using one unified approach for 38 languages, they were able to mine 3.5 billion parallel sentences, out of which 661 million are aligned with English. 17 language pairs have more than 30 million parallel sentences, 82 more than 10 million, and most more than one million, including direct alignments between many European or Asian languages.
They train NMT systems for most of the language pairs and evaluate them on TED, WMT and WAT test sets. Also, they achieve a new SOTA for a single system on the WMT'19 test set for translation between English and German, Russian and Chinese, as well as German/French.
But, they will soon provide a script to extract the parallel data from this corpus
blog post: https://ai.facebook.com/blog/ccmatrix-a-billion-scale-bitext-data-set-for-training-translation-models/
paper: https://arxiv.org/abs/1911.04944.pdf
github: https://github.com/facebookresearch/LASER/tree/master/tasks/CCMatrix
#nlp #multilingual #laser #data #monolingual
The authors show that margin-based bitext mining in LASER's multilingual sentence space can be applied to monolingual corpora of billions of sentences.
They are using 10 snapshots of a curated common crawl corpus CCNet totaling 32.7 billion unique sentences. Using one unified approach for 38 languages, they were able to mine 3.5 billion parallel sentences, out of which 661 million are aligned with English. 17 language pairs have more than 30 million parallel sentences, 82 more than 10 million, and most more than one million, including direct alignments between many European or Asian languages.
They train NMT systems for most of the language pairs and evaluate them on TED, WMT and WAT test sets. Also, they achieve a new SOTA for a single system on the WMT'19 test set for translation between English and German, Russian and Chinese, as well as German/French.
But, they will soon provide a script to extract the parallel data from this corpus
blog post: https://ai.facebook.com/blog/ccmatrix-a-billion-scale-bitext-data-set-for-training-translation-models/
paper: https://arxiv.org/abs/1911.04944.pdf
github: https://github.com/facebookresearch/LASER/tree/master/tasks/CCMatrix
#nlp #multilingual #laser #data #monolingual
TyDi QA: A Multilingual Question Answering Benchmark
it's a q&a corpus covering 11 Typologically Diverse languages: russian, english, arabic, bengali, finnish, indonesian, japanese, kiswahili, korean, telugu, thai.
the authors collected questions from people who wanted an answer but did not know the answer yet.
they showed people an interesting passage from Wikipedia written in their native language and then had them ask a question, any question, as long as it was not answered by the passage and they actually wanted to know the answer.
blog post: https://ai.googleblog.com/2020/02/tydi-qa-multilingual-question-answering.html?m=1
paper: only pdf
#nlp #qa #multilingual #data
it's a q&a corpus covering 11 Typologically Diverse languages: russian, english, arabic, bengali, finnish, indonesian, japanese, kiswahili, korean, telugu, thai.
the authors collected questions from people who wanted an answer but did not know the answer yet.
they showed people an interesting passage from Wikipedia written in their native language and then had them ask a question, any question, as long as it was not answered by the passage and they actually wanted to know the answer.
blog post: https://ai.googleblog.com/2020/02/tydi-qa-multilingual-question-answering.html?m=1
paper: only pdf
#nlp #qa #multilingual #data
DEEP DOUBLE DESCENT
where bigger models and more data hurt
it's really cool & interesting research about where we watch that the performance first improves, then gets worse, and then improves again with increasing model size, data size, or training time. but this effect is often avoided through careful regularization.
some conclusions from research:
– there is a regime where bigger models are worse
– there is a regime where more samples hurt
– there is a regime where training longer reverses overfitting
blog post: https://openai.com/blog/deep-double-descent/
paper: https://arxiv.org/abs/1912.02292
#deep #train #size #openai
where bigger models and more data hurt
it's really cool & interesting research about where we watch that the performance first improves, then gets worse, and then improves again with increasing model size, data size, or training time. but this effect is often avoided through careful regularization.
some conclusions from research:
– there is a regime where bigger models are worse
– there is a regime where more samples hurt
– there is a regime where training longer reverses overfitting
blog post: https://openai.com/blog/deep-double-descent/
paper: https://arxiv.org/abs/1912.02292
#deep #train #size #openai
Data Science by ODS.ai 🦜
🔝Great OpenDataScience Channel Audience Research The first audience research was done on 25.06.18 and it is time to update our knowledge on what are we. Please fill in this form: https://forms.gle/GGNgukYNQbAZPtmk8 all the collected data will be used to…
☺️526 responses collected thanks to you!
Now we are looking for a volunteer to perform an #exploratory analysis of responses an publish it as a an example on github in a form of #jupyter notebook. If you are familiar with git, jupyter, basics of #exploratory analysis and want to help, write to @opendatasciencebot bot (make sure you include your username, so we can reach you back).
In the mean time, please spend some free weekend time to fill in the questionnaire form if you haven’t filled it yet: https://forms.gle/GGNgukYNQbAZPtmk8 This will help us to make channel better for you.
2020 questionnaire link: https://forms.gle/GGNgukYNQbAZPtmk8
Now we are looking for a volunteer to perform an #exploratory analysis of responses an publish it as a an example on github in a form of #jupyter notebook. If you are familiar with git, jupyter, basics of #exploratory analysis and want to help, write to @opendatasciencebot bot (make sure you include your username, so we can reach you back).
In the mean time, please spend some free weekend time to fill in the questionnaire form if you haven’t filled it yet: https://forms.gle/GGNgukYNQbAZPtmk8 This will help us to make channel better for you.
2020 questionnaire link: https://forms.gle/GGNgukYNQbAZPtmk8
Google Docs
@opendatascience audience research 2020
Hey, this is a form to study the audience of our channel to post more relevant and interesting content for you. Please fill in.
However, first question with residence country is not obligatory, channel administration will highly appreciate any answers, for…
However, first question with residence country is not obligatory, channel administration will highly appreciate any answers, for…
Few-shot Video-to-Video Synthesis
it's the pytorch implementation for few-shot photorealistic video-to-video (vid2vid) translation.
it can be used for generating human motions from poses, synthesizing people talking from edge maps, or turning semantic label maps into photo-realistic videos.
the core of vid2vid translation is image-to-image translation.
blog post: https://nvlabs.github.io/few-shot-vid2vid/
paper: https://arxiv.org/abs/1910.12713
youtube: https://youtu.be/8AZBuyEuDqc
github: https://github.com/NVlabs/few-shot-vid2vid
#cv #nips #neurIPS #pattern #recognition #vid2vid #synthesis
it's the pytorch implementation for few-shot photorealistic video-to-video (vid2vid) translation.
it can be used for generating human motions from poses, synthesizing people talking from edge maps, or turning semantic label maps into photo-realistic videos.
the core of vid2vid translation is image-to-image translation.
blog post: https://nvlabs.github.io/few-shot-vid2vid/
paper: https://arxiv.org/abs/1910.12713
youtube: https://youtu.be/8AZBuyEuDqc
github: https://github.com/NVlabs/few-shot-vid2vid
#cv #nips #neurIPS #pattern #recognition #vid2vid #synthesis
Data Science by ODS.ai 🦜
Three challenges of Deep Learning according to Yann LeCun
Yann LeCun's talk slides and video
Slides: https://drive.google.com/file/d/1r-mDL4IX_hzZLDBKp8_e8VZqD7fOzBkF/view
Video of the talks: https://vimeo.com/390347111
- 1:10 in for Geoff Hinton's keynote,
- 1:44 for Yann LeCunn's,
- 2:18 for Yoshua Bengio's,
- 2:51 for the panel discussion moderated by Leslie Pack Kaelbling
#talk #meta #master
Slides: https://drive.google.com/file/d/1r-mDL4IX_hzZLDBKp8_e8VZqD7fOzBkF/view
Video of the talks: https://vimeo.com/390347111
- 1:10 in for Geoff Hinton's keynote,
- 1:44 for Yann LeCunn's,
- 2:18 for Yoshua Bengio's,
- 2:51 for the panel discussion moderated by Leslie Pack Kaelbling
#talk #meta #master