Data Science | Machine Learning | Artificial Intelligence
1.12K subscribers
105 photos
11 videos
20 files
210 links
Your daily dose of data science

Discussion chat: @data_science_forum
Download Telegram
Supporting content decision makers with machine learning

#Netflix shared a post providing information about how they research and prepare data for new title production.

Link: https://netflixtechblog.com/supporting-content-decision-makers-with-machine-learning-995b7b76006f

#NLU #NLP #recommendation #embeddings
ᅠᅠ

Team
@OpenArchiveBooks
@data_entusiasts
MIT Introduction to Deep Learning

And specifically, lecture about RNN and its modifications:
https://youtu.be/qjrad0V0uJE

The #course is excellent as well, but more about image processing. For NLP beginners, such clear and elegant survey about RNNs will be quite useful. So, a lot of architectures in #NLP models came from image processing tasks. If you want to recap some theory or get understanding of basics of DL — strong recommendation!

#DL
ᅠᅠ

Team
@OpenArchiveBooks
@data_entusiasts
Deep learning to translate between programming languages

#FacebookAI released TransCoder, an entirely self-supervised neural transcompiler system that is claimed to make code migration easier and more efficient.

ArXiV: https://arxiv.org/pdf/2006.03511.pdf
Github: https://github.com/facebookresearch/TransCoder/

#NLU #codegeneration #NLP
ᅠᅠ

Team
@OpenArchiveBooks
@data_entusiasts
​​The Cost of Training NLP Models: A Concise Overview

The authors review the cost of training large-scale language models, and the drivers of these costs.

More at the paper: https://arxiv.org/pdf/2004.08900

#nlp #language
ᅠᅠ

Team
@OpenArchiveBooks
@data_entusiasts
​​Rethinking Generalization of Neural Models: A Named Entity Recognition Case Study

Authors use the NER task to analyze the generalization behavior of existing models from different perspectives. Experiments with in-depth analyses diagnose the bottleneck of existing neural NER models in terms of breakdown performance analysis, annotation errors, dataset bias, and category relationships, which suggest directions for improvement.

The authors also release two datasets for future research: ReCoNLL and PLONER.

The main findings of the paper:
– the performance of existing models (including the state-of-the-art model) heavily influenced by the degree to which test entities have been seen in the training set with the same label
– the proposed measure enables to detect human annotation errors.

Once these errors are fixed, previous models can achieve new state-of-the-art results
– authors introduce two measures to characterize the data bias and the cross-dataset generalization experiment shows that the performance of NER systems is influenced not only by whether the test entity has been seen in the training set but also by whether the context of the test entity has been observed
– providing more training samples is not a guarantee of better results. A targeted increase in training samples will make it more profitable
– the relationship between entity categories influences the difficulty of model learning, which leads to some hard test samples that are difficult to solve using common learning methods


Paper: https://arxiv.org/abs/2001.03844
Github: https://github.com/pfliu-nlp/Named-Entity-Recognition-NER-Papers
Website: http://pfliu.com/InterpretNER/

#nlp #generalization #NER #annotations #dataset
ᅠᅠ

Team
@OpenArchiveBooks
@data_entusiasts
​​What is Trending on Wikipedia? Capturing Trends and Language Biases Across Wikipedia Editions

The authors propose an automatic evaluation and comparison of the browsing behavior of Wikipedia readers that can be applied to any language editions of Wikipedia. Focused on English, French, and Russian languages during the last four months of 2018.

They approach consists of the following steps:
– extraction of a sub-network of trending Wikipedia articles and identification of trends
– extraction of keywords from the summaries of every Wikipedia article in the sub-network and weighting according to their importance
– labeling of the trends with high-level topics using the extracted keywords

Paper: https://arxiv.org/pdf/2002.06885
Code: https://github.com/epfl-lts2/sparkwiki


#nlp #trend #wikipedia

Team
@OpenArchiveBooks
@data_enthusiasts
​​Summarizing Books with Human Feedback

#OpenAI fine-tuned #GPT3 to summarize books well enough to be human-readable. Main approach: recursively split text into parts and then meta-summarize summaries.

This is really important because once there will be a great summarization #SOTA we won't need editors to write posts for you. And researchers ultimatively will have some asisstance interpreting models' results.

BlogPost: https://openai.com/blog/summarizing-books/
Paper: https://arxiv.org/pdf/2109.10862

#summarization #NLU #NLP

Team
@OpenArchiveBooks
@data_enthusiasts
​​NL-Augmenter: A Framework for Task-Sensitive Natural Language Augmentation

This paper presents a new participatory Python-based natural language augmentation framework that supports the creation of transformations (modifications to the data) and filters (data splits according to specific features).

The current version of the framework contains 117 transformations and 23 filters for a variety of natural language tasks.

The authors demonstrate the efficacy of NL-Augmenter by using several of its transformations to analyze the robustness of popular natural language models.

Paper: https://arxiv.org/abs/2112.02721
Code: https://github.com/GEM-benchmark/NL-Augmenter

A detailed unofficial overview of the paper: https://andlukyane.com/blog/paper-review-nlaugmenter

#deeplearning #nlp #augmentation #robustness

Team
@data_enthusiasts
@OpenArchiveBooks