一个在 spaCy 上做ULMiT/BERT/Elmo 做 pre-training 的实验记录。
https://github.com/explosion/spaCy/pull/2931
https://github.com/explosion/spaCy/pull/2931
GitHub
💫 Add experimental ULMFit/BERT/Elmo-like pretraining by honnibal · Pull Request #2931 · explosion/spaCy
Add support for a new command, spacy pretrain:
usage: spacy pretrain [-h] [-cw 128] [-cd 4] [-er 1000] [-d 0.2] [-i 1] [-s 0]
texts_loc vectors_model output_dir
Pre-train...
usage: spacy pretrain [-h] [-cw 128] [-cd 4] [-er 1000] [-d 0.2] [-i 1] [-s 0]
texts_loc vectors_model output_dir
Pre-train...
一个对 GCN 训练和评估各种 trick 和 pitfalls 的 recap,简单看了一下有很多训练细节的描述和提及,还有 GCN 网络构建的很多关键部分。
先前试过朴素的 GCN 做文本分类:https://arxiv.org/abs/1809.05679
自己也造了一个轮子,GCN 做文本分类确实可行,而且相对 TextCNN 这些方法速度快很多。
link:https://arxiv.org/abs/1811.05868
先前试过朴素的 GCN 做文本分类:https://arxiv.org/abs/1809.05679
自己也造了一个轮子,GCN 做文本分类确实可行,而且相对 TextCNN 这些方法速度快很多。
link:https://arxiv.org/abs/1811.05868
huggingface 这些人把包打进pypi了,懒人模式可以开启了……
link: https://github.com/huggingface/pytorch-pretrained-BERT
link: https://github.com/huggingface/pytorch-pretrained-BERT
GitHub
GitHub - huggingface/transformers: 🤗 Transformers: the model-definition framework for state-of-the-art machine learning models…
🤗 Transformers: the model-definition framework for state-of-the-art machine learning models in text, vision, audio, and multimodal models, for both inference and training. - GitHub - huggingface/t...
一份在PyData Warsaw2018上的 slides,分享NLP Summarization.
https://ghostweather.slides.com/lynncherny/tl-dr-summarization#/6
进到页面后作者还有一些不错的 slides,包括 Google 那篇the stories we tell,写得都挺不错,适合快速 recap。
https://ghostweather.slides.com/lynncherny/tl-dr-summarization#/6
进到页面后作者还有一些不错的 slides,包括 Google 那篇the stories we tell,写得都挺不错,适合快速 recap。
Slides
Tl;dr: Summarization.
A talk overviewing NLP summarization goals and metrics, given as keynote at PyData Warsaw, with some non-news experiments and commentary on artistic applications.
提交到 ICLR 2019的一篇不错的小文,比较pre-trained sentence-level language model,下面作者的 response 也挺不错。
https://openreview.net/forum?id=Bkl87h09FX
https://openreview.net/forum?id=Bkl87h09FX
OpenReview
Looking for ELMo's friends: Sentence-Level Pretraining Beyond...
We compare many tasks and task combinations for pretraining sentence-level BiLSTMs for NLP tasks. Language modeling is the best single pretraining task, but simple baselines also do well.
对推荐系统中 MF的一些概览,初次接触 RecSys 可以看看。https://towardsdatascience.com/paper-summary-matrix-factorization-techniques-for-recommender-systems-82d1a7ace74
基于 PyTorch 的high-level lib,很早以前看过,没注意已经是 PyTorch 官方 team 的 repo,可以关注一下。
https://github.com/pytorch/ignite
https://github.com/pytorch/ignite
GitHub
GitHub - pytorch/ignite: High-level library to help with training and evaluating neural networks in PyTorch flexibly and transparently.
High-level library to help with training and evaluating neural networks in PyTorch flexibly and transparently. - pytorch/ignite
《Do Better ImageNet Models Transfer Better?》的第二版。
In v1, we used public checkpoints where the ResNet models were trained without regularizers, which is why they performed best in the fixed feature setting. In v2, we retrained everything. Surprisingly, for ImageNet training, the same hyperparameters work well for all models.
In v2, we show that regularization settings for ImageNet training matter a lot for transfer learning on fixed features. ImageNet accuracy now correlates with transfer acc in all settings.
https://arxiv.org/abs/1805.08974
In v1, we used public checkpoints where the ResNet models were trained without regularizers, which is why they performed best in the fixed feature setting. In v2, we retrained everything. Surprisingly, for ImageNet training, the same hyperparameters work well for all models.
In v2, we show that regularization settings for ImageNet training matter a lot for transfer learning on fixed features. ImageNet accuracy now correlates with transfer acc in all settings.
https://arxiv.org/abs/1805.08974
MedicalTorch 升级到了v0.2,这是一个在 PyTorch 上专门用作医学图像的框架,没有仔细研究过,可能是医学图像和其他领域的图像处理有所不同。粗略看了一下代码里的 Model,提到了 segmentation using deep dilated convolutions
link: https://www.nature.com/articles/s41598-018-24304-3
transforms 里的函数有好多特殊的,像是一个高质量的项目,有待研究。
link:https://medicaltorch.readthedocs.io/en/stable/
link: https://www.nature.com/articles/s41598-018-24304-3
transforms 里的函数有好多特殊的,像是一个高质量的项目,有待研究。
link:https://medicaltorch.readthedocs.io/en/stable/
Nature
Spinal cord gray matter segmentation using deep dilated convolutions
Scientific Reports - <ArticleTitle Language="En" xml:lang="en">Spinal cord gray matter segmentation using deep dilated...
pandas bokeh 一个半年前准备造的轮子被人先造了,不过这种轮子也是不少了。。。
link: https://github.com/PatrikHlobil/Pandas-Bokeh
link: https://github.com/PatrikHlobil/Pandas-Bokeh
GitHub
GitHub - PatrikHlobil/Pandas-Bokeh: Bokeh Plotting Backend for Pandas and GeoPandas
Bokeh Plotting Backend for Pandas and GeoPandas. Contribute to PatrikHlobil/Pandas-Bokeh development by creating an account on GitHub.
一份对 FM 比较不错的应用介绍,包括推荐搜索这样的典型应用,适合了解 FFM 和 FM。https://www.m3tech.blog/entry/2019/01/02/090000
エムスリーテックブログ
Factorization Machineの実装と数値検証 - エムスリーテックブログ
はじめに あけましておめでとうございます。エンジニアGの西場です(@m_nishiba)。AI・機械学習チームで自然言語処理や推薦システムの開発を行っています。 Gunosyのデータ分析ブログのDeepなFactorization Machinesの最新動向 (2018)を読んでFactorization Machin…
Parabel 的 Rust 高度并行实现。https://github.com/tomtung/parabel-rs
关于 Parabel:https://dl.acm.org/citation.cfm?doid=3178876.3185998
看起来是适合大规模分类问题,性能超群,留待日后研究。
关于 Parabel:https://dl.acm.org/citation.cfm?doid=3178876.3185998
看起来是适合大规模分类问题,性能超群,留待日后研究。
GitHub
GitHub - tomtung/omikuji: An efficient implementation of Partitioned Label Trees & its variations for extreme multi-label classification
An efficient implementation of Partitioned Label Trees & its variations for extreme multi-label classification - GitHub - tomtung/omikuji: An efficient implementation of Partitioned Label T...
2018年几个比较重要的数据集,自己用过 SQuAD2.0/CoQA/HotpotQA/TencentAI ML 质量都比较高
https://medium.com/syncedreview/2018-in-review-10-open-sourced-ai-datasets-696b3b49801f
还推荐 Tencent AI 前段时间发布的中文 embedding:https://ai.tencent.com/ailab/nlp/embedding.html
https://medium.com/syncedreview/2018-in-review-10-open-sourced-ai-datasets-696b3b49801f
还推荐 Tencent AI 前段时间发布的中文 embedding:https://ai.tencent.com/ailab/nlp/embedding.html
Medium
2018 In Review: 10 Open-Sourced AI Datasets
In a boon to AI researchers, the last year witnessed an unprecedented open-sourcing of large datasets by popular AI research projects.
来自Uber AI 的一个不错的轮子,玩了一天非常适合跑demo和验证,许多state of the art 的解决方案都可以先做验证。https://uber.github.io/ludwig/
blog介绍:https://eng.uber.com/introducing-ludwig/
blog介绍:https://eng.uber.com/introducing-ludwig/
DVC:做data science model管理的工具,大致原理是使用git和s3之类的进行联合存储。多人团队,跨多业务团队还是蛮有用的,上一次和其他队员一起刷Kaggle的时候用过一次体验不错。https://github.com/iterative/dvc
GitHub
GitHub - treeverse/dvc: 🦉 Data Versioning and ML Experiments
🦉 Data Versioning and ML Experiments. Contribute to treeverse/dvc development by creating an account on GitHub.
FAIR的ELF发布了ELF Go的新版,应该后面会继续发更多Go bot,https://facebook.ai/developers/tools/elf
ELF OpenGo:https://research.fb.com/facebook-open-sources-elf-opengo/
lecun的fb post:https://www.facebook.com/yann.lecun/posts/10155789997817143
ELF OpenGo:https://research.fb.com/facebook-open-sources-elf-opengo/
lecun的fb post:https://www.facebook.com/yann.lecun/posts/10155789997817143
早上试玩了一下JAX,前段时间有关注,昨天看Francois又在提到。简单来说就是Numpy+gradients,有XLA https://github.com/tensorflow/tensorflow/blob/master/tensorflow/compiler/xla/g3doc/overview.md 加成的GPU加速。想实现一些底层框架的话也许是一个不错的选择。https://github.com/google/jax
前有StanfordNLP,又发现 https://github.com/zalandoresearch/flair 不过现在对这种轮子有点免疫。看了一些源码觉得项目代码写得还是挺不错的,自己造轮子的朋友不妨一看,看得多才能造得好。
GitHub
GitHub - flairNLP/flair: A very simple framework for state-of-the-art Natural Language Processing (NLP)
A very simple framework for state-of-the-art Natural Language Processing (NLP) - flairNLP/flair
ignite,来自FAIR的PyTorch high-level api,昨晚玩了一下非常好用,感觉是有点像keras和tf的关系。https://github.com/pytorch/ignite
GitHub
GitHub - pytorch/ignite: High-level library to help with training and evaluating neural networks in PyTorch flexibly and transparently.
High-level library to help with training and evaluating neural networks in PyTorch flexibly and transparently. - pytorch/ignite