Data Science Archive

基于 LSTM 构建语言模型，然后用作输入法，以前有看到过一个韩国人做的，这次作者来自东京大学和 CMU，数据集是日语的 BCCWJ。其实是2016年的工作，但是语言模型放进输入法还是一个挺自然的事情，看起来还是挺有意思。
paper：https://arxiv.org/pdf/1810.09309.pdf
code：https://github.com/yohokuno/neural_ime

582 views小熊猫, edited 05:17

一个对 LSTM 中 autoencoder 的科普介绍，还挺清楚。just another，有关键部分的 Keras code 帮助理解。https://machinelearningmastery.com/lstm-autoencoders

579 views小熊猫, 05:22

Data Science Archive

语言模型中的迁移学习进展和总结，对目前State of the Art 的 LM 都有介绍，包括allennlp 的 ELMo，ULMFiT，OpenAI 的 Transformer，以及最近 Google 刷屏的 BERT。https://drive.google.com/file/d/1kmNAwrSlFYo0cN_DcURMOArBwe9FxWxR/view

Google Docs

transfer_learning_with_language_models.pdf

572 views小熊猫, 05:37

Data Science Archive

PyTorch 的 BERT 实现，包括 script 来将 TensorFlow 的 pre-trained model 进行转换，作者来自huggingface。https://github.com/huggingface/pytorch-pretrained-BERT

GitHub

GitHub - huggingface/transformers: 🤗 Transformers: the model-definition framework for state-of-the-art machine learning models…

🤗 Transformers: the model-definition framework for state-of-the-art machine learning models in text, vision, audio, and multimodal models, for both inference and training. - GitHub - huggingface/t...

580 views小熊猫, 05:39

Data Science Archive

HotpotQA：一个 wikipedia-based QA pairs dataset。
paper：https://arxiv.org/abs/1809.09600
code：https://github.com/hotpotqa/hotpot
link：https://hotpotqa.github.io/

GitHub

GitHub - hotpotqa/hotpot

Contribute to hotpotqa/hotpot development by creating an account on GitHub.

584 views小熊猫, 05:43

Data Science Archive

ICL 数学系DL课程的一些资料，包括有PyTorch和 TensorFlow 的 Tutorial 以及作业相关，看了一下Tutorial 部分，觉得非常有意思，和其他传统的基础作业不太一样，这里都是流行的落地项目，比如Question Answering、Generative Model with VAEs/GANs，非常值得一看。https://github.com/pukkapies/dl-imperial-maths

GitHub

GitHub - pukkapies/dl-imperial-maths: Code and assignment repository for the Imperial College Mathematics department Deep Learning…

Code and assignment repository for the Imperial College Mathematics department Deep Learning course - GitHub - pukkapies/dl-imperial-maths: Code and assignment repository for the Imperial College M...

702 views小熊猫, 05:50

Data Science Archive

一个将 scikit-learn estimator 转化成其他语言的工具，这样线上做 prediction 的时候会更加灵活，暂时还没有需要研究，不过看起来是非常有意义的项目，目前更新也比较活跃。https://github.com/nok/sklearn-porter

GitHub

GitHub - nok/sklearn-porter: Transpile trained scikit-learn estimators to C, Java, JavaScript and others.

Transpile trained scikit-learn estimators to C, Java, JavaScript and others. - nok/sklearn-porter

616 views小熊猫, edited 05:55

Data Science Archive

NIPS 2018 上 MPC solver，用于在强化学习模型中的控制辅助。Specifically, we differentiate through MPC by using the KKT conditions of the convex approximation at a fixed point of the controller. 作者是用在 PyTorch 上，做了一个 PyTorch 的 Lib，不过确实先前的control methods 都有局限。
paper: https://arxiv.org/abs/1810.13400
code: https://github.com/locuslab/mpc.pytorch
link: https://locuslab.github.io/mpc.pytorch/

arXiv.org

Differentiable MPC for End-to-end Planning and Control

We present foundations for using Model Predictive Control (MPC) as a differentiable policy class for reinforcement learning in continuous state and action spaces. This provides one way of...

623 views小熊猫, 06:10

Data Science Archive

Yandex 的 NLP 课程资料，这家俄罗斯的公司实力很强，也是 catboost 和 Clickhouse 的东家。
link: https://github.com/yandexdataschool/nlp_course
顺便可以看看托管：https://github.com/yandexdataschool
似乎是他们做的DataScience公开课，值得关注。

GitHub

GitHub - yandexdataschool/nlp_course: YSDA course in Natural Language Processing

YSDA course in Natural Language Processing. Contribute to yandexdataschool/nlp_course development by creating an account on GitHub.

653 views小熊猫, edited 06:13

Data Science Archive

一个 GBM 的实验，比较纯 Python+numba jit 和efficient version histogram binning优化过的 GBT（lightGBM) 的 benchmark。试了一下，貌似 master 分支上的 code 已经相差无几，更新比较活跃。
code: https://github.com/ogrisel/pygbm
关于 numba jit：http://numba.pydata.org/

GitHub

GitHub - ogrisel/pygbm: Experimental Gradient Boosting Machines in Python with numba.

Experimental Gradient Boosting Machines in Python with numba. - ogrisel/pygbm

639 views小熊猫, edited 06:20

Data Science Archive

介绍wasserstein距离的一篇科普文章，深入浅出写得非常好。link：http://www.mindcodec.com/an-intuitive-guide-to-optimal-transport-for-machine-learning/

662 views小熊猫, edited 07:36

Data Science Archive

一个强化学习introductory课程，看了两眼质量还不错，挺系统的，code里面基础RL算法的细节都有涉及，有配套视频，口音还算可以接受。
slides：http://pages.isir.upmc.fr/~sigaud/teach/english.html
code：https://github.com/osigaud/rl_labs_notebooks
视频部分不长，十几分钟的简短介绍。
video：https://www.youtube.com/watch?v=9gzL3QQzvQ4

GitHub

GitHub - osigaud/rl_labs_notebooks: Labs for understanding and coding Standard Reinforcement Learning concepts

Labs for understanding and coding Standard Reinforcement Learning concepts - GitHub - osigaud/rl_labs_notebooks: Labs for understanding and coding Standard Reinforcement Learning concepts

708 views小熊猫, 08:00

Data Science Archive

介绍 QTE/ATE，以及 Local ATE，来自 Uber Eng，有不少产品角度的数据科学思考。
link: https://eng.uber.com/analyzing-experiment-outcomes/
顺带找到一个知乎上关于 Local ATE 的介绍：https://www.zhihu.com/question/32199571/answer/55792738

739 views小熊猫, edited 08:27

Data Science Archive

一个 ML 扩展包，配合scikit-learn 一起食用还是很不错的，以前用过，主要优势在于 ensemble 和各种常用应用层面的封装，毕竟scikit-learn 里面不常用的方法还是有点多。
link: http://rasbt.github.io/mlxtend/
作者是威斯康辛麦迪逊的统计系老师，也是这本《Python Machine Learning》的作者。
书：https://www.amazon.com/Python-Machine-Learning-Sebastian-Raschka/dp/1783555130

rasbt.github.io

mlxtend

A library consisting of useful tools and extensions for the day-to-day data science tasks.

794 views小熊猫, 17:00

Data Science Archive

一个用 R 做 EDA 的例子，作者来自UChicago。https://angela-li.github.io/slides/2018-11-08/dc-r-presentation#1

angela-li.github.io

Data Science? Make it Spatial

706 views小熊猫, 17:09

Data Science Archive

flexdashboard，可以在 RStudio 里面做交互的可视化插件。如果用 RStudio 的话可以一试，用 Jupyter 似乎不是太需要了。https://blog.rstudio.com/2016/05/17/flexdashboard-easy-interactive-dashboards-for-r/

Rstudio

flexdashboard: Easy interactive dashboards for R

Today we’re excited to announce flexdashboard, a new package that enables you to easily create flexible, attractive, interactive dashboards with R. Authoring and customization of dashboards is done using R Markdown and you can optionally include Shiny components…

710 views小熊猫, edited 17:17

Data Science Archive

一个 ML 系统线上部署以及实战操作部分的工具栈，有模型存储， Data Pipeline，ETL，特征工程，以及各种性能优化，很多工程角度实用的工具收集。
link: https://github.com/EthicalML/awesome-machine-learning-operations
作者也在 EuroScipy 2018上给了一个比较简短的 talk: https://axsauze.github.io/scalable-data-science/#/

GitHub

GitHub - EthicalML/awesome-production-machine-learning: A curated list of awesome open source libraries to deploy, monitor, version…

A curated list of awesome open source libraries to deploy, monitor, version and scale your machine learning - EthicalML/awesome-production-machine-learning

710 views小熊猫, 21:41

Data Science Archive

cuDF: GPU DataFrame Library，pandas-like API。貌似 NVIDIA 也有一个类似的项目？但是刚才去找了半天没找到。来自 rapids.ai。
link: https://github.com/rapidsai/cudf
团队还有其他不错的项目，cuML，cuGRAPH，可视化的工具等等，可能是想做一个 GPU Data Science Ecosystem，可以关注一下。
团队主页：https://rapids.ai/
团队项目主页：https://github.com/RAPIDSai

GitHub

GitHub - rapidsai/cudf: cuDF - GPU DataFrame Library

cuDF - GPU DataFrame Library . Contribute to rapidsai/cudf development by creating an account on GitHub.

710 views小熊猫, 22:04

Data Science Archive

XLNI Dataset，和先前 MLNI 差不多类型，不过语言种类更多，但是是它们翻译过来的。这次 Google BERT pre-trained 项目中官方实现的例子里面也有。https://code.fb.com/ai-research/xlni/

Facebook Engineering

Facebook, NYU expand available languages for natural language understanding systems

The XLNI dataset, a collaboration between Facebook and NYU, builds on the MultiNLI corpus, adding 14 languages including low-resource languages.

703 views小熊猫, 22:08

Data Science Archive

一个收集 NLP 各个子领域进展的 markdown 项目，这里对进展的定义不错，都是基于某某公开数据集，以及相应的 metrics，非常适合刚刚入门某个领域。扫了一眼 text classification & summarization，还是比较系统的。遗憾的是对于各个领域独有的（默认的）一些 trick 没有提及。
link: https://github.com/sebastianruder/NLP-progress

GitHub

GitHub - sebastianruder/NLP-progress: Repository to track the progress in Natural Language Processing (NLP), including the datasets…

Repository to track the progress in Natural Language Processing (NLP), including the datasets and the current state-of-the-art for the most common NLP tasks. - sebastianruder/NLP-progress

709 views小熊猫, 22:14

Data Science Archive

EMNLP 2018 上一个非监督的Statistical Machine Translation，WMT14 的 BLEU 分数26.2，还是挺不错的。翻译领域其实不太了解，NMT 还算实践过一些，传统的Statistical MT几乎不太懂。
看了一下项目里的requirements，看到了Moses 的身影，似乎这个是早期传统的 SMT 的重要工具？（上次在一个文言文翻译现代汉语的项目里见到过。
code: https://github.com/artetxem/monoses
link: https://arxiv.org/abs/1809.01272
Moses: http://www.statmt.org/moses/

GitHub

GitHub - artetxem/monoses: Unsupervised Statistical Machine Translation

Unsupervised Statistical Machine Translation. Contribute to artetxem/monoses development by creating an account on GitHub.

722 views小熊猫, 22:23

About

Blog

Apps

Platform