Data Science Archive

PyTorch 的 BERT 实现，包括 script 来将 TensorFlow 的 pre-trained model 进行转换，作者来自huggingface。https://github.com/huggingface/pytorch-pretrained-BERT

GitHub

GitHub - huggingface/transformers: 🤗 Transformers: the model-definition framework for state-of-the-art machine learning models…

🤗 Transformers: the model-definition framework for state-of-the-art machine learning models in text, vision, audio, and multimodal models, for both inference and training. - GitHub - huggingface/t...

580 views小熊猫, 05:39

Data Science Archive

HotpotQA：一个 wikipedia-based QA pairs dataset。
paper：https://arxiv.org/abs/1809.09600
code：https://github.com/hotpotqa/hotpot
link：https://hotpotqa.github.io/

GitHub

GitHub - hotpotqa/hotpot

Contribute to hotpotqa/hotpot development by creating an account on GitHub.

584 views小熊猫, 05:43

Data Science Archive

ICL 数学系DL课程的一些资料，包括有PyTorch和 TensorFlow 的 Tutorial 以及作业相关，看了一下Tutorial 部分，觉得非常有意思，和其他传统的基础作业不太一样，这里都是流行的落地项目，比如Question Answering、Generative Model with VAEs/GANs，非常值得一看。https://github.com/pukkapies/dl-imperial-maths

GitHub

GitHub - pukkapies/dl-imperial-maths: Code and assignment repository for the Imperial College Mathematics department Deep Learning…

Code and assignment repository for the Imperial College Mathematics department Deep Learning course - GitHub - pukkapies/dl-imperial-maths: Code and assignment repository for the Imperial College M...

702 views小熊猫, 05:50

Data Science Archive

一个将 scikit-learn estimator 转化成其他语言的工具，这样线上做 prediction 的时候会更加灵活，暂时还没有需要研究，不过看起来是非常有意义的项目，目前更新也比较活跃。https://github.com/nok/sklearn-porter

GitHub

GitHub - nok/sklearn-porter: Transpile trained scikit-learn estimators to C, Java, JavaScript and others.

Transpile trained scikit-learn estimators to C, Java, JavaScript and others. - nok/sklearn-porter

616 views小熊猫, edited 05:55

Data Science Archive

NIPS 2018 上 MPC solver，用于在强化学习模型中的控制辅助。Specifically, we differentiate through MPC by using the KKT conditions of the convex approximation at a fixed point of the controller. 作者是用在 PyTorch 上，做了一个 PyTorch 的 Lib，不过确实先前的control methods 都有局限。
paper: https://arxiv.org/abs/1810.13400
code: https://github.com/locuslab/mpc.pytorch
link: https://locuslab.github.io/mpc.pytorch/

arXiv.org

Differentiable MPC for End-to-end Planning and Control

We present foundations for using Model Predictive Control (MPC) as a differentiable policy class for reinforcement learning in continuous state and action spaces. This provides one way of...

623 views小熊猫, 06:10

Data Science Archive

Yandex 的 NLP 课程资料，这家俄罗斯的公司实力很强，也是 catboost 和 Clickhouse 的东家。
link: https://github.com/yandexdataschool/nlp_course
顺便可以看看托管：https://github.com/yandexdataschool
似乎是他们做的DataScience公开课，值得关注。

GitHub

GitHub - yandexdataschool/nlp_course: YSDA course in Natural Language Processing

YSDA course in Natural Language Processing. Contribute to yandexdataschool/nlp_course development by creating an account on GitHub.

653 views小熊猫, edited 06:13

Data Science Archive

一个 GBM 的实验，比较纯 Python+numba jit 和efficient version histogram binning优化过的 GBT（lightGBM) 的 benchmark。试了一下，貌似 master 分支上的 code 已经相差无几，更新比较活跃。
code: https://github.com/ogrisel/pygbm
关于 numba jit：http://numba.pydata.org/

GitHub

GitHub - ogrisel/pygbm: Experimental Gradient Boosting Machines in Python with numba.

Experimental Gradient Boosting Machines in Python with numba. - ogrisel/pygbm

639 views小熊猫, edited 06:20

Data Science Archive

介绍wasserstein距离的一篇科普文章，深入浅出写得非常好。link：http://www.mindcodec.com/an-intuitive-guide-to-optimal-transport-for-machine-learning/

662 views小熊猫, edited 07:36

Data Science Archive

一个强化学习introductory课程，看了两眼质量还不错，挺系统的，code里面基础RL算法的细节都有涉及，有配套视频，口音还算可以接受。
slides：http://pages.isir.upmc.fr/~sigaud/teach/english.html
code：https://github.com/osigaud/rl_labs_notebooks
视频部分不长，十几分钟的简短介绍。
video：https://www.youtube.com/watch?v=9gzL3QQzvQ4

GitHub

GitHub - osigaud/rl_labs_notebooks: Labs for understanding and coding Standard Reinforcement Learning concepts

Labs for understanding and coding Standard Reinforcement Learning concepts - GitHub - osigaud/rl_labs_notebooks: Labs for understanding and coding Standard Reinforcement Learning concepts

708 views小熊猫, 08:00

Data Science Archive

介绍 QTE/ATE，以及 Local ATE，来自 Uber Eng，有不少产品角度的数据科学思考。
link: https://eng.uber.com/analyzing-experiment-outcomes/
顺带找到一个知乎上关于 Local ATE 的介绍：https://www.zhihu.com/question/32199571/answer/55792738

739 views小熊猫, edited 08:27

Data Science Archive

一个 ML 扩展包，配合scikit-learn 一起食用还是很不错的，以前用过，主要优势在于 ensemble 和各种常用应用层面的封装，毕竟scikit-learn 里面不常用的方法还是有点多。
link: http://rasbt.github.io/mlxtend/
作者是威斯康辛麦迪逊的统计系老师，也是这本《Python Machine Learning》的作者。
书：https://www.amazon.com/Python-Machine-Learning-Sebastian-Raschka/dp/1783555130

rasbt.github.io

mlxtend

A library consisting of useful tools and extensions for the day-to-day data science tasks.

794 views小熊猫, 17:00

Data Science Archive

一个用 R 做 EDA 的例子，作者来自UChicago。https://angela-li.github.io/slides/2018-11-08/dc-r-presentation#1

angela-li.github.io

Data Science? Make it Spatial

706 views小熊猫, 17:09

Data Science Archive

flexdashboard，可以在 RStudio 里面做交互的可视化插件。如果用 RStudio 的话可以一试，用 Jupyter 似乎不是太需要了。https://blog.rstudio.com/2016/05/17/flexdashboard-easy-interactive-dashboards-for-r/

Rstudio

flexdashboard: Easy interactive dashboards for R

Today we’re excited to announce flexdashboard, a new package that enables you to easily create flexible, attractive, interactive dashboards with R. Authoring and customization of dashboards is done using R Markdown and you can optionally include Shiny components…

710 views小熊猫, edited 17:17

Data Science Archive

一个 ML 系统线上部署以及实战操作部分的工具栈，有模型存储， Data Pipeline，ETL，特征工程，以及各种性能优化，很多工程角度实用的工具收集。
link: https://github.com/EthicalML/awesome-machine-learning-operations
作者也在 EuroScipy 2018上给了一个比较简短的 talk: https://axsauze.github.io/scalable-data-science/#/

GitHub

GitHub - EthicalML/awesome-production-machine-learning: A curated list of awesome open source libraries to deploy, monitor, version…

A curated list of awesome open source libraries to deploy, monitor, version and scale your machine learning - EthicalML/awesome-production-machine-learning

710 views小熊猫, 21:41

Data Science Archive

cuDF: GPU DataFrame Library，pandas-like API。貌似 NVIDIA 也有一个类似的项目？但是刚才去找了半天没找到。来自 rapids.ai。
link: https://github.com/rapidsai/cudf
团队还有其他不错的项目，cuML，cuGRAPH，可视化的工具等等，可能是想做一个 GPU Data Science Ecosystem，可以关注一下。
团队主页：https://rapids.ai/
团队项目主页：https://github.com/RAPIDSai

GitHub

GitHub - rapidsai/cudf: cuDF - GPU DataFrame Library

cuDF - GPU DataFrame Library . Contribute to rapidsai/cudf development by creating an account on GitHub.

710 views小熊猫, 22:04

Data Science Archive

XLNI Dataset，和先前 MLNI 差不多类型，不过语言种类更多，但是是它们翻译过来的。这次 Google BERT pre-trained 项目中官方实现的例子里面也有。https://code.fb.com/ai-research/xlni/

Facebook Engineering

Facebook, NYU expand available languages for natural language understanding systems

The XLNI dataset, a collaboration between Facebook and NYU, builds on the MultiNLI corpus, adding 14 languages including low-resource languages.

703 views小熊猫, 22:08

Data Science Archive

一个收集 NLP 各个子领域进展的 markdown 项目，这里对进展的定义不错，都是基于某某公开数据集，以及相应的 metrics，非常适合刚刚入门某个领域。扫了一眼 text classification & summarization，还是比较系统的。遗憾的是对于各个领域独有的（默认的）一些 trick 没有提及。
link: https://github.com/sebastianruder/NLP-progress

GitHub

GitHub - sebastianruder/NLP-progress: Repository to track the progress in Natural Language Processing (NLP), including the datasets…

Repository to track the progress in Natural Language Processing (NLP), including the datasets and the current state-of-the-art for the most common NLP tasks. - sebastianruder/NLP-progress

709 views小熊猫, 22:14

Data Science Archive

EMNLP 2018 上一个非监督的Statistical Machine Translation，WMT14 的 BLEU 分数26.2，还是挺不错的。翻译领域其实不太了解，NMT 还算实践过一些，传统的Statistical MT几乎不太懂。
看了一下项目里的requirements，看到了Moses 的身影，似乎这个是早期传统的 SMT 的重要工具？（上次在一个文言文翻译现代汉语的项目里见到过。
code: https://github.com/artetxem/monoses
link: https://arxiv.org/abs/1809.01272
Moses: http://www.statmt.org/moses/

GitHub

GitHub - artetxem/monoses: Unsupervised Statistical Machine Translation

Unsupervised Statistical Machine Translation. Contribute to artetxem/monoses development by creating an account on GitHub.

722 views小熊猫, 22:23

Data Science Archive

一个用featuretools做特征工程的例子，ft这个工具还不错，上次做Kaggle也有用到，如果是不太熟悉的领域，又是categorical data，先ft提一波高阶组合特征，跑一个baseline还是不错的。
不过这个工具有相当多tricky的参数，时间开销也比较大。
link：https://medium.com/@rrfd/simple-automatic-feature-engineering-using-featuretools-in-python-for-classification-b1308040e183

Medium

Simple Automatic Feature Engineering — Using featuretools in Python for Classification

Preface

737 views小熊猫, edited 04:52

Data Science Archive

一篇快速回顾统计概念的小文，举的例子还是挺不错的，写得也很好。贝叶斯学派和统计学派，虚空假设，Type Error，p-value。
link: https://towardsdatascience.com/statistics-for-people-in-a-hurry-a9613c0ed0b

Medium

Statistics for people in a hurry

Ever wished someone would just tell you what the point of statistics is and what the jargon means in plain English? Let me try to grant…

770 views小熊猫, edited 06:34

Data Science Archive

Sebastian Raschka终于写完了他的这套博文系列《Model evaluation, model selection, and algorithm selection in machine learning》的第四章，非常详细地介绍了模型评测部分需要考虑的各种环节，需要一些统计基础。
前三篇连载都是两年前写的，当时看得也是获益匪浅，统计背景比较强的老师看模型和算法的角度会不太一样，非常推荐。
link:
1. https://sebastianraschka.com/blog/2016/model-evaluation-selection-part1.html
2. https://sebastianraschka.com/blog/2016/model-evaluation-selection-part2.html
3. https://sebastianraschka.com/blog/2016/model-evaluation-selection-part3.html
4. https://sebastianraschka.com/blog/2018/model-evaluation-selection-part4.html