基于 LSTM 构建语言模型,然后用作输入法,以前有看到过一个韩国人做的,这次作者来自东京大学和 CMU,数据集是日语的 BCCWJ。其实是2016年的工作,但是语言模型放进输入法还是一个挺自然的事情,看起来还是挺有意思。
paper:https://arxiv.org/pdf/1810.09309.pdf
code:https://github.com/yohokuno/neural_ime
paper:https://arxiv.org/pdf/1810.09309.pdf
code:https://github.com/yohokuno/neural_ime
一个对 LSTM 中 autoencoder 的科普介绍,还挺清楚。just another,有关键部分的 Keras code 帮助理解。https://machinelearningmastery.com/lstm-autoencoders
语言模型中的迁移学习进展和总结,对目前State of the Art 的 LM 都有介绍,包括allennlp 的 ELMo,ULMFiT,OpenAI 的 Transformer,以及最近 Google 刷屏的 BERT。https://drive.google.com/file/d/1kmNAwrSlFYo0cN_DcURMOArBwe9FxWxR/view
Google Docs
transfer_learning_with_language_models.pdf
PyTorch 的 BERT 实现,包括 script 来将 TensorFlow 的 pre-trained model 进行转换,作者来自huggingface。https://github.com/huggingface/pytorch-pretrained-BERT
GitHub
GitHub - huggingface/transformers: 🤗 Transformers: the model-definition framework for state-of-the-art machine learning models…
🤗 Transformers: the model-definition framework for state-of-the-art machine learning models in text, vision, audio, and multimodal models, for both inference and training. - GitHub - huggingface/t...
HotpotQA:一个 wikipedia-based QA pairs dataset。
paper:https://arxiv.org/abs/1809.09600
code:https://github.com/hotpotqa/hotpot
link:https://hotpotqa.github.io/
paper:https://arxiv.org/abs/1809.09600
code:https://github.com/hotpotqa/hotpot
link:https://hotpotqa.github.io/
GitHub
GitHub - hotpotqa/hotpot
Contribute to hotpotqa/hotpot development by creating an account on GitHub.
ICL 数学系DL课程的一些资料,包括有PyTorch和 TensorFlow 的 Tutorial 以及作业相关,看了一下Tutorial 部分,觉得非常有意思,和其他传统的基础作业不太一样,这里都是流行的落地项目,比如Question Answering、Generative Model with VAEs/GANs,非常值得一看。https://github.com/pukkapies/dl-imperial-maths
GitHub
GitHub - pukkapies/dl-imperial-maths: Code and assignment repository for the Imperial College Mathematics department Deep Learning…
Code and assignment repository for the Imperial College Mathematics department Deep Learning course - GitHub - pukkapies/dl-imperial-maths: Code and assignment repository for the Imperial College M...
一个将 scikit-learn estimator 转化成其他语言的工具,这样线上做 prediction 的时候会更加灵活,暂时还没有需要研究,不过看起来是非常有意义的项目,目前更新也比较活跃。https://github.com/nok/sklearn-porter
GitHub
GitHub - nok/sklearn-porter: Transpile trained scikit-learn estimators to C, Java, JavaScript and others.
Transpile trained scikit-learn estimators to C, Java, JavaScript and others. - nok/sklearn-porter
NIPS 2018 上 MPC solver,用于在强化学习模型中的控制辅助。Specifically, we differentiate through MPC by using the KKT conditions of the convex approximation at a fixed point of the controller. 作者是用在 PyTorch 上,做了一个 PyTorch 的 Lib,不过确实先前的control methods 都有局限。
paper: https://arxiv.org/abs/1810.13400
code: https://github.com/locuslab/mpc.pytorch
link: https://locuslab.github.io/mpc.pytorch/
paper: https://arxiv.org/abs/1810.13400
code: https://github.com/locuslab/mpc.pytorch
link: https://locuslab.github.io/mpc.pytorch/
arXiv.org
Differentiable MPC for End-to-end Planning and Control
We present foundations for using Model Predictive Control (MPC) as a differentiable policy class for reinforcement learning in continuous state and action spaces. This provides one way of...
Yandex 的 NLP 课程资料,这家俄罗斯的公司实力很强,也是 catboost 和 Clickhouse 的东家。
link: https://github.com/yandexdataschool/nlp_course
顺便可以看看托管:https://github.com/yandexdataschool
似乎是他们做的DataScience公开课,值得关注。
link: https://github.com/yandexdataschool/nlp_course
顺便可以看看托管:https://github.com/yandexdataschool
似乎是他们做的DataScience公开课,值得关注。
GitHub
GitHub - yandexdataschool/nlp_course: YSDA course in Natural Language Processing
YSDA course in Natural Language Processing. Contribute to yandexdataschool/nlp_course development by creating an account on GitHub.
一个 GBM 的实验,比较纯 Python+numba jit 和efficient version histogram binning优化过的 GBT(lightGBM) 的 benchmark。试了一下,貌似 master 分支上的 code 已经相差无几,更新比较活跃。
code: https://github.com/ogrisel/pygbm
关于 numba jit:http://numba.pydata.org/
code: https://github.com/ogrisel/pygbm
关于 numba jit:http://numba.pydata.org/
GitHub
GitHub - ogrisel/pygbm: Experimental Gradient Boosting Machines in Python with numba.
Experimental Gradient Boosting Machines in Python with numba. - ogrisel/pygbm
介绍wasserstein距离的一篇科普文章,深入浅出写得非常好。link:http://www.mindcodec.com/an-intuitive-guide-to-optimal-transport-for-machine-learning/
一个强化学习introductory课程,看了两眼质量还不错,挺系统的,code里面基础RL算法的细节都有涉及,有配套视频,口音还算可以接受。
slides:http://pages.isir.upmc.fr/~sigaud/teach/english.html
code:https://github.com/osigaud/rl_labs_notebooks
视频部分不长,十几分钟的简短介绍。
video:https://www.youtube.com/watch?v=9gzL3QQzvQ4
slides:http://pages.isir.upmc.fr/~sigaud/teach/english.html
code:https://github.com/osigaud/rl_labs_notebooks
视频部分不长,十几分钟的简短介绍。
video:https://www.youtube.com/watch?v=9gzL3QQzvQ4
GitHub
GitHub - osigaud/rl_labs_notebooks: Labs for understanding and coding Standard Reinforcement Learning concepts
Labs for understanding and coding Standard Reinforcement Learning concepts - GitHub - osigaud/rl_labs_notebooks: Labs for understanding and coding Standard Reinforcement Learning concepts
介绍 QTE/ATE,以及 Local ATE,来自 Uber Eng,有不少产品角度的数据科学思考。
link: https://eng.uber.com/analyzing-experiment-outcomes/
顺带找到一个知乎上关于 Local ATE 的介绍:https://www.zhihu.com/question/32199571/answer/55792738
link: https://eng.uber.com/analyzing-experiment-outcomes/
顺带找到一个知乎上关于 Local ATE 的介绍:https://www.zhihu.com/question/32199571/answer/55792738
一个 ML 扩展包,配合scikit-learn 一起食用还是很不错的,以前用过,主要优势在于 ensemble 和各种常用应用层面的封装,毕竟scikit-learn 里面不常用的方法还是有点多。
link: http://rasbt.github.io/mlxtend/
作者是威斯康辛麦迪逊的统计系老师,也是这本《Python Machine Learning》的作者。
书:https://www.amazon.com/Python-Machine-Learning-Sebastian-Raschka/dp/1783555130
link: http://rasbt.github.io/mlxtend/
作者是威斯康辛麦迪逊的统计系老师,也是这本《Python Machine Learning》的作者。
书:https://www.amazon.com/Python-Machine-Learning-Sebastian-Raschka/dp/1783555130
rasbt.github.io
mlxtend
A library consisting of useful tools and extensions for the day-to-day data science tasks.
一个用 R 做 EDA 的例子,作者来自UChicago。https://angela-li.github.io/slides/2018-11-08/dc-r-presentation#1
angela-li.github.io
Data Science? Make it Spatial
flexdashboard,可以在 RStudio 里面做交互的可视化插件。如果用 RStudio 的话可以一试,用 Jupyter 似乎不是太需要了。https://blog.rstudio.com/2016/05/17/flexdashboard-easy-interactive-dashboards-for-r/
Rstudio
flexdashboard: Easy interactive dashboards for R
Today we’re excited to announce flexdashboard, a new package that enables you to easily create flexible, attractive, interactive dashboards with R. Authoring and customization of dashboards is done using R Markdown and you can optionally include Shiny components…
一个 ML 系统线上部署以及实战操作部分的工具栈,有模型存储, Data Pipeline,ETL,特征工程,以及各种性能优化,很多工程角度实用的工具收集。
link: https://github.com/EthicalML/awesome-machine-learning-operations
作者也在 EuroScipy 2018上给了一个比较简短的 talk: https://axsauze.github.io/scalable-data-science/#/
link: https://github.com/EthicalML/awesome-machine-learning-operations
作者也在 EuroScipy 2018上给了一个比较简短的 talk: https://axsauze.github.io/scalable-data-science/#/
GitHub
GitHub - EthicalML/awesome-production-machine-learning: A curated list of awesome open source libraries to deploy, monitor, version…
A curated list of awesome open source libraries to deploy, monitor, version and scale your machine learning - EthicalML/awesome-production-machine-learning
cuDF: GPU DataFrame Library,pandas-like API。貌似 NVIDIA 也有一个类似的项目?但是刚才去找了半天没找到。来自 rapids.ai。
link: https://github.com/rapidsai/cudf
团队还有其他不错的项目,cuML,cuGRAPH,可视化的工具等等,可能是想做一个 GPU Data Science Ecosystem,可以关注一下。
团队主页:https://rapids.ai/
团队项目主页:https://github.com/RAPIDSai
link: https://github.com/rapidsai/cudf
团队还有其他不错的项目,cuML,cuGRAPH,可视化的工具等等,可能是想做一个 GPU Data Science Ecosystem,可以关注一下。
团队主页:https://rapids.ai/
团队项目主页:https://github.com/RAPIDSai
GitHub
GitHub - rapidsai/cudf: cuDF - GPU DataFrame Library
cuDF - GPU DataFrame Library . Contribute to rapidsai/cudf development by creating an account on GitHub.
XLNI Dataset,和先前 MLNI 差不多类型,不过语言种类更多,但是是它们翻译过来的。这次 Google BERT pre-trained 项目中官方实现的例子里面也有。https://code.fb.com/ai-research/xlni/
Facebook Engineering
Facebook, NYU expand available languages for natural language understanding systems
The XLNI dataset, a collaboration between Facebook and NYU, builds on the MultiNLI corpus, adding 14 languages including low-resource languages.
一个收集 NLP 各个子领域进展的 markdown 项目,这里对进展的定义不错,都是基于某某公开数据集,以及相应的 metrics,非常适合刚刚入门某个领域。扫了一眼 text classification & summarization,还是比较系统的。遗憾的是对于各个领域独有的(默认的)一些 trick 没有提及。
link: https://github.com/sebastianruder/NLP-progress
link: https://github.com/sebastianruder/NLP-progress
GitHub
GitHub - sebastianruder/NLP-progress: Repository to track the progress in Natural Language Processing (NLP), including the datasets…
Repository to track the progress in Natural Language Processing (NLP), including the datasets and the current state-of-the-art for the most common NLP tasks. - sebastianruder/NLP-progress
EMNLP 2018 上一个非监督的Statistical Machine Translation,WMT14 的 BLEU 分数26.2,还是挺不错的。翻译领域其实不太了解,NMT 还算实践过一些,传统的Statistical MT几乎不太懂。
看了一下项目里的requirements,看到了Moses 的身影,似乎这个是早期传统的 SMT 的重要工具?(上次在一个文言文翻译现代汉语的项目里见到过。
code: https://github.com/artetxem/monoses
link: https://arxiv.org/abs/1809.01272
Moses: http://www.statmt.org/moses/
看了一下项目里的requirements,看到了Moses 的身影,似乎这个是早期传统的 SMT 的重要工具?(上次在一个文言文翻译现代汉语的项目里见到过。
code: https://github.com/artetxem/monoses
link: https://arxiv.org/abs/1809.01272
Moses: http://www.statmt.org/moses/
GitHub
GitHub - artetxem/monoses: Unsupervised Statistical Machine Translation
Unsupervised Statistical Machine Translation. Contribute to artetxem/monoses development by creating an account on GitHub.