Data Science Archive

flexdashboard，可以在 RStudio 里面做交互的可视化插件。如果用 RStudio 的话可以一试，用 Jupyter 似乎不是太需要了。https://blog.rstudio.com/2016/05/17/flexdashboard-easy-interactive-dashboards-for-r/

Rstudio

flexdashboard: Easy interactive dashboards for R

Today we’re excited to announce flexdashboard, a new package that enables you to easily create flexible, attractive, interactive dashboards with R. Authoring and customization of dashboards is done using R Markdown and you can optionally include Shiny components…

710 views小熊猫, edited 17:17

Data Science Archive

一个 ML 系统线上部署以及实战操作部分的工具栈，有模型存储， Data Pipeline，ETL，特征工程，以及各种性能优化，很多工程角度实用的工具收集。
link: https://github.com/EthicalML/awesome-machine-learning-operations
作者也在 EuroScipy 2018上给了一个比较简短的 talk: https://axsauze.github.io/scalable-data-science/#/

GitHub

GitHub - EthicalML/awesome-production-machine-learning: A curated list of awesome open source libraries to deploy, monitor, version…

A curated list of awesome open source libraries to deploy, monitor, version and scale your machine learning - EthicalML/awesome-production-machine-learning

710 views小熊猫, 21:41

Data Science Archive

cuDF: GPU DataFrame Library，pandas-like API。貌似 NVIDIA 也有一个类似的项目？但是刚才去找了半天没找到。来自 rapids.ai。
link: https://github.com/rapidsai/cudf
团队还有其他不错的项目，cuML，cuGRAPH，可视化的工具等等，可能是想做一个 GPU Data Science Ecosystem，可以关注一下。
团队主页：https://rapids.ai/
团队项目主页：https://github.com/RAPIDSai

GitHub

GitHub - rapidsai/cudf: cuDF - GPU DataFrame Library

cuDF - GPU DataFrame Library . Contribute to rapidsai/cudf development by creating an account on GitHub.

710 views小熊猫, 22:04

Data Science Archive

XLNI Dataset，和先前 MLNI 差不多类型，不过语言种类更多，但是是它们翻译过来的。这次 Google BERT pre-trained 项目中官方实现的例子里面也有。https://code.fb.com/ai-research/xlni/

Facebook Engineering

Facebook, NYU expand available languages for natural language understanding systems

The XLNI dataset, a collaboration between Facebook and NYU, builds on the MultiNLI corpus, adding 14 languages including low-resource languages.

703 views小熊猫, 22:08

Data Science Archive

一个收集 NLP 各个子领域进展的 markdown 项目，这里对进展的定义不错，都是基于某某公开数据集，以及相应的 metrics，非常适合刚刚入门某个领域。扫了一眼 text classification & summarization，还是比较系统的。遗憾的是对于各个领域独有的（默认的）一些 trick 没有提及。
link: https://github.com/sebastianruder/NLP-progress

GitHub

GitHub - sebastianruder/NLP-progress: Repository to track the progress in Natural Language Processing (NLP), including the datasets…

Repository to track the progress in Natural Language Processing (NLP), including the datasets and the current state-of-the-art for the most common NLP tasks. - sebastianruder/NLP-progress

709 views小熊猫, 22:14

Data Science Archive

EMNLP 2018 上一个非监督的Statistical Machine Translation，WMT14 的 BLEU 分数26.2，还是挺不错的。翻译领域其实不太了解，NMT 还算实践过一些，传统的Statistical MT几乎不太懂。
看了一下项目里的requirements，看到了Moses 的身影，似乎这个是早期传统的 SMT 的重要工具？（上次在一个文言文翻译现代汉语的项目里见到过。
code: https://github.com/artetxem/monoses
link: https://arxiv.org/abs/1809.01272
Moses: http://www.statmt.org/moses/

GitHub

GitHub - artetxem/monoses: Unsupervised Statistical Machine Translation

Unsupervised Statistical Machine Translation. Contribute to artetxem/monoses development by creating an account on GitHub.

722 views小熊猫, 22:23

Data Science Archive

一个用featuretools做特征工程的例子，ft这个工具还不错，上次做Kaggle也有用到，如果是不太熟悉的领域，又是categorical data，先ft提一波高阶组合特征，跑一个baseline还是不错的。
不过这个工具有相当多tricky的参数，时间开销也比较大。
link：https://medium.com/@rrfd/simple-automatic-feature-engineering-using-featuretools-in-python-for-classification-b1308040e183

Medium

Simple Automatic Feature Engineering — Using featuretools in Python for Classification

Preface

737 views小熊猫, edited 04:52

Data Science Archive

一篇快速回顾统计概念的小文，举的例子还是挺不错的，写得也很好。贝叶斯学派和统计学派，虚空假设，Type Error，p-value。
link: https://towardsdatascience.com/statistics-for-people-in-a-hurry-a9613c0ed0b

Medium

Statistics for people in a hurry

Ever wished someone would just tell you what the point of statistics is and what the jargon means in plain English? Let me try to grant…

770 views小熊猫, edited 06:34

Data Science Archive

Sebastian Raschka终于写完了他的这套博文系列《Model evaluation, model selection, and algorithm selection in machine learning》的第四章，非常详细地介绍了模型评测部分需要考虑的各种环节，需要一些统计基础。
前三篇连载都是两年前写的，当时看得也是获益匪浅，统计背景比较强的老师看模型和算法的角度会不太一样，非常推荐。
link:
1. https://sebastianraschka.com/blog/2016/model-evaluation-selection-part1.html
2. https://sebastianraschka.com/blog/2016/model-evaluation-selection-part2.html
3. https://sebastianraschka.com/blog/2016/model-evaluation-selection-part3.html
4. https://sebastianraschka.com/blog/2018/model-evaluation-selection-part4.html

Sebastian Raschka, PhD

Model evaluation, model selection, and algorithm selection in machine learning

Machine learning has become a central part of our life -- as consumers, customers, and hopefully as researchers and practitioners! Whether we are applying...

796 views小熊猫, edited 06:48

Data Science Archive

一键打开 Colab 的Chrome扩展…https://chrome.google.com/webstore/detail/open-in-colab/iogfkhleblhcpcekbiedikdehleodpjo/related

Google

Open in Colab

Open a Github-hosted notebook in Google Colab

764 views小熊猫, 14:04

Data Science Archive

PCam 一个组织病理学图像的 dataset，量不大，单卡可以用来跑一些 benchmark。似乎这种纹理图片做起来和其他分类可能还是有一些区别，还可以参考一下最近 Kaggle 上的找盐的那场比赛。
link: http://basveeling.nl/posts/pcam/
github: https://github.com/basveeling/pcam

Bas's Blog

PCam: histopathology dataset for fundamental machine learning.

During my work[1] on deep learning models for histopathology, I’ve started to appreciate the tremendous barrier-to-entry that exists for machine learning researchers to evaluate their methods on large medical datasets. This is especially the case for histopathology…

782 views小熊猫, 14:09

Data Science Archive

一个自动画网络结构图的 Python 脚本，除了常见格式，竟然还有 pptx。卷积反卷积，max/ave/global pooling/dense 这些常见的 layer 都能支持。
link: https://github.com/yu4u/convnet-drawer
也是draw_convnet 的姊妹项目。
link: https://github.com/gwding/draw_convnet

GitHub

GitHub - yu4u/convnet-drawer: Python script for illustrating Convolutional Neural Networks (CNN) using Keras-like model definitions

Python script for illustrating Convolutional Neural Networks (CNN) using Keras-like model definitions - yu4u/convnet-drawer

788 views小熊猫, edited 04:10

Data Science Archive

PyCM: 一个 multi-class 混淆矩阵分析的工具，对于特定的分类问题的结果评估也许可以用得上，不过我先前用 scikit-learn 自带的 https://scikit-learn.org/stable/modules/generated/sklearn.metrics.confusion_matrix.html 就基本满足了。看了一下，这个支持的存储类型更为丰富，统计标准也更多。
link: http://www.shaghighi.ir/pycm/
github: https://github.com/sepandhaghighi/pycm

http://www.pycm.ir

PyCM(Python confusion matrix) is a multi-class confusion matrix library in Python.

802 views小熊猫, 04:18

Data Science Archive

Andrew 和 Richard Sutton 的 RL 圣经第二版，暂时没有太多时间研究 RL，需要的时候翻翻好了。去年（前年？）好像有 draft 版本，不过我也没读过…
link: https://drive.google.com/file/d/1opPSz5AZ_kVa1uWOdOiveNiBFiEOHjkG/view

811 views小熊猫, edited 04:29

Data Science Archive

一本模型黑盒解释的小书，质量蛮不错的，早上读了一下 Feature Interaction 和 Importance 部分，写得非常系统，有一些统计角度的未曾想过的解释，挺到位。值得精读。
link: https://christophm.github.io/interpretable-ml-book/

christophm.github.io

Interpretable Machine Learning

879 views小熊猫, 04:35

Data Science Archive

芝加哥艺术学院 release 了一些非常高质量的画作，without restriction，Creative Commons Zero License.
质量确实超级高，没找到打包下载的，点进每张画之后，点右下角的下载按钮就可以了。做neural transfer，GAN 或者其他什么好玩的实验应该还是不错的。数量也很大，按照 kottle 的说法应该是有50k张。
link: https://kottke.org/18/11/the-art-institute-of-chicago-has-put-50000-high-res-images-from-their-collection-online
link: https://www.artic.edu/collection?is_public_domain=1

Kottke.org

The Art Institute of Chicago Has Put 50,000 High-Res Images from Their Collection Online

The Art Institute of Chicago recently unveiled a new website design. As part of their first design upgrade in 6 years, they have placed more than 52

947 views小熊猫, edited 10:10

Data Science Archive

HuggingFace 实现的 PyTorch BERT 项目里增加了 FP16，还有更多 feature，multi-GPU，distributed training 之类的。
link: https://github.com/huggingface/pytorch-pretrained-BERT

907 views小熊猫, edited 17:05

Data Science Archive

TF Hub 上的一个 BigGAN 的 demo，BigGAN 上个月觉得特别好玩的东西，只是感觉风头好像最近被 BERT 盖过去了…
https://colab.research.google.com/github/tensorflow/hub/blob/master/examples/colab/biggan_generation_with_tf_hub.ipynb

Google

Google Colab Notebook

Run, share, and edit Python notebooks

928 views小熊猫, edited 02:16

Data Science Archive

一个用 ULMFiT 做 fine-tune 的 slides 分享，尚不清楚作者背景，发的时候 at 了 Jeremy Howard…
https://docs.google.com/presentation/d/1eqFVk0OaYTcXOfcBtcBRyuPDPmX9_GsMxRzo-HxsvC0/edit#slide=id.p1

Google Docs

ULMFiT

Universal Language Model Fine-tuning for Text Classification Presented by Asutosh Sahoo B115017 CSE, 7th Semester

931 views小熊猫, edited 06:17

Data Science Archive

一个 loss monitor：https://www.wandb.com/blog/monitor-your-pytorch-models-with-five-extra-lines-of-code
可能比自己用 Visdom/TensorBoard 什么的简单一点。

wandb.ai

Monitor Your PyTorch Models With Five Extra Lines of Code on Weights & Biases

by Lukas Biewald — I love PyTorch and I love experiment tracking, here's how to do both!

996 views小熊猫, edited 06:21

Data Science Archive

massive GPU cluster 上训练技巧，看起来是对 mini-batch size 有一个比较好的 control，以及 2D-Torus all-reduce 来做各个 GPU 梯度更新同步问题。刚刚提交到 arxiv，来自 SONY 团队。paper 题目也很有意思：ImageNet/ResNet-50 Training in 224 Seconds.

This work Tesla V100 x1088, Infiniband EDR x2, 91.62% GPU scaling efficiency

https://arxiv.org/abs/1811.05233

1K views小熊猫, 07:34

About

Blog

Apps

Platform