Data Science Archive

PCam 一个组织病理学图像的 dataset，量不大，单卡可以用来跑一些 benchmark。似乎这种纹理图片做起来和其他分类可能还是有一些区别，还可以参考一下最近 Kaggle 上的找盐的那场比赛。
link: http://basveeling.nl/posts/pcam/
github: https://github.com/basveeling/pcam

Bas's Blog

PCam: histopathology dataset for fundamental machine learning.

During my work[1] on deep learning models for histopathology, I’ve started to appreciate the tremendous barrier-to-entry that exists for machine learning researchers to evaluate their methods on large medical datasets. This is especially the case for histopathology…

782 views小熊猫, 14:09

Data Science Archive

一个自动画网络结构图的 Python 脚本，除了常见格式，竟然还有 pptx。卷积反卷积，max/ave/global pooling/dense 这些常见的 layer 都能支持。
link: https://github.com/yu4u/convnet-drawer
也是draw_convnet 的姊妹项目。
link: https://github.com/gwding/draw_convnet

GitHub

GitHub - yu4u/convnet-drawer: Python script for illustrating Convolutional Neural Networks (CNN) using Keras-like model definitions

Python script for illustrating Convolutional Neural Networks (CNN) using Keras-like model definitions - yu4u/convnet-drawer

788 views小熊猫, edited 04:10

Data Science Archive

PyCM: 一个 multi-class 混淆矩阵分析的工具，对于特定的分类问题的结果评估也许可以用得上，不过我先前用 scikit-learn 自带的 https://scikit-learn.org/stable/modules/generated/sklearn.metrics.confusion_matrix.html 就基本满足了。看了一下，这个支持的存储类型更为丰富，统计标准也更多。
link: http://www.shaghighi.ir/pycm/
github: https://github.com/sepandhaghighi/pycm

http://www.pycm.ir

PyCM(Python confusion matrix) is a multi-class confusion matrix library in Python.

802 views小熊猫, 04:18

Data Science Archive

Andrew 和 Richard Sutton 的 RL 圣经第二版，暂时没有太多时间研究 RL，需要的时候翻翻好了。去年（前年？）好像有 draft 版本，不过我也没读过…
link: https://drive.google.com/file/d/1opPSz5AZ_kVa1uWOdOiveNiBFiEOHjkG/view

811 views小熊猫, edited 04:29

Data Science Archive

一本模型黑盒解释的小书，质量蛮不错的，早上读了一下 Feature Interaction 和 Importance 部分，写得非常系统，有一些统计角度的未曾想过的解释，挺到位。值得精读。
link: https://christophm.github.io/interpretable-ml-book/

christophm.github.io

Interpretable Machine Learning

879 views小熊猫, 04:35

Data Science Archive

芝加哥艺术学院 release 了一些非常高质量的画作，without restriction，Creative Commons Zero License.
质量确实超级高，没找到打包下载的，点进每张画之后，点右下角的下载按钮就可以了。做neural transfer，GAN 或者其他什么好玩的实验应该还是不错的。数量也很大，按照 kottle 的说法应该是有50k张。
link: https://kottke.org/18/11/the-art-institute-of-chicago-has-put-50000-high-res-images-from-their-collection-online
link: https://www.artic.edu/collection?is_public_domain=1

Kottke.org

The Art Institute of Chicago Has Put 50,000 High-Res Images from Their Collection Online

The Art Institute of Chicago recently unveiled a new website design. As part of their first design upgrade in 6 years, they have placed more than 52

947 views小熊猫, edited 10:10

Data Science Archive

HuggingFace 实现的 PyTorch BERT 项目里增加了 FP16，还有更多 feature，multi-GPU，distributed training 之类的。
link: https://github.com/huggingface/pytorch-pretrained-BERT

907 views小熊猫, edited 17:05

Data Science Archive

TF Hub 上的一个 BigGAN 的 demo，BigGAN 上个月觉得特别好玩的东西，只是感觉风头好像最近被 BERT 盖过去了…
https://colab.research.google.com/github/tensorflow/hub/blob/master/examples/colab/biggan_generation_with_tf_hub.ipynb

Google

Google Colab Notebook

Run, share, and edit Python notebooks

928 views小熊猫, edited 02:16

Data Science Archive

一个用 ULMFiT 做 fine-tune 的 slides 分享，尚不清楚作者背景，发的时候 at 了 Jeremy Howard…
https://docs.google.com/presentation/d/1eqFVk0OaYTcXOfcBtcBRyuPDPmX9_GsMxRzo-HxsvC0/edit#slide=id.p1

Google Docs

ULMFiT

Universal Language Model Fine-tuning for Text Classification Presented by Asutosh Sahoo B115017 CSE, 7th Semester

931 views小熊猫, edited 06:17

Data Science Archive

一个 loss monitor：https://www.wandb.com/blog/monitor-your-pytorch-models-with-five-extra-lines-of-code
可能比自己用 Visdom/TensorBoard 什么的简单一点。

wandb.ai

Monitor Your PyTorch Models With Five Extra Lines of Code on Weights & Biases

by Lukas Biewald — I love PyTorch and I love experiment tracking, here's how to do both!

996 views小熊猫, edited 06:21

Data Science Archive

massive GPU cluster 上训练技巧，看起来是对 mini-batch size 有一个比较好的 control，以及 2D-Torus all-reduce 来做各个 GPU 梯度更新同步问题。刚刚提交到 arxiv，来自 SONY 团队。paper 题目也很有意思：ImageNet/ResNet-50 Training in 224 Seconds.

This work Tesla V100 x1088, Infiniband EDR x2, 91.62% GPU scaling efficiency

https://arxiv.org/abs/1811.05233

1K views小熊猫, 07:34

Data Science Archive

NIPS 2018 creativity workshop 上一篇关于歌词生成的 paper。对于生成模型来说，特别是需要一些创造力的问题，传统的 NLU 的 metrics （比如翻译常用的 BLEU）不是很好用，得到分数很高的未必会让人感觉好。
先前自己做对联机的时候也有这样的感觉，有的模型虽然 BLEU，Perplexity 都很低，但反倒直观上看起来并不怎么样。
文章中对生成歌词这个问题同时分别在歌词和书籍语料上生成了两个language model，同时让歌词拥有歌词的特点（看起来是捕捉韵脚，对仗，重复加重情感），也具有书籍的特点（词汇量丰富，表达多样性）。
参考意义应该还是挺大，尤其是对于需要创造力的生成问题，利用多个不同的 multi language model 来进行 ensemble 的思想尤为值得借鉴。
这个 workshop 也是NIPS 中一直比较关注的，经常有很多很有意思的 paper。
作者来自Google Brain。
workshop homepage: https://nips2018creativity.github.io/
paper: https://arxiv.org/abs/1811.04651

Machine Learning for Creativity and Design

Introduction

NeurIPS 2018 Workshop, Montreal, Canada

1.06K views小熊猫, edited 17:42

Data Science Archive

Gael Varoquaux 在euroSciPy 上做的关于interprete model 的 tutorial，他的博客里面干货一向很多，周末好好研究一下，就是有时候文章里面法语单词会混在里面，不太影响理解，习惯就好……
link: http://gael-varoquaux.info/interpreting_ml_tuto/#

1.06K views小熊猫, 10:05

Data Science Archive

一个EMNLP 2018的 recap，看着挺好，配合 paper 食用更佳。博客也不错。
link: https://supernlp.github.io/2018/11/10/emnlp-2018/

1.16K views小熊猫, 10:07

Data Science Archive

一个在 spaCy 上做ULMiT/BERT/Elmo 做 pre-training 的实验记录。
https://github.com/explosion/spaCy/pull/2931

GitHub

💫 Add experimental ULMFit/BERT/Elmo-like pretraining by honnibal · Pull Request #2931 · explosion/spaCy

Add support for a new command, spacy pretrain:
usage: spacy pretrain [-h] [-cw 128] [-cd 4] [-er 1000] [-d 0.2] [-i 1] [-s 0]
texts_loc vectors_model output_dir

Pre-train...

1.25K views小熊猫, 10:12

Data Science Archive

一个对 GCN 训练和评估各种 trick 和 pitfalls 的 recap，简单看了一下有很多训练细节的描述和提及，还有 GCN 网络构建的很多关键部分。
先前试过朴素的 GCN 做文本分类：https://arxiv.org/abs/1809.05679
自己也造了一个轮子，GCN 做文本分类确实可行，而且相对 TextCNN 这些方法速度快很多。
link：https://arxiv.org/abs/1811.05868

1.22K views小熊猫, edited 10:25

Data Science Archive

huggingface 这些人把包打进pypi了，懒人模式可以开启了……
link: https://github.com/huggingface/pytorch-pretrained-BERT

GitHub

GitHub - huggingface/transformers: 🤗 Transformers: the model-definition framework for state-of-the-art machine learning models…

🤗 Transformers: the model-definition framework for state-of-the-art machine learning models in text, vision, audio, and multimodal models, for both inference and training. - GitHub - huggingface/t...

1.26K views小熊猫, 18:08

Data Science Archive

一份在PyData Warsaw2018上的 slides，分享NLP Summarization.
https://ghostweather.slides.com/lynncherny/tl-dr-summarization#/6
进到页面后作者还有一些不错的 slides，包括 Google 那篇the stories we tell，写得都挺不错，适合快速 recap。

Slides

Tl;dr: Summarization.

A talk overviewing NLP summarization goals and metrics, given as keynote at PyData Warsaw, with some non-news experiments and commentary on artistic applications.

1.3K views小熊猫, 03:04

About

Blog

Apps

Platform