Data Science Archive – Telegram

Data Science Archive

@datasciencearchive

1.72K subscribers

1 photo

113 links

小熊猫的个人工具收纳箱，还包括一些零碎的笔记，大概会有这些：

* 有趣/有价值/SOTA的会议论文和代码分享
* 自然语言处理，计算机视觉，语音信号领域进展
* Kaggle 和其他算法竞赛经验
* 反作弊，搜索和个性化推荐算法产品的工程化
* 统计学习，矩阵计算，贝叶斯相关的工具
* 可视化、算法服务相关的存储、并行和分布式计算工具

希望我收集的信息也可以帮到你，如果有其他建议，或者寻找工作机会，都可以给我发邮件： jinyzho@microsoft.com

Download Telegram

About

Blog

Apps

Platform

Data Science Archive

1.72K subscribers

Data Science Archive

一个用featuretools做特征工程的例子，ft这个工具还不错，上次做Kaggle也有用到，如果是不太熟悉的领域，又是categorical data，先ft提一波高阶组合特征，跑一个baseline还是不错的。
不过这个工具有相当多tricky的参数，时间开销也比较大。
link：https://medium.com/@rrfd/simple-automatic-feature-engineering-using-featuretools-in-python-for-classification-b1308040e183

Simple Automatic Feature Engineering — Using featuretools in Python for Classification

737 views小熊猫, edited 04:52

Data Science Archive

一篇快速回顾统计概念的小文，举的例子还是挺不错的，写得也很好。贝叶斯学派和统计学派，虚空假设，Type Error，p-value。
link: https://towardsdatascience.com/statistics-for-people-in-a-hurry-a9613c0ed0b

Statistics for people in a hurry

Ever wished someone would just tell you what the point of statistics is and what the jargon means in plain English? Let me try to grant…

770 views小熊猫, edited 06:34

Data Science Archive

Sebastian Raschka终于写完了他的这套博文系列《Model evaluation, model selection, and algorithm selection in machine learning》的第四章，非常详细地介绍了模型评测部分需要考虑的各种环节，需要一些统计基础。
前三篇连载都是两年前写的，当时看得也是获益匪浅，统计背景比较强的老师看模型和算法的角度会不太一样，非常推荐。
link:
1. https://sebastianraschka.com/blog/2016/model-evaluation-selection-part1.html
2. https://sebastianraschka.com/blog/2016/model-evaluation-selection-part2.html
3. https://sebastianraschka.com/blog/2016/model-evaluation-selection-part3.html
4. https://sebastianraschka.com/blog/2018/model-evaluation-selection-part4.html

Sebastian Raschka, PhD

Model evaluation, model selection, and algorithm selection in machine learning

Machine learning has become a central part of our life -- as consumers, customers, and hopefully as researchers and practitioners! Whether we are applying...

796 views小熊猫, edited 06:48

Data Science Archive

一键打开 Colab 的Chrome扩展…https://chrome.google.com/webstore/detail/open-in-colab/iogfkhleblhcpcekbiedikdehleodpjo/related

Open a Github-hosted notebook in Google Colab

764 views小熊猫, 14:04

Data Science Archive

PCam 一个组织病理学图像的 dataset，量不大，单卡可以用来跑一些 benchmark。似乎这种纹理图片做起来和其他分类可能还是有一些区别，还可以参考一下最近 Kaggle 上的找盐的那场比赛。
link: http://basveeling.nl/posts/pcam/
github: https://github.com/basveeling/pcam

PCam: histopathology dataset for fundamental machine learning.

During my work[1] on deep learning models for histopathology, I’ve started to appreciate the tremendous barrier-to-entry that exists for machine learning researchers to evaluate their methods on large medical datasets. This is especially the case for histopathology…

782 views小熊猫, 14:09

Data Science Archive

一个自动画网络结构图的 Python 脚本，除了常见格式，竟然还有 pptx。卷积反卷积，max/ave/global pooling/dense 这些常见的 layer 都能支持。
link: https://github.com/yu4u/convnet-drawer
也是draw_convnet 的姊妹项目。
link: https://github.com/gwding/draw_convnet

GitHub - yu4u/convnet-drawer: Python script for illustrating Convolutional Neural Networks (CNN) using Keras-like model definitions

Python script for illustrating Convolutional Neural Networks (CNN) using Keras-like model definitions - yu4u/convnet-drawer

788 views小熊猫, edited 04:10

Data Science Archive

PyCM: 一个 multi-class 混淆矩阵分析的工具，对于特定的分类问题的结果评估也许可以用得上，不过我先前用 scikit-learn 自带的 https://scikit-learn.org/stable/modules/generated/sklearn.metrics.confusion_matrix.html 就基本满足了。看了一下，这个支持的存储类型更为丰富，统计标准也更多。
link: http://www.shaghighi.ir/pycm/
github: https://github.com/sepandhaghighi/pycm

http://www.pycm.ir

PyCM(Python confusion matrix) is a multi-class confusion matrix library in Python.

802 views小熊猫, 04:18

Data Science Archive

Andrew 和 Richard Sutton 的 RL 圣经第二版，暂时没有太多时间研究 RL，需要的时候翻翻好了。去年（前年？）好像有 draft 版本，不过我也没读过…
link: https://drive.google.com/file/d/1opPSz5AZ_kVa1uWOdOiveNiBFiEOHjkG/view

811 views小熊猫, edited 04:29

Data Science Archive

一本模型黑盒解释的小书，质量蛮不错的，早上读了一下 Feature Interaction 和 Importance 部分，写得非常系统，有一些统计角度的未曾想过的解释，挺到位。值得精读。
link: https://christophm.github.io/interpretable-ml-book/

christophm.github.io

Interpretable Machine Learning

879 views小熊猫, 04:35

Data Science Archive

芝加哥艺术学院 release 了一些非常高质量的画作，without restriction，Creative Commons Zero License.
质量确实超级高，没找到打包下载的，点进每张画之后，点右下角的下载按钮就可以了。做neural transfer，GAN 或者其他什么好玩的实验应该还是不错的。数量也很大，按照 kottle 的说法应该是有50k张。
link: https://kottke.org/18/11/the-art-institute-of-chicago-has-put-50000-high-res-images-from-their-collection-online
link: https://www.artic.edu/collection?is_public_domain=1

The Art Institute of Chicago Has Put 50,000 High-Res Images from Their Collection Online

The Art Institute of Chicago recently unveiled a new website design. As part of their first design upgrade in 6 years, they have placed more than 52

947 views小熊猫, edited 10:10

Data Science Archive

HuggingFace 实现的 PyTorch BERT 项目里增加了 FP16，还有更多 feature，multi-GPU，distributed training 之类的。
link: https://github.com/huggingface/pytorch-pretrained-BERT

907 views小熊猫, edited 17:05

Data Science Archive

TF Hub 上的一个 BigGAN 的 demo，BigGAN 上个月觉得特别好玩的东西，只是感觉风头好像最近被 BERT 盖过去了…
https://colab.research.google.com/github/tensorflow/hub/blob/master/examples/colab/biggan_generation_with_tf_hub.ipynb

Google Colab Notebook

Run, share, and edit Python notebooks

928 views小熊猫, edited 02:16

Data Science Archive

一个用 ULMFiT 做 fine-tune 的 slides 分享，尚不清楚作者背景，发的时候 at 了 Jeremy Howard…
https://docs.google.com/presentation/d/1eqFVk0OaYTcXOfcBtcBRyuPDPmX9_GsMxRzo-HxsvC0/edit#slide=id.p1

Universal Language Model Fine-tuning for Text Classification Presented by Asutosh Sahoo B115017 CSE, 7th Semester

931 views小熊猫, edited 06:17

Data Science Archive

一个 loss monitor：https://www.wandb.com/blog/monitor-your-pytorch-models-with-five-extra-lines-of-code
可能比自己用 Visdom/TensorBoard 什么的简单一点。

Monitor Your PyTorch Models With Five Extra Lines of Code on Weights & Biases

by Lukas Biewald — I love PyTorch and I love experiment tracking, here's how to do both!

996 views小熊猫, edited 06:21

Data Science Archive

massive GPU cluster 上训练技巧，看起来是对 mini-batch size 有一个比较好的 control，以及 2D-Torus all-reduce 来做各个 GPU 梯度更新同步问题。刚刚提交到 arxiv，来自 SONY 团队。paper 题目也很有意思：ImageNet/ResNet-50 Training in 224 Seconds.

This work Tesla V100 x1088, Infiniband EDR x2, 91.62% GPU scaling efficiency

https://arxiv.org/abs/1811.05233

1K views小熊猫, 07:34

Data Science Archive

NIPS 2018 creativity workshop 上一篇关于歌词生成的 paper。对于生成模型来说，特别是需要一些创造力的问题，传统的 NLU 的 metrics （比如翻译常用的 BLEU）不是很好用，得到分数很高的未必会让人感觉好。
先前自己做对联机的时候也有这样的感觉，有的模型虽然 BLEU，Perplexity 都很低，但反倒直观上看起来并不怎么样。
文章中对生成歌词这个问题同时分别在歌词和书籍语料上生成了两个language model，同时让歌词拥有歌词的特点（看起来是捕捉韵脚，对仗，重复加重情感），也具有书籍的特点（词汇量丰富，表达多样性）。
参考意义应该还是挺大，尤其是对于需要创造力的生成问题，利用多个不同的 multi language model 来进行 ensemble 的思想尤为值得借鉴。
这个 workshop 也是NIPS 中一直比较关注的，经常有很多很有意思的 paper。
作者来自Google Brain。
workshop homepage: https://nips2018creativity.github.io/
paper: https://arxiv.org/abs/1811.04651

Machine Learning for Creativity and Design

NeurIPS 2018 Workshop, Montreal, Canada

1.06K views小熊猫, edited 17:42

Data Science Archive

Gael Varoquaux 在euroSciPy 上做的关于interprete model 的 tutorial，他的博客里面干货一向很多，周末好好研究一下，就是有时候文章里面法语单词会混在里面，不太影响理解，习惯就好……
link: http://gael-varoquaux.info/interpreting_ml_tuto/#

1.06K views小熊猫, 10:05

Data Science Archive

一个EMNLP 2018的 recap，看着挺好，配合 paper 食用更佳。博客也不错。
link: https://supernlp.github.io/2018/11/10/emnlp-2018/

1.16K views小熊猫, 10:07

Data Science Archive

一个在 spaCy 上做ULMiT/BERT/Elmo 做 pre-training 的实验记录。
https://github.com/explosion/spaCy/pull/2931

💫 Add experimental ULMFit/BERT/Elmo-like pretraining by honnibal · Pull Request #2931 · explosion/spaCy

Add support for a new command, spacy pretrain:
usage: spacy pretrain [-h] [-cw 128] [-cd 4] [-er 1000] [-d 0.2] [-i 1] [-s 0]
texts_loc vectors_model output_dir

Pre-train...

1.25K views小熊猫, 10:12

Data Science Archive

一个对 GCN 训练和评估各种 trick 和 pitfalls 的 recap，简单看了一下有很多训练细节的描述和提及，还有 GCN 网络构建的很多关键部分。
先前试过朴素的 GCN 做文本分类：https://arxiv.org/abs/1809.05679
自己也造了一个轮子，GCN 做文本分类确实可行，而且相对 TextCNN 这些方法速度快很多。
link：https://arxiv.org/abs/1811.05868

1.22K views小熊猫, edited 10:25

Data Science Archive

huggingface 这些人把包打进pypi了，懒人模式可以开启了……
link: https://github.com/huggingface/pytorch-pretrained-BERT

GitHub - huggingface/transformers: 🤗 Transformers: the model-definition framework for state-of-the-art machine learning models…

🤗 Transformers: the model-definition framework for state-of-the-art machine learning models in text, vision, audio, and multimodal models, for both inference and training. - GitHub - huggingface/t...

1.26K views小熊猫, 18:08