Data Science Archive

晚上有一个朋友看到推送问我，对categorical feature 为什么要做target encoding。其实这比较取决于模型，不过对于tabular data常用的tree based model来说，OHE是比较差的，如果是用xgboost需要自己做target encoding，catBoost/lightGBM不需要，自带了。https://medium.com/data-design/visiting-categorical-features-and-encoding-in-decision-trees-53400fa65931

Medium

Visiting: Categorical Features and Encoding in Decision Trees

When you have categorical features and you are using decision trees, you often have a major issue: how to deal with categorical features?

4.43K views小熊猫, 17:29

Data Science Archive

说到特征降维/选择的问题，大部分EDA的套路都是从model训练的loss来判断feature importance。其实有一个简单易行而且很有效的办法是在CV里面用做feature permutation，对原始特征shuffle得到shadow（也可以加一些噪音），在通过zscore比较两者差异来判断importance，不断遍历筛选。在ESLII中593页有提到这个办法。R里面有一个包Boruta可以做这件事，py也有：https://github.com/scikit-learn-contrib/boruta_py

GitHub

GitHub - scikit-learn-contrib/boruta_py: Python implementations of the Boruta all-relevant feature selection method.

Python implementations of the Boruta all-relevant feature selection method. - scikit-learn-contrib/boruta_py

6.54K views小熊猫, 18:33

Data Science Archive

中间这段时间一直在面试换工作，现在基本稳定之后会继续更新和收集相关工作资料。感谢订阅的朋友。

3.47K views小熊猫, 09:31

Data Science Archive

PTP 是 IBM 出品的一个为 PyTorch 服务的部署框架。看了一下涵盖的领域比较全面，CV，NLP 都有，各种 pre-trained model 也比较全，甚至包含了许多评测基准和现成的一些更 high-level 的模型结构。非常适合快速实验。https://github.com/ibm/pytorchpipe

GitHub

GitHub - IBM/pytorchpipe: PyTorchPipe (PTP) is a component-oriented framework for rapid prototyping and training of computational…

PyTorchPipe (PTP) is a component-oriented framework for rapid prototyping and training of computational pipelines combining vision and language - GitHub - IBM/pytorchpipe: PyTorchPipe (PTP) is a co...

3.85K views小熊猫, 09:34

Data Science Archive

一个 Time series 数据集补空的工具，集成了几乎全部所需的统计方法，transform 上也是该有的都用，Box-Cox 什么的，几乎不需要底层的那些 DS工具包了，api上兼容了 scikit-learn，用法和功能和 R 里面的auto.arima 一样，只多不少。https://github.com/alkaline-ml/pmdarima

GitHub

GitHub - alkaline-ml/pmdarima: A statistical library designed to fill the void in Python's time series analysis capabilities, including…

A statistical library designed to fill the void in Python's time series analysis capabilities, including the equivalent of R's auto.arima function. - alkaline-ml/pmdarima

4.82K views小熊猫, edited 03:30

Data Science Archive

2019 ACL Salesforce Research 上常识阅读理解paper 的 code 更新，依赖 huggingface 的 transformers，看过 demo 还是非常不错的。https://github.com/salesforce/cos-e

GitHub

GitHub - salesforce/cos-e: Commonsense Explanations Dataset and Code

Commonsense Explanations Dataset and Code. Contribute to salesforce/cos-e development by creating an account on GitHub.

5.56K views小熊猫, 07:36

Data Science Archive

HuggingFace Transformers 包加了几组中文的 pre-trained models，包括 BERT-wwm, RoBERTa-wwm, XLNet，来自哈工大和讯飞。https://github.com/ymcui/Chinese-BERT-wwm/blob/master/README_EN.md

GitHub

Chinese-BERT-wwm/README_EN.md at master · ymcui/Chinese-BERT-wwm

Pre-Training with Whole Word Masking for Chinese BERT（中文BERT-wwm系列模型） - ymcui/Chinese-BERT-wwm

6.06K views小熊猫, 08:32

Data Science Archive

来自 Huggingface 的 tokenizer，Rust 实现，确实速度惊人。https://github.com/huggingface/tokenizers

GitHub

GitHub - huggingface/tokenizers: 💥 Fast State-of-the-Art Tokenizers optimized for Research and Production

💥 Fast State-of-the-Art Tokenizers optimized for Research and Production - huggingface/tokenizers

6.01K views小熊猫, 07:52

Data Science Archive

CUDA 层面重新实现的几种 RNN，自带Zoneout 和DropConnect，试用了一下 Py 和 C++的 API，确实是快非常多，API 可设定的参数还不是太多。https://github.com/lmnt-com/haste

GitHub

GitHub - lmnt-com/haste: Haste: a fast, simple, and open RNN library

Haste: a fast, simple, and open RNN library. Contribute to lmnt-com/haste development by creating an account on GitHub.

6.08K views小熊猫, 02:54

Data Science Archive

关于 Tabular dataset 中 GBM 的一些意见，虽说是目前为止（或者未来的一段时间）应该还将继续是 STOA，但是或多或少会有一些用浅层 NN 融合的方案来继续提升性能，比较重要的一份参考是两年前的 https://www.kaggle.com/c/porto-seguro-safe-driver-prediction/discussion/44629
来源一条CPMP 的推文以及讨论：https://twitter.com/JFPuget/status/1233379034425384960

Kaggle

Porto Seguro’s Safe Driver Prediction

Predict if a driver will file an insurance claim next year.

7.2K views小熊猫, edited 05:51

Data Science Archive

本来以为是个水货，结果刚点进去就发现了Pharebank 这个好东西，强烈推荐给有协作需求的在读 PhD。https://www.annaclemens.com/blog/16-free-tools-scientists-write-better-more-productively

Researchers' Writing Academy - Academic Writing Program by Anna Clemens, PhD

19 Academic Writing Tools (that are completely free!)

Whether you're looking for an academic phrase finder, a collaborative academic writing software, a tool to stay focused on your writing or a writing project management app - I've got you covered!

7.91K views小熊猫, 17:16

Data Science Archive

最近重新开始接触时间序列，找到一个蛮不错的基础教材，准备开始恶补。http://www.math.pku.edu.cn/teachers/lidf/course/atsa/atsanotes/html/_atsanotes/index.html

www.math.pku.edu.cn

金融时间序列分析备课笔记

本科生《金融时间序列分析》授课备课资料。采用R的bookdown制作，输出格式为bookdown::gitbook.

8.27K views小熊猫, 03:48

Data Science Archive

最近针对时间序列拆解重新理解的时候发现对 additive model 理解仍然有一些偏差。发现通用解法中用b-样条基函数的有点绕，终于在看了pyGAM这个包的源码和文档中完全搞懂，不过平滑约束的程度很难有点难顶就是了。https://github.com/dswah/pyGAM

GitHub

GitHub - dswah/pyGAM: [CONTRIBUTORS WELCOME] Generalized Additive Models in Python

[CONTRIBUTORS WELCOME] Generalized Additive Models in Python - dswah/pyGAM

11.5K views小熊猫, 13:50

Data Science Archive

意外发现一篇特别好的频率派和贝叶斯派的博文：http://jakevdp.github.io/blog/2014/03/11/frequentism-and-bayesianism-a-practical-intro/

jakevdp.github.io

Frequentism and Bayesianism: A Practical Introduction | Pythonic Perambulations

❤1

11.3K views小熊猫, edited 01:02

Data Science Archive

最近在上线前彻查API，不少收获来自内部也是开放的指南。https://github.com/microsoft/api-guidelines/blob/vNext/Guidelines.md

GitHub

api-guidelines/Guidelines.md at vNext · microsoft/api-guidelines

Microsoft REST API Guidelines. Contribute to microsoft/api-guidelines development by creating an account on GitHub.

8.11K views小熊猫, 01:59

Data Science Archive

最近在做简单地离线 demo 的时候开始使用 https://streamlit.io ，在这之前我大概用了一两年的 Dash，就目前的感觉比 Dash 的准备时间少了不止一半。在 2022 年，我想如果不出意外的话，有交互的将会锁定使用 streamlit.io，静态的使用 http://datapane.com

👍3❤1

2.42K views小熊猫, edited 03:20

Data Science Archive

AlibiExplain 应该是这几年看到的在机器学习模型可解释性上做得最系统的工具，堪称知识库型文档，毕竟不能只了解一点 SHAP。https://docs.seldon.io/projects/alibi/en/latest/index.html

👍6❤1

5.62K views小熊猫, edited 03:37

Data Science Archive

Deepchecks 是目前我发现关于模型的离线检查和生产环境监控最好的工具，尤其是项目给出的 Suite 和 Condition 的概念。目前只能在 notebook 里面用，暂时还不支持 HTML 或者 pdf。项目很新，值得关注。https://github.com/deepchecks/deepchecks

GitHub

GitHub - deepchecks/deepchecks: Deepchecks: Tests for Continuous Validation of ML Models & Data. Deepchecks is a holistic open…

Deepchecks: Tests for Continuous Validation of ML Models & Data. Deepchecks is a holistic open-source solution for all of your AI & ML validation needs, enabling to thoroughly test ...

👍8

2.6K views小熊猫, 04:06

Data Science Archive

FB 最近的 ConvNeXt 看起来满强的，实验结果能打 ViT/SwinTransfomer，纯ConvNet 结构，在 MLP-Mixer 之后我越来越觉得在经历一段军备竞赛之后终于回到似乎是回到对结构探索的正确道路上。从算力有限时代的 MLP 到快速又符合直觉的ConvNet滤波计算，出现瓶颈并拥有更多的算力后再去看曾经的结构总是有不一样的启发。不知道下一次能看到像 ResNet 一样留下普世启发的结构是什么时候。https://github.com/facebookresearch/ConvNeXt

GitHub

GitHub - facebookresearch/ConvNeXt: Code release for ConvNeXt model

Code release for ConvNeXt model. Contribute to facebookresearch/ConvNeXt development by creating an account on GitHub.

🔥9

2.02K views小熊猫, edited 07:26

Data Science Archive

推荐一篇博客，作者介绍在 DS 项目中写测试。毕竟 ML 的项目测试起来和传统的程序不是太一样，除了最基础的 assert, pytest 这些之外对数据的分布和数据一些统计指标也需要做测试。文中提到的几个工具 Hypothesis 和 Pandera 我都是用过的，Pandera 很好用，也可以原生集成给 Pandas/Koalas（Koalas 也是我配合 PySpark 最常用的 DataFrame 工具）。https://www.peterbaumgartner.com/blog/testing-for-data-science/

Peterbaumgartner

Ways I Use Testing as a Data Scientist

In my work, writing tests serves three purposes: making sure things work, documenting my understanding, preventing future errors. When I was starting out with testing, I had a hard time understanding what I should be writing tests for. As a beginner, I just…

👍8❤3

1.93K views小熊猫, 05:46

Data Science Archive

单纯的 Boruta 判断特征的时候会依照二项分布的接受或者拒绝来判定对前面生成的影子特征进行筛选。所以如果把 Boruta 的第二阶段（特征metric排序以及筛选）单独拿出来，其实是可以用别的方案进行替换的，比如 SHAP（这里的引用来源也是我之前推荐的一本电子书《interpretable-ML》，强烈推荐）。也确实有轮子在做这样的事情，我自己找了一个 Kaggle 上的 Tabular Dataset 试了一下独立工作效果不太明显，不过提供另外一种特征筛选的方法来做 ensemble 应该是有提升的（吧）。轮子在这里：BorutaShap https://github.com/Ekeany/Boruta-Shap

👍2

1.83K views小熊猫, edited 05:46

About

Blog

Apps

Platform