RAdam + LookAhead 实验结果还是有点奇怪的,不是太明朗的感觉。一个用fastdoai的实现。https://medium.com/@lessw/new-deep-learning-optimizer-ranger-synergistic-combination-of-radam-lookahead-for-the-best-of-2dc83f79a48d
Medium
New Deep Learning Optimizer, Ranger: Synergistic combination of RAdam + LookAhead for the best of both.
A new paper in part by the famed deep learning researcher Geoffrey Hinton introduces the LookAhead optimizer(“LookAhead optimizer: k steps…
上周在造一个CTR项目轮子的时候又系统回顾了一些非复杂DNN模型的hyper param optmization 的方法和工具,发现一个新的工具:Optuna https://github.com/pfnet/optuna
GitHub
GitHub - optuna/optuna: A hyperparameter optimization framework
A hyperparameter optimization framework. Contribute to optuna/optuna development by creating an account on GitHub.
最近在用一些非监督方法做降维的时候,发现在categorical feature有时候MCA比传统的PCA要好一些,(不过有时候先做target encoding再用普通的PCA也不错)。用了一段时间Prince,简单好用,性能不错。https://github.com/MaxHalford/Prince
GitHub
GitHub - MaxHalford/prince: :crown: Multivariate exploratory data analysis in Python — PCA, CA, MCA, MFA, FAMD, GPA
:crown: Multivariate exploratory data analysis in Python — PCA, CA, MCA, MFA, FAMD, GPA - MaxHalford/prince
晚上有一个朋友看到推送问我,对categorical feature 为什么要做target encoding。其实这比较取决于模型,不过对于tabular data常用的tree based model来说,OHE是比较差的,如果是用xgboost需要自己做target encoding,catBoost/lightGBM不需要,自带了。https://medium.com/data-design/visiting-categorical-features-and-encoding-in-decision-trees-53400fa65931
Medium
Visiting: Categorical Features and Encoding in Decision Trees
When you have categorical features and you are using decision trees, you often have a major issue: how to deal with categorical features?
说到特征降维/选择的问题,大部分EDA的套路都是从model训练的loss来判断feature importance。其实有一个简单易行而且很有效的办法是在CV里面用做feature permutation,对原始特征shuffle得到shadow(也可以加一些噪音),在通过zscore比较两者差异来判断importance,不断遍历筛选。在ESLII中593页有提到这个办法。R里面有一个包Boruta可以做这件事,py也有:https://github.com/scikit-learn-contrib/boruta_py
GitHub
GitHub - scikit-learn-contrib/boruta_py: Python implementations of the Boruta all-relevant feature selection method.
Python implementations of the Boruta all-relevant feature selection method. - scikit-learn-contrib/boruta_py
PTP 是 IBM 出品的一个为 PyTorch 服务的部署框架。看了一下涵盖的领域比较全面,CV,NLP 都有,各种 pre-trained model 也比较全,甚至包含了许多评测基准和现成的一些更 high-level 的模型结构。非常适合快速实验。https://github.com/ibm/pytorchpipe
GitHub
GitHub - IBM/pytorchpipe: PyTorchPipe (PTP) is a component-oriented framework for rapid prototyping and training of computational…
PyTorchPipe (PTP) is a component-oriented framework for rapid prototyping and training of computational pipelines combining vision and language - GitHub - IBM/pytorchpipe: PyTorchPipe (PTP) is a co...
一个 Time series 数据集补空的工具,集成了几乎全部所需的统计方法,transform 上也是该有的都用,Box-Cox 什么的,几乎不需要底层的那些 DS工具包了,api上兼容了 scikit-learn,用法和功能和 R 里面的auto.arima 一样,只多不少。https://github.com/alkaline-ml/pmdarima
GitHub
GitHub - alkaline-ml/pmdarima: A statistical library designed to fill the void in Python's time series analysis capabilities, including…
A statistical library designed to fill the void in Python's time series analysis capabilities, including the equivalent of R's auto.arima function. - alkaline-ml/pmdarima
2019 ACL Salesforce Research 上常识阅读理解paper 的 code 更新,依赖 huggingface 的 transformers,看过 demo 还是非常不错的。https://github.com/salesforce/cos-e
GitHub
GitHub - salesforce/cos-e: Commonsense Explanations Dataset and Code
Commonsense Explanations Dataset and Code. Contribute to salesforce/cos-e development by creating an account on GitHub.
HuggingFace Transformers 包加了几组中文的 pre-trained models,包括 BERT-wwm, RoBERTa-wwm, XLNet,来自哈工大和讯飞。https://github.com/ymcui/Chinese-BERT-wwm/blob/master/README_EN.md
GitHub
Chinese-BERT-wwm/README_EN.md at master · ymcui/Chinese-BERT-wwm
Pre-Training with Whole Word Masking for Chinese BERT(中文BERT-wwm系列模型) - ymcui/Chinese-BERT-wwm
CUDA 层面重新实现的几种 RNN,自带Zoneout 和DropConnect,试用了一下 Py 和 C++的 API,确实是快非常多,API 可设定的参数还不是太多。https://github.com/lmnt-com/haste
GitHub
GitHub - lmnt-com/haste: Haste: a fast, simple, and open RNN library
Haste: a fast, simple, and open RNN library. Contribute to lmnt-com/haste development by creating an account on GitHub.
关于 Tabular dataset 中 GBM 的一些意见,虽说是目前为止(或者未来的一段时间)应该还将继续是 STOA,但是或多或少会有一些用浅层 NN 融合的方案来继续提升性能,比较重要的一份参考是两年前的 https://www.kaggle.com/c/porto-seguro-safe-driver-prediction/discussion/44629
来源一条CPMP 的推文以及讨论:https://twitter.com/JFPuget/status/1233379034425384960
来源一条CPMP 的推文以及讨论:https://twitter.com/JFPuget/status/1233379034425384960
Kaggle
Porto Seguro’s Safe Driver Prediction
Predict if a driver will file an insurance claim next year.
本来以为是个水货,结果刚点进去就发现了Pharebank 这个好东西,强烈推荐给有协作需求的在读 PhD。https://www.annaclemens.com/blog/16-free-tools-scientists-write-better-more-productively
Researchers' Writing Academy - Academic Writing Program by Anna Clemens, PhD
19 Academic Writing Tools (that are completely free!)
Whether you're looking for an academic phrase finder, a collaborative academic writing software, a tool to stay focused on your writing or a writing project management app - I've got you covered!
最近针对时间序列拆解重新理解的时候发现对 additive model 理解仍然有一些偏差。发现通用解法中用b-样条基函数的有点绕,终于在看了pyGAM这个包的源码和文档中完全搞懂,不过平滑约束的程度很难有点难顶就是了。https://github.com/dswah/pyGAM
GitHub
GitHub - dswah/pyGAM: [CONTRIBUTORS WELCOME] Generalized Additive Models in Python
[CONTRIBUTORS WELCOME] Generalized Additive Models in Python - dswah/pyGAM
最近在做简单地离线 demo 的时候开始使用 https://streamlit.io ,在这之前我大概用了一两年的 Dash,就目前的感觉比 Dash 的准备时间少了不止一半。在 2022 年,我想如果不出意外的话,有交互的将会锁定使用 streamlit.io,静态的使用 http://datapane.com
👍3❤1
AlibiExplain 应该是这几年看到的在机器学习模型可解释性上做得最系统的工具,堪称知识库型文档,毕竟不能只了解一点 SHAP。https://docs.seldon.io/projects/alibi/en/latest/index.html
👍6❤1
Deepchecks 是目前我发现关于模型的离线检查和生产环境监控最好的工具,尤其是项目给出的 Suite 和 Condition 的概念。目前只能在 notebook 里面用,暂时还不支持 HTML 或者 pdf。项目很新,值得关注。https://github.com/deepchecks/deepchecks
GitHub
GitHub - deepchecks/deepchecks: Deepchecks: Tests for Continuous Validation of ML Models & Data. Deepchecks is a holistic open…
Deepchecks: Tests for Continuous Validation of ML Models & Data. Deepchecks is a holistic open-source solution for all of your AI & ML validation needs, enabling to thoroughly test ...
👍8