Spark in me
2.2K subscribers
829 photos
48 videos
116 files
2.68K links
Lost like tears in rain. DS, ML, a bit of philosophy and math. No bs or ads.
Download Telegram
Measuring feature importance properly

http://explained.ai/rf-importance/index.html

Once again stumbled upon an amazing article about measuring feature importance for any ML algorithms:
(0) Permutation importance - if your ML algorithm is costly, then you can just shuffle a column and check importance
(1) Drop column importance - drop a column, re-train a model, check performance metrics

Why it is useful / caveats
(0) If you really care about understanding your domain - feature importances are a must have
(1) All of this works only for powerful models
(2) Landmines include - correlated or duplicate variables, data normalization

Correlated variables
(0) For RF - correlated variables share permutation importance roughly proportionally to their correlation
(1) Drop column importance can behave unpredictably

I personally like engineering different kinds of features and doing ablation tests:
(0) Among feature sets, sharing similar purpose
(1) Within feature sets

#data_science
2018 DS/ML digest 14

Amazing article - why you do not need ML
- https://cyberomin.github.io/startup/2018/07/01/sql-ml-ai.html
- I personally love plain-vanilla SQL and in 90% of cases people under-use it
- I even wrote 90% of my JSON API on our blog in pure PostgreSQL xD

Practice / papers
(0) Interesting papers from CVPR https://towardsdatascience.com/the-10-coolest-papers-from-cvpr-2018-11cb48585a49
(1) Some down-to-earth obstacles to ML deploy https://habr.com/company/hh/blog/415437/
(2) Using synthetic data for CNNs (by Nvidia) - https://arxiv.org/pdf/1804.06516.pdf
(3) This puzzles me - so much effort and engineering spent on something ... strange and useless - http://taskonomy.stanford.edu/index.html
On paper they do a cool thing - investigate transfer learning between different domains, but in practice it is done on TF and there is no clear conclusion of any kind
(4) VAE + real datasets http://siavashk.github.io/2016/02/22/autoencoder-imagenet/ - only small Imagenet (64x64)
(5) Understanding the speed of models deployed on mobile - http://machinethink.net/blog/how-fast-is-my-model/
(6) A brief overview of multi-modal methods https://medium.com/mlreview/multi-modal-methods-image-captioning-from-translation-to-attention-895b6444256e

Visualizations / explanations
(0) Amazing website with ML explanations http://explained.ai/
(1) PCA and linear VAEs are close https://pvirie.wordpress.com/2016/03/29/linear-autoencoders-do-pca/

#deep_learning
#digest
#data_science
Open Images Object detection on Kaggle

- https://www.kaggle.com/c/google-ai-open-images-object-detection-track#Description

- Key ideas
-- 1.2 images, high-res, 500 classes
-- decent prizes, but short time-span (2 months)
-- object detection

#deep_learning
2018 DS/ML digest 15

What I filtered through this time

Market / news
(0) Letters by big company employees against using ML for weapons
- Microsoft
- Amazon
(1) Facebook open sources Dense Pose (eseentially this is Mask-RCNN)
- https://research.fb.com/facebook-open-sources-densepose/

Papers / posts / NLP
(0) One more blog post about text / sentence embeddings https://goo.gl/Zm8C2c
- key idea different weighting

(1) One more sentence embedding calculation method
- https://openreview.net/pdf?id=SyK00v5xx ?

(2) Posts explaing NLP embeddings
- http://www.offconvex.org/2015/12/12/word-embeddings-1/ - some basics - SVD / Word2Vec / GloVe
-- SVD improves embedding quality (as compared to ohe)?
-- use log-weighting, use TF-IDF weighting (the above weighting)
- http://www.offconvex.org/2016/02/14/word-embeddings-2/ - word embedding properties
-- dimensions vs. embedding quality http://www.cs.princeton.edu/~arora/pubs/LSAgraph.jpg

(3) Spacy + Cython = 100x speed boost - https://goo.gl/9TwVqu - good to know about this as a last resort
- described use-case
you are pre-processing a large training set for a DeepLearning framework like pyTorch/TensorFlow
or you have a heavy processing logic in your DeepLearning batch loader that slows down your training

(4) Once again stumbled upon this - https://blog.openai.com/language-unsupervised/

(5) Papers
- Simple NLP embedding baseline https://goo.gl/nGujzS
- NLP decathlon for question answering https://goo.gl/6HHi7q
- Debiasing embeddings https://arxiv.org/abs/1806.06301
- Once again transfer learning in NLP by open-AI - https://goo.gl/82VR4U

#deep_learning
#digest
#data_science
Forwarded from SK
Playing with VAEs and their practical use

So, I played a bit with Variational Auto Encoders (VAE) and wrote a small blog post on this topic

https://spark-in.me/post/playing-with-vae-umap-pca

Please like, share and repost!

#deep_learning
#data_science

Like this post or have something to say => tell us more in the comments or donate!
A new multi-threaded addition to pandas stack?

Read about this some time ago (when this was just in development https://t.me/snakers4/1850) - found essentially 3 alternatives
- just being clever about optimizing your operations + using what is essentially a multi-threaded map/reduce in pandas https://t.me/snakers4/1981
- pandas on ray
- dask (overkill)

Links:
(0) https://rise.cs.berkeley.edu/blog/pandas-on-ray-early-lessons/
(1) https://www.reddit.com/comments/8wuz7e
(2) https://github.com/modin-project/modin

So...I ran a test in the notebook I had on hand. It works. More tests will be done in future.
https://pics.spark-in.me/upload/2c7a2f8c8ce1dd7a86a54ec3a3dcf965.png

#data_science
#pandas
Disclaimer - it does not support pivot tables or complicated group_by ...
Yet another proxy - shadowsocks

If someone needs another proxy guide, someone with an Arabic username shared some alternative advice for proxy configuration
- http://disq.us/p/1tsy4nk (wait a bit till link resolves)

#internet
#linux
2018 DS/ML digest 16

Papers / posts
(0) RL now solves Quake
https://venturebeat.com/2018/07/03/googles-deepmind-taught-ai-teamwork-by-playing-quake-iii-arena/
(1) A fast.ai post about AdamW
http://www.fast.ai/2018/07/02/adam-weight-decay/
-- Adam generally requires more regularization than SGD, so be sure to adjust your regularization hyper-parameters when switching from SGD to Adam
-- Amsgrad turns out to be very disappointing
-- Refresher article http://ruder.io/optimizing-gradient-descent/index.html#nadam
(2) How to tackle new classes in CV
https://petewarden.com/2018/07/06/what-image-classifiers-can-do-about-unknown-objects/
(3) A new word in GANs?
-- https://ajolicoeur.wordpress.com/RelativisticGAN/
-- https://arxiv.org/pdf/1807.00734.pdf
(4) Using deep learning representations for search
-- https://goo.gl/R1vhTh
-- library for fast search on python https://github.com/spotify/annoy
(5) One more paper on GAN convergence
https://avg.is.tuebingen.mpg.de/publications/meschedericml2018
(6) Switchable normalization - adds a bit to ResNet50 + pre-trained models
https://github.com/switchablenorms/Switchable-Normalization

Datasets
(0) Disney starts to release datasets
https://www.disneyanimation.com/technology/datasets


Market / interesting links
(0) A motion to open-source GitHub
https://github.com/dear-github/dear-github/issues/304
(1) Allegedly GTX 1180 start in sales appearing in Asia (?)
(2) Some controversy regarding Andrew Ng and self-driving cars https://goo.gl/WNW4E3
(3) National AI strategies overviewed - https://goo.gl/BXDCD7
-- Canada C$135m
-- China has the largest strategy
-- Notably - countries like Finland also have one
(4) Amazon allegedly sells face recognition to the USA https://goo.gl/eDzekn

#data_science
#deep_learning
Ofc such experiments are done on toy datasets - but it's nice to know