Spark in me
2.21K subscribers
822 photos
48 videos
116 files
2.68K links
Lost like tears in rain. DS, ML, a bit of philosophy and math. No bs or ads.
Download Telegram
Just found a book on practical Python programming patterns
- http://python-3-patterns-idioms-test.readthedocs.io/en/latest/PythonForProgrammers.html

Looks good

#python
Most common libraries for Natural Language Processing:

CoreNLP from Stanford group:
http://stanfordnlp.github.io/CoreNLP/index.html

NLTK, the most widely-mentioned NLP library for Python:
http://www.nltk.org/

TextBlob, a user-friendly and intuitive NLTK interface:
https://textblob.readthedocs.io/en/dev/index.html

Gensim, a library for document similarity analysis:
https://radimrehurek.com/gensim/

SpaCy, an industrial-strength NLP library built for performance:
https://spacy.io/docs/

Source: https://itsvit.com/blog/5-heroic-tools-natural-language-processing/

#nlp #digest #libs
It is tricky to launch XGB fully on GPU. People report that on the same data CatBoost has inferior quality w/o tweaking (but is faster). LightGBM is reported to be faster and to have the same accuracy.

So I tried adding LighGBM w GPU support to my Dockerfile -
https://github.com/Microsoft/LightGBM/blob/master/docs/GPU-Tutorial.rst - but I encountered some driver Docker issues.

One of the caveats I understood - it supports only older Nvidia drivers, up to 384.

Luckily, there is a Dockerfile by MS that seems to be working (+ jupyter, but I could not install extensions)
https://github.com/Microsoft/LightGBM/blob/master/docker/gpu/README.md

#data_science
Found some starter boilerplate of how to use hyperopt instead of gridsearch for faster search:
- here - https://goo.gl/ccXkuM
- and here - https://goo.gl/ktblo5

#data_science
So, I have benched XGB vs LightGBM vs CatBoost. Also I endured xgb and lgb GPU installation. This is just general usage impression, not a hard benchmark.

My thoughts are below.

(1) Installation - CPU
(all) - are installed via pip or conda in one line

(2) Installation - GPU
(xgb) - easily done via following their instructions, only nvidia drivers required;
(lgb) - easily done on Azure cloud. on Linux requires some drivers that may be lagging. Could not integrate my Dockerfile with their instructions, but their Dockerfile worked perfectly;
(cb) - instructions were too convoluted for me to follow;

(3) Docs / examples
(xgb) - the worst one, fine-tuning guidelines are murky and unpolished;
(lgb) - their python API is not entirely well documented (e.g. some options can be found only on forums), but overall the docs are very decent + some ft hints;
(cb) - overall docs are nice, a lot of simple examples + some boilerplate in .ipynb format;

(4) Regression
(xgb) - works poorly. Maybe my params are bad, but out-of-the-box params of sklearn API use 5-10x more time than the rest and lag in accuracy;
(lgb) - best performing one out of the box, fast + accurate;
(cb) - fast but less accurate;

(5) Classification
(xgb) - best accuracy
(lgb) - fast, high accuracy
(cb) - fast, worse accuracy

(6) GPU usage
(xgb) - for some reason accuracy when using full GPU options is really really bad. Forum advice does not help.

#data_science
I have seen questions on forums - how to add Keras-like progress bar to PyTorch for simple models?

The answer is to use tqdm and this property
- https://goo.gl/cG6Ug8

This example is also great
from tqdm import trange
from random import random, randint
from time import sleep

t = trange(100)
for i in t:
# Description will be displayed on the left
t.set_description('GEN %i' % i)
# Postfix will be displayed on the right, and will format automatically
# based on argument's datatype
t.set_postfix(loss=random(), gen=randint(1,999), str='h', lst=[1, 2])
sleep(0.1)

#deep_learning
#pytorch
2018 DS/ML digest 6

Visualization
(1) A new amazing post by Google on distil - https://distill.pub/2018/building-blocks/.
This is really amazing work, but their notebooks tells me that it is a far cry from being able to be utilized by the community - https://goo.gl/3c1Fza
This is how the CNN sees the image - https://goo.gl/S4KT5d
Expect this to be packaged as part of Tensorboard in a year or so)

Datasets
(1) New landmark dataset by Google - https://goo.gl/veSEhg - looks cool, but ...
Prizes in the accompanying Kaggle competitions are laughable https://goo.gl/EEGDEH https://goo.gl/JF93Xx
Given that datasets are really huge...~300G
Also also if you win, you will have to buy a ticket to the USA on your money ...
(2) Useful script to download the images https://goo.gl/JF93Xx
(3) Imagenet for satellite imagery - http://xviewdataset.org/#register - pre-register
https://arxiv.org/pdf/1802.07856.pdf paper
(4) CVPR 2018 for satellite imagery - http://deepglobe.org/challenge.html

Papers / new techniques
(1) Improving RNN performance via auxiliary loss - https://arxiv.org/pdf/1803.00144.pdf
(2) Satellite imaging for emergencies - https://arxiv.org/pdf/1803.00397.pdf
(3) Baidu - neural voice cloning - https://goo.gl/uJe852

Market
(1) Google TPU benchmarks - https://goo.gl/YKL9yx
As usual such charts do not show consumer hardware.
My guess is that a single 1080Ti may deliver comparable performance (i.e. 30-40% of it) for ~US$700-1000k, i.e. ~150 hours of rent (this is ~ 1 week!)
Miners say that 1080Ti can work 1-2 years non-stop
(2) MIT and SenseTime announce effort to advance artificial intelligence research https://goo.gl/MXB3V9
(3) Google released its ML course - https://goo.gl/jnVyNF - but generally it is a big TF ad ... Andrew Ng is better for grasping concepts

Internet
(1) Interesting thing - all ISPs have some preferential agreements between each other - https://goo.gl/sEvZMN


#digest
#data_science
#deep_learning
New articles about picking GPUs for DL
- https://blog.slavv.com/picking-a-gpu-for-deep-learning-3d4795c273b9
- https://goo.gl/h6PJqc

Also most likely for 4-GPU set-ups in a case some of them may be required to have water cooling to avoid thermal throttling.

#deep_learning