Spark in me

It is tricky to launch XGB fully on GPU. People report that on the same data CatBoost has inferior quality w/o tweaking (but is faster). LightGBM is reported to be faster and to have the same accuracy.

So I tried adding LighGBM w GPU support to my Dockerfile -
https://github.com/Microsoft/LightGBM/blob/master/docs/GPU-Tutorial.rst - but I encountered some driver Docker issues.

One of the caveats I understood - it supports only older Nvidia drivers, up to 384.

Luckily, there is a Dockerfile by MS that seems to be working (+ jupyter, but I could not install extensions)
https://github.com/Microsoft/LightGBM/blob/master/docker/gpu/README.md

#data_science

GitHub

LightGBM/docs/GPU-Tutorial.rst at master · microsoft/LightGBM

A fast, distributed, high performance gradient boosting (GBT, GBDT, GBRT, GBM or MART) framework based on decision tree algorithms, used for ranking, classification and many other machine learning ...

1.2K viewsAlexander, 02:50

Spark in me

Modern Pandas series about classic time-series algorithms
- https://tomaugspurger.github.io/modern-7-timeseries

Some basic boilerplate and baselines

#data_science
#time_series

tomaugspurger.github.io

datasframe
– Modern Pandas (Part 7): Timeseries

Posts and writings by Tom Augspurger

1.3K viewsAlexander, 09:03

Spark in me

Amazing article about the most popular warning in Pandas
- https://www.dataquest.io/blog/settingwithcopywarning/

#data_science

Dataquest

SettingwithCopyWarning: How to Fix This Warning in Pandas – Dataquest

SettingWithCopyWarning: Everything you need to know about the most common (and most misunderstood) warning in pandas and how to fix it!

1.2K viewsAlexander, 06:26

Spark in me

Found some starter boilerplate of how to use hyperopt instead of gridsearch for faster search:
- here - https://goo.gl/ccXkuM
- and here - https://goo.gl/ktblo5

#data_science

1.0K viewsAlexander, edited 06:45

Spark in me

So, I have benched XGB vs LightGBM vs CatBoost. Also I endured xgb and lgb GPU installation. This is just general usage impression, not a hard benchmark.

My thoughts are below.

(1) Installation - CPU
(all) - are installed via pip or conda in one line

(2) Installation - GPU
(xgb) - easily done via following their instructions, only nvidia drivers required;
(lgb) - easily done on Azure cloud. on Linux requires some drivers that may be lagging. Could not integrate my Dockerfile with their instructions, but their Dockerfile worked perfectly;
(cb) - instructions were too convoluted for me to follow;

(3) Docs / examples
(xgb) - the worst one, fine-tuning guidelines are murky and unpolished;
(lgb) - their python API is not entirely well documented (e.g. some options can be found only on forums), but overall the docs are very decent + some ft hints;
(cb) - overall docs are nice, a lot of simple examples + some boilerplate in .ipynb format;

(4) Regression
(xgb) - works poorly. Maybe my params are bad, but out-of-the-box params of sklearn API use 5-10x more time than the rest and lag in accuracy;
(lgb) - best performing one out of the box, fast + accurate;
(cb) - fast but less accurate;

(5) Classification
(xgb) - best accuracy
(lgb) - fast, high accuracy
(cb) - fast, worse accuracy

(6) GPU usage
(xgb) - for some reason accuracy when using full GPU options is really really bad. Forum advice does not help.

#data_science

970 viewsAlexander, edited 13:09

Spark in me

I have seen questions on forums - how to add Keras-like progress bar to PyTorch for simple models?

The answer is to use tqdm and this property
- https://goo.gl/cG6Ug8

This example is also great

from tqdm import trange
from random import random, randint
from time import sleep

t = trange(100)
for i in t:
    # Description will be displayed on the left
    t.set_description('GEN %i' % i)
    # Postfix will be displayed on the right, and will format automatically
    # based on argument's datatype
    t.set_postfix(loss=random(), gen=randint(1,999), str='h', lst=[1, 2])
    sleep(0.1)

#deep_learning
#pytorch

Stack Overflow

Can I add message to the tqdm progressbar?

When using the tqdm progress bar: can I add a message to the same line as the progress bar in a loop?

I tried using the "tqdm.write" option, but it adds a new line on every write. I would like each

978 viewsAlexander, edited 14:34

Spark in me

https://youtu.be/pVgC-7QTr40

YouTube

Building Blocks of AI Interpretability | Two Minute Papers #234

The paper "Building Blocks of Interpretability" is available here:
https://distill.pub/2018/building-blocks/

Our Patreon page: https://www.patreon.com/TwoMinutePapers

We would like to thank our generous Patreon supporters who make Two Minute Papers possible:…

763 viewsAlexander, 01:53

Spark in me

This is amazing
https://distill.pub/2018/building-blocks/

Distill

The Building Blocks of Interpretability

Interpretability techniques are normally studied in isolation. We explore the powerful interfaces that arise when you combine them -- and the rich structure of this combinatorial space.

781 viewsAlexander, 01:58

Spark in me

https://github.com/keras-team/keras/releases/tag/2.1.5

GitHub

Release Keras 2.1.5 · keras-team/keras

Areas of improvement

Bug fixes.
New APIs: sequence generation API TimeseriesGenerator, and new layer DepthwiseConv2D.
Unit tests / CI improvements.
Documentation improvements.

API changes

Add ne...

766 viewsAlexander, 02:00

Spark in me

2018 DS/ML digest 6

Visualization
(1) A new amazing post by Google on distil - https://distill.pub/2018/building-blocks/.
This is really amazing work, but their notebooks tells me that it is a far cry from being able to be utilized by the community - https://goo.gl/3c1Fza
This is how the CNN sees the image - https://goo.gl/S4KT5d
Expect this to be packaged as part of Tensorboard in a year or so)

Datasets
(1) New landmark dataset by Google - https://goo.gl/veSEhg - looks cool, but ...
Prizes in the accompanying Kaggle competitions are laughable https://goo.gl/EEGDEH https://goo.gl/JF93Xx
Given that datasets are really huge...~300G
Also also if you win, you will have to buy a ticket to the USA on your money ...
(2) Useful script to download the images https://goo.gl/JF93Xx
(3) Imagenet for satellite imagery - http://xviewdataset.org/#register - pre-register
https://arxiv.org/pdf/1802.07856.pdf paper
(4) CVPR 2018 for satellite imagery - http://deepglobe.org/challenge.html

Papers / new techniques
(1) Improving RNN performance via auxiliary loss - https://arxiv.org/pdf/1803.00144.pdf
(2) Satellite imaging for emergencies - https://arxiv.org/pdf/1803.00397.pdf
(3) Baidu - neural voice cloning - https://goo.gl/uJe852

Market
(1) Google TPU benchmarks - https://goo.gl/YKL9yx
As usual such charts do not show consumer hardware.
My guess is that a single 1080Ti may deliver comparable performance (i.e. 30-40% of it) for ~US$700-1000k, i.e. ~150 hours of rent (this is ~ 1 week!)
Miners say that 1080Ti can work 1-2 years non-stop
(2) MIT and SenseTime announce effort to advance artificial intelligence research https://goo.gl/MXB3V9
(3) Google released its ML course - https://goo.gl/jnVyNF - but generally it is a big TF ad ... Andrew Ng is better for grasping concepts

Internet
(1) Interesting thing - all ISPs have some preferential agreements between each other - https://goo.gl/sEvZMN

#digest
#data_science
#deep_learning

Distill

The Building Blocks of Interpretability

Interpretability techniques are normally studied in isolation. We explore the powerful interfaces that arise when you combine them -- and the rich structure of this combinatorial space.

905 viewsAlexander, 05:03

Spark in me

New articles about picking GPUs for DL
- https://blog.slavv.com/picking-a-gpu-for-deep-learning-3d4795c273b9
- https://goo.gl/h6PJqc

Also most likely for 4-GPU set-ups in a case some of them may be required to have water cooling to avoid thermal throttling.

#deep_learning

Medium

Picking a GPU for Deep Learning

Buyer’s guide in 2019

1.2K viewsAlexander, 06:03

Spark in me

Коэффициент Джини. Из экономики в машинное обучение / Хабрахабр
https://m.habrahabr.ru/company/ods/blog/350440/

Habr

Коэффициент Джини. Из экономики в машинное обучение

Интересный факт: в 1912 году итальянский статистик и демограф Коррадо Джини написал знаменитый труд «Вариативность и изменчивость признака», и в этом же году...

1.1K viewsAlexander, 18:49

Spark in me

PyTorch caught up with keras
https://mobile.twitter.com/i/web/status/971863128341323776

#deep_learning