Spark in me
2.31K subscribers
600 photos
41 videos
114 files
2.52K links
Lost like tears in rain. DS, ML, a bit of philosophy and math. No bs or ads.
Download Telegram
Found some starter boilerplate of how to use hyperopt instead of gridsearch for faster search:
- here - https://goo.gl/ccXkuM
- and here - https://goo.gl/ktblo5

#data_science
So, I have benched XGB vs LightGBM vs CatBoost. Also I endured xgb and lgb GPU installation. This is just general usage impression, not a hard benchmark.

My thoughts are below.

(1) Installation - CPU
(all) - are installed via pip or conda in one line

(2) Installation - GPU
(xgb) - easily done via following their instructions, only nvidia drivers required;
(lgb) - easily done on Azure cloud. on Linux requires some drivers that may be lagging. Could not integrate my Dockerfile with their instructions, but their Dockerfile worked perfectly;
(cb) - instructions were too convoluted for me to follow;

(3) Docs / examples
(xgb) - the worst one, fine-tuning guidelines are murky and unpolished;
(lgb) - their python API is not entirely well documented (e.g. some options can be found only on forums), but overall the docs are very decent + some ft hints;
(cb) - overall docs are nice, a lot of simple examples + some boilerplate in .ipynb format;

(4) Regression
(xgb) - works poorly. Maybe my params are bad, but out-of-the-box params of sklearn API use 5-10x more time than the rest and lag in accuracy;
(lgb) - best performing one out of the box, fast + accurate;
(cb) - fast but less accurate;

(5) Classification
(xgb) - best accuracy
(lgb) - fast, high accuracy
(cb) - fast, worse accuracy

(6) GPU usage
(xgb) - for some reason accuracy when using full GPU options is really really bad. Forum advice does not help.

#data_science
I have seen questions on forums - how to add Keras-like progress bar to PyTorch for simple models?

The answer is to use tqdm and this property
- https://goo.gl/cG6Ug8

This example is also great
from tqdm import trange
from random import random, randint
from time import sleep

t = trange(100)
for i in t:
# Description will be displayed on the left
t.set_description('GEN %i' % i)
# Postfix will be displayed on the right, and will format automatically
# based on argument's datatype
t.set_postfix(loss=random(), gen=randint(1,999), str='h', lst=[1, 2])
sleep(0.1)

#deep_learning
#pytorch
2018 DS/ML digest 6

Visualization
(1) A new amazing post by Google on distil - https://distill.pub/2018/building-blocks/.
This is really amazing work, but their notebooks tells me that it is a far cry from being able to be utilized by the community - https://goo.gl/3c1Fza
This is how the CNN sees the image - https://goo.gl/S4KT5d
Expect this to be packaged as part of Tensorboard in a year or so)

Datasets
(1) New landmark dataset by Google - https://goo.gl/veSEhg - looks cool, but ...
Prizes in the accompanying Kaggle competitions are laughable https://goo.gl/EEGDEH https://goo.gl/JF93Xx
Given that datasets are really huge...~300G
Also also if you win, you will have to buy a ticket to the USA on your money ...
(2) Useful script to download the images https://goo.gl/JF93Xx
(3) Imagenet for satellite imagery - http://xviewdataset.org/#register - pre-register
https://arxiv.org/pdf/1802.07856.pdf paper
(4) CVPR 2018 for satellite imagery - http://deepglobe.org/challenge.html

Papers / new techniques
(1) Improving RNN performance via auxiliary loss - https://arxiv.org/pdf/1803.00144.pdf
(2) Satellite imaging for emergencies - https://arxiv.org/pdf/1803.00397.pdf
(3) Baidu - neural voice cloning - https://goo.gl/uJe852

Market
(1) Google TPU benchmarks - https://goo.gl/YKL9yx
As usual such charts do not show consumer hardware.
My guess is that a single 1080Ti may deliver comparable performance (i.e. 30-40% of it) for ~US$700-1000k, i.e. ~150 hours of rent (this is ~ 1 week!)
Miners say that 1080Ti can work 1-2 years non-stop
(2) MIT and SenseTime announce effort to advance artificial intelligence research https://goo.gl/MXB3V9
(3) Google released its ML course - https://goo.gl/jnVyNF - but generally it is a big TF ad ... Andrew Ng is better for grasping concepts

Internet
(1) Interesting thing - all ISPs have some preferential agreements between each other - https://goo.gl/sEvZMN


#digest
#data_science
#deep_learning
New articles about picking GPUs for DL
- https://blog.slavv.com/picking-a-gpu-for-deep-learning-3d4795c273b9
- https://goo.gl/h6PJqc

Also most likely for 4-GPU set-ups in a case some of them may be required to have water cooling to avoid thermal throttling.

#deep_learning
Interesting / noteworthy semseg papers

In practice - UNet and LinkNet are best and simple solutions.
Rarely people report that something like Tiramisu works properly.
Though I saw once in last Konika competition - a good solution based on DenseNet + Standard decoder.
So I decided to read some of the newer and older Semseg papers.

Classic papers

UNet,LinkNet - nuff said
(0) Links
- UNet - http://arxiv.org/abs/1505.04597
- LinkNet - http://arxiv.org/abs/1707.03718

Older, overlooked, but interesting papers

SegNet: A Deep Convolutional Encoder-Decoder Architecture for Image Segmentation
One of original papers before UNet
(0) https://arxiv.org/abs/1511.00561
(1) Basically UNet w/o skip connections but it stores pooling indices
(1) SegNet uses the max pooling indices to upsample (without learning) the feature map(s) and convolves with a trainable decoder filter bank


ENet: A Deep Neural Network Architecture for Real-Time Semantic Segmentation
Paszke, Adam / Chaurasia, Abhishek / Kim, Sangpil / Culurciello, Eugenio
(0) Link
- http://arxiv.org/abs/1606.02147
(1) Key facts
- Up to 18× faster, 75× less FLOPs, 79× less parameters vs SegNet or FCN
- Supposedly runs on NVIDIA Jetson TX1 Embedded Systems
- Essentially a minzture of ResNet and Inception architectures
- Overview of the architecture
-- https://goo.gl/M6CPEv
-- https://goo.gl/b5Kb2S
(2) Interesting ideas
- Visual information is highly spatially redundant, and thus can be compressed into a more efficient representation
- Highly assymetric - decoder is much smaller
- Dilated convolutions in the middle => significant accuracy boost
- Dropout > L2
- Pooling operation in parallel with a convolution of stride 2, and concatenate resulting feature maps


Newer papers

xView: Objects in Context in Overhead Imagery - new "Imagenet" for satellite images
(0) Link
- Will be available here http://xviewdataset.org/#register
(1) Examples
- https://goo.gl/JKr9wW
- https://goo.gl/TWRmn2
(2) Stats
- 0.3m ground sample distance
- 60 classes in 7 different parent classes
- 1 million labeled objects covering over 1,400 km2 of the earth’s surface
- classes https://goo.gl/v9CM5b
(3) Baseline
- Their baseline using SSD has very poor performance ~20% mAP

Rethinking Atrous Convolution for Semantic Image Segmentation
(0) Link
- http://arxiv.org/abs/1706.05587
- Liang-Chieh Chen George / Papandreou Florian / Schroff Hartwig / Adam
- Google Inc.
(1) Problems to be solved
- Reduced feature resolution
- Objects at multiple scales
(2) Key approaches
- Image pyramid (reportedly works poorly and requires a lot of memory)
- Encoder-decoder
- Spatial pyramid pooling (reportedly works poorly and requires a lot of memory)
(3) Key ideas
- Atrous (dilated) convolution - https://goo.gl/uSFCv5
- ResNet + Atrous convolutions - https://goo.gl/pUjUBS
- Atrous Spatial Pyramid Pooling block https://goo.gl/AiQZC1 - https://goo.gl/p63qNR
(4) Performance
- As with the latest semseg methods, true performance boost is unclear
- I would argue that such methods may be useful for large objects



#digest
#deep_learning
An article about how to use CLI params in python with argparse
- https://goo.gl/9wxDxh

If this is too slow - then just use this as a starter boilerplate
- https://goo.gl/Bm39Bc (this is how I learned it)

Why do you need this?
- Run long overnight (or even day long) jobs in python
- Run multiple experiments
- Make your code more tractable for other people
- Expose a simple API for others to use

The same can be done via newer frameworks, but why learn an abstraction, that may die soon, instead of using instruments that worked for decades?

#data_science
Internet digest
(1) Ben Evans - https://goo.gl/8f4RkE

Market
(1) Waymo launching pilot for the self-driving trucks - https://goo.gl/Bw2R9Q
(2) Netflix to spend US$8bn on ~700 shows in 2018 - https://goo.gl/6myKj6 (sic!)
(3) Intel vs Qualcomm and Broadcomm - https://goo.gl/pa3iYB + Inter considering to buy Broadcomm - https://goo.gl/XP8fqd
(4) Amazon buys ring - https://goo.gl/cnMw6o
(5) Latest darkmarket bust - Hansa - https://goo.gl/YcUxYD - it was not busted at once, but put under surveillance
- As with Silk Road - all started with the officials finding a server and making a copy of hard drive
- This time - it was a dev server
- It contained ... owners' IRC accounts and some personal info

Internet + ML
(1) Netflix uses ML to generate thumbnails for its shows automatically - https://goo.gl/6poibk
- Features collected: manual annotation, meta-data, object detection, brightness, colour, face detection, blur, motion detection, actors, mature content

#internet
#digest
New stack overflow survey 2018
- https://insights.stackoverflow.com/survey/2018/

Key fact - global and USA Data Scientist salary
- Global https://goo.gl/AyYoVv
- USA https://goo.gl/CKdthV

Interesting facts
- Countries - https://goo.gl/2neadX
- How people learn - https://goo.gl/HxKuRH
- Git dominates version control - https://goo.gl/HDXVMj
- PyTorch is in the top of most loved frameworks https://goo.gl/66xJXs
- Connected stacks of technologies - https://goo.gl/pcXiNj
- Most popular languages and tools - https://goo.gl/GK32vn
- Most popular frameworks - https://goo.gl/Khjw87 (PyTorch =) )
- Most popular databases - https://goo.gl/TjTp65
- Attitude to rivalry - https://goo.gl/7mwWd2

#internet