Spark in me

2018 DS/ML digest 7

(!) Top SpaceNet solution report from Albu - top contender is Satellite competitions (!)
(0) Draft - https://goo.gl/4voioj
(1) Key ideas:
-- Upsampling works as well as transposed conv
-- Optimal BCE/DICE weights for SpaceNet three - 0.8/0.2
-- RGB => mean / min-max scaling
-- This algorithm was used for curvy lines - https://goo.gl/B3fYMi
-- Basic upsampling block - conv3х3, upsampling, conv3х3

NLP in Russian in usable form (!):
(1) A review on NER libraries for Russian
-- Review - https://habrahabr.ru/post/349864/
-- Rule based repository https://github.com/natasha/natasha
-- Also this can be done with CNNs - https://github.com/deepmipt/ner
-- Also also I guess that as with all CNNs - they are really hackable and simple, but require annotation

Papers / edu content / code
(1) RU - amazing article about loglocc - https://alexanderdyakonov.wordpress.com/2018/03/12/логистическая-функция-ошибки/
(2) LSTMs are now applied in theory to memory access - https://arxiv.org/pdf/1803.02329.pdf
(3) Preprocessing pipeline for hand-written notes - https://mzucker.github.io/2016/09/20/noteshrink.html
(4) Nice article to build your intuiting about latent spaces - https://goo.gl/ggWjG6

Market
(1) Deep Learning starts popping up in science - https://deepchem.io/about.html
(2) Fchollet about practice with CNNs - https://goo.gl/X9C2zx
(3) Google Open-sources its DeepLab3 for semantic segmentation with dilated convolutions - https://goo.gl/Kkztj2
(4) Google's autp-ML is still nascent - https://goo.gl/YUPSP3

Just for lulz - if you have nothing to do - here is a list of key modern architectures worth just knowing about
# Architectures
* AlexNet: https://papers.nips.cc/paper/4824-imagenet-classification-with-deep-convolutional-neural-networks
* ZFNet: https://arxiv.org/abs/1311.2901
* VGG16: https://arxiv.org/abs/1505.06798
* ResNet: https://arxiv.org/abs/1704.06904
* GoogLeNet: https://arxiv.org/abs/1409.4842
* Inception: https://arxiv.org/abs/1512.00567
* Xception: https://arxiv.org/abs/1610.02357
* MobileNet: https://arxiv.org/abs/1704.04861
# Semantic Segmentation
* FCN: https://arxiv.org/abs/1411.4038
* SegNet: https://arxiv.org/abs/1511.00561
* UNet: https://arxiv.org/abs/1505.04597
* PSPNet: https://arxiv.org/abs/1612.01105
* DeepLab: https://arxiv.org/abs/1606.00915
* ICNet: https://arxiv.org/abs/1704.08545
* ENet: https://arxiv.org/abs/1606.02147
# Generative adversarial networks
* GAN: https://arxiv.org/abs/1406.2661
* DCGAN: https://arxiv.org/abs/1511.06434
* WGAN: https://arxiv.org/abs/1701.07875
* Pix2Pix: https://arxiv.org/abs/1611.07004
* CycleGAN: https://arxiv.org/abs/1703.10593
# Object detection
* RCNN: https://arxiv.org/abs/1311.2524
* Fast-RCNN: https://arxiv.org/abs/1504.08083
* Faster-RCNN: https://arxiv.org/abs/1506.01497
* SSD: https://arxiv.org/abs/1512.02325
* YOLO: https://arxiv.org/abs/1506.02640
* YOLO9000: https://arxiv.org/abs/1612.08242

#digest
#data_science
#deep_learning

goo.gl

Google URL Shortener

Google URL Shortener at goo.gl is used by Google products to create short URLs that can be easily shared, tweeted, or emailed to friends.

1.3K viewsAlexander, 09:03

Spark in me

Before there was an unofficial Kaggle CLI tool, now there is an official Kaggle API tool
https://github.com/Kaggle/kaggle-api

Cool.
Lol..and ofc data download did not work...unlike the unofficial tool.
Maybe submits will work.

GitHub

GitHub - Kaggle/kaggle-api: Official Kaggle API

Official Kaggle API. Contribute to Kaggle/kaggle-api development by creating an account on GitHub.

977 viewsAlexander, edited 09:26

Spark in me

Internet / tech
(1) LIDAR - bridge technology https://www.ben-evans.com/benedictevans/2018/3/12/bridges
(2) VW to invest US$25bn in batteries https://goo.gl/yPrpUX
(3) Self-driving car kills a pedestrian https://goo.gl/Md3Cbs
(4) Terminal case of marketing bs - theranos - https://goo.gl/zNjZPL
(5) Spotify was a P2P app at first lol - https://goo.gl/e8riLc
(6) Stack Overflow survey 2018 - https://stackoverflow.blog/2018/01/08/take-2018-developer-survey/

Lol
(1) Prototype of small flying car - https://cora.aero

#digest

Benedict Evans

Bridges and LIDAR — Benedict Evans

A bridge product says 'of course x is the right way to do this, but the technology or market environment to deliver x is not available yet, or is too expensive, and so here is something that gives some of the same benefits but works now.' Sometimes that’s…

939 viewsAlexander, 05:03

Spark in me

A practical note on using

pd.to_feather()

Works really well, if you have an NVME drive and you want to save a large dataframe to disk in binary format.

If your NVME is properly installed it will give you 1.5-2+GB/s read/write speed, so even if your df is 20+GB in size, it will read literally in seconds.

The ETL process to produce such a df may take minutes.

#data_science

1.1K viewsAlexander, edited 06:57

Spark in me

A video about realistic state of chat-bots (RU)
https://www.facebook.com/deepmipt/videos/vb.576961855845861/868578240017553/?type=2&theater

#nlp
#data_science
#deep_learning

Facebook

Neural Networks and Deep Learning lab at MIPT

При звонке в сервис или банк диалог с поддержкой строится по одному и тому же сценарию с небольшими вариациями. Отвечать «по бумажке» на вопросы может и не человек, а чат-бот. О том, как нейронные...

1.8K viewsAlexander, 13:40

Spark in me

Nice post about IT museum from Aminux
https://aminux.wordpress.com/2018/03/21/tallinn-arvutimuuseum/

Amin 's Blog

Заглянем в музей…

Всем привет. Сегодня я немного расскажу про один классный музей. Находится он в городе Таллине, в Эстонии. Если будете в тех краях — очень рекомендую сходить. Вычислительная техника развивает…

1.4K viewsAlexander, 03:46

Spark in me

Finally a proper LightGBM / XGB / CatBoost practical comparsion!
https://towardsdatascience.com/catboost-vs-light-gbm-vs-xgboost-5f93620723db

#data_science

Medium

CatBoost vs. Light GBM vs. XGBoost

Who is going to win this war of predictions and on what cost? Let’s explore.

1.3K viewsAlexander, 10:26

Spark in me

https://youtu.be/b7FxPsqfkOY

1.0K viewsAlexander, 18:11

Spark in me

So, I have briefly watched Andrew Ng's series on RNNs.
It's super cool if you do not know much about RNNs and / or want to refresh your memories and / or want to jump start your knowledge about NLP.
Also he explains stuff with really simple and clear illustrations.
Tasks in the course are also cool (notebook + submit button), but they are very removed from real practice, as they imply coding gradients and forwards passes from scratch in python.
(which I did enough during his classic course)
Also no GPU tricks / no research / production boilerplate ofc.

Below are ideas and references that may be useful to know about NLP / RNNs for everyone.

Also for NLP:
(0) Key NLP sota achievements in 2017
-- https://medium.com/@madrugado/advances-in-nlp-in-2017-b00e927fcc57
-- https://medium.com/@madrugado/advances-in-nlp-in-2017-part-ii-d8da391a3f01
(1) Consider fast.ai courses and notebooks https://github.com/fastai/courses/tree/master/deeplearning2
(2) Consider NLP newsletter http://newsletter.ruder.io
(3) Consider excellent PyTorch tutorials http://pytorch.org/tutorials/
(4) There is a lot quality code in PyTorch community (e.g. 1-2 page GloVe implementations!)
(5) Brief 1-hour intro to practical NLP https://www.youtube.com/watch?v=Ozm0bEi5KaI

Also related posts on the channel / libraries:
(1) Pre-trained vectors in Russian - https://t.me/snakers4/1623
(2) How to learn about CTC loss https://t.me/snakers4/1690 (when our seq2seq )
(3) Most popular MLP libraries for English - https://t.me/snakers4/1832
(4) NER in Russian - https://habrahabr.ru/post/349864/
(5) Lemmatization library in Russian - https://pymorphy2.readthedocs.io/en/latest/user/guide.html - recommended by a friend

Basic tasks considered more or less solved by RNNs
(1) Speech recognition / trigger word detection
(2) Music generation
(3) Sentiment analysis
(4) Machine translation
(5) Video activity recognition / tagging
(6) Named entity recognition (NER)

Problems with standard CNN when modelling sequences:
(1) Different length of input and output
(2) Features for different positions in the sequence are not shared
(3) Enormous number of params

Typical word representations
(1) One-hot encoded vectors (10-50k typical solutions, 100k-1m commerical / sota solutions)
(2) Learned embeddings - reduce computation burden to ~300-500 dimensions instead of 10k+

Typical rules of thumb / hacks / practical approaches for RNNs
(0) Typical architectures - deep GRU (lighter) and LSTM cells
(1) Tanh or RELU for hidden layer activation
(2) Sigmoid for output when classifying
(3) Usage of EndOfSentence, UNKnown word, StartOfSentence, etc tokens
(4) Usually word level models are used (not character level)
(5) Passing hidden state in encoder-decoder architectures
(6) Vanishing gradients - typically GRUs / LSTMs are used
(7) Very long sequences for time series - AR features are used instead of long windows, typically GRUs and LSTMs are good for sequence of 200-300 (easier and more straightforward than attention in this case)
(8) Exploding gradients - standard solution - clipping, though it may lead to inferior results (from practice)
(9) Teacher forcing - substitute predicted y_t+1 with real value when training seq2seq model during forward pass
(10) Peephole conntections - let GRU or LSTM see the c_t-1 from the previous hidden state
(11) Finetune imported embeddings for smaller tasks with smaller datasets
(12) On big datasets - may make sense to learn embeddings from scratch
(13) Usage of bidirectional LSTMs / GRUs / Attention where applicable

Typical similarity functions for high-dim vectors
(0) Cosine (angle)
(1) Eucledian

Seminal papers / consctructs / ideas:
(1) Training embeddings - the later the methods came out - the simpler they are
- Matrix factorization techniques

- Naive approach using language model + softmax (non tractable for large corpuses)

- Negative sampling + skip gram + logistic regression = Word2Vec (2013)
-- http://arxiv.org/abs/1310.4546
-- useful ideas
-- if there is information - a simple model (i.e. logistic regression) will work
-- negative subsampling -

Medium

Advances in NLP in 2017

Disclaimer: First of all I need to say that all these trends and advances are only my point of view, other people could have other…

1.0K viewsAlexander, 09:56

Spark in me

sample words with frequency between uniform and 3/4 power of frequency in corpus to alleviate frequent words
-- train only a limited number of classifiers (i.e. 5-15, 1 positive sample + k negative) on each update
-- skip-gram model in a nutshell - http://prntscr.com/iwfwb2

- GloVe - Global Vectors (2014)
-- http://aclweb.org/anthology/D14-1162
-- supposedly GloVe is better given same resources than Word2Vec - http://prntscr.com/iwf9bx
-- in practice word vectors with 200 dimensions are enough for applied tasks
-- considered to be one of sota solutions now (afaik)

(2) BLEU score for translation
- essentially an exp of modified precision index for logs of 4 n-grams
- http://prntscr.com/iwe3v2
- http://dl.acm.org/citation.cfm?id=1073135

(3) Attention is all you need
- http://arxiv.org/abs/1706.03762

To be continued.

#data_science
#nlp
#rnns

Lightshot

Screenshot

Captured with Lightshot

891 viewsAlexander, 09:56

Spark in me

NLP project peculiarities
(0) Always handle new words somehow
(1) Easy evaluation of test results - you can just look at it
(2) Key difference is always in the domain - short or long sequences / sentences / whole documents - require different features / models / transfer learning

Basic Approaches to modern NLP projects
https://www.youtube.com/watch?v=Ozm0bEi5KaI
(0) Basic pipeline
http://prntscr.com/iwhlsx

(1) Basic preprocessing
- Stemming / lemmatization
- Regular expressions

(2) Naive / old school approaches that can just work
- Bag of Words => simple model
- Bag of Words => tf-idf => SVD / PCA / NMF => simple model

(3) Embeddings
- Average / sum of Word2Vec embeddings
- Word2Vec * tf-idf >> Doc2Vec
- Small documents => embeddings work better
- Big documents => bag of features / high level features

(4) Sentiment analysis features
- http://prntscr.com/iwhzqk
- n-chars => won several Kaggle competitions

(5) Also a couple of articles for developing intuition for sentence2vec
- https://medium.com/@premrajnarkhede/sentence2vec-evaluation-of-popular-theories-part-i-simple-average-of-word-vectors-3399f1183afe
- https://medium.com/scaleabout/a-gentle-introduction-to-doc2vec-db3e8c0cce5e

(6) Transfer learning in NLP - looks like it may become more popular / prominent
- Jeremy Howard's preprint on NLP transfer learning - http://arxiv.org/abs/1801.06146

#data_science
#nlp

YouTube

ML tutorial for NLP, Алексей Натекин

Open Data Science December Meetup – 20.12.2017
Партнеры митапа – Юрий Мельничек, компании Microsoft и Mapbox.

Задачи с анализом текста плотно вошли в нашу жизнь и встречаются повсеместно. В этом обзорно-обучающем докладе Алексей рассмотрит основные ингредиенты…

1.4K viewsAlexander, 13:26

Spark in me

https://youtu.be/veWkBsK0nwU

YouTube

DeepMind's AI Learns Complex Behaviors From Scratch | Two Minute Papers #239

The paper "Learning by Playing - Solving Sparse Reward Tasks from Scratch" is available here:
https://arxiv.org/abs/1802.10567

Our Patreon page: https://www.patreon.com/TwoMinutePapers

We would like to thank our generous Patreon supporters who make Two…

1.1K viewsAlexander, 14:47

Spark in me

Pandas vs. numpy speed benchmarks
- https://github.com/mm-mansour/Fast-Pandas

#data_science

GitHub

GitHub - mm-mansour/Fast-Pandas: Benchmark for different operations in pandas against various dataframe sizes.

Benchmark for different operations in pandas against various dataframe sizes. - GitHub - mm-mansour/Fast-Pandas: Benchmark for different operations in pandas against various dataframe sizes.

1.1K viewsAlexander, 08:02

Spark in me

Finally a good piece on RF feature selection

http://parrt.cs.usfca.edu/doc/rf-importance/index.html

#data_science

876 viewsAlexander, 03:43

Spark in me

New TF release (if you are into it, I personally use TensorBoard)
- https://github.com/tensorflow/tensorflow/releases/tag/v1.7.0

What is interesting

TensorBoard Debugger Plugin, the graphical user interface (GUI) of TensorFlow Debugger (tfdbg), is now in alpha.

#deep_learning

GitHub

Release TensorFlow 1.7.0 · tensorflow/tensorflow

Release 1.7.0
Major Features And Improvements

Eager mode is moving out of contrib, try tf.enable_eager_execution().
Graph rewrites emulating fixed-point quantization compatible with TensorFlow Lit...

881 viewsAlexander, 04:35

Spark in me

A new post by Fchollet
https://medium.com/@francois.chollet/what-worries-me-about-ai-ed9df072b704

#deep_learning

Medium

What worries me about AI

Disclaimer: These are my own personal views. I do not speak for my employer. If you quote this article, please have the honesty to present…

928 viewsAlexander, 04:37

Spark in me

Internet digest
- Chrome OS on tablets - https://goo.gl/K5iCJw
- Facial recognition in China - https://goo.gl/aJjPH5 - 1984
- Ikea + AR manual - https://goo.gl/WW6Eqg
- WildBerries.ru stats - https://goo.gl/qPspe1
- Digital content forgery and ML - https://goo.gl/e5tqWa
- On Facebook tracking your SMS and calls
-- https://newsroom.fb.com/news/2018/03/fact-check-your-call-and-sms-history/

#digest
#internet

9to5Mac

Google debuts Chrome OS tablets to take on the iPad in education

Ahead of Apple’s education-focused event tomorrow where a new affordable iPad is expected, Google this morning announced the first Chrome OS tablet. The Acer Chromebook Tab 10 is a new form factor for Chrome OS going directly after the K-12 market. Since…

841 viewsAlexander, 10:35

Spark in me via @vote

Posts on the website

Recent competitions - DS Bowl, konika, time series forecasting – 21
👍👍👍👍👍👍👍 60%

GANs – 14
👍👍👍👍👍 40%

Assembling devbox
▫️ 0%

👥 35 people voted so far.

892 viewsAlexander, 11:07

Recent competitions - DS Bowl, konika, time series forecasting – 21

GANs – 14

Assembling devbox

Spark in me via @vote

NLP related content on the channel?

Yes – 67
👍👍👍👍👍👍👍 92%

Do not care – 4
▫️ 5%

No – 2
▫️ 3%

👥 73 people voted so far.

1.1K viewsAlexander, 11:07

Wow PyTorch is so cool that it even has a concat dataset class

http://pytorch.org/docs/master/data.html#torch.utils.data.ConcatDataset

Does not work for datasets with different resolution though

#pytorch

1.1K viewsAlexander, edited 11:32

About

Blog

Apps

Platform