Spark in me
2.2K subscribers
824 photos
48 videos
116 files
2.68K links
Lost like tears in rain. DS, ML, a bit of philosophy and math. No bs or ads.
Download Telegram
New stack overflow survey 2018
- https://insights.stackoverflow.com/survey/2018/

Key fact - global and USA Data Scientist salary
- Global https://goo.gl/AyYoVv
- USA https://goo.gl/CKdthV

Interesting facts
- Countries - https://goo.gl/2neadX
- How people learn - https://goo.gl/HxKuRH
- Git dominates version control - https://goo.gl/HDXVMj
- PyTorch is in the top of most loved frameworks https://goo.gl/66xJXs
- Connected stacks of technologies - https://goo.gl/pcXiNj
- Most popular languages and tools - https://goo.gl/GK32vn
- Most popular frameworks - https://goo.gl/Khjw87 (PyTorch =) )
- Most popular databases - https://goo.gl/TjTp65
- Attitude to rivalry - https://goo.gl/7mwWd2

#internet
New cool trick - use
pd.to_feather()
instead of pickle or csv

Supposed to work much faster as it dumps the data same way its located in RAM

#data_science
2018 DS/ML digest 7

(!) Top SpaceNet solution report from Albu - top contender is Satellite competitions (!)
(0) Draft - https://goo.gl/4voioj
(1) Key ideas:
-- Upsampling works as well as transposed conv
-- Optimal BCE/DICE weights for SpaceNet three - 0.8/0.2
-- RGB => mean / min-max scaling
-- This algorithm was used for curvy lines - https://goo.gl/B3fYMi
-- Basic upsampling block - conv3х3, upsampling, conv3х3

NLP in Russian in usable form (!):
(1) A review on NER libraries for Russian
-- Review - https://habrahabr.ru/post/349864/
-- Rule based repository https://github.com/natasha/natasha
-- Also this can be done with CNNs - https://github.com/deepmipt/ner
-- Also also I guess that as with all CNNs - they are really hackable and simple, but require annotation

Papers / edu content / code
(1) RU - amazing article about loglocc - https://alexanderdyakonov.wordpress.com/2018/03/12/логистическая-функция-ошибки/
(2) LSTMs are now applied in theory to memory access - https://arxiv.org/pdf/1803.02329.pdf
(3) Preprocessing pipeline for hand-written notes - https://mzucker.github.io/2016/09/20/noteshrink.html
(4) Nice article to build your intuiting about latent spaces - https://goo.gl/ggWjG6

Market
(1) Deep Learning starts popping up in science - https://deepchem.io/about.html
(2) Fchollet about practice with CNNs - https://goo.gl/X9C2zx
(3) Google Open-sources its DeepLab3 for semantic segmentation with dilated convolutions - https://goo.gl/Kkztj2
(4) Google's autp-ML is still nascent - https://goo.gl/YUPSP3

Just for lulz - if you have nothing to do - here is a list of key modern architectures worth just knowing about
# Architectures
* AlexNet: https://papers.nips.cc/paper/4824-imagenet-classification-with-deep-convolutional-neural-networks
* ZFNet: https://arxiv.org/abs/1311.2901
* VGG16: https://arxiv.org/abs/1505.06798
* ResNet: https://arxiv.org/abs/1704.06904
* GoogLeNet: https://arxiv.org/abs/1409.4842
* Inception: https://arxiv.org/abs/1512.00567
* Xception: https://arxiv.org/abs/1610.02357
* MobileNet: https://arxiv.org/abs/1704.04861
# Semantic Segmentation
* FCN: https://arxiv.org/abs/1411.4038
* SegNet: https://arxiv.org/abs/1511.00561
* UNet: https://arxiv.org/abs/1505.04597
* PSPNet: https://arxiv.org/abs/1612.01105
* DeepLab: https://arxiv.org/abs/1606.00915
* ICNet: https://arxiv.org/abs/1704.08545
* ENet: https://arxiv.org/abs/1606.02147
# Generative adversarial networks
* GAN: https://arxiv.org/abs/1406.2661
* DCGAN: https://arxiv.org/abs/1511.06434
* WGAN: https://arxiv.org/abs/1701.07875
* Pix2Pix: https://arxiv.org/abs/1611.07004
* CycleGAN: https://arxiv.org/abs/1703.10593
# Object detection
* RCNN: https://arxiv.org/abs/1311.2524
* Fast-RCNN: https://arxiv.org/abs/1504.08083
* Faster-RCNN: https://arxiv.org/abs/1506.01497
* SSD: https://arxiv.org/abs/1512.02325
* YOLO: https://arxiv.org/abs/1506.02640
* YOLO9000: https://arxiv.org/abs/1612.08242

#digest
#data_science
#deep_learning
Before there was an unofficial Kaggle CLI tool, now there is an official Kaggle API tool
https://github.com/Kaggle/kaggle-api

Cool.
Lol..and ofc data download did not work...unlike the unofficial tool.
Maybe submits will work.
A practical note on using
pd.to_feather()

Works really well, if you have an NVME drive and you want to save a large dataframe to disk in binary format.

If your NVME is properly installed it will give you 1.5-2+GB/s read/write speed, so even if your df is 20+GB in size, it will read literally in seconds.

The ETL process to produce such a df may take minutes.

#data_science
So, I have briefly watched Andrew Ng's series on RNNs.
It's super cool if you do not know much about RNNs and / or want to refresh your memories and / or want to jump start your knowledge about NLP.
Also he explains stuff with really simple and clear illustrations.
Tasks in the course are also cool (notebook + submit button), but they are very removed from real practice, as they imply coding gradients and forwards passes from scratch in python.
(which I did enough during his classic course)
Also no GPU tricks / no research / production boilerplate ofc.

Below are ideas and references that may be useful to know about NLP / RNNs for everyone.

Also for NLP:
(0) Key NLP sota achievements in 2017
-- https://medium.com/@madrugado/advances-in-nlp-in-2017-b00e927fcc57
-- https://medium.com/@madrugado/advances-in-nlp-in-2017-part-ii-d8da391a3f01
(1) Consider fast.ai courses and notebooks https://github.com/fastai/courses/tree/master/deeplearning2
(2) Consider NLP newsletter http://newsletter.ruder.io
(3) Consider excellent PyTorch tutorials http://pytorch.org/tutorials/
(4) There is a lot quality code in PyTorch community (e.g. 1-2 page GloVe implementations!)
(5) Brief 1-hour intro to practical NLP https://www.youtube.com/watch?v=Ozm0bEi5KaI

Also related posts on the channel / libraries:
(1) Pre-trained vectors in Russian - https://t.me/snakers4/1623
(2) How to learn about CTC loss https://t.me/snakers4/1690 (when our seq2seq )
(3) Most popular MLP libraries for English - https://t.me/snakers4/1832
(4) NER in Russian - https://habrahabr.ru/post/349864/
(5) Lemmatization library in Russian - https://pymorphy2.readthedocs.io/en/latest/user/guide.html - recommended by a friend

Basic tasks considered more or less solved by RNNs
(1) Speech recognition / trigger word detection
(2) Music generation
(3) Sentiment analysis
(4) Machine translation
(5) Video activity recognition / tagging
(6) Named entity recognition (NER)

Problems with standard CNN when modelling sequences:
(1) Different length of input and output
(2) Features for different positions in the sequence are not shared
(3) Enormous number of params

Typical word representations
(1) One-hot encoded vectors (10-50k typical solutions, 100k-1m commerical / sota solutions)
(2) Learned embeddings - reduce computation burden to ~300-500 dimensions instead of 10k+

Typical rules of thumb / hacks / practical approaches for RNNs
(0) Typical architectures - deep GRU (lighter) and LSTM cells
(1) Tanh or RELU for hidden layer activation
(2) Sigmoid for output when classifying
(3) Usage of EndOfSentence, UNKnown word, StartOfSentence, etc tokens
(4) Usually word level models are used (not character level)
(5) Passing hidden state in encoder-decoder architectures
(6) Vanishing gradients - typically GRUs / LSTMs are used
(7) Very long sequences for time series - AR features are used instead of long windows, typically GRUs and LSTMs are good for sequence of 200-300 (easier and more straightforward than attention in this case)
(8) Exploding gradients - standard solution - clipping, though it may lead to inferior results (from practice)
(9) Teacher forcing - substitute predicted y_t+1 with real value when training seq2seq model during forward pass
(10) Peephole conntections - let GRU or LSTM see the c_t-1 from the previous hidden state
(11) Finetune imported embeddings for smaller tasks with smaller datasets
(12) On big datasets - may make sense to learn embeddings from scratch
(13) Usage of bidirectional LSTMs / GRUs / Attention where applicable

Typical similarity functions for high-dim vectors
(0) Cosine (angle)
(1) Eucledian

Seminal papers / consctructs / ideas:
(1) Training embeddings - the later the methods came out - the simpler they are
- Matrix factorization techniques

- Naive approach using language model + softmax (non tractable for large corpuses)

- Negative sampling + skip gram + logistic regression = Word2Vec (2013)
-- http://arxiv.org/abs/1310.4546
-- useful ideas
-- if there is information - a simple model (i.e. logistic regression) will work
-- negative subsampling -
sample words with frequency between uniform and 3/4 power of frequency in corpus to alleviate frequent words
-- train only a limited number of classifiers (i.e. 5-15, 1 positive sample + k negative) on each update
-- skip-gram model in a nutshell - http://prntscr.com/iwfwb2

- GloVe - Global Vectors (2014)
-- http://aclweb.org/anthology/D14-1162
-- supposedly GloVe is better given same resources than Word2Vec - http://prntscr.com/iwf9bx
-- in practice word vectors with 200 dimensions are enough for applied tasks
-- considered to be one of sota solutions now (afaik)

(2) BLEU score for translation
- essentially an exp of modified precision index for logs of 4 n-grams
- http://prntscr.com/iwe3v2
- http://dl.acm.org/citation.cfm?id=1073135

(3) Attention is all you need
- http://arxiv.org/abs/1706.03762

To be continued.

#data_science
#nlp
#rnns
NLP project peculiarities
(0) Always handle new words somehow
(1) Easy evaluation of test results - you can just look at it
(2) Key difference is always in the domain - short or long sequences / sentences / whole documents - require different features / models / transfer learning

Basic Approaches to modern NLP projects
https://www.youtube.com/watch?v=Ozm0bEi5KaI
(0) Basic pipeline
http://prntscr.com/iwhlsx

(1) Basic preprocessing
- Stemming / lemmatization
- Regular expressions

(2) Naive / old school approaches that can just work
- Bag of Words => simple model
- Bag of Words => tf-idf => SVD / PCA / NMF => simple model

(3) Embeddings
- Average / sum of Word2Vec embeddings
- Word2Vec * tf-idf >> Doc2Vec
- Small documents => embeddings work better
- Big documents => bag of features / high level features

(4) Sentiment analysis features
- http://prntscr.com/iwhzqk
- n-chars => won several Kaggle competitions

(5) Also a couple of articles for developing intuition for sentence2vec
- https://medium.com/@premrajnarkhede/sentence2vec-evaluation-of-popular-theories-part-i-simple-average-of-word-vectors-3399f1183afe
- https://medium.com/scaleabout/a-gentle-introduction-to-doc2vec-db3e8c0cce5e

(6) Transfer learning in NLP - looks like it may become more popular / prominent
- Jeremy Howard's preprint on NLP transfer learning - http://arxiv.org/abs/1801.06146

#data_science
#nlp