Just found a book on practical Python programming patterns
- http://python-3-patterns-idioms-test.readthedocs.io/en/latest/PythonForProgrammers.html
Looks good
#python
- http://python-3-patterns-idioms-test.readthedocs.io/en/latest/PythonForProgrammers.html
Looks good
#python
Savva's company (my jungle teammate) made a brief post about the competition
https://www.objectstyle.com/news/savva-kolbachev-computer-vision-contest-win
https://www.objectstyle.com/news/savva-kolbachev-computer-vision-contest-win
Objectstyle
ObjectStyler Savva Kolbachev wins 3rd prize in Computer Vision contest by Chimp&See - ObjectStyle.com
Congrats to ObjectStyler Savva Kolbachev and his team for winning the 3rd prize in Computer Vision contest organized by Chimp&See!
A great survey - how to work with imbalanced data
https://machinelearningmastery.com/tactics-to-combat-imbalanced-classes-in-your-machine-learning-dataset/
#data_science
https://machinelearningmastery.com/tactics-to-combat-imbalanced-classes-in-your-machine-learning-dataset/
#data_science
MachineLearningMastery.com
8 Tactics to Combat Imbalanced Classes in Your Machine Learning Dataset - MachineLearningMastery.com
Has this happened to you? You are working on your dataset. You create a classification model and get 90% accuracy immediately.
Forwarded from Data Science by ODS.ai 🦜
Most common libraries for Natural Language Processing:
CoreNLP from Stanford group:
http://stanfordnlp.github.io/CoreNLP/index.html
NLTK, the most widely-mentioned NLP library for Python:
http://www.nltk.org/
TextBlob, a user-friendly and intuitive NLTK interface:
https://textblob.readthedocs.io/en/dev/index.html
Gensim, a library for document similarity analysis:
https://radimrehurek.com/gensim/
SpaCy, an industrial-strength NLP library built for performance:
https://spacy.io/docs/
Source: https://itsvit.com/blog/5-heroic-tools-natural-language-processing/
#nlp #digest #libs
CoreNLP from Stanford group:
http://stanfordnlp.github.io/CoreNLP/index.html
NLTK, the most widely-mentioned NLP library for Python:
http://www.nltk.org/
TextBlob, a user-friendly and intuitive NLTK interface:
https://textblob.readthedocs.io/en/dev/index.html
Gensim, a library for document similarity analysis:
https://radimrehurek.com/gensim/
SpaCy, an industrial-strength NLP library built for performance:
https://spacy.io/docs/
Source: https://itsvit.com/blog/5-heroic-tools-natural-language-processing/
#nlp #digest #libs
CoreNLP
High-performance human language analysis tools, now with native deep learning modules in Python, available in many human languages.
A framework to deploy and maintain models by instacart - https://tech.instacart.com/how-to-build-a-deep-learning-model-in-15-minutes-a3684c6f71e - please tell me if anybody tried it
Medium
How to build a deep learning model in 15 minutes
An open source framework for configuring, building, deploying and maintaining deep learning models in Python.
It is tricky to launch XGB fully on GPU. People report that on the same data CatBoost has inferior quality w/o tweaking (but is faster). LightGBM is reported to be faster and to have the same accuracy.
So I tried adding LighGBM w GPU support to my Dockerfile -
https://github.com/Microsoft/LightGBM/blob/master/docs/GPU-Tutorial.rst - but I encountered some driver Docker issues.
One of the caveats I understood - it supports only older Nvidia drivers, up to 384.
Luckily, there is a Dockerfile by MS that seems to be working (+ jupyter, but I could not install extensions)
https://github.com/Microsoft/LightGBM/blob/master/docker/gpu/README.md
#data_science
So I tried adding LighGBM w GPU support to my Dockerfile -
https://github.com/Microsoft/LightGBM/blob/master/docs/GPU-Tutorial.rst - but I encountered some driver Docker issues.
One of the caveats I understood - it supports only older Nvidia drivers, up to 384.
Luckily, there is a Dockerfile by MS that seems to be working (+ jupyter, but I could not install extensions)
https://github.com/Microsoft/LightGBM/blob/master/docker/gpu/README.md
#data_science
GitHub
LightGBM/docs/GPU-Tutorial.rst at master · microsoft/LightGBM
A fast, distributed, high performance gradient boosting (GBT, GBDT, GBRT, GBM or MART) framework based on decision tree algorithms, used for ranking, classification and many other machine learning ...
Modern Pandas series about classic time-series algorithms
- https://tomaugspurger.github.io/modern-7-timeseries
Some basic boilerplate and baselines
#data_science
#time_series
- https://tomaugspurger.github.io/modern-7-timeseries
Some basic boilerplate and baselines
#data_science
#time_series
tomaugspurger.github.io
datasframe
– Modern Pandas (Part 7): Timeseries
– Modern Pandas (Part 7): Timeseries
Posts and writings by Tom Augspurger
Amazing article about the most popular warning in Pandas
- https://www.dataquest.io/blog/settingwithcopywarning/
#data_science
- https://www.dataquest.io/blog/settingwithcopywarning/
#data_science
Dataquest
SettingwithCopyWarning: How to Fix This Warning in Pandas – Dataquest
SettingWithCopyWarning: Everything you need to know about the most common (and most misunderstood) warning in pandas and how to fix it!
Found some starter boilerplate of how to use hyperopt instead of gridsearch for faster search:
- here - https://goo.gl/ccXkuM
- and here - https://goo.gl/ktblo5
#data_science
- here - https://goo.gl/ccXkuM
- and here - https://goo.gl/ktblo5
#data_science
So, I have benched XGB vs LightGBM vs CatBoost. Also I endured xgb and lgb GPU installation. This is just general usage impression, not a hard benchmark.
My thoughts are below.
(1) Installation - CPU
(all) - are installed via pip or conda in one line
(2) Installation - GPU
(xgb) - easily done via following their instructions, only nvidia drivers required;
(lgb) - easily done on Azure cloud. on Linux requires some drivers that may be lagging. Could not integrate my Dockerfile with their instructions, but their Dockerfile worked perfectly;
(cb) - instructions were too convoluted for me to follow;
(3) Docs / examples
(xgb) - the worst one, fine-tuning guidelines are murky and unpolished;
(lgb) - their python API is not entirely well documented (e.g. some options can be found only on forums), but overall the docs are very decent + some ft hints;
(cb) - overall docs are nice, a lot of simple examples + some boilerplate in .ipynb format;
(4) Regression
(xgb) - works poorly. Maybe my params are bad, but out-of-the-box params of sklearn API use 5-10x more time than the rest and lag in accuracy;
(lgb) - best performing one out of the box, fast + accurate;
(cb) - fast but less accurate;
(5) Classification
(xgb) - best accuracy
(lgb) - fast, high accuracy
(cb) - fast, worse accuracy
(6) GPU usage
(xgb) - for some reason accuracy when using full GPU options is really really bad. Forum advice does not help.
#data_science
My thoughts are below.
(1) Installation - CPU
(all) - are installed via pip or conda in one line
(2) Installation - GPU
(xgb) - easily done via following their instructions, only nvidia drivers required;
(lgb) - easily done on Azure cloud. on Linux requires some drivers that may be lagging. Could not integrate my Dockerfile with their instructions, but their Dockerfile worked perfectly;
(cb) - instructions were too convoluted for me to follow;
(3) Docs / examples
(xgb) - the worst one, fine-tuning guidelines are murky and unpolished;
(lgb) - their python API is not entirely well documented (e.g. some options can be found only on forums), but overall the docs are very decent + some ft hints;
(cb) - overall docs are nice, a lot of simple examples + some boilerplate in .ipynb format;
(4) Regression
(xgb) - works poorly. Maybe my params are bad, but out-of-the-box params of sklearn API use 5-10x more time than the rest and lag in accuracy;
(lgb) - best performing one out of the box, fast + accurate;
(cb) - fast but less accurate;
(5) Classification
(xgb) - best accuracy
(lgb) - fast, high accuracy
(cb) - fast, worse accuracy
(6) GPU usage
(xgb) - for some reason accuracy when using full GPU options is really really bad. Forum advice does not help.
#data_science
I have seen questions on forums - how to add Keras-like progress bar to PyTorch for simple models?
The answer is to use tqdm and this property
- https://goo.gl/cG6Ug8
This example is also great
#deep_learning
#pytorch
The answer is to use tqdm and this property
- https://goo.gl/cG6Ug8
This example is also great
from tqdm import trange
from random import random, randint
from time import sleep
t = trange(100)
for i in t:
# Description will be displayed on the left
t.set_description('GEN %i' % i)
# Postfix will be displayed on the right, and will format automatically
# based on argument's datatype
t.set_postfix(loss=random(), gen=randint(1,999), str='h', lst=[1, 2])
sleep(0.1)
#deep_learning
#pytorch
Stack Overflow
Can I add message to the tqdm progressbar?
When using the tqdm progress bar: can I add a message to the same line as the progress bar in a loop?
I tried using the "tqdm.write" option, but it adds a new line on every write. I would like each
I tried using the "tqdm.write" option, but it adds a new line on every write. I would like each
2018 DS/ML digest 6
Visualization
(1) A new amazing post by Google on distil - https://distill.pub/2018/building-blocks/.
This is really amazing work, but their notebooks tells me that it is a far cry from being able to be utilized by the community - https://goo.gl/3c1Fza
This is how the CNN sees the image - https://goo.gl/S4KT5d
Expect this to be packaged as part of Tensorboard in a year or so)
Datasets
(1) New landmark dataset by Google - https://goo.gl/veSEhg - looks cool, but ...
Prizes in the accompanying Kaggle competitions are laughable https://goo.gl/EEGDEH https://goo.gl/JF93Xx
Given that datasets are really huge...~300G
Also also if you win, you will have to buy a ticket to the USA on your money ...
(2) Useful script to download the images https://goo.gl/JF93Xx
(3) Imagenet for satellite imagery - http://xviewdataset.org/#register - pre-register
https://arxiv.org/pdf/1802.07856.pdf paper
(4) CVPR 2018 for satellite imagery - http://deepglobe.org/challenge.html
Papers / new techniques
(1) Improving RNN performance via auxiliary loss - https://arxiv.org/pdf/1803.00144.pdf
(2) Satellite imaging for emergencies - https://arxiv.org/pdf/1803.00397.pdf
(3) Baidu - neural voice cloning - https://goo.gl/uJe852
Market
(1) Google TPU benchmarks - https://goo.gl/YKL9yx
As usual such charts do not show consumer hardware.
My guess is that a single 1080Ti may deliver comparable performance (i.e. 30-40% of it) for ~US$700-1000k, i.e. ~150 hours of rent (this is ~ 1 week!)
Miners say that 1080Ti can work 1-2 years non-stop
(2) MIT and SenseTime announce effort to advance artificial intelligence research https://goo.gl/MXB3V9
(3) Google released its ML course - https://goo.gl/jnVyNF - but generally it is a big TF ad ... Andrew Ng is better for grasping concepts
Internet
(1) Interesting thing - all ISPs have some preferential agreements between each other - https://goo.gl/sEvZMN
#digest
#data_science
#deep_learning
Visualization
(1) A new amazing post by Google on distil - https://distill.pub/2018/building-blocks/.
This is really amazing work, but their notebooks tells me that it is a far cry from being able to be utilized by the community - https://goo.gl/3c1Fza
This is how the CNN sees the image - https://goo.gl/S4KT5d
Expect this to be packaged as part of Tensorboard in a year or so)
Datasets
(1) New landmark dataset by Google - https://goo.gl/veSEhg - looks cool, but ...
Prizes in the accompanying Kaggle competitions are laughable https://goo.gl/EEGDEH https://goo.gl/JF93Xx
Given that datasets are really huge...~300G
Also also if you win, you will have to buy a ticket to the USA on your money ...
(2) Useful script to download the images https://goo.gl/JF93Xx
(3) Imagenet for satellite imagery - http://xviewdataset.org/#register - pre-register
https://arxiv.org/pdf/1802.07856.pdf paper
(4) CVPR 2018 for satellite imagery - http://deepglobe.org/challenge.html
Papers / new techniques
(1) Improving RNN performance via auxiliary loss - https://arxiv.org/pdf/1803.00144.pdf
(2) Satellite imaging for emergencies - https://arxiv.org/pdf/1803.00397.pdf
(3) Baidu - neural voice cloning - https://goo.gl/uJe852
Market
(1) Google TPU benchmarks - https://goo.gl/YKL9yx
As usual such charts do not show consumer hardware.
My guess is that a single 1080Ti may deliver comparable performance (i.e. 30-40% of it) for ~US$700-1000k, i.e. ~150 hours of rent (this is ~ 1 week!)
Miners say that 1080Ti can work 1-2 years non-stop
(2) MIT and SenseTime announce effort to advance artificial intelligence research https://goo.gl/MXB3V9
(3) Google released its ML course - https://goo.gl/jnVyNF - but generally it is a big TF ad ... Andrew Ng is better for grasping concepts
Internet
(1) Interesting thing - all ISPs have some preferential agreements between each other - https://goo.gl/sEvZMN
#digest
#data_science
#deep_learning
Distill
The Building Blocks of Interpretability
Interpretability techniques are normally studied in isolation. We explore the powerful interfaces that arise when you combine them -- and the rich structure of this combinatorial space.
New articles about picking GPUs for DL
- https://blog.slavv.com/picking-a-gpu-for-deep-learning-3d4795c273b9
- https://goo.gl/h6PJqc
Also most likely for 4-GPU set-ups in a case some of them may be required to have water cooling to avoid thermal throttling.
#deep_learning
- https://blog.slavv.com/picking-a-gpu-for-deep-learning-3d4795c273b9
- https://goo.gl/h6PJqc
Also most likely for 4-GPU set-ups in a case some of them may be required to have water cooling to avoid thermal throttling.
#deep_learning
Medium
Picking a GPU for Deep Learning
Buyer’s guide in 2019
Коэффициент Джини. Из экономики в машинное обучение / Хабрахабр
https://m.habrahabr.ru/company/ods/blog/350440/
https://m.habrahabr.ru/company/ods/blog/350440/
Habr
Коэффициент Джини. Из экономики в машинное обучение
Интересный факт: в 1912 году итальянский статистик и демограф Коррадо Джини написал знаменитый труд «Вариативность и изменчивость признака», и в этом же году...