Spark in me
2.28K subscribers
728 photos
47 videos
114 files
2.62K links
Lost like tears in rain. DS, ML, a bit of philosophy and math. No bs or ads.
Download Telegram
Easiest solutions to manage configs for ML models

When you have a lot of experiments, you need to minimize your code bulk and manage model configs concisely.
(This also kind of can be done via CLI parameters, but usually these things complement each other)

I know 3 ways:

(0) dicts + kwargs + dotdicts
(1) [attr](https://github.com/python-attrs/attrs)
(2) new python 3.7 [DataClass](https://docs.python.org/3/library/dataclasses.html) (which is very similar to attr)

Which one do you use?

#data_science
2019 DS / ML digest 18
Link

Highlights of the week(s):

- Speech Vocoder w/o GPU on inference
- Publish ML web-apps w/o web frameworks via streamlit in pure python
- Unsupervised pre-training on non-curated image datasets
- PyTorch's popularity in research
- Why ML in medicine does not work and how to solve it via ontologies

#digest
#deep_learning
Streamlit vs. viola vs. panel vs. dash vs. bokeh server

TLDR - make scientific web-apps via python only w/o any web-programming (i.e. django, tornado).

Dash
- Mostly for BI
- Also a paid product
- Looks like the new Tableau
- Serving and out-of-the-box scaling options

Bokeh server
- Mostly plotting (very flexible, unlimited capabilities)
- High entry cost, though bokeh is kind of easy to use
- Also should scale well

Panel
- A bokeh server wrapper with a lot of capabilities for geo + templates

Streamlit
- The nicest looking app for interactive ML apps (maybe even annotation)
- Has pre-built styles and grid
- Limited only to its pre-built widgets
- Built on tornado with a very specific data model incompatible with the majority of available widgets
- Supposed to scale well - built on top of tornado

Viola
- If it runs in a notebook - it will run in viola
- Just turns a notebook into a server
- The app with the most promise for DS / ML
- Scales kind of meh - you need to run a jupyter kernel for each user - also takes some time to spin up a kernel
- Fully benefits from a rich ecosystem of jupyter / python / widgets
- In theory has customizable grid and CSS, but does not come pre-built with this => higher barrier to entry

Also most of these apps have no authentication buil-in.

More details:

- A nice summary here;
- A very detailed pros and cons summary of Streamlit + Viola. Also a very in-depth detailed discussion;
- Also awesome streamlit boilerplate is awesome;

#data_science
Amazing hardware YouTube channel (RU)

Link.

Smart, in-depth, highly analytical, no bs / ads / cringe / sensationalism. Not your typical Russian channel, not Linus Tech Tips or similar.

Example videos:

- What hardware companies could do
- Choosing a PSU

#hardware
Viewing jupyter notebook diffs properly

Tried this tool - https://www.reviewnb.com/
It promises proper notebook diff rendering especially in case of adding images

It renders such diffs properly ... but in THEIR third party interface, not within github (maybe I was naive to expect otherwise)
Which makes it kind of pointless
Now you do not have to try it
Минутка жизы

Давно не писал посты на русском, да и вообще я стараюсь сторониться противоречивых вещей, но вот накопился повод написать.

Предыстория

- Мы сделали очень разнообразный датасет на 20,000 часов для STT (для сравнения самые крупные публичные датасеты на английском в сумме составляют в районе 5,000 часов и домены там не очень разнообразные);
- Не то, чтобы его приняли очень холодно - скорее типичное диванное отношение "страна вас не забудет" / "так нельзя" / "вы нарушили чьи-то права" (кто что-то по сути хотел сказать - сказал, хоть таких было очень мало) / подарите нам за бусы / вставить нужное;
- Датасет лежит в публичном доступе, но под лицензией CC-NC-BY и некоторые существенные его части скрыты , чтобы таки товарищи с таким отношением не мешали. Тем не менее для тех, кто хочет поиграться с речью - там более чем достаточно данных;
- Про отдельные кейсы, когда коммерческие компании тупо его используют наплевав на лицензию - упомянуть стоит, но герои и сами знают себя в лицо;
- В общем и целом, без пиар машинки на миллион долларов и без криков что "ML эта крута!" - мы ожидали что оно разойдется хотя бы 3-5 мощнее, чем оно разошлось. Хотя как бы мы сделали огромную работу (ждите еще публикаций!) сопоставимую с тем, чтобы вытянуть бегемота из болота и посчитали нужным ей поделиться по-возможности. Отчасти это показывает, что массовый adoption спича еще довольно на низком уровне, да и чего греха таить - всем нужно все готовое и бесплатно, тут удивляться опять же особо нечему;

TLDR

Мне переслали пост от евангелиста Яндекса про то, что мол спич шагает огромными шагами (в реальности это не совсем так).
Если прочитать его, создается типичное ощущение, что все классно, ML "эта крута" и умные роботы уже стоят за углом (это мы тоже адресуем в публикации скоро). И это не какой-то стартап в доменной зоне .ai, который кричит на кажлом углу что-то, но по сути продает middle-ware, а типа топовый tech канал в телеге на русском языке.

Я не вспомню был ли пост изменен впоследствии, но после его публикации, я написал автору в личку с такой критикой, о том, что он как бы сильно искажает информацию (не подумайте, проект Мозиллы, ОЧЕНЬ крутой, они немногие кто публикует реально вменяемые вещи в спиче):

- Там реально нормальное число часов есть в 3-4 языках максимум;
- Сам датасет Common Voice скорее подходит для задачи идентификации спикеров;
- Модели они тренируют на своем + нескольких полу-публичных и приватных датасетах, но данные там довольно узкие по домены и их маловато. И тренируют они на одном языке;
- Про модели там отдельный аргумент (они не самые оптимальные), но это менее важно;

Мы пообщались, на что собственно был получен ответ, что подача собственно действительно предвзятая, но мол он топит за доступность данных, особенно на русском. А на русском в том же Common Voice куцые 67 часов, которые с учетом специфики спича и сбора датасета, для STT вообще никак не годятся. Увы и ах.

И возникает собственно риторический вопроc - зачем "лидеры мнений", якобы подкованные в технологиях, подают информацию в заведомо искаженном виде явно игнорируя очевидные факты (или просто не делая домашнюю работу)? Думайте сами)

So what?

Ну и что?

Да просто я в очередной раз убеждаюсь в следующих вещах

- Евангелисты / ведуны технических каналов / популяризаторы не то чтобы не рубят в своем предмете, они даже по ссылкам не проходят (не говоря уже об анализе и проверке фактов) - чтение такого контента просто увеличивает уровень шума. Особенно доставляет когда противоречие прямо находится на расстоянии одного клика;

- Если кто-то пишет, что все хорошо, и круто, и классно, задайте два вопроса (i) что именно умолчали и почему все так однобоко (ii) с какой целью все это сказано и что сказано между строк;

- Если вы молоды и ум ваш еще не до конца окреп - мотайте на ус, делайте выводы про то как работает мир, думайте хотите ли вы, чтобы такое становилось нормой;

I reject this reality and substitute my own!
Творите, верьте в себя, не пугайтесь, фильтруйте и не плодите мусор!
A Small Note On Keeping Access Tokens For Your Apps In Check

Use case - you want to use a trusted third party service to store API tokens / public keys for your services with the following requirements:

- Manage keys / revoke keys / manage users having access to said keys
- The service should be cheap
- Tag your secrets for easier filtering
- "Turn off" / revoke some secrets only for some keys

I viewed and tried several services, and it seems that (surprusingly for me) AWS Secrets Manager is the best one. It just works. Google's alternatives are a bit clunky / weird / still in beta (also Google has no proper clients for its secret manager service). Less known providers also have options (but do not expect easy to use clients), and there are self-hosted options, which are also a bit difficult to use. Also I found out that there are services that essentially provide you something similar to open cryptography packages (open-ssh) but in the cloud (why?).

Also ... a killer feature - you can use the amazing boto3 library in python with the majority of Amazon services. Surprise-surprise.

PS
I am NOT endorsing AWS, in general I try avoiding AWS. But s3-compatible storage, boto3 and this service are just too good.

#deployment
Logging With Notifications

Out-of-the box with 10+ providers. Integrated with loguru.

I frequently stumble upon people using some provider-specific client libraries / wrappers and / or writing some code to send some messages to themselves about their servers' hardware / neural network training / exceptions / pipelines. I even wrote such code myself. Lacking focus - the code was shitty and abandoned.

Now loguru is integrated with a notifiers library.
Just very helpful. Very simple and elegant. Just send yourself a telegram message and fall-back to gmail if it does not work.

... grain of salt

requests can use an env variable to use and HTTP or SOCK5 proxy. It works with some of the notifiers downstream libraries well enough if they use a sufficiently new version of requests (i.e. the code is fresh). But boto3 for example does not support SOCK5 proxies out of the box.


#deployment
A Great Start For Your Сustom Python Dockerfiles

I like to popularize really great open-source stuff. I have shared my ML Dockerfiles several time. Now I base my PyTorch workflows on ... surprise-surprise PyTorch's official images with Apex. But (when I looked) for some reason it was difficult to find the original dockerfiles themselves, there were only images (maybe I did not look well enough).

But what if you need a simpler / different / lighter python workflow without PyTorch / GPUs? Miniconda is an obvious choice. Yeah, and now there is miniconda as a docker image (pre-built) and with dockerfile! What is also remarkable, my dockerfile, which I inherited from Fchollet in 2017, starts very similarly to this miniconda dockerfile.

https://hub.docker.com/r/continuumio/miniconda3/dockerfile

Enjoy. A great way to start your python project and / or journey.

#python
ML Model Quality = Capacity * Throughput * Annotation Quality / Focus


I firmly believe that proper ML models follow this formula and should be considered to be similar to compression algorithms (more will follow in an article we are shortly releasing). If your model / pipeline compresses 10 TB worth of data into 50-200MB of your model and works fast on not very large CPU and GPU - then success.


This also kind of explains why knowledge distillation works. If you look closely - this pattern is everywhere:


- MobileNets trade capacity for speed. My own experiments show, that separable convolutions train 3-4x slower, but achieve the same results

- FAIR trained networks with superior performance on weakly-supervised huge image datasets scraped from the Internet. Models are larger, obviously they train longer - but they capture more of the signal in a broader domain

- Distillation papers focus on distilling via dense signals, which helps speed up the process tremendously

- This also kind of applies in NLP. I do no know how to test this with transformers, but there have already been successful attempts by FAIR and Google to distill transformers into a smaller / compressed versions of themselves. I have not seen proper reports of how much time it takes to train such smaller versions compared to large ones, though

- In speech recognition, alignment with attention trains 3-5x slower on the same data and setup w/o CTC loss, which makes sense, as this is the most difficult part of the process. If we have labels WHEN each letter is pronounced, would the network train even 2-3x faster?

- In speech generation, you can focus your Tacotron model by providing more granular / short audio files with texts, but in the end it just means being able to learn the alignment faster. Ofc you can also force-feed your pre-calculated alignment as teacher forcing of sorts, if you have a pre-trained STT model =)

- Also, almost no annotation is 100% perfect. ML model performance has a strange relation with annotation quality. Make it too poor - it will not train / diverge. Make it too easy / perfect - you will end up with non-generalizable poor model and you will deceive yourself. The best result is achieved when your annotation is both focused on the task you are solving, and broad enough to capture the whole domain. A bit controversial.

#deep_learning
Last ML digest this year - 19

https://spark-in.me/post/2019_ds_ml_digest_19

A bit half-assed, I was a bit busy with non-ML stuff lately, so idk)

Highlights:

- Fast Sparse ConvNets - if this somehow trickles down, it will be a new standard on par with MobileNets
- What’s Hidden in a Randomly Weighted Neural Network? Lottery ticket hypothesis shown to work in Imagenet
- Reproducibility crisis in ML and science

Also read the above posts about python libraries, they are also cool!

#digest
The State of Speech Encoding Сodecs in Python

A VERY brief TLDR w/o looking into how it works inside.

- mp3, works kind of ok, but allegedly its compression makes STT perform worse. Have not checked it myself, but with sufficiently high bitrate probably it will not matter. Also "murky" legally. A lot of libraries that support it via downstream tools, but I have not searched for tools w/o unnecessary bulk / implemented not using sox / ffmpeg - probably they exist

- speex. Obsolete according to its authors, to be replaced by opus

- vorbis and its associated container ogg. Very popular, there is a cool python pysoundfile library that works on top of libsndfile library. An obvious choice together with wave or scipy.io.wav for wav files

- opus codec (also with ogg container). This is supposed to be the one-size-fits all super codec for all needs that encodes both speech and music. Also allegedly it even reduces STT recognition erros. At the moment there no proper native support in python, except for pyogg library. It is a bit unpolished and there is no write support (technically there is, but it is not user-friendly).

- Also worth noting that there was a heated debate about including opus support into libsndfile and it was included (repos even for newer OSs do not include this support yet), but I have not found a clear instuction how to make it into a .so file after it compiles (maybe you know? would be very cool!). It compiles, but the instructions are not very clear after this stage. Probably another 6-12 months will pass before this becomes mainstream!

- More convoluted ways like CFFI wrappers around sox and torchaudio sox effect chain wrapper. But this introduces overhead that may not be desirable for training NNs or running in production

- lpcnet (it is a vocoder under the hood) as codec - it boasts speech compression at 1.6 kb/s. Literally. Very impressive, but there is no mainstream way of just using it in python (yet). And naturally is works only for speech! So no music / noise / etc

#deep_learning
One More Last ML digest this year - 20

Actually
this should have been number 19.
But I wrote it and forgot about it (too much stuff happening).
So whatever, lol, I am an idiot =)

Highlights

- 87.4% top-1 accuracy on Imagenet:
- AI circus (cool article, read it!) status update eof 2019, the winter is coming? =)
- New Open Images, solutions to Open Images competition 2019
- Objects in the wild dataset - alternative to Open Images

https://spark-in.me/post/2019_ds_ml_digest_20

#digest
Using opus Codec with Libsndfile

Well almost. It writes valid .opus files, but refuses to read them, though other programs read w/o problems (including PyOGG).

Building

RUN apt-get update && \
apt-get install cmake autoconf autogen automake build-essential libasound2-dev \
libflac-dev libogg-dev libtool libvorbis-dev libopus-dev pkg-config -y && \
cd /usr/lib && git clone https://github.com/erikd/libsndfile.git && \
cd libsndfile && mkdir -p build && cd build && \
cmake .. -DBUILD_SHARED_LIBS=ON && make && make install && \
cmake --build . && pip install soundfile==0.10.3.post1 && \
apt-get clean && \
ln /usr/lib/libsndfile/build/libsndfile.so.1 /usr/bin/libsndfile.so.1 && \
ldconfig && \
rm -rf /var/lib/apt/lists/*


Writing a file:

import soundfile as sf
sf._subtypes['OPUS']=0x0064

from scipy.io.wavfile import read

sr, data = read('taco-encoder-id/data/ruslan_16000/0/00/4accb05419b9.wav')
sf.write('test_test.opus', data, sr, format='OGG', subtype='OPUS')


#audio
subscriptions.xml
10.6 KB
Someone asked about a list of blogs I read
This media is not supported in your browser
VIEW IN TELEGRAM
Forwarded from Alexander