Spark in me
2.21K subscribers
822 photos
48 videos
116 files
2.68K links
Lost like tears in rain. DS, ML, a bit of philosophy and math. No bs or ads.
Download Telegram
Viewing jupyter notebook diffs properly

Tried this tool - https://www.reviewnb.com/
It promises proper notebook diff rendering especially in case of adding images

It renders such diffs properly ... but in THEIR third party interface, not within github (maybe I was naive to expect otherwise)
Which makes it kind of pointless
Now you do not have to try it
Минутка жизы

Давно не писал посты на русском, да и вообще я стараюсь сторониться противоречивых вещей, но вот накопился повод написать.

Предыстория

- Мы сделали очень разнообразный датасет на 20,000 часов для STT (для сравнения самые крупные публичные датасеты на английском в сумме составляют в районе 5,000 часов и домены там не очень разнообразные);
- Не то, чтобы его приняли очень холодно - скорее типичное диванное отношение "страна вас не забудет" / "так нельзя" / "вы нарушили чьи-то права" (кто что-то по сути хотел сказать - сказал, хоть таких было очень мало) / подарите нам за бусы / вставить нужное;
- Датасет лежит в публичном доступе, но под лицензией CC-NC-BY и некоторые существенные его части скрыты , чтобы таки товарищи с таким отношением не мешали. Тем не менее для тех, кто хочет поиграться с речью - там более чем достаточно данных;
- Про отдельные кейсы, когда коммерческие компании тупо его используют наплевав на лицензию - упомянуть стоит, но герои и сами знают себя в лицо;
- В общем и целом, без пиар машинки на миллион долларов и без криков что "ML эта крута!" - мы ожидали что оно разойдется хотя бы 3-5 мощнее, чем оно разошлось. Хотя как бы мы сделали огромную работу (ждите еще публикаций!) сопоставимую с тем, чтобы вытянуть бегемота из болота и посчитали нужным ей поделиться по-возможности. Отчасти это показывает, что массовый adoption спича еще довольно на низком уровне, да и чего греха таить - всем нужно все готовое и бесплатно, тут удивляться опять же особо нечему;

TLDR

Мне переслали пост от евангелиста Яндекса про то, что мол спич шагает огромными шагами (в реальности это не совсем так).
Если прочитать его, создается типичное ощущение, что все классно, ML "эта крута" и умные роботы уже стоят за углом (это мы тоже адресуем в публикации скоро). И это не какой-то стартап в доменной зоне .ai, который кричит на кажлом углу что-то, но по сути продает middle-ware, а типа топовый tech канал в телеге на русском языке.

Я не вспомню был ли пост изменен впоследствии, но после его публикации, я написал автору в личку с такой критикой, о том, что он как бы сильно искажает информацию (не подумайте, проект Мозиллы, ОЧЕНЬ крутой, они немногие кто публикует реально вменяемые вещи в спиче):

- Там реально нормальное число часов есть в 3-4 языках максимум;
- Сам датасет Common Voice скорее подходит для задачи идентификации спикеров;
- Модели они тренируют на своем + нескольких полу-публичных и приватных датасетах, но данные там довольно узкие по домены и их маловато. И тренируют они на одном языке;
- Про модели там отдельный аргумент (они не самые оптимальные), но это менее важно;

Мы пообщались, на что собственно был получен ответ, что подача собственно действительно предвзятая, но мол он топит за доступность данных, особенно на русском. А на русском в том же Common Voice куцые 67 часов, которые с учетом специфики спича и сбора датасета, для STT вообще никак не годятся. Увы и ах.

И возникает собственно риторический вопроc - зачем "лидеры мнений", якобы подкованные в технологиях, подают информацию в заведомо искаженном виде явно игнорируя очевидные факты (или просто не делая домашнюю работу)? Думайте сами)

So what?

Ну и что?

Да просто я в очередной раз убеждаюсь в следующих вещах

- Евангелисты / ведуны технических каналов / популяризаторы не то чтобы не рубят в своем предмете, они даже по ссылкам не проходят (не говоря уже об анализе и проверке фактов) - чтение такого контента просто увеличивает уровень шума. Особенно доставляет когда противоречие прямо находится на расстоянии одного клика;

- Если кто-то пишет, что все хорошо, и круто, и классно, задайте два вопроса (i) что именно умолчали и почему все так однобоко (ii) с какой целью все это сказано и что сказано между строк;

- Если вы молоды и ум ваш еще не до конца окреп - мотайте на ус, делайте выводы про то как работает мир, думайте хотите ли вы, чтобы такое становилось нормой;

I reject this reality and substitute my own!
Творите, верьте в себя, не пугайтесь, фильтруйте и не плодите мусор!
A Small Note On Keeping Access Tokens For Your Apps In Check

Use case - you want to use a trusted third party service to store API tokens / public keys for your services with the following requirements:

- Manage keys / revoke keys / manage users having access to said keys
- The service should be cheap
- Tag your secrets for easier filtering
- "Turn off" / revoke some secrets only for some keys

I viewed and tried several services, and it seems that (surprusingly for me) AWS Secrets Manager is the best one. It just works. Google's alternatives are a bit clunky / weird / still in beta (also Google has no proper clients for its secret manager service). Less known providers also have options (but do not expect easy to use clients), and there are self-hosted options, which are also a bit difficult to use. Also I found out that there are services that essentially provide you something similar to open cryptography packages (open-ssh) but in the cloud (why?).

Also ... a killer feature - you can use the amazing boto3 library in python with the majority of Amazon services. Surprise-surprise.

PS
I am NOT endorsing AWS, in general I try avoiding AWS. But s3-compatible storage, boto3 and this service are just too good.

#deployment
Logging With Notifications

Out-of-the box with 10+ providers. Integrated with loguru.

I frequently stumble upon people using some provider-specific client libraries / wrappers and / or writing some code to send some messages to themselves about their servers' hardware / neural network training / exceptions / pipelines. I even wrote such code myself. Lacking focus - the code was shitty and abandoned.

Now loguru is integrated with a notifiers library.
Just very helpful. Very simple and elegant. Just send yourself a telegram message and fall-back to gmail if it does not work.

... grain of salt

requests can use an env variable to use and HTTP or SOCK5 proxy. It works with some of the notifiers downstream libraries well enough if they use a sufficiently new version of requests (i.e. the code is fresh). But boto3 for example does not support SOCK5 proxies out of the box.


#deployment
A Great Start For Your Сustom Python Dockerfiles

I like to popularize really great open-source stuff. I have shared my ML Dockerfiles several time. Now I base my PyTorch workflows on ... surprise-surprise PyTorch's official images with Apex. But (when I looked) for some reason it was difficult to find the original dockerfiles themselves, there were only images (maybe I did not look well enough).

But what if you need a simpler / different / lighter python workflow without PyTorch / GPUs? Miniconda is an obvious choice. Yeah, and now there is miniconda as a docker image (pre-built) and with dockerfile! What is also remarkable, my dockerfile, which I inherited from Fchollet in 2017, starts very similarly to this miniconda dockerfile.

https://hub.docker.com/r/continuumio/miniconda3/dockerfile

Enjoy. A great way to start your python project and / or journey.

#python
ML Model Quality = Capacity * Throughput * Annotation Quality / Focus


I firmly believe that proper ML models follow this formula and should be considered to be similar to compression algorithms (more will follow in an article we are shortly releasing). If your model / pipeline compresses 10 TB worth of data into 50-200MB of your model and works fast on not very large CPU and GPU - then success.


This also kind of explains why knowledge distillation works. If you look closely - this pattern is everywhere:


- MobileNets trade capacity for speed. My own experiments show, that separable convolutions train 3-4x slower, but achieve the same results

- FAIR trained networks with superior performance on weakly-supervised huge image datasets scraped from the Internet. Models are larger, obviously they train longer - but they capture more of the signal in a broader domain

- Distillation papers focus on distilling via dense signals, which helps speed up the process tremendously

- This also kind of applies in NLP. I do no know how to test this with transformers, but there have already been successful attempts by FAIR and Google to distill transformers into a smaller / compressed versions of themselves. I have not seen proper reports of how much time it takes to train such smaller versions compared to large ones, though

- In speech recognition, alignment with attention trains 3-5x slower on the same data and setup w/o CTC loss, which makes sense, as this is the most difficult part of the process. If we have labels WHEN each letter is pronounced, would the network train even 2-3x faster?

- In speech generation, you can focus your Tacotron model by providing more granular / short audio files with texts, but in the end it just means being able to learn the alignment faster. Ofc you can also force-feed your pre-calculated alignment as teacher forcing of sorts, if you have a pre-trained STT model =)

- Also, almost no annotation is 100% perfect. ML model performance has a strange relation with annotation quality. Make it too poor - it will not train / diverge. Make it too easy / perfect - you will end up with non-generalizable poor model and you will deceive yourself. The best result is achieved when your annotation is both focused on the task you are solving, and broad enough to capture the whole domain. A bit controversial.

#deep_learning
Last ML digest this year - 19

https://spark-in.me/post/2019_ds_ml_digest_19

A bit half-assed, I was a bit busy with non-ML stuff lately, so idk)

Highlights:

- Fast Sparse ConvNets - if this somehow trickles down, it will be a new standard on par with MobileNets
- What’s Hidden in a Randomly Weighted Neural Network? Lottery ticket hypothesis shown to work in Imagenet
- Reproducibility crisis in ML and science

Also read the above posts about python libraries, they are also cool!

#digest
The State of Speech Encoding Сodecs in Python

A VERY brief TLDR w/o looking into how it works inside.

- mp3, works kind of ok, but allegedly its compression makes STT perform worse. Have not checked it myself, but with sufficiently high bitrate probably it will not matter. Also "murky" legally. A lot of libraries that support it via downstream tools, but I have not searched for tools w/o unnecessary bulk / implemented not using sox / ffmpeg - probably they exist

- speex. Obsolete according to its authors, to be replaced by opus

- vorbis and its associated container ogg. Very popular, there is a cool python pysoundfile library that works on top of libsndfile library. An obvious choice together with wave or scipy.io.wav for wav files

- opus codec (also with ogg container). This is supposed to be the one-size-fits all super codec for all needs that encodes both speech and music. Also allegedly it even reduces STT recognition erros. At the moment there no proper native support in python, except for pyogg library. It is a bit unpolished and there is no write support (technically there is, but it is not user-friendly).

- Also worth noting that there was a heated debate about including opus support into libsndfile and it was included (repos even for newer OSs do not include this support yet), but I have not found a clear instuction how to make it into a .so file after it compiles (maybe you know? would be very cool!). It compiles, but the instructions are not very clear after this stage. Probably another 6-12 months will pass before this becomes mainstream!

- More convoluted ways like CFFI wrappers around sox and torchaudio sox effect chain wrapper. But this introduces overhead that may not be desirable for training NNs or running in production

- lpcnet (it is a vocoder under the hood) as codec - it boasts speech compression at 1.6 kb/s. Literally. Very impressive, but there is no mainstream way of just using it in python (yet). And naturally is works only for speech! So no music / noise / etc

#deep_learning
One More Last ML digest this year - 20

Actually
this should have been number 19.
But I wrote it and forgot about it (too much stuff happening).
So whatever, lol, I am an idiot =)

Highlights

- 87.4% top-1 accuracy on Imagenet:
- AI circus (cool article, read it!) status update eof 2019, the winter is coming? =)
- New Open Images, solutions to Open Images competition 2019
- Objects in the wild dataset - alternative to Open Images

https://spark-in.me/post/2019_ds_ml_digest_20

#digest
Using opus Codec with Libsndfile

Well almost. It writes valid .opus files, but refuses to read them, though other programs read w/o problems (including PyOGG).

Building

RUN apt-get update && \
apt-get install cmake autoconf autogen automake build-essential libasound2-dev \
libflac-dev libogg-dev libtool libvorbis-dev libopus-dev pkg-config -y && \
cd /usr/lib && git clone https://github.com/erikd/libsndfile.git && \
cd libsndfile && mkdir -p build && cd build && \
cmake .. -DBUILD_SHARED_LIBS=ON && make && make install && \
cmake --build . && pip install soundfile==0.10.3.post1 && \
apt-get clean && \
ln /usr/lib/libsndfile/build/libsndfile.so.1 /usr/bin/libsndfile.so.1 && \
ldconfig && \
rm -rf /var/lib/apt/lists/*


Writing a file:

import soundfile as sf
sf._subtypes['OPUS']=0x0064

from scipy.io.wavfile import read

sr, data = read('taco-encoder-id/data/ruslan_16000/0/00/4accb05419b9.wav')
sf.write('test_test.opus', data, sr, format='OGG', subtype='OPUS')


#audio
subscriptions.xml
10.6 KB
Someone asked about a list of blogs I read
This media is not supported in your browser
VIEW IN TELEGRAM
Forwarded from Alexander
Some useful devops things

Happy holidays everyone!
Was a bit busy with devops stuff, found some under-the-radar things, that you might find useful when deploying ML applications.

- Dockerize - if you have some older application in your stack, that writes logs to some random folder, you can use this to easily dockerize this app, explanation

- Wait for it - if you have a distributed architecture and you cannot design your app to be resilient to restarts / one of services being not accessible from start - you can use this

- Reverse proxy + TTS. Actually I tried using traefik (it is advertised as an eazy one-size fits all solution) ... but it does not work and glitches. But nginx of course works. Found this gem of a wrapper recently - it allows you to use nginx reverse proxy and have TLS encryption with Let's encrypt with just 2 micro services

- Docker compose 2 vs 3. Did not find this in docs - 3 is not necessarily newer or better, it is just geared towards swarm mode in Docker

#devops
New embedded computing platform?

Looks like Intel has a new computing platform for ultra compact PCs. This may fit some of the semi-embedded ML applications!

Intel NUC 9 Extreme Compute Element
- https://www.theverge.com/2020/1/7/21051879/intel-pc-nuc-9-extreme-ghost-canyon-element-hands-on-teardown-ces-2020
- https://pc-01.tech/razer-tomahawk/

#hardware
Deploying High Load ML Models ... in Real-Time with Batches

I have heard opinions that properly using GPUs in production is difficult, because of how to handle real-time queues / batching properly. In reality it's a chore, but you can even save money compared with CPU-only deploy (especially if you deploy a whole workload where some parts are CPU intensive)!

Using GPUs for your ML models has several advantages:

- 10x faster (sometimes even without batching)
- Usually you can have viable batch sizes of 10 - 100 on one GPU depending on your task

But how can you use GPUs in production?
Usually if you do not have real-time requirements or if your model / workload is large, you can get away without explicit batching.

But what if, you need high load and real-time responses at the same time?

The pattern that I arrived at is:

- Use some message broker (Redis, Rabbit MQ). I chose Rabbit MQ because it has nice tutorials, libraries and community

- Upon accepting a workload, check it, hash it and store it locally

- Send a message to a broker with hash / pointer to this stored workload via Remote Procedure Call pattern (also if you really have high load, you may need to send these messages asynchronously as well! in this case aio-pika RPC pattern will help you)

- On the consumer side, accumulate messages to process batches and / or process them on timeouts, if batch accumulation takes too much time

- This has an added benefit of resilience if you write your code properly and acknowledge messages when necessary


Some useful reading:

- RPC pattern in pika (python Rabbit MQ client)
- Real asynchronous proper pika examples 1 / 2
- RPC in aio-pika (asyncio pika)
- What if you want to have RPC / batches / async client in pika at the same time?

Also, docker compose does not yet accept `gpus` option

So there are workarounds:
- https://github.com/docker/compose/issues/6691
- https://github.com/NVIDIA/nvidia-container-runtime#docker-engine-setup

Please tell me if you would like a more detailed post on this topic.

#devops
#deep_learning
A really cool down to earth developer's blog / email digest


Key differences from your typical coder's blog:

- Emails arrive in the order they were written. It tells a story
- Real examples from real life. Real fails
- No BS and sugar coating
- No 10x coder / code ninja / code guru stuff

https://codewithoutrules.com/softwareclown/

#code
PyTorch 1.4 release

TLDR - production / deploy oriented blocks. Blocks to train huge networks. New cool features - pruning and quantization get traction. AMD support starts getting mentioned.

https://github.com/pytorch/pytorch/releases/tag/v1.4.0

- PyTorch Mobile - Build level customization
- Distributed Model Parallel Training (RPC)
- Java bindings
- End of python 2 support =)
- Pruning out-of-the box
- Learning rate schedulers (torch.optim.lr_scheduler) now support “chaining.”
- Named Tensors (out of beta?)
- AMD Support (!?)
- Quantization (!) - more modules support

Still no builds for python 3.8? =)

#deep_learning
Has anyone tried ROCm + PyTorch?
anonymous poll

What is ROCm? – 46
👍👍👍👍👍👍👍 62%

No, I have not tried it – 26
👍👍👍👍 35%

Yes, it technically works, but too early stage – 2
▫️ 3%

Yes, it works properly, even for real-life cases
▫️ 0%

👥 74 people voted so far.