Spark in me
2.26K subscribers
754 photos
48 videos
114 files
2.65K links
Lost like tears in rain. DS, ML, a bit of philosophy and math. No bs or ads.
Download Telegram
Some useful devops things

Happy holidays everyone!
Was a bit busy with devops stuff, found some under-the-radar things, that you might find useful when deploying ML applications.

- Dockerize - if you have some older application in your stack, that writes logs to some random folder, you can use this to easily dockerize this app, explanation

- Wait for it - if you have a distributed architecture and you cannot design your app to be resilient to restarts / one of services being not accessible from start - you can use this

- Reverse proxy + TTS. Actually I tried using traefik (it is advertised as an eazy one-size fits all solution) ... but it does not work and glitches. But nginx of course works. Found this gem of a wrapper recently - it allows you to use nginx reverse proxy and have TLS encryption with Let's encrypt with just 2 micro services

- Docker compose 2 vs 3. Did not find this in docs - 3 is not necessarily newer or better, it is just geared towards swarm mode in Docker

#devops
New embedded computing platform?

Looks like Intel has a new computing platform for ultra compact PCs. This may fit some of the semi-embedded ML applications!

Intel NUC 9 Extreme Compute Element
- https://www.theverge.com/2020/1/7/21051879/intel-pc-nuc-9-extreme-ghost-canyon-element-hands-on-teardown-ces-2020
- https://pc-01.tech/razer-tomahawk/

#hardware
Deploying High Load ML Models ... in Real-Time with Batches

I have heard opinions that properly using GPUs in production is difficult, because of how to handle real-time queues / batching properly. In reality it's a chore, but you can even save money compared with CPU-only deploy (especially if you deploy a whole workload where some parts are CPU intensive)!

Using GPUs for your ML models has several advantages:

- 10x faster (sometimes even without batching)
- Usually you can have viable batch sizes of 10 - 100 on one GPU depending on your task

But how can you use GPUs in production?
Usually if you do not have real-time requirements or if your model / workload is large, you can get away without explicit batching.

But what if, you need high load and real-time responses at the same time?

The pattern that I arrived at is:

- Use some message broker (Redis, Rabbit MQ). I chose Rabbit MQ because it has nice tutorials, libraries and community

- Upon accepting a workload, check it, hash it and store it locally

- Send a message to a broker with hash / pointer to this stored workload via Remote Procedure Call pattern (also if you really have high load, you may need to send these messages asynchronously as well! in this case aio-pika RPC pattern will help you)

- On the consumer side, accumulate messages to process batches and / or process them on timeouts, if batch accumulation takes too much time

- This has an added benefit of resilience if you write your code properly and acknowledge messages when necessary


Some useful reading:

- RPC pattern in pika (python Rabbit MQ client)
- Real asynchronous proper pika examples 1 / 2
- RPC in aio-pika (asyncio pika)
- What if you want to have RPC / batches / async client in pika at the same time?

Also, docker compose does not yet accept `gpus` option

So there are workarounds:
- https://github.com/docker/compose/issues/6691
- https://github.com/NVIDIA/nvidia-container-runtime#docker-engine-setup

Please tell me if you would like a more detailed post on this topic.

#devops
#deep_learning
A really cool down to earth developer's blog / email digest


Key differences from your typical coder's blog:

- Emails arrive in the order they were written. It tells a story
- Real examples from real life. Real fails
- No BS and sugar coating
- No 10x coder / code ninja / code guru stuff

https://codewithoutrules.com/softwareclown/

#code
PyTorch 1.4 release

TLDR - production / deploy oriented blocks. Blocks to train huge networks. New cool features - pruning and quantization get traction. AMD support starts getting mentioned.

https://github.com/pytorch/pytorch/releases/tag/v1.4.0

- PyTorch Mobile - Build level customization
- Distributed Model Parallel Training (RPC)
- Java bindings
- End of python 2 support =)
- Pruning out-of-the box
- Learning rate schedulers (torch.optim.lr_scheduler) now support “chaining.”
- Named Tensors (out of beta?)
- AMD Support (!?)
- Quantization (!) - more modules support

Still no builds for python 3.8? =)

#deep_learning
Has anyone tried ROCm + PyTorch?
anonymous poll

What is ROCm? – 46
👍👍👍👍👍👍👍 62%

No, I have not tried it – 26
👍👍👍👍 35%

Yes, it technically works, but too early stage – 2
▫️ 3%

Yes, it works properly, even for real-life cases
▫️ 0%

👥 74 people voted so far.
First 2020 ML / DS / Coding Digest


Highlights

- PyTorch 1.4 - focus on production / deploy / optimization - cool!
- Order of magnitude more efficient transformer from Google?
- Proper English ASR system comparison
- Pandas 1.0

Please like / share / repost!

https://spark-in.me/post/2020_ds_ml_digest_01

#digest
Spark in me
A small saga about OpenVPN TLDR: (0) Purchase a cheap VDS from a noname provider with decent bandwidth => install OpenVPN => forget about problems => share with friends and family; (1) This guide just works https://goo.gl/K2xjby (do not be afraid of its length…
Decided to update my OpenVPN installation, since I already gave my VPN to several people

Tried pritunl - it really works out of the box - probably full installation would take 10-15 mins really

Another valid alternative is a dockerized OpenVPN

Some time ago I wrote a plain down-to-earth guide for windows users on how to rent a server, create a key, etc etc - if you would like the same for this VPN - ping me
About AMD support ...
Forwarded from Egor
бинарей нет готовых, последняя серия карт Navi не поддерживается. кому такое нафиг надо
Soo cool!

Umap has a dedicated built-in plotting tool!)))
Forwarded from Sava Kalbachou
The State of Native Quantization in PyTorch

Yeah, right. PyTorch 1.3 and / or 1.4 boasted native qint8 quantization support.
So cool, right?

They have 2 main tutorials:

- Memory intensive networks with linear layers (BERT) => dynamic quantization
- Convolutional networks => static quantization

I have not tried vanilla BERT and / or vanilla MobileNet (please tell if you tried!), but looks like:

- 1D convolutions are not supported yet
- Native nn.transformers dynamic quantization ... does not work

It is kind of meh, because I used their native layers (instead of hugging face for example) ... to avoid this exact kind of issue! =)

Anyway, tell me if quantization worked for you in PyTorch, and meanwhile you can upvote these feature requests / bug reports if you feel like these features are useful!

- Fix nn.transformer quantization
- 1D conv support

Ofc I also could spend a couple of months fiddling with quantization myself, but it looks like this year will be more production driven, so why do the same job as they are clearly now are focused on doing?)

#deep_learning
Spark in me
Streamlit vs. viola vs. panel vs. dash vs. bokeh server TLDR - make scientific web-apps via python only w/o any web-programming (i.e. django, tornado). Dash - Mostly for BI - Also a paid product - Looks like the new Tableau - Serving and out-of-the-box…
Using Viola With The Power of Vue.js

Remember the last post about python dashboard / demo solutions?

Since then we tried Viola and Streamlit.
Streamlit is very cool, but you cannot build really custom things with it. You should have 100% support of widgets that you need, and there are always issues with caching.
Also it is painful to change the default appearance.

Remember that viola "In theory has customizable grid and CSS"?
Of course someone took care of that!

Enter ipyvuetify + voila-vuetify.
TLDR - this allows you to have viola demos with Vue UI Library, all in 100% python.

Problems:

- No native table widget / method to show and UPDATE pandas tables. There are solutions that load Vue UI tables, but no updates out-of-the-box
- All plotting libraries will work mostly fine
- All jupiter widgets will work fine. But when you will need a custom widget - you will have to either code it, or find some hack with js-links or manual HTML manipulations
- Takes some time to load, will NOT scale to hundreds / thousands of concurrent users
- ipyvuetify is poorly documented, not very intuitive examples

Links:

- https://github.com/voila-dashboards/voila-vuetify
- https://github.com/mariobuikhuizen/ipyvuetify

#data_science
2020 DS / ML Digest 2

Highlights

- New STT benchmarks from FAIR
- Analysis of GPT-2 by thegradient
- Google’s Meena, a 2.6 billion parameter end-to-end trained neural conversational model (not AGI ofc)
- OpenAI now uses PyTorch
- LaserTag - cool idea on how to handle simpler s2s tasks, i.e. error correction

Please like / share / repost!

https://spark-in.me/post/2020_ds_ml_digest_02

#digest
Spark in me
2020 DS / ML Digest 2 Highlights - New STT benchmarks from FAIR - Analysis of GPT-2 by thegradient - Google’s Meena, a 2.6 billion parameter end-to-end trained neural conversational model (not AGI ofc) - OpenAI now uses PyTorch - LaserTag - cool idea on…
Setting up Wi-Fi on a Headless Server

Yeah, that's a pain in the ass!
Chicken and egg problem - you need to install packages, but you need to set-up Wi-Fi first.
So first you need to install packages ... by copying them via USB stick.
Remember CD-ROM sneakernet? =)
Also making Wi-Fi robust to reboots is a pain.

This guides worked for me on Ubuntu 18.04.3 server:

- https://www.linuxbabe.com/command-line/ubuntu-server-16-04-wifi-wpa-supplicant
- rc.local worked instead of systemd or crontab for me https://gist.github.com/mohamadaliakbari/1cb9400984094541581fff07143e1c9d
- better use your router to nail down the static IP

#linux
Some Proxy Related Tips and Hacks ... Quick Ez and in Style =)

DO is not cool anymore

First of all - let's get the Elephant out of the room. Remember I recommended Digital Ocean?
Looks like they behave like a f**g corporation now. They require you selfie with a passport now.
F**k this. Even AWS does not need this.


Why would you need proxies?

Scraping mostly. Circumventing anal restrictions.
Sometimes there are other legit use cases like proxying your tg api requests.


Which framework to use?

None.
One of our team tried scrapy, but there is too much hassle (imho) w/o any benefits.
(apart from their corporate platform crawlera, but I do not still understand why it exists, enterprise maybe)
Just use aiohttp, asyncio, bs4, requests, threading and multiprocessing.
And just write mostly good enough code.
If you do not need to scrape 100m pages per day or use selenium to scrape JS, this is more than enough.
Really. Do not buy-in into this cargo cult stuff.


Video

For video-content there are libraries:

- youtube-dl - lots of features, horrible python API, nice CLI, it really works
- pytube - was really cool and pythonic, but author abandoned it. Most likely he just wrote a ton of regexp that he decided not to support. Some methods still work though

Also remember that many HTTP libraries have HTTP / SOCK5 support.
If the libraries are old, this may be supported via env variables.


Where to get proxies?

The most interesting part.
There are "dedicated" services (e.g. smartproxy / luminati.io / proxymesh / bestproxy.ru / mobile proxy services).
And they probably are the only option when you scrape Amazon.
But if you need high bandwidth and many requests - such proxies usually have garbage speed.
(and morally - probably 50% of them are hacked routers)
Ofc there is scrapoxy.io/ - but this is just too much!


Enter Vultr

They have always been a DO look-alike serice.

I found a simple hacky way to get 10-20-40 proxies quickly.
You can just use Vultr + Ubuntu 18.04 docker image + write a plain startup script.
That is is. Literally.

With Docker already installed your script may looks something like this:

docker run -d --name socks5_1 -p 1080:1080 -e PROXY_USER=user -e PROXY_PASSWORD=password serjs/go-socks5-proxy && \
docker run -d --name socks5_2 -p 1081:1080 -e PROXY_USER=user -e PROXY_PASSWORD=password serjs/go-socks5-proxy

There are cheaper hosting alternatives - Vultr is quite expensive.
But his script feature + Docker images really save time.
But now they have the nice features - backups / snapshots / startup scripts / ez scaling / etc / etc w/o the corporate bs.
Also use my Give $100, Get $25 link!

Beware - new accounts may be limited to 5 servers per account.
You may want to create several accounts at first.

#data_science
#scraping