Spark in me – Telegram

Spark in me

2.18K subscribers

973 photos

56 videos

116 files

2.74K links

Lost like tears in rain. DS, ML, a bit of philosophy and math. No bs or ads.

Download Telegram

About

Blog

Apps

Platform

2.18K subscribers

Assembling a NAS for less than US$50

So ... you want a NAS for emergency backups that only you know about.

You have spent money on GPUs, drives, devboxes and you would like to get your NAS for free.
Ofc, if you are a clever boi, you will have RAID arrays on your devbox, offsite backups, etc etc

If you feel particularly S&M, you might even use AWS Glacier or smth similar.
Or you may buy a NAS (decent devices start from US$500-1000 w/o drives! rip-off!)

But you see, all of the above variants cost money.
Or you cannot easily throw such a backup out of the window / encryption creates overhead.

So you can create a NAS on the cheap in style:
- Buy any raspberry pi (US$5 - US$20, you can find one used even cheaper);
- Buy a USB HDD enclosure (US$5 - US$40);
- Find some garbage drives for free;
- Copy your files, put HDD under your pillow;
- Profit;

Added bonuses:
- If you live in a police state - you can use RAID 0 (just hide the second drive) => in essence this is like have a perfect one-time pad encryption;
- Easily use RAID 1 or RAID 10 with 4 drives;
- Very high portability, if you use 2.5'' drives;
- Mdadm arrays are easily transferrable;
- Cyber punk vibe;

#hardware

1.2K viewsAlexander, edited 07:10

also people recommend backblaze for ultra cheap "fast" backup storage
it also has rsync

1.1K viewsAlexander, edited 08:37

https://github.com/pytorch/pytorch/releases/tag/v1.3.0

More experimental features)

Release Mobile Support, Named Tensors, Quantization, Type Promotion and many more · pytorch/pytorch

Table of Contents

Breaking Changes
Highlights

[Experimental]: Mobile Support
[Experimental]: Named Tensor Support
[Experimental]: Quantization support
Type Promotion
Deprecations

New Features

...

1.2K viewsAlexander, 17:45

Current state of TF vs PyTorch

This review kind of is nothing new, but if you are new to the market, here is my TLDR:

- In reseach PyTorch >> TF, except for obscure cases;
- For small teams PyTorch >> TF;
- For fast product delivery and iteration PyTorch >> TF;
- For corporations TF > PyTorch;
- For edge computing / mobile now TF > PyTorch;
- For production in general, soon PyTorch ~ TF;

- The research community will not likely switch from PyTorch to TF 2.0;
- The remaining question now - will the large corporations / captive audiences switch to TF 2.0 from 1.0 or to PyTorch;

#deep_learning

The State of Machine Learning Frameworks in 2019

Since deep learning regained prominence in 2012, many machine learning frameworks have clamored to become the new favorite among researchers and industry practitioners. From the early academic outputs Caffe and Theano to the massive industry-backed PyTorch…

1.6K viewsAlexander, 09:02

Playing with name NER

Premise

So, I needed to separate street names that are actual name + surname. Do not ask me why.
Yeah I know that maybe 70% of streets are human names more or less.
So you need 99% precision and at least 30-40% recall.
Or you can imagine a creepy soviet name like Трактор.

So, today making a NER parser is easy, take out our favourite framework (plan PyTorch ofc) of choice.
Even use FastText or something even less true. Add data and boom you have it.

The pain

But not so fast. Turns our there is a reason why cutting out proper names is a pain.
For Russian there is the natasha library, but since it works on YARGY, it has some assumptions about data structure.
I.e. names should be capitalized, come in pairs (name - surname), etc etc - I did not look their rules under the hood, but I would write it like this.

So probably this would be a name -

Иван Иванов

But this probably would not

ванечка иванофф

Is it bad?
Ofc no, it just assumes some stuff that may not hold for your dataset.
And yeah it works for streets just fine.

Also recognizing a proper name without context does not really work. And good luck finding (or generating) corpora for that.

Why deep learning may not work

So I downloaded some free databases with names (VK.com respects your secutity lol - the 100M leaked database is available, but useless, too much noise) and surnames.
Got 700k surnames of different origin, around 100-200k male and female names. Used just random words from CC + wiki + taiga for hard negative mining.
Got 92% accuracy on 4 classes (just word, female name, male name, surname) with some naive models.

... and it works .... kind of. If you give it 10M unique word forms, it can distinguish name-like stuff in 90% of cases.
But for addresses it is useless more or less and heuristics from natasha work much better.

The moral

- A tool that works on one case may be 90% useless on another;
- Heuristics have very high precision, low recall and are fragile;
- Neural networks are superior, but you should match your artifically created dataset to the real data (it may take a month to pull off properly);
- In any case, properly cracking both approaches may take time, but both heuristics and NNs are very fast to create, but sometimes 3 plain rules give you 100% precision with 10% recall and sometimes generating a fake dataset that matches your domain is a no-brainer. It depends.

#data_science
#nlp
#deep_learning

GitHub - natasha/yargy: Rule-based facts extraction for Russian language

Rule-based facts extraction for Russian language. Contribute to natasha/yargy development by creating an account on GitHub.

1.6K viewsAlexander, edited 07:23

https://youtu.be/obY4c9aqUqs

LINDEMANN - Ich weiß es nicht (Official AI-Video)

This video has been generated through GANs. Generative Adversarial Networks (GANs) are AI architectures capable of generating imagery. By analysing thousands of pictures GANs learn image features in a similar way to humans, generalizing visual patterns into…

1.4K viewsAlexander, 17:25

GANs ⬆️

1.4K viewsAlexander, 17:25

Tensorboard logging in PyTorch

Looked at this module some time ago. Looks like it matured now.
The coolest current feature - param logging.

Just compare these two docs:
- TensorboardX
- torch.utils

Looks like PyTorch just imported the most popular libarary, copying their docs and APIs.
Nice!

#deep_learning

1.6K viewsAlexander, 09:56

2019 DS / ML digest 17

Link

Highlights of the week(s):

- BERT miniaturization?
- PyTorch domination?
- MobileNet from Facebook - FbNet

#digest
#deep_learning

2019 DS/ML digest 17

2019 DS/ML digest 17
Статьи автора - http://spark-in.me/author/snakers41
Блог - http://spark-in.me

1.4K viewsAlexander, 07:57

The current state of "DIY" ML hardware

(i.e. that you can actually assemble and maintain and use in a small team)

Wanted to write a large post, but decided to just a TLDR.
In case you need a super-computer / cluster / devbox with 4 - 16 GPUs.

The bad
- Nvidia DGX and similar - 3-5x overpriced (sic!)
- Cloud providers (Amazon) - 2-3x overpriced

The ugly
- Supermicro GPU server solutions. This server hardware is a bit overpriced, but its biggest problem is old processor sockets
- Custom shop buit machines (with water) - very nice, but (except for water) you just pay US$5 - 10 - 15k for work you can do yourself in one day
- 2 CPU professional level motherboards - very cool, but powerful Intel Xeons are also very overpriced

The good
- Powerful AMD processor with 12-32 cores + top tier motherboard. This will support 4 GPUs on x8 speed and have a 10 Gb/s ethernet port
- Just add more servers with 10 Gb/s connection and probably later connect them into a ring ... cheap / powerful / easy to maintain

More democratization soon?

Probably the following technologies will untie our hands

- Single slot GPUs - Zotac clearly thought about it, maybe it will become mainstream in the professional market
- PCIE 4.0 => enough speed for ML even on cheaper motherboards
- New motherboards for AMD processors => maybe more PCIE slots will become normal
- Intel optane persistent memory => slow and expensive now, maybe RAM / SSD will merge (imagine having 2 TB of cheap RAM on your box)

Good chat in ODS on same topic.

#hardware

ZOTAC’s GeForce RTX 2080 Ti ArcticStorm: A Single-Slot Water Cooled GeForce RTX 2080 Ti

Ultra-high-end graphics cards these days all seem to either come with a very large triple fan cooler, or more exotically, a hybrid cooling system based around a large heatsink with fans and a liquid cooling block. Naturally, these cards use two or more slots…

9.5K viewsAlexander, 10:35

Open STT v1.0 release

Finally we released open STT v1.0 =)

Highlights

- 20 000 hours of annotated data
- 2 new large and diverse domains
- 12k speakers (to be released soon)
- Overall quality improvement
- See below posts and releases for more details

+---------------+------+--------+------+
| Domain        | Utts | Hours  | GB   |
+---------------+------+--------+------+
| Radio         | 8,3М | 11,996 | 1367 |
+---------------+------+--------+------+
| Public Speech | 1,7M | 2,709  | 301  |
+---------------+------+--------+------+
| Youtube       | 2,6М | 2,117  | 346  |
+---------------+------+--------+------+
| Books         | 1,3М | 1,632  | 180  |
+---------------+------+--------+------+
| Calls         | 695K | 819    | 91   |
+---------------+------+--------+------+
| Other         | 1.9M | 835    | 95   |
+---------------+------+--------+------+

How can I help?
- Share our dataset
- Share / publish your dataset - the more domains the better
- Upvote on habr
- Upvote on TDS (when released)
- We have an Open Collective page for donations

Links
- Open STT https://github.com/snakers4/open_stt
- Release https://github.com/snakers4/open_stt/releases
- Open TTS https://github.com/snakers4/open_tts
- Habr https://habr.com/ru/post/474462/
- Towards Data Science (coming soon)
- Bloghttps://spark-in.me/post/open-stt-release-v10
- Open collective https://opencollective.com/open_stt (edited)

GitHub - snakers4/open_stt: Open STT

Open STT. Contribute to snakers4/open_stt development by creating an account on GitHub.

9.1K viewsAlexander, 12:41

Forwarded from Just links

I reimplemented this code in pure pytorch, and reproduces their results. It also gave decent results on ImageNet in only 5 epochs.
https://github.com/Randl/Ranger_Mish_reimplementation

Randl/Ranger_Mish_reimplementation

Contribute to Randl/Ranger_Mish_reimplementation development by creating an account on GitHub.

69 viewsAlexander, 12:42

Open STT v1.0 release Finally we released open STT v1.0 =) Highlights - 20 000 hours of annotated data - 2 new large and diverse domains - 12k speakers (to be released soon) - Overall quality improvement - See below posts and releases for more details …

Also a medium post
Please give us 50 claps if you have an account)

Open STT 1.0 release

Finally we made it!

1.6K viewsAlexander, 05:27

https://www.youtube.com/watch?v=QtUwhA5oVNE

Юрий Бабуров: "Рассказ про наш открытый корпус русской речи" 2019-10-31

Рассказ про наш открытый корпус русской речи для распознавания и синтеза. Путь к успеху длиной в 10 месяцев. Митап в ЦФТ.

1.8K viewsAlexander, 08:15

Стажировка по работе с речью

Ищем увлечённых людей, скорее всего студентов 2 или 3 курса, кто хотел бы развиваться в направлении по работе с речью и ML в целом.

Работать можем начать хоть вчера, ограничений вообще никаких нет.
Планируем встречаться лично 1-2 раза в неделю.

Особо не подразумевается, что вы должны прямо что-то уметь, скорее мы рассчитываем найти людей:

- Со знанием английского (читать статьи, писать статьи, вести переписку и логи, говорить не нужно)
- Умных, целеустремленных, идейных
- С минимальной математической подготовкой
- Всему нужному мы научим. Или ты нас чему-то научишь

Будет плюсом:

- Python + PyTorch
- Любые другие DL фреймворки это хорошо, но юзать их не будем
- Ты бегло прочитал(а) seminal papers в какой-то области (CV, NLP, ASR) и у тебя есть свое мнение (отличное от "стакать трансформеры")
- Если ты запилил(а) вообще проект в любой сфере, где видно, что тащил(а) именно ты
- Если ты хочешь научиться решать или умеешь решать реальные задачи
- Если ты сделал(а) или хочешь сделать что-то осознанное в сфере ML
- Ты прошарен(а) в экосистеме Linux, не боишься работать в консоли

Что не нужно

- Заниматься чем-то ради того, чтобы заниматься
- Работать в нашей компании большая честь (tm)
- Кодить у доски, инвертировать деревья, перемножать большие числа в уме, вставить любое подобное

Зачем тебе это надо

- Если у тебя есть какие-то идеи в этой сфере, то мы можем дать платформу чтобы их качественно реализовать
- Когда у нас появится +1 место на фулл-тайм работу, угадай кто будет в шорт-листе
- Мы реально двигаем ML / решаем прикладные задачи, а не просто мараем бумагу / пилим бабос / собираем хайп
- Публикации, решение реальных задач, очень быстрый набор опыта
- Самым ярким кандидатам будем готовы отсыпать фантиков

Контакты

- Присылай в любом формате свои достижения, единственное пожелание - будь лаконичным
- Писать мне в телеграм напрямую - @snakers41

Ссылки на наши работы и публикации

- https://github.com/snakers4/open_stt
- https://medium.com/@aveysov
- https://spark-in.me/

GitHub - snakers4/open_stt: Open STT

Open STT. Contribute to snakers4/open_stt development by creating an account on GitHub.

3.6K viewsAlexander, edited 14:53

Стажировка по работе с речью Ищем увлечённых людей, скорее всего студентов 2 или 3 курса, кто хотел бы развиваться в направлении по работе с речью и ML в целом. Работать можем начать хоть вчера, ограничений вообще никаких нет. Планируем встречаться лично…

* поправка, скорее всего студентов 3 или 4 курса

1.6K viewsAlexander, 14:44

Easiest solutions to manage configs for ML models

When you have a lot of experiments, you need to minimize your code bulk and manage model configs concisely.
(This also kind of can be done via CLI parameters, but usually these things complement each other)

I know 3 ways:

(0) dicts + kwargs + dotdicts
(1) [attr](https://github.com/python-attrs/attrs)
(2) new python 3.7 [DataClass](https://docs.python.org/3/library/dataclasses.html) (which is very similar to attr)

Which one do you use?

#data_science

GitHub - python-attrs/attrs: Python Classes Without Boilerplate

Python Classes Without Boilerplate. Contribute to python-attrs/attrs development by creating an account on GitHub.

1.6K viewsAlexander, edited 19:06

2019 DS / ML digest 18
Link

Highlights of the week(s):

- Speech Vocoder w/o GPU on inference
- Publish ML web-apps w/o web frameworks via streamlit in pure python
- Unsupervised pre-training on non-curated image datasets
- PyTorch's popularity in research
- Why ML in medicine does not work and how to solve it via ontologies

#digest
#deep_learning

2019 DS/ML digest 18

2019 DS/ML digest 18
Статьи автора - http://spark-in.me/author/snakers41
Блог - http://spark-in.me

1.1K viewsAlexander, 15:32

Streamlit vs. viola vs. panel vs. dash vs. bokeh server

TLDR - make scientific web-apps via python only w/o any web-programming (i.e. django, tornado).

Dash
- Mostly for BI
- Also a paid product
- Looks like the new Tableau
- Serving and out-of-the-box scaling options

Bokeh server
- Mostly plotting (very flexible, unlimited capabilities)
- High entry cost, though bokeh is kind of easy to use
- Also should scale well

Panel
- A bokeh server wrapper with a lot of capabilities for geo + templates

Streamlit
- The nicest looking app for interactive ML apps (maybe even annotation)
- Has pre-built styles and grid
- Limited only to its pre-built widgets
- Built on tornado with a very specific data model incompatible with the majority of available widgets
- Supposed to scale well - built on top of tornado

Viola
- If it runs in a notebook - it will run in viola
- Just turns a notebook into a server
- The app with the most promise for DS / ML
- Scales kind of meh - you need to run a jupyter kernel for each user - also takes some time to spin up a kernel
- Fully benefits from a rich ecosystem of jupyter / python / widgets
- In theory has customizable grid and CSS, but does not come pre-built with this => higher barrier to entry

Also most of these apps have no authentication buil-in.

More details:

- A nice summary here;
- A very detailed pros and cons summary of Streamlit + Viola. Also a very in-depth detailed discussion;
- Also awesome streamlit boilerplate is awesome;

#data_science

Jupyter Dashboarding — some thoughts on Voila, Panel and Dash

There are three main players in the Python dashboarding space, let’s discuss.

10.3K viewsAlexander, 15:58

Amazing hardware YouTube channel (RU)

Link.

Smart, in-depth, highly analytical, no bs / ads / cringe / sensationalism. Not your typical Russian channel, not Linus Tech Tips or similar.

Example videos:

- What hardware companies could do
- Choosing a PSU

#hardware

1.4K viewsAlexander, 03:01

Amazing hardware YouTube channel (RU) Link. Smart, in-depth, highly analytical, no bs / ads / cringe / sensationalism. Not your typical Russian channel, not Linus Tech Tips or similar. Example videos: - What hardware companies could do - Choosing a PSU…

How are 2080 Ti GPUs different? (RU)

A bit too late =)

https://www.youtube.com/watch?v=YcvM2DhIYdc

#hardware

Nvidia Turing | Особености архитектуры, практика использования RT ядер

Смотрим - что нового в Тьюрингах и смотрим на практическую работу RT ядер и шумоподавления при трассеровке лучей тензорными ядрами.

https://vk.com/pc_0_1 - группа "Этот компьютер" - свежие и актуальные новости IT мира

1.2K viewsAlexander, 08:29