Spark in me
2.21K subscribers
822 photos
48 videos
116 files
2.68K links
Lost like tears in rain. DS, ML, a bit of philosophy and math. No bs or ads.
Download Telegram
В новом конкурсе нашел на Kaggle отличный "мануал", про то как работать c bson (архив базы Монго).

Очень рекомендую к прочтению
- https://www.kaggle.com/humananalog/keras-generator-for-reading-directly-from-bson/notebook

#data_science
#python
Пара отличных тредов про то, как сделать ваш генератор на питоне thread-safe, то есть минимальными усилиями использовать параметр workers > 1 у fit_generator в Keras. Полезно, если ваша модель сильно CPU-bound.

- https://github.com/fchollet/keras/issues/1638
- https://stackoverflow.com/questions/41194726/python-generator-thread-safety-using-keras
- http://anandology.com/blog/using-iterators-and-generators/

#data_science
#python
Отличная паста чтобы проверять хеши файлов.

# make sure you downloaded the files correctly
import hashlib
import os.path as path

def sha256(fname):
hash_sha256 = hashlib.sha256()
with open(fname, 'rb') as f:
for chunk in iter(lambda: f.read(4096), b''):
hash_sha256.update(chunk)
return hash_sha256.hexdigest()

filenames = ['', '', '', '', ']

hashes = ['', '', '', '', '']

data_root = path.join('data/') # make sure you set up this path correctly

# this may take a few minutes
for filename, hash_ in zip(filenames, hashes):
computed_hash = sha256(path.join(data_root, filename))
if computed_hash == hash_:
print('{}: OK'.format(filename))
else:
print('{}: fail'.format(filename))
print('expected: {}'.format(hash_))
print('computed: {}'.format(computed_hash))

#python
#data_science
У меня встал вопрос расширения класса Pytorch, который мне понравился. Если бы все было банально - я бы просто написал функцию и вызвал бы ее и передал ей объект класса, но но одна проблема - некоторые утилиты в классе вызывают локальные утилиты, которые не совсем понятно как модифицировать при импорте.

Вдохновившись примером итератора с bson (было выше - https://goo.gl/xvZErF), как оказалось расширение классов делается довольно просто:
- Раз https://goo.gl/JZpfiV
- Два https://goo.gl/D3KkLm
- Ну и старая наркомания для тех кому внутрянка питона интересна
-- https://www.artima.com/weblogs/viewpost.jsp?thread=237121
-- https://www.artima.com/weblogs/viewpost.jsp?thread=236278
-- http://www.artima.com/weblogs/viewpost.jsp?thread=236275

#python
#data_science
Из серии извращений - как загрузить k-means объект из второго питона в третий, причем с ростом версии sklearn?

Очевидное решение не работает по причине смены версии sklearn
- https://goo.gl/s8V5zf

А такое работает
# saving - python2
import numpy as np
np.savetxt('centroids.txt', centroids, delimiter=',')

# loading - python3
from sklearn.cluster import KMeans
import numpy as np

centroids = np.loadtxt('centroids.txt', delimiter=',')
kmeans = KMeans(init = centroids)

#python
Великолепная либа на питоне для работы с видео
- https://github.com/Zulko/moviepy

Она построена сверху над image.io и по сути позволяет работать с видео в 1 строчку (вместо просто итерации или ручного использования ffmpeg). Как хорошо что на питоне есть такие инструменты!

#python
#video
На новой работе увидел, что люди тренируют свои модели на 2 питоне (ЩИТО?), на tensorflow (WTF???) и грузят данные в 1 поток (2017 год на дворе!).

По этой причине сделал коллегам такую немного трололо презентацию. Может и вам понравится
- https://goo.gl/ne9RH4

Все простое - очень просто, главное просто знать где искать)

#data_science
#deep_learning
#python
Just found a book on practical Python programming patterns
- http://python-3-patterns-idioms-test.readthedocs.io/en/latest/PythonForProgrammers.html

Looks good

#python
Useful Python abstractions / sugar / patterns

I already shared a book about patterns, which contains mostly high level / more complicated patters. But for writing ML code sometimes simple imperative function programming style is ok.

So - I will be posting about simple and really powerful python tips I am learning now.

This time I found out about map and filter, which are super useful for data preprocessing:

Map
items = [1, 2, 3, 4, 5]
squared = list(map(lambda x: x**2, items))
Filter
number_list = range(-5, 5)
less_than_zero = list(filter(lambda x: x < 0, number_list))
print(less_than_zero)
Also found this book - http://book.pythontips.com/en/latest/map_filter.html

#python
#data_science
A decent explanation about decorators in Python

http://book.pythontips.com/en/latest/decorators.html

#python
Useful Python / PyTorch bits

dot.notation access to dictionary attributes

class dotdict(dict):
__getattr__ = dict.get
__setattr__ = dict.__setitem__
__delattr__ = dict.__delitem__

PyTorch embedding layer - ignore padding

nn.Embedding has a padding_idx attribute not to update the padding token embedding.

#python
#pytorch
Monkey patching a PyTorch model

Well, ideally you should not do this.
But sometimes you just need to quickly test something and amend your model on the fly.

This helps:


import torch
import functools

def rsetattr(obj, attr, val):
pre, _, post = attr.rpartition('.')
return setattr(rgetattr(obj, pre) if pre else obj, post, val)

def rgetattr(obj, attr, *args):
def _getattr(obj, attr):
return getattr(obj, attr, *args)
return functools.reduce(_getattr, [obj] + attr.split('.'))

for module in model.named_modules():
old_module_path = module[0]
old_module_object = module[1]
# replace an old object with the new one
# copy some settings and its state
if isinstance(old_module_object,torch.nn.SomeClass):
new_module = SomeOtherClass(old_module_object.some_settings,
old_module_object.some_other_settings)

new_module.load_state_dict(module_object.state_dict())
rsetattr(model,old_module_path,new_module)


The above code essentially does the same as:


model

.path.to.some.block = some_other_block
`

#python
#pytorch
#deep_learning
#oop
A Great Start For Your Сustom Python Dockerfiles

I like to popularize really great open-source stuff. I have shared my ML Dockerfiles several time. Now I base my PyTorch workflows on ... surprise-surprise PyTorch's official images with Apex. But (when I looked) for some reason it was difficult to find the original dockerfiles themselves, there were only images (maybe I did not look well enough).

But what if you need a simpler / different / lighter python workflow without PyTorch / GPUs? Miniconda is an obvious choice. Yeah, and now there is miniconda as a docker image (pre-built) and with dockerfile! What is also remarkable, my dockerfile, which I inherited from Fchollet in 2017, starts very similarly to this miniconda dockerfile.

https://hub.docker.com/r/continuumio/miniconda3/dockerfile

Enjoy. A great way to start your python project and / or journey.

#python
Pandas ... Parallel Wrappers Strike Back

Someone made this into a library with 800+ starts - https://github.com/nalepae/pandarallel - which is cool AF if it will be maintained!

I wrote similar wrappers 2 years ago and I thought no one cared about this, because when I shared it, no one paid any attention. But in our day-to-day work they are still a workhorse. Simple, naïve, concise yet efficient.

I believe this library only has 2 major drawbacks:

- Spawning processes takes a second or so, but this is just python
- It does not support shared memory (?), the only thing that I arguably lack in such a tool

12 - 24, or even 64 core AMD processors are cheap nowadays, you know.

#python
The Uncharted Waters of Having an Encrypted gRPC API Endpoint in Python without Hassle

Typically, you do not have to think twice about SSL in your standard HTTP web apps. Usually it is handled by your reverse proxy out-of-the-box, i.e. caddy / traefik / nginx-proxy, you name it. In some cases you just use certbot and that is it.

The end user (typically a person) does not care about storing and obtaining the certificates (API users also do not really care about the encryption of their data for some reason preferring to rely on API admins). Moreover a plethora of tools packaged within reverse proxies make this just a matter of plumbing and trial and error.

But what if you use a gRPC endpoint? The typical answer is that it will most likely run behind some corporate firewall and will not be exposed to the Internet. But what if it will?

The official docs are not very clear on this topic for this very reason, and there are a few comprehensive (yet kind of old) guides - python, go.

But wait, 95% of these guides just follow the happy path from the docs, i.e. manually creating and managing all of the certificates (imagine the nightmare) or describe in details how to automate LE certificates in Go.

But they fail on several fronts:

- They fail to mention that you should add some form of token auth even before starting your gRPC session (and the official python example is way too complicated);

- They usually imply that you have to manually (or automatically) create client certificates and distribute them, but usually do not explain what happens if the client leaves this field blank. Turns out each release of gRPC ships with a list of most prominent CAs, which are loaded by default;

- They also often assume a fully automated scenario, when you have enough infrastructure to actually invest into integrating dedicating clients like acme-tiny in your app (a full list can be found here);

The above Go guide also says this (minica):

Is important to emphasize this example is not meant to be replicated for internal/private services. In talking to Jacob Hoffman-Andrews from Let’s Encrypt, he mentioned: In general, I recommend that people don’t use Let’s Encrypt certificates for gRPC or other internal RPC services. In my opinion, it’s both easier and safer to generate a single-purpose internal CA using something like minica and generate both server and client certificates with it. That way you don’t have to open up your RPC servers to the outside internet, plus you limit the scope of trust to just what’s needed for your internal RPCs, plus you can have a much longer certificate lifetime, plus you can get revocation that works.


But maybe there is a simpler approach, more akin to how people visit websites? Moreover, public corporate endpoints (i.e. Sber) seemingly do not really bother with certificates (but the client is secure judging by logs). Maybe you can combine ease of use and security?

Turns out yes, but this is not immediately evident from docs and guides:

- You should leave the client certificate blank. In this case the client will look out for keys from widely accepted CAs packaged with the current version of the gRPC package;

- You should obtain your certificates from a trusted CA like LetsEncrypt (use certbot or any other client in case you need automation);

- If you are not using the latest gRPC client, chances are you will encounter a similar gotcha when testing your production certificate with cryptic errors. Explanation - CAs also rotate their keys. Solution - just update your package;

Minimalistic example:

Client side:

ssl_creds = grpc.ssl_channel_credentials()
channel = grpc.secure_channel(bind_address, ssl_creds)

Server side:

with open('privkey.pem', 'rb') as f:
private_key = f.read()
with open('fullchain.pem', 'rb') as f:
certificate_chain = f.read()
server_creds = grpc.ssl_server_credentials(
((private_key, certificate_chain,),))

server.add_secure_port(bind_address, server_creds)

When this works, do not forget some sort of additional token-based auth.