Spark in me

В новом конкурсе нашел на Kaggle отличный "мануал", про то как работать c bson (архив базы Монго).

Очень рекомендую к прочтению
- https://www.kaggle.com/humananalog/keras-generator-for-reading-directly-from-bson/notebook

#data_science
#python

Kaggle

Keras generator for reading directly from BSON

Using data from Cdiscount’s Image Classification Challenge

1.2K viewsAlexander, 14:15

Spark in me

Пара отличных тредов про то, как сделать ваш генератор на питоне thread-safe, то есть минимальными усилиями использовать параметр workers > 1 у fit_generator в Keras. Полезно, если ваша модель сильно CPU-bound.

- https://github.com/fchollet/keras/issues/1638
- https://stackoverflow.com/questions/41194726/python-generator-thread-safety-using-keras
- http://anandology.com/blog/using-iterators-and-generators/

#data_science
#python

GitHub

Proper way of making a data generator which can handle multiple workers · Issue #1638 · fchollet/keras

I am having difficulty in writing a data generator which can work with multiple workers. My data generator works fine with one worker, but with > 1 workers, it gives me the following error:
Unbound...

1.1K viewsAlexander, edited 07:53

Spark in me

Отличная паста чтобы проверять хеши файлов.

# make sure you downloaded the files correctly
import hashlib
import os.path as path

def sha256(fname):
    hash_sha256 = hashlib.sha256()
    with open(fname, 'rb') as f:
        for chunk in iter(lambda: f.read(4096), b''):
            hash_sha256.update(chunk)
    return hash_sha256.hexdigest()

filenames = ['', '', '', '', ']

hashes = ['', '', '', '', '']

data_root = path.join('data/')  # make sure you set up this path correctly

# this may take a few minutes
for filename, hash_ in zip(filenames, hashes):
    computed_hash = sha256(path.join(data_root, filename))
    if computed_hash == hash_:
        print('{}: OK'.format(filename))
    else:
        print('{}: fail'.format(filename))
        print('expected: {}'.format(hash_))
        print('computed: {}'.format(computed_hash))

#python
#data_science

1.1K viewsAlexander, edited 07:42

Spark in me

Оказывается уже есть готовый squeeze-net для keras с весами =)

Неплохо
- https://github.com/wohlert/keras-squeezenet

#python
#neural_nets

GitHub

wohlert/keras-squeezenet

Pretrained Squeezenet 1.1 implementation in Keras. Contribute to wohlert/keras-squeezenet development by creating an account on GitHub.

1.2K viewsAlexander, 08:41

Spark in me

У меня встал вопрос расширения класса Pytorch, который мне понравился. Если бы все было банально - я бы просто написал функцию и вызвал бы ее и передал ей объект класса, но но одна проблема - некоторые утилиты в классе вызывают локальные утилиты, которые не совсем понятно как модифицировать при импорте.

Вдохновившись примером итератора с bson (было выше - https://goo.gl/xvZErF), как оказалось расширение классов делается довольно просто:
- Раз https://goo.gl/JZpfiV
- Два https://goo.gl/D3KkLm
- Ну и старая наркомания для тех кому внутрянка питона интересна
-- https://www.artima.com/weblogs/viewpost.jsp?thread=237121
-- https://www.artima.com/weblogs/viewpost.jsp?thread=236278
-- http://www.artima.com/weblogs/viewpost.jsp?thread=236275

#python
#data_science

Kaggle

Keras generator for reading directly from BSON

Explore and run machine learning code with Kaggle Notebooks | Using data from Cdiscount’s Image Classification Challenge

954 viewsAlexander, 07:05

Spark in me

Из серии извращений - как загрузить k-means объект из второго питона в третий, причем с ростом версии sklearn?

Очевидное решение не работает по причине смены версии sklearn
- https://goo.gl/s8V5zf

А такое работает

# saving - python2
import numpy as np
np.savetxt('centroids.txt', centroids, delimiter=',') 

# loading - python3 
from sklearn.cluster import KMeans
import numpy as np

centroids = np.loadtxt('centroids.txt', delimiter=',')
kmeans = KMeans(init = centroids)

#python

Stackoverflow

Unpickling a python 2 object with python 3

I'm wondering if there is a way to load an object that was pickled in Python 2.4, with Python 3.4.

I've been running 2to3 on a large amount of company legacy code to get it up to date.

Having don...

1.1K viewsAlexander, 09:55

Spark in me

Великолепная либа на питоне для работы с видео
- https://github.com/Zulko/moviepy

Она построена сверху над image.io и по сути позволяет работать с видео в 1 строчку (вместо просто итерации или ручного использования ffmpeg). Как хорошо что на питоне есть такие инструменты!

#python
#video

GitHub

GitHub - Zulko/moviepy: Video editing with Python

Video editing with Python. Contribute to Zulko/moviepy development by creating an account on GitHub.

1.1K viewsAlexander, 04:35

Spark in me

На новой работе увидел, что люди тренируют свои модели на 2 питоне (ЩИТО?), на tensorflow (WTF???) и грузят данные в 1 поток (2017 год на дворе!).

По этой причине сделал коллегам такую немного трололо презентацию. Может и вам понравится
- https://goo.gl/ne9RH4

Все простое - очень просто, главное просто знать где искать)

#data_science
#deep_learning
#python

1.1K viewsAlexander, edited 11:54

Spark in me

Just found a book on practical Python programming patterns
- http://python-3-patterns-idioms-test.readthedocs.io/en/latest/PythonForProgrammers.html

Looks good

#python

1.3K viewsAlexander, 14:01

Spark in me

Amazing article about the most popular warning in Pandas
- https://www.dataquest.io/blog/settingwithcopywarning/

#data_science

Dataquest

SettingwithCopyWarning: How to Fix This Warning in Pandas – Dataquest

SettingWithCopyWarning: Everything you need to know about the most common (and most misunderstood) warning in pandas and how to fix it!

1.2K viewsAlexander, 06:26

Spark in me

Useful Python abstractions / sugar / patterns

I already shared a book about patterns, which contains mostly high level / more complicated patters. But for writing ML code sometimes simple imperative function programming style is ok.

So - I will be posting about simple and really powerful python tips I am learning now.

This time I found out about map and filter, which are super useful for data preprocessing:

Map

items = [1, 2, 3, 4, 5]
squared = list(map(lambda x: x**2, items))

Filter

number_list = range(-5, 5)
less_than_zero = list(filter(lambda x: x < 0, number_list))
print(less_than_zero)

Also found this book - http://book.pythontips.com/en/latest/map_filter.html

#python
#data_science

846 viewsAlexander, 04:58

Spark in me

Readable list comprehensions in Python

My list and dictionary comprehensions usually look like s**t

https://gist.github.com/IaroslavR/7dcb54830242a22de1869f6fd05a8d7e

#python

Gist

Examples of readable comprehension formatting from SO

Examples of readable comprehension formatting from SO - examples.py

816 viewsAlexander, 13:54

Spark in me

A decent explanation about decorators in Python

http://book.pythontips.com/en/latest/decorators.html

#python

1.1K viewsAlexander, edited 07:45

Spark in me

Yet another python tricks book

https://dbader.org/
https://www.getdrip.com/deliveries/xugaymstfzmizbyposdk?__s=ejdgfo9tsdhpgcrcscs3
https://vk.com/doc7608079_466151365

#python

dbader.org

Python Training by Dan Bader – dbader.org

Dan Bader helps Python developers become more awesome. His tutorials, videos, and trainings have reached over half a million developers around the world.

1.1K viewsAlexander, edited 05:53

Spark in me

Finally found a decent python module building guide

https://chrisyeh96.github.io/2017/08/08/definitive-guide-python-imports.html

#python

chrisyeh96.github.io

The Definitive Guide to Python import Statements | Chris Yeh

ML for Sustainability | PhD Student @ Caltech

1.5K viewsAlexander, 18:05

Spark in me

Useful Python / PyTorch bits

dot.notation access to dictionary attributes

class dotdict(dict):
    __getattr__ = dict.get
    __setattr__ = dict.__setitem__
    __delattr__ = dict.__delitem__

PyTorch embedding layer - ignore padding

nn.Embedding has a padding_idx attribute not to update the padding token embedding.

#python
#pytorch

1.2K viewsAlexander, edited 06:00

Spark in me

Monkey patching a PyTorch model

Well, ideally you should not do this.
But sometimes you just need to quickly test something and amend your model on the fly.

This helps:


import torch
import functools

def rsetattr(obj, attr, val):
    pre, _, post = attr.rpartition('.')
    return setattr(rgetattr(obj, pre) if pre else obj, post, val)

def rgetattr(obj, attr, *args):
    def _getattr(obj, attr):
        return getattr(obj, attr, *args)
    return functools.reduce(_getattr, [obj] + attr.split('.'))

for module in model.named_modules():
    old_module_path = module[0]
    old_module_object = module[1]
    # replace an old object with the new one
    # copy some settings and its state
    if isinstance(old_module_object,torch.nn.SomeClass):
        new_module = SomeOtherClass(old_module_object.some_settings,
                                    old_module_object.some_other_settings)
        
        new_module.load_state_dict(module_object.state_dict())
        rsetattr(model,old_module_path,new_module)

The above code essentially does the same as:


model

.path.to.some.block = some_other_block
`

#python
#pytorch
#deep_learning
#oop

1.1K viewsAlexander, edited 07:24

Spark in me

A Great Start For Your Сustom Python Dockerfiles

I like to popularize really great open-source stuff. I have shared my ML Dockerfiles several time. Now I base my PyTorch workflows on ... surprise-surprise PyTorch's official images with Apex. But (when I looked) for some reason it was difficult to find the original dockerfiles themselves, there were only images (maybe I did not look well enough).

But what if you need a simpler / different / lighter python workflow without PyTorch / GPUs? Miniconda is an obvious choice. Yeah, and now there is miniconda as a docker image (pre-built) and with dockerfile! What is also remarkable, my dockerfile, which I inherited from Fchollet in 2017, starts very similarly to this miniconda dockerfile.

https://hub.docker.com/r/continuumio/miniconda3/dockerfile

Enjoy. A great way to start your python project and / or journey.

#python

1.1K viewsAlexander, 16:15

Spark in me

Pandas ... Parallel Wrappers Strike Back

Someone made this into a library with 800+ starts - https://github.com/nalepae/pandarallel - which is cool AF if it will be maintained!

I wrote similar wrappers 2 years ago and I thought no one cared about this, because when I shared it, no one paid any attention. But in our day-to-day work they are still a workhorse. Simple, naïve, concise yet efficient.

I believe this library only has 2 major drawbacks:

- Spawning processes takes a second or so, but this is just python
- It does not support shared memory (?), the only thing that I arguably lack in such a tool

12 - 24, or even 64 core AMD processors are cheap nowadays, you know.

#python

GitHub

GitHub - nalepae/pandarallel: A simple and efficient tool to parallelize Pandas operations on all available CPUs

A simple and efficient tool to parallelize Pandas operations on all available CPUs - nalepae/pandarallel

948 viewsAlexander, edited 12:01

Spark in me

The Uncharted Waters of Having an Encrypted gRPC API Endpoint in Python without Hassle

Typically, you do not have to think twice about SSL in your standard HTTP web apps. Usually it is handled by your reverse proxy out-of-the-box, i.e. caddy / traefik / nginx-proxy, you name it. In some cases you just use certbot and that is it.

The end user (typically a person) does not care about storing and obtaining the certificates (API users also do not really care about the encryption of their data for some reason preferring to rely on API admins). Moreover a plethora of tools packaged within reverse proxies make this just a matter of plumbing and trial and error.

But what if you use a gRPC endpoint? The typical answer is that it will most likely run behind some corporate firewall and will not be exposed to the Internet. But what if it will?

The official docs are not very clear on this topic for this very reason, and there are a few comprehensive (yet kind of old) guides - python, go.

But wait, 95% of these guides just follow the happy path from the docs, i.e. manually creating and managing all of the certificates (imagine the nightmare) or describe in details how to automate LE certificates in Go.

But they fail on several fronts:

- They fail to mention that you should add some form of token auth even before starting your gRPC session (and the official python example is way too complicated);

- They usually imply that you have to manually (or automatically) create client certificates and distribute them, but usually do not explain what happens if the client leaves this field blank. Turns out each release of gRPC ships with a list of most prominent CAs, which are loaded by default;

- They also often assume a fully automated scenario, when you have enough infrastructure to actually invest into integrating dedicating clients like acme-tiny in your app (a full list can be found here);

The above Go guide also says this (minica):

Is important to emphasize this example is not meant to be replicated for internal/private services. In talking to Jacob Hoffman-Andrews from Let’s Encrypt, he mentioned: In general, I recommend that people don’t use Let’s Encrypt certificates for gRPC or other internal RPC services. In my opinion, it’s both easier and safer to generate a single-purpose internal CA using something like minica and generate both server and client certificates with it. That way you don’t have to open up your RPC servers to the outside internet, plus you limit the scope of trust to just what’s needed for your internal RPCs, plus you can have a much longer certificate lifetime, plus you can get revocation that works.

But maybe there is a simpler approach, more akin to how people visit websites? Moreover, public corporate endpoints (i.e. Sber) seemingly do not really bother with certificates (but the client is secure judging by logs). Maybe you can combine ease of use and security?

Turns out yes, but this is not immediately evident from docs and guides:

- You should leave the client certificate blank. In this case the client will look out for keys from widely accepted CAs packaged with the current version of the gRPC package;

- You should obtain your certificates from a trusted CA like LetsEncrypt (use certbot or any other client in case you need automation);

- If you are not using the latest gRPC client, chances are you will encounter a similar gotcha when testing your production certificate with cryptic errors. Explanation - CAs also rotate their keys. Solution - just update your package;

Minimalistic example:

Client side:

ssl_creds = grpc.ssl_channel_credentials()
channel = grpc.secure_channel(bind_address, ssl_creds)

Server side:

with open('privkey.pem', 'rb') as f:
    private_key = f.read()
with open('fullchain.pem', 'rb') as f:
    certificate_chain = f.read()
server_creds = grpc.ssl_server_credentials(
    ((private_key, certificate_chain,),))

server.add_secure_port(bind_address, server_creds)

When this works, do not forget some sort of additional token-based auth.

gRPC

Authentication

An overview of gRPC authentication, including built-in auth mechanisms, and how to plug in your own authentication systems.

987 viewsAlexander, 13:29

About

Blog

Apps

Platform