Spark in me

Amazingly simple code to mimic fast-texts n-gram subword routine

Nuff said. Try it for yourself.

from fastText import load_model

model = load_model('official_fasttext_wiki_200_model')

def find_ngrams(string, n):
    ngrams = zip(*[string[i:] for i in range(n)])
    ngrams = [''.join(_) for _ in ngrams]
    return ngrams

string = 'грёзоблаженствующий'

ngrams = []
for i in range(3,7):
    ngrams.extend(find_ngrams('<'+string+'>',i))

ft_ngrams, ft_indexes = model.get_subwords(string)

ngrams = set(ngrams)
ft_ngrams = set(ft_ngrams)

print(sorted(ngrams),sorted(ft_ngrams))
print(set(ft_ngrams).difference(set(ngrams)),set(ngrams).difference(set(ft_ngrams)))

#nlp

1.3K viewsAlexander, edited 06:03

Spark in me

Head of DS in Ostrovok (Moscow)

Please contact @eshneyderman (Евгений Шнейдерман) if you are up to the challenge.

#jobs

874 viewsAlexander, 16:29

Spark in me

Forwarded from Felix Shpilman

Evgeny Shneyderman:
https://hh.ru/vacancy/27723418

hh.ru

Вакансия Head Of Data Science в Москве, работа в Ostrovok.ru (вакансия в архиве)

Вакансия Head Of Data Science. Зарплата: не указана. Москва. Требуемый опыт: 3–6 лет. Полная занятость. Дата публикации: 07.11.2018.

924 viewsAlexander, 16:29

Spark in me

Parsing Wikipedia in 4 plain commands in Python Wrote a small take on using Wikipedia as corpus for NLP. https://spark-in.me/post/parsing-wikipedia-in-four-commands-for-nlp https://medium.com/@aveysov/parsing-wikipedia-in-4-simple-commands-for-plain-nlp…

Russian post on Habr

https://habr.com/post/425507/

Please support if you have an account.

#nlp

Habr

Парсим Википедию для задач NLP в 4 команды

Суть Оказывается для этого достаточно запуcтить всего лишь такой набор команд: git clone https://github.com/attardi/wikiextractor.git cd wikiextractor wget...

2.2K viewsAlexander, 16:47

Spark in me

Monkey patching a PyTorch model

Well, ideally you should not do this.
But sometimes you just need to quickly test something and amend your model on the fly.

This helps:


import torch
import functools

def rsetattr(obj, attr, val):
    pre, _, post = attr.rpartition('.')
    return setattr(rgetattr(obj, pre) if pre else obj, post, val)

def rgetattr(obj, attr, *args):
    def _getattr(obj, attr):
        return getattr(obj, attr, *args)
    return functools.reduce(_getattr, [obj] + attr.split('.'))

for module in model.named_modules():
    old_module_path = module[0]
    old_module_object = module[1]
    # replace an old object with the new one
    # copy some settings and its state
    if isinstance(old_module_object,torch.nn.SomeClass):
        new_module = SomeOtherClass(old_module_object.some_settings,
                                    old_module_object.some_other_settings)
        
        new_module.load_state_dict(module_object.state_dict())
        rsetattr(model,old_module_path,new_module)

The above code essentially does the same as:


model

.path.to.some.block = some_other_block
`

#python
#pytorch
#deep_learning
#oop

1.1K viewsAlexander, edited 07:24

Spark in me

PCIE risers that REALLY WORK for DL

Thermaltake TT Premium PCIE 3.0 extender.
All the others I tried were crap.

#deep_learning

1.2K viewsAlexander, 13:04

Spark in me

Wiki graph database

Just found out that Wikipedia also provides this
- https://wiki.dbpedia.org/OnlineAccess
- https://wiki.dbpedia.org/downloads-2016-10#p10608-2

May be useful for research in future.
Seems very theoretic and probably works only for English, but it is best to keep such things on the radar.

Example queries:

People who were born in Berlin before 1900
German musicians with German and English descriptions
Musicians who were born in Berlin
Games

#data_science

921 viewsAlexander, 05:38

Spark in me

Going from millions of points of data to billions on a single machine

In my experience pandas works fine with tables up to 50-100m rows.

Ofc plain indexing/caching (i.e. pre-process all of your data in chunks and index it somehow) and / or clever map/reduce like style optimizations work.

But sometimes it is just good to know that such things exist:

- https://vaex.io/ for large data-frames + some nice visualizations;
- Datashader.org for large visualizations;
- Also you can use Dask for these purposes I guess https://jakevdp.github.io/blog/2015/08/14/out-of-core-dataframes-in-python/;

#data_science

vaex.io

Vaex: A Fast DataFrame for Python 🚀

Out-of-Core DataFrames for Python, ML, visualize and explore big tabular data at a billion rows per second 🚀 | Pandas alternative

929 viewsAlexander, 06:04

Spark in me

Python3 nvidia driver bindings in glances

They used to have only python2 ones.

If you update your drivers and glances, you will get a nice GPU memory / load indicator within glances.
So convenient.

#linux

965 viewsAlexander, 06:08

Spark in me

A small continuation of the crawling saga

2 takes on the Common Crawl

https://spark-in.me/post/parsing-common-crawl-in-four-simple-commands
https://spark-in.me/post/parsing-common-crawl-in-two-simple-commands

It turned out to be a bit tougher than expected
But doable

#nlp

Spark in me

Parsing Common Crawl in 4 plain scripts in python

Parsing Common Crawl in 4 plain scripts in python
Статьи автора - http://spark-in.me/author/snakers41
Блог - http://spark-in.me

1.2K viewsAlexander, 10:11

Spark in me

Another set of links for common crawl for NLP

Looks like we were not the first, ofc.

Below are some projects dedicated to NLP corpus retrieval on scale:
- Java + license detection + boilerplate removal: https://zoidberg.ukp.informatik.tu-darmstadt.de/jenkins/job/DKPro%20C4Corpus/org.dkpro.c4corpus$dkpro-c4corpus-doc/doclinks/1/

- Prepared deduplicated CC text archives http://data.statmt.org/ngrams/deduped/

- Google group
https://groups.google.com/forum/#!topic/common-crawl/6F-yXsC35xM

Wow!

#nlp

zoidberg.ukp.informatik.tu-darmstadt.de

DKPro C4Corpus™ User Guide and Reference

1.2K viewsAlexander, 18:31

Spark in me

Downloading 200GB files in literally hours

(1) Order 500 Mbit/s Internet connection from your ISP
(2) Use aria2 - https://aria2.github.io/ with -x
(3) Profit

#data_science

3.6K viewsAlexander, 18:32

Spark in me

https://www.youtube.com/watch?v=kBFMsY5ZP0o

YouTube

This AI Senses Humans Through Walls 👀

Pick up cool perks on our Patreon page:
› https://www.patreon.com/TwoMinutePapers

Crypto and PayPal links are available below. Thank you very much for your generous support!
› PayPal: https://www.paypal.me/TwoMinutePapers
› Bitcoin: 13hhmJnLEzwXgmgJN7RB6bWVdT7WkrFAHh…

1.3K viewsAlexander, 19:11

Spark in me

DS/ML digest 26

More interesting NLP papers / material ...
https://spark-in.me/post/2018_ds_ml_digest_26

#digest
#deep_learning
#data_science

Spark in me

2018 DS/ML digest 26

2018 DS/ML digest 26
Статьи автора - http://spark-in.me/author/snakers41
Блог - http://spark-in.me

1.1K viewsAlexander, edited 09:33

Spark in me

An Open source alternative to Mendeley

Looks like that Zotero is also cross-platform, and open-source

Also you can import the whole Mendeley library with 1 button push:
https://www.zotero.org/support/kb/mendeley_import

#data_science

www.zotero.org

kb:mendeley_import [Zotero Documentation]

Zotero is a free, easy-to-use tool to help you collect, organize, cite, and share research.

904 viewsAlexander, 16:56

Spark in me

https://www.youtube.com/watch?v=KJAnSyB6mME

YouTube

PyTorch developer conference part 1

Sessions on Applied Research in Industry & Developer Education. Talks from @karpathy (@Tesla), @ctnzr (@nvidia), @ftzo (Pyro/@UberEng), @MarkNeumannnn (@allenai_org), @StuartFrye (@Udacity) and @math_rachel (fast.ai) #PTDC18 www.pytorch.org

(THIS IS A MIRROR)…

1.0K viewsAlexander, 16:58

Spark in me

Looks like mixed precision training ... is solved in PyTorch

Lol - and I could not find it

https://github.com/NVIDIA/apex/tree/master/apex/amp

#deep_learning

GitHub

apex/apex/amp at master · NVIDIA/apex

A PyTorch Extension: Tools for easy mixed precision and distributed training in Pytorch - NVIDIA/apex

953 viewsAlexander, 17:11

Spark in me

Mixed precision distributed training ImageNet example in PyTorch

https://github.com/NVIDIA/apex/blob/master/examples/imagenet/main.py

#deep_learning

963 viewsAlexander, 03:47

Spark in me

Google's super resolution zoom

Finally Google made something interesting
https://www.youtube.com/watch?v=z-ZJqd4eQrc
https://ai.googleblog.com/2018/10/see-better-and-further-with-super-res.html

YouTube

Super Res Zoom

1.2K viewsAlexander, 05:15