Spark in me - Internet, data science, math, deep learning, philosophy
1.8K members
196 photos
30 files
1.6K links
All this - lost like tears in rain.

Data science, ML, a bit of philosophy and math. No bs.

Our website
Our chat
DS courses review
Download Telegram
to view and join the conversation
Amazingly simple code to mimic fast-texts n-gram subword routine

Nuff said. Try it for yourself.

from fastText import load_model

model = load_model('official_fasttext_wiki_200_model')

def find_ngrams(string, n):
ngrams = zip(*[string[i:] for i in range(n)])
ngrams = [''.join(_) for _ in ngrams]
return ngrams

string = 'грёзоблаженствующий'

ngrams = []
for i in range(3,7):

ft_ngrams, ft_indexes = model.get_subwords(string)

ngrams = set(ngrams)
ft_ngrams = set(ft_ngrams)

Head of DS in Ostrovok (Moscow)

Please contact @eshneyderman (Евгений Шнейдерман) if you are up to the challenge.

Monkey patching a PyTorch model

Well, ideally you should not do this.
But sometimes you just need to quickly test something and amend your model on the fly.

This helps:

import torch
import functools

def rsetattr(obj, attr, val):
pre, _, post = attr.rpartition('.')
return setattr(rgetattr(obj, pre) if pre else obj, post, val)

def rgetattr(obj, attr, *args):
def _getattr(obj, attr):
return getattr(obj, attr, *args)
return functools.reduce(_getattr, [obj] + attr.split('.'))

for module in model.named_modules():
old_module_path = module[0]
old_module_object = module[1]
# replace an old object with the new one
# copy some settings and its state
if isinstance(old_module_object,torch.nn.SomeClass):
new_module = SomeOtherClass(old_module_object.some_settings,


The above code essentially does the same as:

model = some_other_block

PCIE risers that REALLY WORK for DL

Thermaltake TT Premium PCIE 3.0 extender.
All the others I tried were crap.

Wiki graph database

Just found out that Wikipedia also provides this

May be useful for research in future.
Seems very theoretic and probably works only for English, but it is best to keep such things on the radar.

Example queries:
People who were born in Berlin before 1900
German musicians with German and English descriptions
Musicians who were born in Berlin

Going from millions of points of data to billions on a single machine

In my experience pandas works fine with tables up to 50-100m rows.

Ofc plain indexing/caching (i.e. pre-process all of your data in chunks and index it somehow) and / or clever map/reduce like style optimizations work.

But sometimes it is just good to know that such things exist:

- for large data-frames + some nice visualizations;
- for large visualizations;
- Also you can use Dask for these purposes I guess;

Python3 nvidia driver bindings in glances

They used to have only python2 ones.

If you update your drivers and glances, you will get a nice GPU memory / load indicator within glances.
So convenient.

Another set of links for common crawl for NLP

Looks like we were not the first, ofc.

Below are some projects dedicated to NLP corpus retrieval on scale:
- Java + license detection + boilerplate removal:$dkpro-c4corpus-doc/doclinks/1/

- Prepared deduplicated CC text archives

- Google group!topic/common-crawl/6F-yXsC35xM


Downloading 200GB files in literally hours

(1) Order 500 Mbit/s Internet connection from your ISP
(2) Use aria2 - with -x
(3) Profit