Spark in me
2.27K subscribers
744 photos
47 videos
114 files
2.63K links
Lost like tears in rain. DS, ML, a bit of philosophy and math. No bs or ads.
Download Telegram
Araneum russicum maximum

TLDR - largest corpus for Russian Internet. Fast-text embeddings pre-trained on this corpus work best for broad internet related domains.

Pre-processed version can be downloaded from rusvectores.
Afaik, this link is not yet on their website (?)

wget http://rusvectores.org/static/rus_araneum_maxicum.txt.gz

#nlp
New fast.ai course

Mainly decision tree practice.

A lot about decision tree visualization
- http://www.fast.ai/2018/09/26/ml-launch/

I personally would check out the visualization bits.
At least it looks like they are not pushing their crappy library =)
The problem with any such visualizations is that they work only for toy datasets.
Drop / shuffle method seems to be more robust.

#data_science
If you are mining for a large web-corpus

... for any language other than English. And you do not want to scrape anything, buy proxies or learn this a bit shady "stack".

In case of Russian the above posted link to araneum has already processed data - which may be useless for certain domains.

What to do?

(1)
In case of Russian you can write here
https://tatianashavrina.github.io/taiga_site/
The author will share her 90+GB RAW corpus with you

(2)
In case of any other language there is a second way
- Go to common crawl website;
- Download the index (200 GB);
- Choose domains in your country / language (now they also have language detection);
- Download only plain-text files you need;

Links to start with
- http://commoncrawl.org/connect/blog/
- http://commoncrawl.org/2018/03/index-to-warc-files-and-urls-in-columnar-format/
- https://www.slideshare.net/RobertMeusel/mining-a-large-web-corpus

#nlp
Andrew Ng book

Looks like its draft is finished.

It describes in plain terms how to build ML pipelines:
- https://drive.google.com/open?id=1aHVZ9pcsGtIcgarZxV-Qfkb0JEvtSLDK

#data_science
Forwarded from Админим с Буквой (bykva)
Github и ssh-ключи

Узнал о такой фишке в гитхабе, что по такой ссылке можно забрать публичный ключ аккаунта. Типа удобно передавать ключ просто ссылкой.

https://github.com/bykvaadm.keys

#github
Amazingly simple code to mimic fast-texts n-gram subword routine

Nuff said. Try it for yourself.

from fastText import load_model

model = load_model('official_fasttext_wiki_200_model')

def find_ngrams(string, n):
ngrams = zip(*[string[i:] for i in range(n)])
ngrams = [''.join(_) for _ in ngrams]
return ngrams

string = 'грёзоблаженствующий'

ngrams = []
for i in range(3,7):
ngrams.extend(find_ngrams('<'+string+'>',i))

ft_ngrams, ft_indexes = model.get_subwords(string)

ngrams = set(ngrams)
ft_ngrams = set(ft_ngrams)

print(sorted(ngrams),sorted(ft_ngrams))
print(set(ft_ngrams).difference(set(ngrams)),set(ngrams).difference(set(ft_ngrams)))
#nlp
Head of DS in Ostrovok (Moscow)

Please contact @eshneyderman (Евгений Шнейдерман) if you are up to the challenge.

#jobs
Monkey patching a PyTorch model

Well, ideally you should not do this.
But sometimes you just need to quickly test something and amend your model on the fly.

This helps:


import torch
import functools

def rsetattr(obj, attr, val):
pre, _, post = attr.rpartition('.')
return setattr(rgetattr(obj, pre) if pre else obj, post, val)

def rgetattr(obj, attr, *args):
def _getattr(obj, attr):
return getattr(obj, attr, *args)
return functools.reduce(_getattr, [obj] + attr.split('.'))

for module in model.named_modules():
old_module_path = module[0]
old_module_object = module[1]
# replace an old object with the new one
# copy some settings and its state
if isinstance(old_module_object,torch.nn.SomeClass):
new_module = SomeOtherClass(old_module_object.some_settings,
old_module_object.some_other_settings)

new_module.load_state_dict(module_object.state_dict())
rsetattr(model,old_module_path,new_module)


The above code essentially does the same as:


model

.path.to.some.block = some_other_block
`

#python
#pytorch
#deep_learning
#oop
PCIE risers that REALLY WORK for DL

Thermaltake TT Premium PCIE 3.0 extender.
All the others I tried were crap.

#deep_learning
Wiki graph database

Just found out that Wikipedia also provides this
- https://wiki.dbpedia.org/OnlineAccess
- https://wiki.dbpedia.org/downloads-2016-10#p10608-2

May be useful for research in future.
Seems very theoretic and probably works only for English, but it is best to keep such things on the radar.

Example queries:
People who were born in Berlin before 1900
German musicians with German and English descriptions
Musicians who were born in Berlin
Games

#data_science
Going from millions of points of data to billions on a single machine

In my experience pandas works fine with tables up to 50-100m rows.

Ofc plain indexing/caching (i.e. pre-process all of your data in chunks and index it somehow) and / or clever map/reduce like style optimizations work.

But sometimes it is just good to know that such things exist:

- https://vaex.io/ for large data-frames + some nice visualizations;
- Datashader.org for large visualizations;
- Also you can use Dask for these purposes I guess https://jakevdp.github.io/blog/2015/08/14/out-of-core-dataframes-in-python/;

#data_science
Python3 nvidia driver bindings in glances

They used to have only python2 ones.

If you update your drivers and glances, you will get a nice GPU memory / load indicator within glances.
So convenient.

#linux
Another set of links for common crawl for NLP

Looks like we were not the first, ofc.

Below are some projects dedicated to NLP corpus retrieval on scale:
- Java + license detection + boilerplate removal: https://zoidberg.ukp.informatik.tu-darmstadt.de/jenkins/job/DKPro%20C4Corpus/org.dkpro.c4corpus$dkpro-c4corpus-doc/doclinks/1/

- Prepared deduplicated CC text archives http://data.statmt.org/ngrams/deduped/

- Google group
https://groups.google.com/forum/#!topic/common-crawl/6F-yXsC35xM

Wow!

#nlp