Spark in me – Telegram

Spark in me

2.27K subscribers

744 photos

47 videos

114 files

2.63K links

Lost like tears in rain. DS, ML, a bit of philosophy and math. No bs or ads.

Download Telegram

About

Blog

Apps

Platform

2.27K subscribers

Araneum russicum maximum

TLDR - largest corpus for Russian Internet. Fast-text embeddings pre-trained on this corpus work best for broad internet related domains.

Pre-processed version can be downloaded from rusvectores.
Afaik, this link is not yet on their website (?)

wget http://rusvectores.org/static/rus_araneum_maxicum.txt.gz

4.5K viewsAlexander, 13:22

https://youtu.be/dyzn3Fmtw-E

This Painter AI Fools Art Historians 39% of the Time

Pick up cool perks on our Patreon page: https://www.patreon.com/TwoMinutePapers

Crypto and PayPal links are available below. Thank you very much for your generous support!
Bitcoin: 13hhmJnLEzwXgmgJN7RB6bWVdT7WkrFAHh
PayPal: https://www.paypal.me/TwoMinutePapers…

1.2K viewsAlexander, 01:47

New fast.ai course

Mainly decision tree practice.

A lot about decision tree visualization
- http://www.fast.ai/2018/09/26/ml-launch/

I personally would check out the visualization bits.
At least it looks like they are not pushing their crappy library =)
The problem with any such visualizations is that they work only for toy datasets.
Drop / shuffle method seems to be more robust.

#data_science

1.1K viewsAlexander, edited 05:40

DS/ML digest 25

https://spark-in.me/post/2018_ds_ml_digest_25

#digest
#deep_learning
#data_science

2018 DS/ML digest 25

2018 DS/ML digest 25
Статьи автора - http://spark-in.me/author/snakers41
Блог - http://spark-in.me

1.2K viewsAlexander, 11:12

If you are mining for a large web-corpus

... for any language other than English. And you do not want to scrape anything, buy proxies or learn this a bit shady "stack".

In case of Russian the above posted link to araneum has already processed data - which may be useless for certain domains.

What to do?

(1)
In case of Russian you can write here
https://tatianashavrina.github.io/taiga_site/
The author will share her 90+GB RAW corpus with you

(2)
In case of any other language there is a second way
- Go to common crawl website;
- Download the index (200 GB);
- Choose domains in your country / language (now they also have language detection);
- Download only plain-text files you need;

Links to start with
- http://commoncrawl.org/connect/blog/
- http://commoncrawl.org/2018/03/index-to-warc-files-and-urls-in-columnar-format/
- https://www.slideshare.net/RobertMeusel/mining-a-large-web-corpus

#nlp

Taiga is a corpus, where text sources and their meta-information are collected according to popular ML tasks.

An open-source corpus for machine learning.

1.2K viewsAlexander, 10:48

Andrew Ng book

Looks like its draft is finished.

It describes in plain terms how to build ML pipelines:
- https://drive.google.com/open?id=1aHVZ9pcsGtIcgarZxV-Qfkb0JEvtSLDK

#data_science

1.4K viewsAlexander, 10:53

New release of keras

https://github.com/keras-team/keras/releases/tag/2.2.3

#deep_learning

keras-team/keras

Deep Learning for humans. Contribute to keras-team/keras development by creating an account on GitHub.

1.0K viewsAlexander, 03:01

PyTorch 1.0 PRE-RELEASE

https://github.com/pytorch/pytorch/releases/tag/v1.0rc0

Looks like it features tools to deploy PyTorch models...

#data_science

pytorch/pytorch

Tensors and Dynamic neural networks in Python with strong GPU acceleration - pytorch/pytorch

1.1K viewsAlexander, 09:59

Forwarded from Админим с Буквой (bykva)

Github и ssh-ключи

Узнал о такой фишке в гитхабе, что по такой ссылке можно забрать публичный ключ аккаунта. Типа удобно передавать ключ просто ссылкой.

https://github.com/bykvaadm.keys

#github

29 viewsAlexander, 15:15

Parsing Wikipedia in 4 plain commands in Python

Wrote a small take on using Wikipedia as corpus for NLP.

https://spark-in.me/post/parsing-wikipedia-in-four-commands-for-nlp

https://medium.com/@aveysov/parsing-wikipedia-in-4-simple-commands-for-plain-nlp-corpus-retrieval-eee66b3ba3ee

Please like / share / repost the article =)

#nlp
#data_science

Parsing Wikipedia in 4 simple commands for plain NLP corpus retrieval

Parsing Wikipedia in 4 simple commands for plain NLP corpus retrieval
Статьи автора - http://spark-in.me/author/snakers41
Блог - http://spark-in.me

2.8K viewsAlexander, 18:16

Amazingly simple code to mimic fast-texts n-gram subword routine

Nuff said. Try it for yourself.

from fastText import load_model

model = load_model('official_fasttext_wiki_200_model')

def find_ngrams(string, n):
    ngrams = zip(*[string[i:] for i in range(n)])
    ngrams = [''.join(_) for _ in ngrams]
    return ngrams

string = 'грёзоблаженствующий'

ngrams = []
for i in range(3,7):
    ngrams.extend(find_ngrams('<'+string+'>',i))

ft_ngrams, ft_indexes = model.get_subwords(string)

ngrams = set(ngrams)
ft_ngrams = set(ft_ngrams)

print(sorted(ngrams),sorted(ft_ngrams))
print(set(ft_ngrams).difference(set(ngrams)),set(ngrams).difference(set(ft_ngrams)))

1.3K viewsAlexander, edited 06:03

Head of DS in Ostrovok (Moscow)

Please contact @eshneyderman (Евгений Шнейдерман) if you are up to the challenge.

#jobs

872 viewsAlexander, 16:29

Forwarded from Felix Shpilman

Evgeny Shneyderman:
https://hh.ru/vacancy/27723418

Вакансия Head Of Data Science в Москве, работа в Ostrovok.ru (вакансия в архиве)

Вакансия Head Of Data Science. Зарплата: не указана. Москва. Требуемый опыт: 3–6 лет. Полная занятость. Дата публикации: 07.11.2018.

923 viewsAlexander, 16:29

Parsing Wikipedia in 4 plain commands in Python Wrote a small take on using Wikipedia as corpus for NLP. https://spark-in.me/post/parsing-wikipedia-in-four-commands-for-nlp https://medium.com/@aveysov/parsing-wikipedia-in-4-simple-commands-for-plain-nlp…

Russian post on Habr

https://habr.com/post/425507/

Please support if you have an account.

#nlp

Парсим Википедию для задач NLP в 4 команды

Суть Оказывается для этого достаточно запуcтить всего лишь такой набор команд: git clone https://github.com/attardi/wikiextractor.git cd wikiextractor wget...

2.2K viewsAlexander, 16:47

Monkey patching a PyTorch model

Well, ideally you should not do this.
But sometimes you just need to quickly test something and amend your model on the fly.

This helps:


import torch
import functools

def rsetattr(obj, attr, val):
    pre, _, post = attr.rpartition('.')
    return setattr(rgetattr(obj, pre) if pre else obj, post, val)

def rgetattr(obj, attr, *args):
    def _getattr(obj, attr):
        return getattr(obj, attr, *args)
    return functools.reduce(_getattr, [obj] + attr.split('.'))

for module in model.named_modules():
    old_module_path = module[0]
    old_module_object = module[1]
    # replace an old object with the new one
    # copy some settings and its state
    if isinstance(old_module_object,torch.nn.SomeClass):
        new_module = SomeOtherClass(old_module_object.some_settings,
                                    old_module_object.some_other_settings)
        
        new_module.load_state_dict(module_object.state_dict())
        rsetattr(model,old_module_path,new_module)

The above code essentially does the same as:


model

.path.to.some.block = some_other_block
`

#python
#pytorch
#deep_learning
#oop

1.1K viewsAlexander, edited 07:24

PCIE risers that REALLY WORK for DL

Thermaltake TT Premium PCIE 3.0 extender.
All the others I tried were crap.

#deep_learning

1.2K viewsAlexander, 13:04

Wiki graph database

Just found out that Wikipedia also provides this
- https://wiki.dbpedia.org/OnlineAccess
- https://wiki.dbpedia.org/downloads-2016-10#p10608-2

May be useful for research in future.
Seems very theoretic and probably works only for English, but it is best to keep such things on the radar.

Example queries:

People who were born in Berlin before 1900
German musicians with German and English descriptions
Musicians who were born in Berlin
Games

920 viewsAlexander, 05:38

Going from millions of points of data to billions on a single machine

In my experience pandas works fine with tables up to 50-100m rows.

Ofc plain indexing/caching (i.e. pre-process all of your data in chunks and index it somehow) and / or clever map/reduce like style optimizations work.

But sometimes it is just good to know that such things exist:

- https://vaex.io/ for large data-frames + some nice visualizations;
- Datashader.org for large visualizations;
- Also you can use Dask for these purposes I guess https://jakevdp.github.io/blog/2015/08/14/out-of-core-dataframes-in-python/;

#data_science

Vaex: A Fast DataFrame for Python 🚀

Out-of-Core DataFrames for Python, ML, visualize and explore big tabular data at a billion rows per second 🚀 | Pandas alternative

928 viewsAlexander, 06:04

Python3 nvidia driver bindings in glances

They used to have only python2 ones.

If you update your drivers and glances, you will get a nice GPU memory / load indicator within glances.
So convenient.

#linux

964 viewsAlexander, 06:08

A small continuation of the crawling saga

2 takes on the Common Crawl

https://spark-in.me/post/parsing-common-crawl-in-four-simple-commands
https://spark-in.me/post/parsing-common-crawl-in-two-simple-commands

It turned out to be a bit tougher than expected
But doable

#nlp

Parsing Common Crawl in 4 plain scripts in python

Parsing Common Crawl in 4 plain scripts in python
Статьи автора - http://spark-in.me/author/snakers41
Блог - http://spark-in.me

1.2K viewsAlexander, 10:11

Another set of links for common crawl for NLP

Looks like we were not the first, ofc.

Below are some projects dedicated to NLP corpus retrieval on scale:
- Java + license detection + boilerplate removal: https://zoidberg.ukp.informatik.tu-darmstadt.de/jenkins/job/DKPro%20C4Corpus/org.dkpro.c4corpus$dkpro-c4corpus-doc/doclinks/1/

- Prepared deduplicated CC text archives http://data.statmt.org/ngrams/deduped/

- Google group
https://groups.google.com/forum/#!topic/common-crawl/6F-yXsC35xM

Wow!

#nlp

zoidberg.ukp.informatik.tu-darmstadt.de

DKPro C4Corpus™ User Guide and Reference

1.2K viewsAlexander, 18:31