Araneum russicum maximum
TLDR - largest corpus for Russian Internet.
Pre-processed version can be downloaded from
Afaik, this link is not yet on their website (?)
#nlp
TLDR - largest corpus for Russian Internet.
Fast-text
embeddings pre-trained on this corpus work best for broad internet related domains.Pre-processed version can be downloaded from
rusvectores
.Afaik, this link is not yet on their website (?)
wget http://rusvectores.org/static/rus_araneum_maxicum.txt.gz
#nlp
New fast.ai course
Mainly decision tree practice.
A lot about decision tree visualization
- http://www.fast.ai/2018/09/26/ml-launch/
I personally would check out the visualization bits.
At least it looks like they are not pushing their crappy library =)
The problem with any such visualizations is that they work only for toy datasets.
Drop / shuffle method seems to be more robust.
#data_science
Mainly decision tree practice.
A lot about decision tree visualization
- http://www.fast.ai/2018/09/26/ml-launch/
I personally would check out the visualization bits.
At least it looks like they are not pushing their crappy library =)
The problem with any such visualizations is that they work only for toy datasets.
Drop / shuffle method seems to be more robust.
#data_science
If you are mining for a large web-corpus
... for any language other than English. And you do not want to scrape anything, buy proxies or learn this a bit shady "stack".
In case of Russian the above posted link to araneum has already processed data - which may be useless for certain domains.
What to do?
(1)
In case of Russian you can write here
https://tatianashavrina.github.io/taiga_site/
The author will share her 90+GB RAW corpus with you
(2)
In case of any other language there is a second way
- Go to common crawl website;
- Download the index (200 GB);
- Choose domains in your country / language (now they also have language detection);
- Download only plain-text files you need;
Links to start with
- http://commoncrawl.org/connect/blog/
- http://commoncrawl.org/2018/03/index-to-warc-files-and-urls-in-columnar-format/
- https://www.slideshare.net/RobertMeusel/mining-a-large-web-corpus
#nlp
... for any language other than English. And you do not want to scrape anything, buy proxies or learn this a bit shady "stack".
In case of Russian the above posted link to araneum has already processed data - which may be useless for certain domains.
What to do?
(1)
In case of Russian you can write here
https://tatianashavrina.github.io/taiga_site/
The author will share her 90+GB RAW corpus with you
(2)
In case of any other language there is a second way
- Go to common crawl website;
- Download the index (200 GB);
- Choose domains in your country / language (now they also have language detection);
- Download only plain-text files you need;
Links to start with
- http://commoncrawl.org/connect/blog/
- http://commoncrawl.org/2018/03/index-to-warc-files-and-urls-in-columnar-format/
- https://www.slideshare.net/RobertMeusel/mining-a-large-web-corpus
#nlp
Taiga Сorpus
Taiga is a corpus, where text sources and their meta-information are collected according to popular ML tasks.
An open-source corpus for machine learning.
Andrew Ng book
Looks like its draft is finished.
It describes in plain terms how to build ML pipelines:
- https://drive.google.com/open?id=1aHVZ9pcsGtIcgarZxV-Qfkb0JEvtSLDK
#data_science
Looks like its draft is finished.
It describes in plain terms how to build ML pipelines:
- https://drive.google.com/open?id=1aHVZ9pcsGtIcgarZxV-Qfkb0JEvtSLDK
#data_science
PyTorch 1.0 PRE-RELEASE
https://github.com/pytorch/pytorch/releases/tag/v1.0rc0
Looks like it features tools to deploy PyTorch models...
#data_science
https://github.com/pytorch/pytorch/releases/tag/v1.0rc0
Looks like it features tools to deploy PyTorch models...
#data_science
GitHub
pytorch/pytorch
Tensors and Dynamic neural networks in Python with strong GPU acceleration - pytorch/pytorch
Forwarded from Админим с Буквой (bykva)
Github и ssh-ключи
Узнал о такой фишке в гитхабе, что по такой ссылке можно забрать публичный ключ аккаунта. Типа удобно передавать ключ просто ссылкой.
https://github.com/bykvaadm.keys
#github
Узнал о такой фишке в гитхабе, что по такой ссылке можно забрать публичный ключ аккаунта. Типа удобно передавать ключ просто ссылкой.
https://github.com/bykvaadm.keys
#github
Parsing Wikipedia in 4 plain commands in Python
Wrote a small take on using Wikipedia as corpus for NLP.
https://spark-in.me/post/parsing-wikipedia-in-four-commands-for-nlp
https://medium.com/@aveysov/parsing-wikipedia-in-4-simple-commands-for-plain-nlp-corpus-retrieval-eee66b3ba3ee
Please like / share / repost the article =)
#nlp
#data_science
Wrote a small take on using Wikipedia as corpus for NLP.
https://spark-in.me/post/parsing-wikipedia-in-four-commands-for-nlp
https://medium.com/@aveysov/parsing-wikipedia-in-4-simple-commands-for-plain-nlp-corpus-retrieval-eee66b3ba3ee
Please like / share / repost the article =)
#nlp
#data_science
Spark in me
Parsing Wikipedia in 4 simple commands for plain NLP corpus retrieval
Parsing Wikipedia in 4 simple commands for plain NLP corpus retrieval
Статьи автора - http://spark-in.me/author/snakers41
Блог - http://spark-in.me
Статьи автора - http://spark-in.me/author/snakers41
Блог - http://spark-in.me
Amazingly simple code to mimic fast-texts n-gram subword routine
Nuff said. Try it for yourself.
Nuff said. Try it for yourself.
from fastText import load_model#nlp
model = load_model('official_fasttext_wiki_200_model')
def find_ngrams(string, n):
ngrams = zip(*[string[i:] for i in range(n)])
ngrams = [''.join(_) for _ in ngrams]
return ngrams
string = 'грёзоблаженствующий'
ngrams = []
for i in range(3,7):
ngrams.extend(find_ngrams('<'+string+'>',i))
ft_ngrams, ft_indexes = model.get_subwords(string)
ngrams = set(ngrams)
ft_ngrams = set(ft_ngrams)
print(sorted(ngrams),sorted(ft_ngrams))
print(set(ft_ngrams).difference(set(ngrams)),set(ngrams).difference(set(ft_ngrams)))
Head of DS in Ostrovok (Moscow)
Please contact @eshneyderman (Евгений Шнейдерман) if you are up to the challenge.
#jobs
Please contact @eshneyderman (Евгений Шнейдерман) if you are up to the challenge.
#jobs
Spark in me
Parsing Wikipedia in 4 plain commands in Python Wrote a small take on using Wikipedia as corpus for NLP. https://spark-in.me/post/parsing-wikipedia-in-four-commands-for-nlp https://medium.com/@aveysov/parsing-wikipedia-in-4-simple-commands-for-plain-nlp…
Habr
Парсим Википедию для задач NLP в 4 команды
Суть Оказывается для этого достаточно запуcтить всего лишь такой набор команд: git clone https://github.com/attardi/wikiextractor.git cd wikiextractor wget...
Monkey patching a PyTorch model
Well, ideally you should not do this.
But sometimes you just need to quickly test something and amend your model on the fly.
This helps:
The above code essentially does the same as:
.path.to.some.block = some_other_block
#python
#pytorch
#deep_learning
#oop
Well, ideally you should not do this.
But sometimes you just need to quickly test something and amend your model on the fly.
This helps:
import torch
import functools
def rsetattr(obj, attr, val):
pre, _, post = attr.rpartition('.')
return setattr(rgetattr(obj, pre) if pre else obj, post, val)
def rgetattr(obj, attr, *args):
def _getattr(obj, attr):
return getattr(obj, attr, *args)
return functools.reduce(_getattr, [obj] + attr.split('.'))
for module in model.named_modules():
old_module_path = module[0]
old_module_object = module[1]
# replace an old object with the new one
# copy some settings and its state
if isinstance(old_module_object,torch.nn.SomeClass):
new_module = SomeOtherClass(old_module_object.some_settings,
old_module_object.some_other_settings)
new_module.load_state_dict(module_object.state_dict())
rsetattr(model,old_module_path,new_module)
The above code essentially does the same as:
model
.path.to.some.block = some_other_block
`
#python
#pytorch
#deep_learning
#oop
PCIE risers that REALLY WORK for DL
Thermaltake TT Premium PCIE 3.0 extender.
All the others I tried were crap.
#deep_learning
Thermaltake TT Premium PCIE 3.0 extender.
All the others I tried were crap.
#deep_learning
Wiki graph database
Just found out that Wikipedia also provides this
- https://wiki.dbpedia.org/OnlineAccess
- https://wiki.dbpedia.org/downloads-2016-10#p10608-2
May be useful for research in future.
Seems very theoretic and probably works only for English, but it is best to keep such things on the radar.
Example queries:
#data_science
Just found out that Wikipedia also provides this
- https://wiki.dbpedia.org/OnlineAccess
- https://wiki.dbpedia.org/downloads-2016-10#p10608-2
May be useful for research in future.
Seems very theoretic and probably works only for English, but it is best to keep such things on the radar.
Example queries:
People who were born in Berlin before 1900
German musicians with German and English descriptions
Musicians who were born in Berlin
Games
#data_science
Going from millions of points of data to billions on a single machine
In my experience pandas works fine with tables up to 50-100m rows.
Ofc plain indexing/caching (i.e. pre-process all of your data in chunks and index it somehow) and / or clever map/reduce like style optimizations work.
But sometimes it is just good to know that such things exist:
- https://vaex.io/ for large data-frames + some nice visualizations;
- Datashader.org for large visualizations;
- Also you can use Dask for these purposes I guess https://jakevdp.github.io/blog/2015/08/14/out-of-core-dataframes-in-python/;
#data_science
In my experience pandas works fine with tables up to 50-100m rows.
Ofc plain indexing/caching (i.e. pre-process all of your data in chunks and index it somehow) and / or clever map/reduce like style optimizations work.
But sometimes it is just good to know that such things exist:
- https://vaex.io/ for large data-frames + some nice visualizations;
- Datashader.org for large visualizations;
- Also you can use Dask for these purposes I guess https://jakevdp.github.io/blog/2015/08/14/out-of-core-dataframes-in-python/;
#data_science
vaex.io
Vaex: A Fast DataFrame for Python 🚀
Out-of-Core DataFrames for Python, ML, visualize and explore big tabular data at a billion rows per second 🚀 | Pandas alternative
Python3 nvidia driver bindings in glances
They used to have only python2 ones.
If you update your drivers and glances, you will get a nice GPU memory / load indicator within glances.
So convenient.
#linux
They used to have only python2 ones.
If you update your drivers and glances, you will get a nice GPU memory / load indicator within glances.
So convenient.
#linux
A small continuation of the crawling saga
2 takes on the Common Crawl
https://spark-in.me/post/parsing-common-crawl-in-four-simple-commands
https://spark-in.me/post/parsing-common-crawl-in-two-simple-commands
It turned out to be a bit tougher than expected
But doable
#nlp
2 takes on the Common Crawl
https://spark-in.me/post/parsing-common-crawl-in-four-simple-commands
https://spark-in.me/post/parsing-common-crawl-in-two-simple-commands
It turned out to be a bit tougher than expected
But doable
#nlp
Spark in me
Parsing Common Crawl in 4 plain scripts in python
Parsing Common Crawl in 4 plain scripts in python
Статьи автора - http://spark-in.me/author/snakers41
Блог - http://spark-in.me
Статьи автора - http://spark-in.me/author/snakers41
Блог - http://spark-in.me
Another set of links for common crawl for NLP
Looks like we were not the first, ofc.
Below are some projects dedicated to NLP corpus retrieval on scale:
- Java + license detection + boilerplate removal: https://zoidberg.ukp.informatik.tu-darmstadt.de/jenkins/job/DKPro%20C4Corpus/org.dkpro.c4corpus$dkpro-c4corpus-doc/doclinks/1/
- Prepared deduplicated CC text archives http://data.statmt.org/ngrams/deduped/
- Google group
https://groups.google.com/forum/#!topic/common-crawl/6F-yXsC35xM
Wow!
#nlp
Looks like we were not the first, ofc.
Below are some projects dedicated to NLP corpus retrieval on scale:
- Java + license detection + boilerplate removal: https://zoidberg.ukp.informatik.tu-darmstadt.de/jenkins/job/DKPro%20C4Corpus/org.dkpro.c4corpus$dkpro-c4corpus-doc/doclinks/1/
- Prepared deduplicated CC text archives http://data.statmt.org/ngrams/deduped/
- Google group
https://groups.google.com/forum/#!topic/common-crawl/6F-yXsC35xM
Wow!
#nlp
zoidberg.ukp.informatik.tu-darmstadt.de
DKPro C4Corpus™ User Guide and Reference