Amazingly simple code to mimic fast-texts n-gram subword routine
Nuff said. Try it for yourself.
Nuff said. Try it for yourself.
from fastText import load_model#nlp
model = load_model('official_fasttext_wiki_200_model')
def find_ngrams(string, n):
ngrams = zip(*[string[i:] for i in range(n)])
ngrams = [''.join(_) for _ in ngrams]
return ngrams
string = 'грёзоблаженствующий'
ngrams = []
for i in range(3,7):
ngrams.extend(find_ngrams('<'+string+'>',i))
ft_ngrams, ft_indexes = model.get_subwords(string)
ngrams = set(ngrams)
ft_ngrams = set(ft_ngrams)
print(sorted(ngrams),sorted(ft_ngrams))
print(set(ft_ngrams).difference(set(ngrams)),set(ngrams).difference(set(ft_ngrams)))
Head of DS in Ostrovok (Moscow)
Please contact @eshneyderman (Евгений Шнейдерман) if you are up to the challenge.
#jobs
Please contact @eshneyderman (Евгений Шнейдерман) if you are up to the challenge.
#jobs
Spark in me
Parsing Wikipedia in 4 plain commands in Python Wrote a small take on using Wikipedia as corpus for NLP. https://spark-in.me/post/parsing-wikipedia-in-four-commands-for-nlp https://medium.com/@aveysov/parsing-wikipedia-in-4-simple-commands-for-plain-nlp…
Habr
Парсим Википедию для задач NLP в 4 команды
Суть Оказывается для этого достаточно запуcтить всего лишь такой набор команд: git clone https://github.com/attardi/wikiextractor.git cd wikiextractor wget...
Monkey patching a PyTorch model
Well, ideally you should not do this.
But sometimes you just need to quickly test something and amend your model on the fly.
This helps:
The above code essentially does the same as:
.path.to.some.block = some_other_block
#python
#pytorch
#deep_learning
#oop
Well, ideally you should not do this.
But sometimes you just need to quickly test something and amend your model on the fly.
This helps:
import torch
import functools
def rsetattr(obj, attr, val):
pre, _, post = attr.rpartition('.')
return setattr(rgetattr(obj, pre) if pre else obj, post, val)
def rgetattr(obj, attr, *args):
def _getattr(obj, attr):
return getattr(obj, attr, *args)
return functools.reduce(_getattr, [obj] + attr.split('.'))
for module in model.named_modules():
old_module_path = module[0]
old_module_object = module[1]
# replace an old object with the new one
# copy some settings and its state
if isinstance(old_module_object,torch.nn.SomeClass):
new_module = SomeOtherClass(old_module_object.some_settings,
old_module_object.some_other_settings)
new_module.load_state_dict(module_object.state_dict())
rsetattr(model,old_module_path,new_module)
The above code essentially does the same as:
model
.path.to.some.block = some_other_block
`
#python
#pytorch
#deep_learning
#oop
PCIE risers that REALLY WORK for DL
Thermaltake TT Premium PCIE 3.0 extender.
All the others I tried were crap.
#deep_learning
Thermaltake TT Premium PCIE 3.0 extender.
All the others I tried were crap.
#deep_learning
Wiki graph database
Just found out that Wikipedia also provides this
- https://wiki.dbpedia.org/OnlineAccess
- https://wiki.dbpedia.org/downloads-2016-10#p10608-2
May be useful for research in future.
Seems very theoretic and probably works only for English, but it is best to keep such things on the radar.
Example queries:
#data_science
Just found out that Wikipedia also provides this
- https://wiki.dbpedia.org/OnlineAccess
- https://wiki.dbpedia.org/downloads-2016-10#p10608-2
May be useful for research in future.
Seems very theoretic and probably works only for English, but it is best to keep such things on the radar.
Example queries:
People who were born in Berlin before 1900
German musicians with German and English descriptions
Musicians who were born in Berlin
Games
#data_science
Going from millions of points of data to billions on a single machine
In my experience pandas works fine with tables up to 50-100m rows.
Ofc plain indexing/caching (i.e. pre-process all of your data in chunks and index it somehow) and / or clever map/reduce like style optimizations work.
But sometimes it is just good to know that such things exist:
- https://vaex.io/ for large data-frames + some nice visualizations;
- Datashader.org for large visualizations;
- Also you can use Dask for these purposes I guess https://jakevdp.github.io/blog/2015/08/14/out-of-core-dataframes-in-python/;
#data_science
In my experience pandas works fine with tables up to 50-100m rows.
Ofc plain indexing/caching (i.e. pre-process all of your data in chunks and index it somehow) and / or clever map/reduce like style optimizations work.
But sometimes it is just good to know that such things exist:
- https://vaex.io/ for large data-frames + some nice visualizations;
- Datashader.org for large visualizations;
- Also you can use Dask for these purposes I guess https://jakevdp.github.io/blog/2015/08/14/out-of-core-dataframes-in-python/;
#data_science
vaex.io
Vaex: A Fast DataFrame for Python 🚀
Out-of-Core DataFrames for Python, ML, visualize and explore big tabular data at a billion rows per second 🚀 | Pandas alternative
Python3 nvidia driver bindings in glances
They used to have only python2 ones.
If you update your drivers and glances, you will get a nice GPU memory / load indicator within glances.
So convenient.
#linux
They used to have only python2 ones.
If you update your drivers and glances, you will get a nice GPU memory / load indicator within glances.
So convenient.
#linux
A small continuation of the crawling saga
2 takes on the Common Crawl
https://spark-in.me/post/parsing-common-crawl-in-four-simple-commands
https://spark-in.me/post/parsing-common-crawl-in-two-simple-commands
It turned out to be a bit tougher than expected
But doable
#nlp
2 takes on the Common Crawl
https://spark-in.me/post/parsing-common-crawl-in-four-simple-commands
https://spark-in.me/post/parsing-common-crawl-in-two-simple-commands
It turned out to be a bit tougher than expected
But doable
#nlp
Spark in me
Parsing Common Crawl in 4 plain scripts in python
Parsing Common Crawl in 4 plain scripts in python
Статьи автора - http://spark-in.me/author/snakers41
Блог - http://spark-in.me
Статьи автора - http://spark-in.me/author/snakers41
Блог - http://spark-in.me
Another set of links for common crawl for NLP
Looks like we were not the first, ofc.
Below are some projects dedicated to NLP corpus retrieval on scale:
- Java + license detection + boilerplate removal: https://zoidberg.ukp.informatik.tu-darmstadt.de/jenkins/job/DKPro%20C4Corpus/org.dkpro.c4corpus$dkpro-c4corpus-doc/doclinks/1/
- Prepared deduplicated CC text archives http://data.statmt.org/ngrams/deduped/
- Google group
https://groups.google.com/forum/#!topic/common-crawl/6F-yXsC35xM
Wow!
#nlp
Looks like we were not the first, ofc.
Below are some projects dedicated to NLP corpus retrieval on scale:
- Java + license detection + boilerplate removal: https://zoidberg.ukp.informatik.tu-darmstadt.de/jenkins/job/DKPro%20C4Corpus/org.dkpro.c4corpus$dkpro-c4corpus-doc/doclinks/1/
- Prepared deduplicated CC text archives http://data.statmt.org/ngrams/deduped/
- Google group
https://groups.google.com/forum/#!topic/common-crawl/6F-yXsC35xM
Wow!
#nlp
zoidberg.ukp.informatik.tu-darmstadt.de
DKPro C4Corpus™ User Guide and Reference
Downloading 200GB files in literally hours
(1) Order 500 Mbit/s Internet connection from your ISP
(2) Use
(3) Profit
#data_science
(1) Order 500 Mbit/s Internet connection from your ISP
(2) Use
aria2
- https://aria2.github.io/ with -x
(3) Profit
#data_science
DS/ML digest 26
More interesting NLP papers / material ...
https://spark-in.me/post/2018_ds_ml_digest_26
#digest
#deep_learning
#data_science
More interesting NLP papers / material ...
https://spark-in.me/post/2018_ds_ml_digest_26
#digest
#deep_learning
#data_science
Spark in me
2018 DS/ML digest 26
2018 DS/ML digest 26
Статьи автора - http://spark-in.me/author/snakers41
Блог - http://spark-in.me
Статьи автора - http://spark-in.me/author/snakers41
Блог - http://spark-in.me
An Open source alternative to Mendeley
Looks like that Zotero is also cross-platform, and open-source
Also you can import the whole Mendeley library with 1 button push:
https://www.zotero.org/support/kb/mendeley_import
#data_science
Looks like that Zotero is also cross-platform, and open-source
Also you can import the whole Mendeley library with 1 button push:
https://www.zotero.org/support/kb/mendeley_import
#data_science
www.zotero.org
kb:mendeley_import [Zotero Documentation]
Zotero is a free, easy-to-use tool to help you collect, organize, cite, and share research.
Looks like mixed precision training ... is solved in PyTorch
Lol - and I could not find it
https://github.com/NVIDIA/apex/tree/master/apex/amp
#deep_learning
Lol - and I could not find it
https://github.com/NVIDIA/apex/tree/master/apex/amp
#deep_learning
GitHub
apex/apex/amp at master · NVIDIA/apex
A PyTorch Extension: Tools for easy mixed precision and distributed training in Pytorch - NVIDIA/apex
Mixed precision distributed training ImageNet example in PyTorch
https://github.com/NVIDIA/apex/blob/master/examples/imagenet/main.py
#deep_learning
https://github.com/NVIDIA/apex/blob/master/examples/imagenet/main.py
#deep_learning
Google's super resolution zoom
Finally Google made something interesting
https://www.youtube.com/watch?v=z-ZJqd4eQrc
https://ai.googleblog.com/2018/10/see-better-and-further-with-super-res.html
Finally Google made something interesting
https://www.youtube.com/watch?v=z-ZJqd4eQrc
https://ai.googleblog.com/2018/10/see-better-and-further-with-super-res.html
YouTube
Super Res Zoom
I guess PyTorch is in the bottom left corner, but realistically the author of this snippet did a lot of
import A as B