Data Science by ODS.ai 🦜
51K subscribers
363 photos
34 videos
7 files
1.52K links
First Telegram Data Science channel. Covering all technical and popular staff about anything related to Data Science: AI, Big Data, Machine Learning, Statistics, general Math and the applications of former. To reach editors contact: @haarrp
Download Telegram
Data Science by ODS.ai 🦜
​​YouTokenToMe, new tool for text tokenisation from VK team Meet new enhanced tokenisation tool on steroids. Works 7-10 times faster alphabetic languages and 40 to 50 times faster on logographic languages, than alternatives. Under the hood (watch source)…
New rust tokenization library from #HuggingFace

Tokenization is a process of converting strings in model input tensors. Library provides BPE/Byte-Level-BPE/WordPiece/SentencePiece tokenization, computes exhaustive set of outputs (offset mappings, attention masks, special token masks).

Library has python and node.js bindings.

The quoted post contains information on another fast #tokenization implementation. Looking forward for speed comparison.

Install: pip install tokenizers
Github: https://github.com/huggingface/tokenizers/tree/master/tokenizers

#NLU #NLP #Transformers #Rust #NotOnlyPython
Forwarded from Archived GitHub
A.W.E.S.O.M. O is an extensive list of interesting open source projects written in various languages.

#python #rust #js #php #golang #go #ts #kotlin #js #clojure #erlang #elixir #c #cpp #dart #ocaml #etc