Data Science by ODS.ai 🦜
​​YouTokenToMe, new tool for text tokenisation from VK team Meet new enhanced tokenisation tool on steroids. Works 7-10 times faster alphabetic languages and 40 to 50 times faster on logographic languages, than alternatives. Under the hood (watch source)…
New rust tokenization library from #HuggingFace
Tokenization is a process of converting strings in model input tensors. Library provides BPE/Byte-Level-BPE/WordPiece/SentencePiece tokenization, computes exhaustive set of outputs (offset mappings, attention masks, special token masks).
Library has python and node.js bindings.
The quoted post contains information on another fast #tokenization implementation. Looking forward for speed comparison.
Install:
Github: https://github.com/huggingface/tokenizers/tree/master/tokenizers
#NLU #NLP #Transformers #Rust #NotOnlyPython
Tokenization is a process of converting strings in model input tensors. Library provides BPE/Byte-Level-BPE/WordPiece/SentencePiece tokenization, computes exhaustive set of outputs (offset mappings, attention masks, special token masks).
Library has python and node.js bindings.
The quoted post contains information on another fast #tokenization implementation. Looking forward for speed comparison.
Install:
pip install tokenizers
Github: https://github.com/huggingface/tokenizers/tree/master/tokenizers
#NLU #NLP #Transformers #Rust #NotOnlyPython
GitHub
tokenizers/tokenizers at main · huggingface/tokenizers
💥 Fast State-of-the-Art Tokenizers optimized for Research and Production - huggingface/tokenizers