Data Science by ODS.ai 🦜

YouTokenToMe, new tool for text tokenisation from VK team Meet new enhanced tokenisation tool on steroids. Works 7-10 times faster alphabetic languages and 40 to 50 times faster on logographic languages, than alternatives. Under the hood (watch source)…

New rust tokenization library from #HuggingFace

Tokenization is a process of converting strings in model input tensors. Library provides BPE/Byte-Level-BPE/WordPiece/SentencePiece tokenization, computes exhaustive set of outputs (offset mappings, attention masks, special token masks).

Library has python and node.js bindings.

The quoted post contains information on another fast #tokenization implementation. Looking forward for speed comparison.

Install: pip install tokenizers
Github: https://github.com/huggingface/tokenizers/tree/master/tokenizers

#NLU #NLP #Transformers #Rust #NotOnlyPython

GitHub

tokenizers/tokenizers at main · huggingface/tokenizers

💥 Fast State-of-the-Art Tokenizers optimized for Research and Production - huggingface/tokenizers

10.2K viewsedited 20:24

Data Science by ODS.ai 🦜

Forwarded from Archived GitHub

A.W.E.S.O.M. O is an extensive list of interesting open source projects written in various languages.

#python #rust #js #php #golang #go #ts #kotlin #js #clojure #erlang #elixir #c #cpp #dart #ocaml #etc

15.7K views10:24

About

Blog

Apps

Platform