Data Science by ODS.ai 🦜

Great article on text preprocessing, covering cleaning, #tokenization, #lemmatization and other aspects

Link: https://medium.com/@datamonsters/text-preprocessing-in-python-steps-tools-and-examples-bf025f872908

#NLP #NLU #datacleaning

Medium

Text Preprocessing in Python: Steps, Tools, and Examples

by Olga Davydova, Data Monsters

11.1K viewsedited 11:14

🌚 13 🌝 26

Data Science by ODS.ai 🦜

YouTokenToMe, new tool for text tokenisation from VK team Meet new enhanced tokenisation tool on steroids. Works 7-10 times faster alphabetic languages and 40 to 50 times faster on logographic languages, than alternatives. Under the hood (watch source)…

New rust tokenization library from #HuggingFace

Tokenization is a process of converting strings in model input tensors. Library provides BPE/Byte-Level-BPE/WordPiece/SentencePiece tokenization, computes exhaustive set of outputs (offset mappings, attention masks, special token masks).

Library has python and node.js bindings.

The quoted post contains information on another fast #tokenization implementation. Looking forward for speed comparison.

Install: pip install tokenizers
Github: https://github.com/huggingface/tokenizers/tree/master/tokenizers

#NLU #NLP #Transformers #Rust #NotOnlyPython

GitHub

tokenizers/tokenizers at main · huggingface/tokenizers

💥 Fast State-of-the-Art Tokenizers optimized for Research and Production - huggingface/tokenizers

10.2K viewsedited 20:24

About

Blog

Apps

Platform