ML Research Hub
32.8K subscribers
4.21K photos
253 videos
23 files
4.54K links
Advancing research in Machine Learning – practical insights, tools, and techniques for researchers.

Admin: @HusseinSheikho || @Hussein_Sheikho
Download Telegram
AICC: Parse HTML Finer, Make Models Better -- A 7.3T AI-Ready Corpus Built by a Model-Based HTML Parser

📝 Summary:
This paper introduces MinerU-HTML, a novel language model-based HTML parser that semantically extracts web content, preserving structure better than heuristic methods. It constructs the 7.3T AICC corpus, demonstrating that models trained on AICC significantly outperform those from other parsers, ...

🔹 Publication Date: Published on Nov 20

🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2511.16397
• PDF: https://arxiv.org/pdf/2511.16397

Datasets citing this paper:
https://huggingface.co/datasets/opendatalab/AICC

==================================

For more data science resources:
https://t.me/DataScienceT

#AI #HTMLParsing #Corpus #LanguageModels #WebData