✨AICC: Parse HTML Finer, Make Models Better -- A 7.3T AI-Ready Corpus Built by a Model-Based HTML Parser
📝 Summary:
This paper introduces MinerU-HTML, a novel language model-based HTML parser that semantically extracts web content, preserving structure better than heuristic methods. It constructs the 7.3T AICC corpus, demonstrating that models trained on AICC significantly outperform those from other parsers, ...
🔹 Publication Date: Published on Nov 20
🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2511.16397
• PDF: https://arxiv.org/pdf/2511.16397
✨ Datasets citing this paper:
• https://huggingface.co/datasets/opendatalab/AICC
==================================
For more data science resources:
✓ https://t.me/DataScienceT
#AI #HTMLParsing #Corpus #LanguageModels #WebData
📝 Summary:
This paper introduces MinerU-HTML, a novel language model-based HTML parser that semantically extracts web content, preserving structure better than heuristic methods. It constructs the 7.3T AICC corpus, demonstrating that models trained on AICC significantly outperform those from other parsers, ...
🔹 Publication Date: Published on Nov 20
🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2511.16397
• PDF: https://arxiv.org/pdf/2511.16397
✨ Datasets citing this paper:
• https://huggingface.co/datasets/opendatalab/AICC
==================================
For more data science resources:
✓ https://t.me/DataScienceT
#AI #HTMLParsing #Corpus #LanguageModels #WebData