#python #help_wanted #pdf #pdf_documents #pdf_manipulation #pdf_parser #pdf_parsing #pypdf2
https://github.com/py-pdf/pypdf
https://github.com/py-pdf/pypdf
GitHub
GitHub - py-pdf/pypdf: A pure-python PDF library capable of splitting, merging, cropping, and transforming the pages of PDF files
A pure-python PDF library capable of splitting, merging, cropping, and transforming the pages of PDF files - py-pdf/pypdf
#python #ai4science #document_analysis #extract_data #layout_analysis #ocr #parser #pdf #pdf_converter #pdf_extractor_llm #pdf_extractor_pretrain #pdf_extractor_rag #pdf_parser #python
MinerU is a tool that converts PDFs into machine-readable formats like markdown or JSON. Here are the key benefits and features MinerU removes headers, footers, and other unnecessary elements to ensure the text is semantically coherent and in human-readable order, even for complex layouts.
- **Structure Preservation** It extracts images, image descriptions, tables, and table titles.
- **Formula Conversion** Recognizes tables and converts them to LaTeX or HTML format.
- **OCR Support** Supports multiple output formats and various visualization results.
- **GPU and CPU Compatibility**: Works on both CPU and GPU environments, compatible with Windows, Linux, and Mac.
You can try MinerU through an online demo, a quick CPU demo, or by using a GPU for faster processing. For detailed usage, refer to the command line options, API integration, and deployment guides provided.
https://github.com/opendatalab/MinerU
MinerU is a tool that converts PDFs into machine-readable formats like markdown or JSON. Here are the key benefits and features MinerU removes headers, footers, and other unnecessary elements to ensure the text is semantically coherent and in human-readable order, even for complex layouts.
- **Structure Preservation** It extracts images, image descriptions, tables, and table titles.
- **Formula Conversion** Recognizes tables and converts them to LaTeX or HTML format.
- **OCR Support** Supports multiple output formats and various visualization results.
- **GPU and CPU Compatibility**: Works on both CPU and GPU environments, compatible with Windows, Linux, and Mac.
You can try MinerU through an online demo, a quick CPU demo, or by using a GPU for faster processing. For detailed usage, refer to the command line options, API integration, and deployment guides provided.
https://github.com/opendatalab/MinerU
GitHub
GitHub - opendatalab/MinerU: Transforms complex documents like PDFs into LLM-ready markdown/JSON for your Agentic workflows.
Transforms complex documents like PDFs into LLM-ready markdown/JSON for your Agentic workflows. - opendatalab/MinerU
#python #document_analysis #layout_analysis #ocr #parser #pdf #pdf_converter #pdf_parser #python #vlm_ocr
Dolphin is a smart AI tool that can analyze and understand complex document images, like pages with text, tables, formulas, and pictures. It works in two steps: first, it figures out the layout and reading order of the page; then, it quickly parses each element using special prompts. This makes it fast and accurate for turning document images into structured data like JSON or Markdown. You can use pre-trained models and easy code to process single pages, PDFs, or specific elements. This helps you save time and effort when extracting information from complicated documents efficiently.
https://github.com/bytedance/Dolphin
Dolphin is a smart AI tool that can analyze and understand complex document images, like pages with text, tables, formulas, and pictures. It works in two steps: first, it figures out the layout and reading order of the page; then, it quickly parses each element using special prompts. This makes it fast and accurate for turning document images into structured data like JSON or Markdown. You can use pre-trained models and easy code to process single pages, PDFs, or specific elements. This helps you save time and effort when extracting information from complicated documents efficiently.
https://github.com/bytedance/Dolphin
GitHub
GitHub - bytedance/Dolphin: The official repo for “Dolphin: Document Image Parsing via Heterogeneous Anchor Prompting”, ACL, 2025.
The official repo for “Dolphin: Document Image Parsing via Heterogeneous Anchor Prompting”, ACL, 2025. - bytedance/Dolphin