Argonne National Laboratory today announced a PDF parser that the lab said could speed up the creation of AI systems trained on scientific literature, leading to better AI research assistants, improved scientific discovery tools and more accessible scientific knowledge. Called AdaParse, it’s a system the lab describes in detail in a research paper published on Cornell University’s Arxiv site in April.
In its statement today, Argonne said:
Most academic writing is organized in PDF files. Researchers need access to PDF parsers that can quickly and accurately analyze PDFs. This makes efficient parsing essential for building multi-modal science models and AI tools.
However, PDF parsing is challenging because PDF files are sometimes designed for visual appearance rather than readability. Parsers can fail by introducing extra spaces, substituting words, scrambling characters, corrupting chemical formulas, or even losing information. Even small errors can have serious consequences.
AdaParse uses machine learning to select the best parsing method for each PDF and applies that method across large collections of documents. AdaParse, can process millions of scientific papers 17 times faster than previous high-quality parsing approaches, the lab said today. This allows researchers to parse many more documents while staying within their computational budget.
Before, only organizations with an excess of resources could afford high-quality PDF parsing at scale. AdaParse has the potential to expand access to large-scale scientific datasets needed for training advanced AI models.
PDF parsing faces fundamental challenges because these files prioritize visuals over machine readability. Documents contain elements like figures, tables, equations, and multimedia content, making parsing prone to failures. Scientific accuracy is particularly sensitive to parsing errors. Even small mistakes like changing “hyperthyroidism” to “hypothyroidism” completely alters the meaning, while changing “pH” to “Ph” transforms an acidity measure into a chemical group.
Traditional approaches face a trade-off between speed and accuracy. Fast parsers often introduce errors that corrupt scientific meaning. High-quality parsers are slow and expensive for large-scale processing. The research team developed an adaptive system called AdaParse that intelligently selects the optimal parser for each document.
AdaParse works in three stages: quickly extracting text to assess document characteristics, using machine learning to predict which parsing approach will yield the highest quality output, and applying the selected parser while managing computational resources across multiple processing nodes. The researchers incorporated human-in-the-loop principles and human expert feedback through direct preference optimization, aligning the system’s choices with what scientists prefer.
Testing on 25,000 scientific documents across eight research domains and six major publishers demonstrated that the system achieved better accuracy than any individual parser and processed documents 17 times faster than state-of-the-art approaches. Large-scale experiments were conducted using the Argonne Leadership Computing Facility, a U.S. Department of Energy Office of Science user facility.
This research was supported by the U.S. Department of Energy Office of Science–Advanced Scientific Computing Research Program and by Laboratory Directed Research and Development (LDRD) funding from Argonne National Laboratory under Contract No. DE-AC02-06CH11357. Research used resources of the Argonne Leadership Computing Facility, a DOE Office of Science user facility.
PM Contacts:
Rick Stevens
Ian Foster
Robert Underwood
Publications:
Siebenschuh, C., et al., “AdaParse: An Adaptive Parallel PDF Parsing and Resource Scaling Engine.” Proceedings of the 8th MLSys Conference (2025). https://arxiv.org/pdf/2505.01435
Related Links:
AdaParse GitHub Repository, University of Chicago/Argonne National Laboratory
MLSys 2025 Conference Presentation, MLSys Conference




