cadmus: a robust pipeline for scalable retrieval of full-text biomedical literature (opens in new tab)
cadmus is an open-source Python toolkit for automated retrieval and processing of full-text biomedical literature. It utilises programmatic access to PubMed, Crossref, Europe PMC, PMC, and publisher APIs, allowing users to construct large, domain-speci[fi]c corpora with minimal manual intervention. cadmus parses PDF, HTML, XML, and plain text [fi]les, standardising them for downstream biomedical text mining. During the retrieval of a Developmental Disorders Corpus (204,043 publications), it a...
Read the original article