Quantifying data reuse in proteomics using PRIDE downloads statistics and a semi-supervised LLM-based framework (opens in new tab)

Understanding how scientific datasets are accessed and reused is essential for resource planning and impact assessment. Here we present the PRIDE Archive download tracking infrastructure and a comprehensive analysis of 159.3 million download records from the PRIDE proteomics database (2021-2025), spanning 35,528 datasets accessed from 235 locations. The infrastructure includes nf-downloadstats, a scalable Nextflow pipeline for processing download logs, and DeepLogBot, a machine-learning frame...

Read the original article