Skip to main content.

Downloading PLoS papers in XML

This page describes how to download large numbers of XML papers from the PLoS Biology website. The script could be adapted to download the other PLoS journals, and may be useful to people wanting quantities of XML papers to test text mining software or to datamine the contents of PLoS. The script works for PLoS Biology, but would also work for the other PLoS journals by modifying the URLs in the script. A compressed archive obtained by running the script containing XML files for all papers from 2003 up to the Nov 2005 issue is included at the end.

Fred Howell, A.nnotate.com

The problem...

As an open access publisher, PLoS makes its papers available in html, pdf and also XML format. The links to individual XML files are included on the website, e.g. the "Download XML" link on the page paper 0030401 is http://biology.plosjournals.org/archive/1545-7885/3/12/pmc/pbio.0030401.xml.

But how can we automatically download all XML papers without having to click on each link individually?

Fetching tables of contents in XML

PLoS provides the table of contents for each issue in XML (in RSS format), e.g. for volume 1 issue 1:- http://biology.plosjournals.org/perlserv/?request=get-toc-rss&issn=1545-7885&volume=1&issue=1

This RSS file includes links to the DOI (digital object identifier) for the html versions of papers, e.g. in

<guid isPermaLink="true">http://dx.doi.org/10.1371/journal.pbio.0030401</guid>

From this, we can extract the paper number (0030401) and use this to build the link to the XML version (volume 3 issue 12, paper 0030401):-

http://biology.plosjournals.org/archive/1545-7885/3/12/pmc/pbio.0030401.xml

Download

(to run it from the command line, save it as a file "fetchplos.txt" and type: php fetchplos.txt