Downloading PLoS papers in XML

This page describes how to download large numbers of XML papers from the PLoS Biology website. The script could be adapted to download the other PLoS journals, and may be useful to people wanting quantities of XML papers to test text mining software or to datamine the contents of PLoS. The script works for PLoS Biology, but would also work for the other PLoS journals by modifying the URLs in the script. A compressed archive obtained by running the script containing XML files for all papers from 2003 up to the Nov 2005 issue is included at the end.

Fred Howell, A.nnotate.com

The problem...

As an open access publisher, PLoS makes its papers available in html, pdf and also XML format. The links to individual XML files are included on the website, e.g. the "Download XML" link on the page paper 0030401 is http://biology.plosjournals.org/archive/1545-7885/3/12/pmc/pbio.0030401.xml.

But how can we automatically download all XML papers without having to click on each link individually?

Fetching tables of contents in XML

PLoS provides the table of contents for each issue in XML (in RSS format), e.g. for volume 1 issue 1:- http://biology.plosjournals.org/perlserv/?request=get-toc-rss&issn=1545-7885&volume=1&issue=1

This RSS file includes links to the DOI (digital object identifier) for the html versions of papers, e.g. in

<guid isPermaLink="true">http://dx.doi.org/10.1371/journal.pbio.0030401</guid>

From this, we can extract the paper number (0030401) and use this to build the link to the XML version (volume 3 issue 12, paper 0030401):-

http://biology.plosjournals.org/archive/1545-7885/3/12/pmc/pbio.0030401.xml

Download

(to run it from the command line, save it as a file "fetchplos.txt" and type: php fetchplos.txt

The original PHP script (by Fred Howell, way back in 2005... ) : fetchplos.txt
News - 4th Dec 2008 - an updated version written by Mitzi Morris ( mitzi at panix dot com ) modified to fetch any of the PLoS area journals - computational biology, genetics, pathogens, neglected tropical diseases: from Mitzi's email: "it appears the PLoS has one system it's using to serve PLoS Biology and Medicine, and another it uses to serve the area journals - this script handles the URLs for the latter": The updated PHP script : fetch_plos_area.txt
All PLoS Biology papers in XML up to Nov 2005 issue: (11Mb) pbio-xml.tgz
You can download the PHP script language from http://www.php.net.

Fetch PLoS XML Papers

Downloading PLoS papers in XML

The problem...

Fetching tables of contents in XML

Download