I love preprint services. I love the idea of archiving preprints, short-cutting the painful peer-review process and sharing the results in an easy and quick manner. This is why for our latest paper, I insisted on experimenting with the idea and actually ended up posting the paper on bioRxiv.
There are various preprint servers available out there (nicely listed here) and although the concept is the same for all, the audience to which you want to reach out differs a lot.
As you know, I am studying Computational Biology, hence I am a fan of the new bioRxiv service. I have been subscribed to e-mail alerts for the subject areas that I am interested in since the service started. I usually use my e-mail box as a task manager and label e-mails for adding them to my have-a-look-at list. This is especially useful for queuing papers of interest to me for reading them later. And once I have the time, I go back to these marked e-mails, check these papers out by reading their abstracts and downloading the full article when necessary.
I have also been doing the same thing for bioRxiv e-alerts and lately noticed something weird: some of my saved papers were not accessible on the site any more. The first time it happened to me, I thought this was a bug on the site and did not worry much about it. The second time, however, was enough to get me suspicious. So I went back to these missing papers and double checked the links on the bioRxiv, but all I was getting was an error on the site saying:
You are not authorized to access this page.
Interesting enough, I found that these papers were not even being listed on the site! Maybe, I thought, these papers were removed from the site on purpose, but is this even possible? According to the documentation on the site, it is not supposed to happen:
So I tweeted about it:
and when you go to the website for one of these, it says "access denied" -- a bug or a feature? More transparency on this would be great.— B. Arman Aksoy (@armish) July 3, 2014
but got no response back. Then I decided to see if I can scrape all such removed papers from the site to see if there are too may of these removed papers and came up with a really small script that records the HTTP response for successive biorXiv DOI numbers up to a point. Based on this HTTP response, you can categorize the papers as follows:
- 404: the DOI has not been registered yet.
- 200: the DOI has been registered and the paper is available.
- 403: the DOI has been registered but access to the paper is restricted -- i.e. the paper has been removed.
Running the script and collecting this information for all DOIs from
10.1101/000001 up to
I found 498 published and 5 removed bioRxiv papers.
This means that almost 1% of the bioRxiv papers has been removed from the archive with no explanation at all.
It also turns out that although bioRxiv has removed these articles, Google has not forgotten about them yet. So for those that are curious, here is the list of removed papers and their titles:
- 10.1101/002055: Somatic mitochondrial DNA mutations are associated with progression, metastasis and death in oral squamous cell carcinoma
- 10.1101/002089: Protectome analysis: a new selective bioinformatics tool for bacterial vaccine candidate discovery
- 10.1101/002451: Genome-Wide Introgression Revealed Pervasive Hybrid Incompatibilities (HI) between Caenorhabditis species
- 10.1101/003251: SeqGL identifies context-dependent binding signals in genome-wide regulatory element maps
- 10.1101/005421: CRISPR/Cas9 nuclease-mediated gene knock-in in bovine pluripotent stem cells and embryos
The number of removed papers might not be that big, but I think the situation is worrisome. I am pretty sure there are valid reasons to why these papers were removed after they got posted on the site, but I think if bioRxiv is planning to be the de facto preprint server for Biological Sciences, then it should be either more strict about its terms of distribution or be more transparent about its decisions.
What do you think?