GigaBlog

Adventures in Data Citation: sorghum as a standard for data release
A correspondence we have contributed to has just been published in the BMC Research Notes "Data standardization, sharing and publication series" on the data-citation and data-release practices surrounding the Sorghum genome that is available in our GigaDB database and that was published last year in Genome Biology. We use Sorghum as an example to highlight the issues surrounding data release and use strong words, subtitling the paper "sorghum genome data exemplifies the new gold standard", justified in this case by the considerable efforts the authors made to go beyond the standards of the field and follow the latest best-practices.
Despite genomics having a reputation as being the field of biology with the best established data-release practices and policies, compliance is still mixed. The authors of the Sorghum study went far beyond the usual minimal raw data deposition and spent six months working with the curators of four public repositories (on top of GigaDB) to make sure that all six data types featured in the paper were in their most usable forms. Making all of the supporting data freely available to allow transparency and reproducibility of work is a key goal of GigaScience, and we felt that this demonstration of leadership in the sharing, standardization and publication of biomedical research data should be applauded and highlighted. We feel that the correspondence article fits the open-data related series scope and criteria well, and hope that it can be used to make the wider research community, on top of the usual digital curation experts, more aware of best practices and what is currently possible with data publication.
Sorghum Illustrating Data Citation
Data citation arises from a recognition that data generated in the course of research
are just as valuable to the ongoing academic discourse as papers, and DataCite (formed in 2009) provides a technical infrastructure using data-DOIs to aid this. To truly put data on a par with research publications and to credit and track their impact the same way, data DOIs need to be treated the same way as scientific articles and cited in the references section of papers. While this is not new in the environmental sciences (see this paper from 2005 for example), the biology community has not been citing data in this way despite published guidelines and recommendations by databases, other than in very sporadic cases (such as this article citing a PDB DOI). Based on our early hiccups getting our dataset DOIs into other journals, the authors worked very closely with the editors of Genome Biology (and carefully following the guidelines of the DCC) to integrate data DOIs into the references of the research article - the first time that we are aware of that this has been accomplished in
the field of genomics. Since this was originally highlighted in the BMC blog, there have been several more successes in this area: subsequent data DOIs have been referenced in Springer journals, one of our data DOIs made it into the references of a Nature series journal for the first time, PLoS journals are now referencing Figshare handles, and our publisher BioMed Central is using the Sorghum dataset as the example of how to cite data in their instructions for authors.
Sorghum Illustrating Data Deposition
The Sorghum study is
also an excellent example for future data-submitters in regards to what can be done to not only
comply with but also go beyond minimal journal data policies. On top of all of the data in the Genome Biology paper being available from GigaDB, the raw data (SRA), genome assemblies (in genbank here), and processed data such as SNPs, Structural Variations, Copy Number Variations and Indels were also deposited in their respective NCBI databases. Furthermore, the authors not only adhered to the standard journal editorial policies for genomics studies insisting on raw data deposition (and if possible genome assemblies) in one of the three INSDC databases, but also deposited additionally processed data to the dbSNP and dbVar databases. This additional effort is at best encouraged by journals but is not currently mandated. When the annotated data is fully integrated into these databases, detailed curation is a time-consuming process (particularly when having to get to grips with data produced by BGI's new SV-tools) and the staggered build releases mean that full integration can take several months, it will be the first plant data in the relatively new dbVar database.
The advantages highlighted in the correspondence are that the GigaDB entry tied together all of these related datasets in one place and allowed them to be released rapidly in a stable and citable form before the associated analysis paper's publication. In addition to complementing the data deposited in the NCBI databases, being available in GigaDB makes the data more discoverable through other channels, such as the DataCite metadata search engine and eventually through citation indexes. In future papers, if additional data types that do not have established public repositories are included in the paper, the data could be made available in GigaDB, as GigaDB can provide a home for potentially any useful data type, supporting information, scripts or source-code. In Sorghum's case, depositing the data in GigaDB also allowed us to give it a clear CC0 public domain waiver under our data policies, maximizing its potential downstream use and liberating it from any potential legal wrangling.
The Rise of Data Citation
We are hoping this
correspondence will further highlight and encourage these promising
developments, and motivate the citation indexes to more quickly adopt
and track these important research outputs. The correspondence is well timed, with a growing number of developments and announcements regarding data-publishing being made in recent months. On top of new data publishing platforms such as F1000 Research, Figshare and Datsets International already highlighted in this blog, there have been further announcements of new data journals in the pipeline, including Geosciences Data Journal from the Royal Meteorological Society and Wiley-Blackwell and a new series of meta-journals launched from Ubiquity press (the publishers of which are co-authors our commentary). These will obviously all benefit from the increased awareness of data citation and the more standardized data practices that studies such as this can help encourage.
Further Reading
1. Edmunds, S., Pollard, T., Hole, B., & Basford, A. (2012). Adventures in data citation: sorghum genome data exemplifies the new gold standard BMC Research Notes, 5 (1) DOI: 10.1186/1756-0500-5-223
2. Zheng, L., Guo, X., He, B., Sun, L., Peng, Y., Dong, S., Liu, T., Jiang, S., Ramachandran, S., Liu, C., & Jing, H. (2011). Genome-wide patterns of genetic variation in sweet and grain sorghum (Sorghum bicolor) Genome Biology, 12 (11) DOI: 10.1186/gb-2011-12-11-r114
3. Zheng, L-Y; Guo, X-S; He, B; Sun, L-J; Peng, Y; Dong, S-S; Liu, T-F;
Jiang, S; Ramachandran, S; Liu, C-M; Jing, H-C (2011): Genome data from
sweet and grain sorghum (Sorghum bicolor). GigaScience. http://dx.doi.org/10.5524/1000
4. Hrynaszkiewicz, I. (2010). A call for BMC Research Notes contributions promoting best practice in data standardization, sharing and publication BMC Research Notes, 3 (1) DOI: 10.1186/1756-0500-3-235
5.
Ball
A, Duke M: ‘How to Cite Datasets and Link to Publications’. DCC How-to Guides.
Edinburgh: Digital Curation Centre; 2011: http://www.dcc.ac.uk/resources/how-guides
Posted at 10:28AM May 11, 2012 by ScottEdmunds in General | Comments[0]
What links RNA-Editing, Data Citation and Ancient Chinese Emperors?
Another incremental step has been achieved for the adoption of the practice of data citation; this week, Nature Biotechnology has included one of our dataset DOIs in their references for the first time. In "Comprehensive analysis of RNA-Seq data reveals extensive RNA editing in a human transcriptome", Zhiyu Peng et al. produced a new pipeline to filter and compare RNA-seq transcriptome and whole genome sequencing data to detect RNA-editing events. Much of the supporting data has been released pre-publication and hosted by our GigaDB database and, as RNA-editing is still quite a controversial phenomenon, the greater transparency enabled by making all of this data publicly available is obviously very welcome.
This "RNA-editome" is the latest "ome" (apologies to Jonathan Eisen) to come from the Yanhuang (YH) Genome project - named after the two emperors thought to be the ancestors of China's largest ethnic group (hence the blog title and picture). After the publication of the YH reference Asian diploid genome in 2008, a peripheral blood mononuclear cell methylome and now RNA-editome have been released from the same anonymous Chinese donor. All raw data and assemblies have been made available through NCBI, and this has been complemented by these and additional datasets from the whole genome, epigenome and transcriptome being made publicly available in a citable form from our GigaDB database.
With the assistance of the British Library and DataCite consortium we have been releasing datasets (many pre-publication) with DOIs since the launch of our database last year, and we have already written much about the issues surrounding this relatively new form of data release in GigaBlog. Things have been hotting up in the data publishing field in the last few months, and while editorial policies regarding pre-publication data release in this manner are still unclear for many publishers, the wonderful people at the newly launched F1000 Research have been compiling a very useful list of journals that have now drafted policies.
On top of journals allowing data to be disseminated in this way, one of the key steps to allow data-citation to work and be trackable is to actually cite the data in the references. While GigaScience data DOIs have been been previously included in publications in Nature Biotechnology (two Macaque genomes) and Science (the genome of an Aboriginal Australian individual), these were not listed in the references. Following on from the recent inclusion of data from the sorghum genome in the references a Genome Biology paper, this is this is the first time we have managed to get DOIs listed in the references of a Nature journal. We'd like to thank the authors of the manuscript for making their data available in this way, and the editorial and production teams at Nature Biotechnology for working with us to include the DOIs.
References
1. Peng
Z et al., Comprehensive analysis of RNA-Seq data
reveals extensive RNA editing in a human transcriptome. Nat Biotech
2012, advance online publication.
2. Tian Z et al., (2011): Transcriptome from a
lymphoblastoid cell line taken from the YH Han Chinese individual.
GigaScience. http://dx.doi.org/10.5524/100013
3. Hayden EC. Evidence of altered RNA stirs debate. Nature. 2011 26;473(7348):432.
4. Wang J et al., The diploid genome sequence of an Asian
individual. Nature. 2008 Nov 6;456(7218):60-5.
5. Li Y et al., The DNA methylome of human peripheral blood mononuclear cells. PLoS
Biol. 2010 Nov 9;8(11):e1000533.
6. Wang,
J et al., (2011): Genome
sequence of YH: the first diploid genome sequence of a Han Chinese individual.
GigaScience. http://dx.doi.org/10.5524/100015
7. Li
Y et al.,
(2011): DNA methylome of human peripheral blood mononuclear cells from the YH
Han Chinese individual. GigaScience. http://dx.doi.org/10.5524/100014
8. Yan G et al., Genome sequencing and comparison of two nonhuman primate animal models, the
cynomolgus and Chinese rhesus macaques. Nat Biotech 2011 advance online
publication.
9. Rasmussen M et al., An Aboriginal Australian Genome Reveals Separate Human Dispersals into Asia. Science 2011 Oct 7;334(6052):94-8.
10. Zheng
LY et al., Genome-wide patterns of genetic variation in sweet and grain
sorghum (Sorghum bicolor). Genome
Biol. 2011 Nov 21;12(11):R114.
Posted at 10:05AM Feb 17, 2012 by ScottEdmunds in General | Comments[0]
GigaScience Journal Part of Global Data-Sharing Effort: New Standards Allow Disparate Data Sets to Integrate
Lead by researchers at the University
of Oxford, a group of more than 30 scientific organizations around the
globe, have worked to produce a common standard that will make possible the
consistent description of enormous and radically different databases compiled
in fields ranging from genetics to stem cell science, to environmental studies.
One of the contributors playing a role in the project is GigaScience, as we feel it potentially very useful to aid in the handling of the wide-variety of data-types covered by our scope our scope.
The new standard provides a way for scientists in widely disparate fields to co-ordinate each other’s findings by allowing behind-the-scenes combination of the mountains of data produced by modern, technology driven science.
This standard-compliant data sharing effort and the establishment of it’s on-line presence, the ISA Commons – www.isacommons.org, is described in a Commentary (and highlighted in the editorial) published today in the journal Nature Genetics.
“We are now working together to provide the means to manage enormous quantities of otherwise incompatible data, ranging from the biomedical to the environmental,” says Susanna-Assunta Sansone, Team Leader of the project at the Oxford e-Research Centre, and founder of the BioSharing Network (of which BMC and GigaScience are both members).
”An example of how this works at the Harvard Stem Cell Institute is that we can now find a relationship between experiments involving normal blood stem cells in fish and cancers in children”, says Winston Hide, Professor of Bioinformatics at the Harvard School of Public Health (for more see this related publication).
It was necessary to establish common data standards, say the commentary’s authors, because of the tsunami of data and technologies washing over the sciences. “There are hundreds of new technologies coming along but also many ways to describe the information produced” said Sansone, noting that "we can take a jigsaw puzzle of different sciences and now fit the many pieces together to form a complete picture".
"One of the things that I find most empowering about this effort is that now small research groups can begin to store laboratory data using this framework, complying to community standards, without their own dedicated bioinformatic support. It is a bit like facebook allowing everyone to create their own website pages - suddenly you don't need to be an expert in computing to get your data out to the rest of the world", says Dr. Jules Griffin, of the University of Cambridge.
"What we like about it is its unifying nature across different bioscience fields and institutions”, says Dr. Christoph Steinbeck, The European Bioinformatics Institute.
And "it also has the potential to work for large centers too”, says Scott Edmunds, of the BGI and GigaScience. As GigaScience aims to take as many types of “large-data” as possible, the need to handle as many formats as possible was essential, and the large number of data-types supported by ISA-commons and ability to create new configurations potentially addresses this very important issue. This has lead to GigaScience being the first journal to offer authors the option to submit data in ISA-commons format, and these resources have also been made available to the BGI (the worlds largest Genomics institute) to release their enormous quantities of data quicker the wider research community through the associated GigaDB database.
For more on the aims and goals of GigaScience please see this previous BMC Blog posting, and for news and updates follow GigaBlog and the @GigaScience twitter feed. The journal is now taking submissions for “big-data” associated research, tools and software for handling large-scale data, and reviews and commentary on issues dealing with data-handling and standards.
References:
1. ISA Commons: isacommons.org
2. It's not about the data. Nature Genetics 44, 2 (2012).
3. Sansone, S-A. et
al. Toward interoperable bioscience data. Nature Genetics 44, 2 (2012).
4. Ho Sui SJ et al. The Stem Cell Discovery Engine:
an integrated repository and analysis system for cancer stem cell comparisons. Nucleic Acids Res. 1;40(D1):D984-D991.
(2012).
Posted at 02:59AM Jan 28, 2012 by ScottEdmunds in General | Comments[0]
Data Citation Enters the (year of the) Dragon
Today marks the first day of the Chinese Lunar New Year, and as we enter the supposedly auspicious year of the Dragon now is a good opportunity to look towards developments in the nascent field of data publication over the upcoming year. This week marked important announcements of new and improved data publication platforms. Those lucky enough to attend Science Online (or filter through the nearly 30,000 tweets produced by the meetings end!) will have seen the new-look Figshare website promoted in the "Dealing with Data" session, and there has also been good coverage online of the platforms launch including in the Wellcome trust blog. Since the launch of the original website roughly a year ago, the recent support from Digital Science (a sister company of Nature Publishing Group) has allowed them to release a much improved front-end, increased storage (currently 250MB, but potentially unlimited), and importantly where data citation is concerned, the use of citable DataCite DOIs.
Following on from the many developments in the last year (see our posting from last months IDCC meeting) another publisher has just thrown their hat into the data publishing ring, with Hindawi announcing the launch of "Datasets International", a new platform for "archiving, documenting, and distributing scholarly research datasets". Like Figshare, Dryad and the other platforms already announced (including our associated GigaDB), they follow best practice by asking authors to provide data under a creative commons CC0 license, although it is currently unclear how much (if any) data hosting is included in their $300 article processing charge.
As we've written previously in this blog, how you cite data is important in tracking and maximizing its use. Ultimately the adoption of data-publication will be greatly aided by publishers, authors and the indexing services correctly carrying out best practice for data-citation, and citing the dataset DOIs in the references. CrossRef have just aided this with a supportive call for publishers to cite DataCite DOIs in the reference sections of articles, although the article they use as an example is illustrative of the problem by not following this. Our recent success in getting one of our GigaDB dataset DOIs integrated into the references of a Genome Biology article is a great example that this can done. The utility of this example is further highlighted this month, as BioMed Central (our publisher alongside BGI) has now used this paper as an example in BioMed Central’s reference style guide, found in any journal's instructions for authors. It now explicitly mentions datasets and provides it as an example of a dataset citation:
“Only articles, datasets and abstracts that have been published or are in press, or are available through public e-print/preprint servers, may be cited
...
“Dataset with persistent identifier
Zheng,
L-Y; Guo, X-S; He, B; Sun, L-J; Peng, Y; Dong, S-S; Liu, T-F; Jiang, S;
Ramachandran, S; Liu, C-M; Jing, H-C (2011): Genome data from sweet and
grain sorghum (Sorghum bicolor). GigaScience. http://dx.doi.org/10.5524/100012."
Springer, BioMed Centrals parent publisher, is also providing examples of correctly cited data, with this recent example correctly citing a Dryad dataset in its references.
Those interested in GigaScience journal and its integrated GigaDB database should contact us on editorial@gigasciencejournal.com. Currently the database is being populated with datasets from BGI collaborations and projects, but upon launch of the journal we will be hosting and issuing DOIs to datasets associated with GigaScience articles, giving an extra form of credit, and increasing the discoverability and impact of an authors work. Please contact us via the above email or submit your big-data associated research articles through our submission page.
Gong Xi Fa Chai! Happy New Year of the Dragon!
References
1. Zheng LY, Guo XS, He B, Sun LJ, Peng Y, Dong SS, Liu TF, Jiang S, Ramachandran S, Liu CM, Jing HC. Genome-wide patterns of genetic variation in sweet and grain sorghum (Sorghum bicolor). Genome Biol. 2011 Nov 21;12(11):R114.
2. Zheng, L-Y; Guo, X-S; He, B; Sun, L-J; Peng, Y; Dong, S-S; Liu, T-F; Jiang, S; Ramachandran, S; Liu, C-M; Jing, H-C (2011): Genome data from sweet and grain sorghum (Sorghum bicolor). GigaScience. http://dx.doi.org/10.5524/100012.
3. Hodkinson BP, Uehling JK, Smith ME (2012) Lepidostroma vilgalysii, a new basidiolichen from the New World. Mycological Progress, online in advance of print. doi:10.1007/s11557-011-0800-z
4. Hodkinson BP, Uehling JK, Smith ME (2012) Data from: Lepidostroma vilgalysii, a new basidiolichen from the New World. Dryad Digital Repository.
doi:10.5061/dryad.j1g5dh23
*UPDATE* 24/1/12: Hindawi have confirmed they will include data hosting (see comments).
Posted at 11:15AM Jan 23, 2012 by ScottEdmunds in General | Comments[3]
GigaData news: Macaque DOIs published in Nature Biotechnology
This week marks another success for the fledgling practice of data citation, with two datasets from our GigaScience database published in Nature Biotechnology. The genomes sequenced by our colleagues at the BGI for the Cynomolgus and Chinese rhesus macaques were initially released in our first batch of datasets with DOIs at our launch in July, and were amongst the first (at the time) unpublished genomes released in this way. Data citation is an important concept, allowing data producers to obtain an early form of credit for releasing their work, speeding up research by encouraging early data release, and allowing the impact and reuse of data to be tracked.
After the recent success of our first dataset being published in the New England of Medicine (the genome of the recent outbreak strain of E. coli O104:H4), this is the first time one of our data DOIs has been accepted in a Nature journal. For data citation to work the assistance of journals is key, and Nature Biotechnology has been particularly helpful in promoting the scheme, arguing in an editorial as far back as 2009 that novel forms of credit for data producers were needed, and suggesting DOIs as an ideal solution for this. The Datacite consortium was set up in late 2010 to do exactly that, and we would like to thank them and the British Library for their help in issuing these DOIs.
Macaque species are the most
commonly used non-human primate models in medical research, and their genomes
will hopefully aid human disease research and drug discovery. Looking
at orthologues of human druggable protein domains in these species is aiding the
potential therapeutic exploitation of their ‘druggable genome’, and has already lead to BGI producing an exome sequencing
platform for the species. On top of their genome assemblies, the DOI landing
pages include links to functionally annotated and
coding sequence sets, as well as a link to a browser and database. After the release of
other datasets such as the CHO cell line genome, we are currently collecting
another large batch of datasets to be released, so watch this
space for further news and announcements.
References
1. Yan, G. et al. Genome sequencing and comparison of two nonhuman primate animal models, the cynomolgus and Chinese rhesus macaques. Nat Biotech advance online publication, (2011).
2. Credit where credit is overdue. Nat
Biotech 27, 579 (2009).
To cite the two datasets please use the following citations:
3. Yan, G; Zhang, G; Fang, X; Zhang, Y; Li, C; Ling, F; Cooper, DN; Li, O;
Li, Y; van Gool, AJ; Du, H; Chen, J; Chen, R; Zhang, P; Huang, Z;
Thompson, JR; Meng, Y; Bai, Y; Wang, J; Zhuo, M; Wang, T; Huang, Y; Wei,
L; Li, J; Wang, Z; Hu, H; Le, L; Stenson, PD; Li, B; Liu, X; Ball, EV;
An, N; Huang, Q; Zhang, Y; Fan, W; Zhang, X; Li, Y; Wang, W; Katze, MG;
Su, B; Nielsen, R; Yang, H; Wang, J; Wang, X; Wang, J (2011): Genomic
data from the Chinese Rhesus Macaque (Macaca mulatta lasiota).
GigaScience. doi:10.5524/100002
http://dx.doi.org/10.5524/100002
4. Yan, G; Zhang, G; Fang, X; Zhang, Y; Li, C; Ling, F; Cooper, DN; Li, O;
Li, Y; van Gool, AJ; Du, H; Chen, J; Chen, R; Zhang, P; Huang, Z;
Thompson, JR; Meng, Y; Bai, Y; Wang, J; Zhuo, M; Wang, T; Huang, Y; Wei,
L; Li, J; Wang, Z; Hu, H; Le, L; Stenson, PD; Li, B; Liu, X; Ball, EV;
An, N; Huang, Q; Zhang, Y; Fan, W; Zhang, X; Li, Y; Wang, W; Katze, MG;
Su, B; Nielsen, R; Yang, H; Wang, J; Wang, X; Wang, J (2011): Genomic
data from the Crab Eating Macaque/Cynomolgus Monkey (Macaca
fascicularis). GigaScience. doi:10.5524/100003
http://dx.doi.org/10.5524/100003
Posted at 02:26AM Oct 21, 2011 by ScottEdmunds in General | Comments[1]
GigaScience in the press (and still on the road)...
We at GigaScience always appreciate good press, and on top of the welcome coverage on blogs (e.g. this in Annals of Botany) and twitter, we are pleased with the profile of the journal in this months Bio-IT world (especially coming on top of the coverage of our database in the previous issue). The article is a nice introduction to the editors, editorial policies, and hopes for the journal, and is useful reading for those who'd like to know a bit more on top of what is already featured in GigaBlog and our authors instructions. We are very pleased with the detailed coverage, although we are not sure this can top our recent mention in the prestigious Annals of Improbable Research.
After our busy summer attending conferences and chasing down papers, GigaScience is not slowing down and is still on the road. Today we are currently at the NGS Asia congress in Singapore (follow #ngsasia on twitter for updates) and next week we will be gathering with the human genetics community at their annual ASHG/ICHG congress in Montreal (we'll be tweeting using #ICHG2011). Those attending can get hold of us via the BMC and BGI booths (1122 and 1322), so please come and say hello.
Posted at 07:19AM Oct 04, 2011 by ScottEdmunds in General | Comments[0]
Papers, papers, papers
After last months website launch, and work progressing on our associated GigaScience database (including the addition of further datasets, such as the recently published Chinese Hamster Ovary cell-line genome) our next priority is the more traditional job of an editor: papers, papers papers. The journal has had some great feedback on twitter, on many blogs, and also a nice feature in this months Bio-IT World, so we are now following up this up by talking to various relevant communities at conferences, and encouraging authors to submit to the journal.
As part of our promotional
efforts and thanks to generous support from the BGI, we are waiving the open
access article-processing charge for all manuscripts published by GigaScience during
our first year. We hope many of you will take advantage of this opportunity, and (pending peer-review) we are particularly looking to highlight articles dealing with data-types that may not be as well served with repositories or standard formats as the established genomics community in our first issues. Please contact us (editorial@gigasciencejournal.com) with any ideas for reviews or commentaries discussing these and other issues to do with data.
Posted at 04:42AM Aug 19, 2011 by ScottEdmunds in General | Comments[0]
Notes from an E. coli “tweenome” – lessons learned from our first data DOI.
Last week marked two important milestones in the deadly
2011 European E. coli 0104:H4 outbreak: the Robert Koch institute announcing the end
of the outbreak, and the publication of several papers from the many groups
sequencing the pathogen. This included a publication in the New England Journal
of Medicine by groups from the BGI, UMC Hamburg-Eppendorf, and Birmingham University acknowledging members
of the crowdsourcing community and the work achieved using the genome sequence
our colleagues at the BGI made available via our GigaScience database.
This was our first dataset released with a DOI and under the freest CC0 public domain license, so now is a
great opportunity to look back to see the consequences of this novel form of
data release.
Due to the unusual severity of the outbreak – thousands severely ill and 50 deaths to date, it was clear that the usual scientific procedure of producing data, analysing it slowly and then releasing it to the public after a potentially long peer-review procedure would have been unhelpful in this case. By releasing the first genomic data before it had even finished uploading to NCBI via twitter, and promoting its use and releasing subsequent improved assemblies this way, a huge community of microbial genomicists around the world took up the challenge to study the organism collaboratively (a process that was dubbed by some to have made E. coli the first “Tweenome”). Once a github repository had been created (thanks to the efforts of the Era7 team in Spain) to provide a home to these analyses and data, groups around the world started producing their own annotations and assemblies within 24 hours, and within a couple of a days a potential ancestral strain had been identified (further clearing Spanish farmers of the blame), and the many antibiotic resistance genes and pathogenic features were much more clearly understood. By releasing the data under a CC0 license, this allowed truly open-source analysis, and the UK HPA and github members followed suit in releasing their work in this way.
Huge progress was achieved in record time, and from
this incredibly speedy work a free diagnostic protocol and free primers were
distributed by the BGI to immediately help tracking the source of the
outbreak. On top of the good feeling and positive coverage obtained by this
(despite some inevitable disagreement over credit and what exactly was achieved), these
novel forms of pre-publication data release did not prevent the acquisition of
more traditional forms of scientific credit – publication in
prestigious scientific and medical journals.
On top of all of the scientific
and public health lessons to be learned, coming from a journal perspective this
makes it a very important example and test case of how new and faster methods
of scientific communication and data dissemination can still complement and
work alongside the traditional systems. This is particularly clear as the
open-source analysis was published in the New England of Medicine, a
prestigious organ with a nearly 200 year history, and founder of
the Ingelfinger rule causing issues in some (mainly medical) journals
regarding certain pre-publication forms of data release. Maximizing
the use of the data by putting it into the public domain still did not
trump scientific etiquette and convention that allowed those producing the data
to be attributed and take credit. This is a great argument in favour of
open-data, and an important lesson to all scientists worrying about setting
their data free.
As (we think) the first ever citable data DOI released to an unpublished genome, this new form of intermediate credit (similar to microattribution)
did not hinder the eventual publication of the genome analysis paper. We’d like
to thank our collaborators in the Datacite and the British Library for
their help issuing the DOIs, and hope it provides a good example for similar
data producers and projects to follow. We have followed this example with the release of additional
unpublished genomes, and large supplementary datasets associated with articles
in GigaScience will be given DOIs to make them more trackable and findable,
further showing their interoperability with traditional scientific articles and
forms of data release. This particular disease outbreak was unusually pathogenic, and the sterling efforts of the medical community and suffering of those
affected should not be forgotten. Whilst there are still many unanswered
questions and huge amounts of work still to be done, many lessons have hopefully been learned, and (as highlighted here) this project provides an excellent example for the future on how a more collaborative and open-form of science can carried out. As
GigaScience would like to be a forum for the discussion of these issues,
as well as promote and work with the open-science movement, we strongly hope
that this can continue and grow.
Posted at 08:39AM Aug 03, 2011 by ScottEdmunds in General | Comments[0]
GigaScience, Giga-database and now GigaBlog: new resources for the big-data community
As biological data is
now produced faster than it can easily be handled and stored, the
dissemination of this data has become a major bottleneck. GigaScience: a new type of journal from BioMed Central and BGI
— no stranger to these issues being the world’s largest Genomics center
— starts taking submissions today with the goal of addressing many of
the issues surrounding “big-data”. Much of the rationale and features of
the GigaScience journal and its associated database is presented on our website.
But with a scope that covers any biological and biomedical
“large-scale” data (and the“(Giga)n” refers to gigantic rather than a
specific number), one important question is how exactly are we defining
“large-scale”? The answer unfortunately is: it depends.
What
makes something big-data varies greatly from field-to-field, and also
changes rapidly with technological developments; so this is a question
we will be regularly asking our editorial board and scientists in
different research communities. But, to keep our readers and authors
updated, rather than constantly changing this information in our instructions for authors,
we feel a blog makes a better forum for this type of open-ended
discussion. We also hope to hear from you as to your thoughts on what
constitutes "big" data, especially for those areas that are not
generally thought of as having large-scale data resources — like
cellular development with a myriad of imaging data types, neuroscience
and electrophysiology, and cohort studies with metadata that has many
permissions issues needing to be discussed and solved.
Launching our first post here, and as a guest on the BMC blog,
we’d like to welcome you and hope our future blog discussions will
supplement and enhance the content of the journal. Upcoming postings
will provide updates on the progress of the journal up to its formal
launch in November, introduce the editors and editorial board, report on
conferences, and provide news on the many current issues surrounding
the handling and use of large-scale data and high-throughput biology.
The blog will also highlight interesting datasets deposited in our
database and new types of large-data from different, potentially
unexpected, biological fields.
As part of our prelaunch activities, GigaScience has just released its first datasets
that are marked with a citable DOI and have no restrictions on use.
These datasets include the sequence and assembly data from the recent
deadly outbreak strain E. coli O104
from BGI and the University Medical Centre Hamburg-Eppendorf, as well
as 7 large vertebrates sequenced for the Genome10K project, a worldwide
collaborative effort to sequence 10,000 vertebrate genomes. These data
include the Giant Panda, the Chinese Rhesus and Crab-Eating Cynomolous Macaques, the Polar Bear, the Emperor and Adelie Penguins, and the Domestic Pigeon. The usefulness of this novel method of rapid data release —prior to manuscript publication— is exemplified by the recent release of the E. coli O104 data as it was being created; this resulted in immediate “crowd-sourcing” of the data by the research community and has already aided the fight against this deadly outbreak.
We
want to give a special thanks to the international group of researchers
who took this important step toward finding the best means to balance
the needs of the larger community to gain access to the data with that
of obtaining credit for their work. Additionally, we would like to thank
BGI and BMC for their support and help in setting up this venture. We’d
like to give our appreciation to Datacite and the British Library for working to provide DOIs for our associated datasets, and to ISA-Tab
for helping with standardization of our data-submission system to make
it more adaptable, standardized, and ISA-tab compliant. We’d also like
to thank our growing editorial board for their (present and future)
support.
We are excited about this new endeavor and are looking
forward to working with the entire community to speed research, push
open access, and aid in making these important resources permanently
available for use and reuse.
Laurie Goodman, Editor-in-Chief
Scott Edmunds, Editor
Alexandra Basford, Assistant Editor
Follow @GigaScience on Twitter
Posted at 05:19PM Jul 06, 2011 by Gabriella Anderson in General | Comments[0]