Categories


Contact

Search

Links


Archive


GigaBlog

Friday Feb 17, 2012

What links RNA-Editing, Data Citation and Ancient Chinese Emperors?

Emperor Gaozu of Han, source: wikipediaAnother incremental step has been achieved for the adoption of the practice of data citation; this week, Nature Biotechnology has included one of our dataset DOIs in their references for the first time. In "Comprehensive analysis of RNA-Seq data reveals extensive RNA editing in a human transcriptome", Zhiyu Peng et al. produced a new pipeline to filter and compare RNA-seq transcriptome and whole genome sequencing data to detect RNA-editing events. Much of the supporting data has been released pre-publication and hosted by our GigaDB database and, as RNA-editing is still quite a controversial phenomenon, the greater transparency enabled by making all of this data publicly available is obviously very welcome.

This "RNA-editome" is the latest "ome" (apologies to Jonathan Eisen) to come from the Yanhuang (YH) Genome project - named after the two emperors thought to be the ancestors of China's largest ethnic group (hence the blog title and picture). After the publication of the YH reference Asian diploid genome in 2008, a peripheral blood mononuclear cell methylome and now RNA-editome have been released from the same anonymous Chinese donor. All raw data and assemblies have been made available through NCBI, and this has been complemented by these and additional datasets from the whole genome, epigenome and transcriptome being made publicly available in a citable form from our GigaDB database.

With the assistance of the British Library and DataCite consortium we have been releasing datasets (many pre-publication) with DOIs since the launch of our database last year, and we have already written much about the issues surrounding this relatively new form of data release in GigaBlog. Things have been hotting up in the data publishing field in the last few months, and while editorial policies regarding pre-publication data release in this manner are still unclear for many publishers, the wonderful people at the newly launched F1000 Research have been compiling a very useful list of journals that have now drafted policies.

On top of journals allowing data to be disseminated in this way, one of the key steps to allow data-citation to work and be trackable is to actually cite the data in the references. While GigaScience data DOIs have been been previously included in publications in Nature Biotechnology (two Macaque genomes) and Science (the genome of an Aboriginal Australian individual), these were not listed in the references. Following on from the recent inclusion of data from the sorghum genome in the references a Genome Biology paper, this is this is the first time we have managed to get DOIs listed in the references of a Nature journal. We'd like to thank the authors of the manuscript for making their data available in this way, and the editorial and production teams at Nature Biotechnology for working with us to include the DOIs.

References

1. Peng Z et al., Comprehensive analysis of RNA-Seq data reveals extensive RNA editing in a human transcriptome. Nat Biotech 2012, advance online publication.
2
. Tian Z et al., (2011): Transcriptome from a lymphoblastoid cell line taken from the YH Han Chinese individual. GigaScience. http://dx.doi.org/10.5524/100013

3
. Hayden EC. Evidence of altered RNA stirs debate. Nature. 2011 26;473(7348):432.
4. Wang J et al., The diploid genome sequence of an Asian individual. Nature. 2008 Nov 6;456(7218):60-5.
5
. Li Y et al., The DNA methylome of human peripheral blood mononuclear cells. PLoS Biol. 2010 Nov 9;8(11):e1000533.

6
. Wang, J et al., (2011): Genome sequence of YH: the first diploid genome sequence of a Han Chinese individual. GigaScience. http://dx.doi.org/10.5524/100015

7
. Li Y et al., (2011): DNA methylome of human peripheral blood mononuclear cells from the YH Han Chinese individual. GigaScience. http://dx.doi.org/10.5524/100014
8
. Yan G et al., Genome sequencing and comparison of two nonhuman primate animal models, the cynomolgus and Chinese rhesus macaques. Nat Biotech 2011 advance online publication.
9
. Rasmussen M et al., An Aboriginal Australian Genome Reveals Separate Human Dispersals into Asia. Science 2011 Oct 7;334(6052):94-8.
10. Zheng LY et al., Genome-wide patterns of genetic variation in sweet and grain sorghum (Sorghum bicolor). Genome Biol. 2011 Nov 21;12(11):R114.

Saturday Jan 28, 2012

GigaScience Journal Part of Global Data-Sharing Effort: New Standards Allow Disparate Data Sets to Integrate

ISA-commons logoLead by researchers at the University of Oxford, a group of more than 30 scientific organizations around the globe, have worked to produce a common standard that will make possible the consistent description of enormous and radically different databases compiled in fields ranging from genetics to stem cell science, to environmental studies. One of the contributors playing a role in the project is GigaScience, as we feel it potentially very useful to aid in the handling of the wide-variety of data-types covered by our scope our scope.

The new standard provides a way for scientists in widely disparate fields to co-ordinate each other’s findings by allowing behind-the-scenes combination of the mountains of data produced by modern, technology driven science.

This standard-compliant data sharing effort and the establishment of it’s on-line presence, the ISA Commons – www.isacommons.org, is described in a Commentary (and highlighted in the editorial) published today in the journal Nature Genetics.

“We are now working together to provide the means to manage enormous quantities of otherwise incompatible data, ranging from the biomedical to the environmental,” says Susanna-Assunta Sansone, Team Leader of the project at the Oxford e-Research Centre, and founder of the BioSharing Network (of which BMC and GigaScience are both members).

”An example of how this works at the Harvard Stem Cell Institute is that we can now find a relationship between experiments involving normal blood stem cells in fish and cancers in children”, says Winston Hide, Professor of Bioinformatics at the Harvard School of Public Health (for more see this related publication).

It was necessary to establish common data standards, say the commentary’s authors, because of the tsunami of data and technologies washing over the sciences. “There are hundreds of new technologies coming along but also many ways to describe the information produced” said Sansone, noting that "we can take a jigsaw puzzle of different sciences and now fit the many pieces together to form a complete picture".

"One of the things that I find most empowering about this effort is that now small research groups can begin to store laboratory data using this framework, complying to community standards, without their own dedicated bioinformatic support. It is a bit like facebook allowing everyone to create their own website pages - suddenly you don't need to be an expert in computing to get your data out to the rest of the world", says Dr. Jules Griffin, of the University of Cambridge.

"What we like about it is its unifying nature across different bioscience fields and institutions”, says Dr. Christoph Steinbeck, The European Bioinformatics Institute.

And "it also has the potential to work for large centers too”, says Scott Edmunds, of the BGI and GigaScience. As GigaScience aims to take as many types of “large-data” as possible, the need to handle as many formats as possible was essential, and the large number of data-types supported by ISA-commons and ability to create new configurations potentially addresses this very important issue. This has lead to GigaScience being the first journal to offer authors the option to submit data in ISA-commons format, and these resources have also been made available to the BGI (the worlds largest Genomics institute) to release their enormous quantities of data quicker the wider research community through the associated GigaDB database.

For more on the aims and goals of GigaScience please see this previous BMC Blog posting, and for news and updates follow GigaBlog and the @GigaScience twitter feed. The journal is now taking submissions for “big-data” associated research, tools and software for handling large-scale data, and reviews and commentary on issues dealing with data-handling and standards.

 References:

1. ISA Commons: isacommons.org
2.
It's not about the data. Nature Genetics 44, 2 (2012).
3. Sansone, S-A. et al. Toward interoperable bioscience data. Nature Genetics 44, 2 (2012).
4. Ho Sui SJ et al. The Stem Cell Discovery Engine: an integrated repository and analysis system for cancer stem cell comparisons. Nucleic Acids Res. 1;40(D1):D984-D991. (2012).

Monday Jan 23, 2012

Data Citation Enters the (year of the) Dragon

DragonToday marks the first day of the Chinese Lunar New Year, and as we enter the supposedly auspicious year of the Dragon now is a good opportunity to look towards developments in the nascent field of data publication over the upcoming year. This week marked important announcements of new and improved data publication platforms. Those lucky enough to attend Science Online (or filter through the nearly 30,000 tweets produced by the meetings end!) will have seen the new-look Figshare website promoted in the "Dealing with Data" session, and there has also been good coverage online of the platforms launch including in the Wellcome trust blog. Since the launch of the original website roughly a year ago, the recent support from Digital Science (a sister company of Nature Publishing Group) has allowed them to release a much improved front-end, increased storage (currently 250MB, but potentially unlimited), and importantly where data citation is concerned, the use of citable DataCite DOIs. 

Following on from the many developments in the last year (see our posting from last months IDCC meeting) another publisher has just thrown their hat into the data publishing ring, with Hindawi announcing the launch of "Datasets International", a new platform for "archiving, documenting, and distributing scholarly research datasets". Like Figshare, Dryad and the other platforms already announced (including our associated GigaDB), they follow best practice by asking authors to provide data under a creative commons CC0 license, although it is currently unclear how much (if any) data hosting is included in their $300 article processing charge.

As we've written previously in this blog, how you cite data is important in tracking and maximizing its use. Ultimately the adoption of data-publication will be greatly aided by publishers, authors and the indexing services correctly carrying out best practice for data-citation, and citing the dataset DOIs in the references. CrossRef have just aided this with a supportive call for publishers to cite DataCite DOIs in the reference sections of articles, although the article they use as an example is illustrative of the problem by not following this. Our recent success in getting one of our GigaDB dataset DOIs integrated into the references of a Genome Biology article is a great example that this can done. The utility of this example is further highlighted this month, as BioMed Central (our publisher alongside BGI) has now used this paper as an example in BioMed Central’s reference style guide, found in any journal's instructions for authors. It now explicitly mentions datasets and provides it as an example of a dataset citation:

“Only articles, datasets and abstracts that have been published or are in press, or are available through public e-print/preprint servers, may be cited

...

Dataset with persistent identifier
Zheng, L-Y; Guo, X-S; He, B; Sun, L-J; Peng, Y; Dong, S-S; Liu, T-F; Jiang, S; Ramachandran, S; Liu, C-M; Jing, H-C (2011): Genome data from sweet and grain sorghum (Sorghum bicolor). GigaScience. http://dx.doi.org/10.5524/100012."

Springer, BioMed Centrals parent publisher, is also providing examples of correctly cited data, with this recent example correctly citing a Dryad dataset in its references.

Those interested in GigaScience journal and its integrated GigaDB database should contact us on editorial@gigasciencejournal.com. Currently the database is being populated with datasets from BGI collaborations and projects, but upon launch of the journal we will be hosting and issuing DOIs to datasets associated with GigaScience articles, giving an extra form of credit, and increasing the discoverability and impact of an authors work. Please contact us via the above email or submit your big-data associated research articles through our submission page.

Gong Xi Fa Chai! Happy New Year of the Dragon!

References

1. Zheng LY, Guo XS, He B, Sun LJ, Peng Y, Dong SS, Liu TF, Jiang S, Ramachandran S, Liu CM, Jing HC. Genome-wide patterns of genetic variation in sweet and grain sorghum (Sorghum bicolor).  Genome Biol. 2011 Nov 21;12(11):R114.

2. Zheng, L-Y; Guo, X-S; He, B; Sun, L-J; Peng, Y; Dong, S-S; Liu, T-F; Jiang, S; Ramachandran, S; Liu, C-M; Jing, H-C (2011): Genome data from sweet and grain sorghum (Sorghum bicolor). GigaScience. http://dx.doi.org/10.5524/100012.

3. Hodkinson BP, Uehling JK, Smith ME (2012) Lepidostroma vilgalysii, a new basidiolichen from the New World. Mycological Progress, online in advance of print. doi:10.1007/s11557-011-0800-z

4. Hodkinson BP, Uehling JK, Smith ME (2012) Data from: Lepidostroma vilgalysii, a new basidiolichen from the New World. Dryad Digital Repository. doi:10.5061/dryad.j1g5dh23

*UPDATE* 24/1/12: Hindawi have confirmed they will include data hosting (see comments).

Friday Dec 23, 2011

Data Citation December

"Citation needed" Despite the approaching holidays its been another busy month in the GigaScience office, with Alexandra attending the InCoB/ISMB-Asia meeting in Kuala Lumpur (see her talk slides here) and the Human Variome Project meeting in Beijing, and Scott attending a number of meetings and workshops in the UK, including the International Digital Curation Conference (IDCC) in Bristol. The "Digital" in the meeting title was a bit of a giveaway of the level of technological savvy of the attendees, as it was heavily tweeted (see #idcc and this storify), blogged (see here for a good example), and videos are also available for many of the talks, so we will not repeat what is already well covered.

With additional workshops on data impact and reuse, Bristol was the center of the Data Citation universe in December, with representatives and talks from many data publishing projects, databases and issuing bodies such as our DataCite collaborators, so it was an excellent opportunity to assess where things currently stand. Interesting new infrastructure was presented by Mark Hahnel, giving a preview of the new design of the FigShare platform launching in the new year, which for the first time will use citable DOIs for their datasets. Brian Hole from Ubiquity press presented on "Publication and Citation", and mentioned data publishing platforms coming from them, and the representatives of other publishers present showed that there are obviously other commercial projects in the pipeline (for example this from F1000).

Being a curation conference, researcher driven approaches were also on display, and the Environmental Sciences community in particular have been publishing datasets with DOIs for many years, both from the well established Pangaea database, and by individual data centers (Sarah Callaghan's talk representing NERC’s environmental data centers being a great example). Phillip Bourne's excellent talk imagined the possibilities that mixing open data stores with well integrated widgets and tools to mashup and produce new analyses could bring, and he mentioned that the very well established PDB (Protein Database) uses DOIs as accessions, but these are not integrated and cited into associated publications. This is a bit of a missed opportunity, and Mark Hahnel (video here) and Heather Piwowar (slides and video) both highlighted the needs for proper attribution and impact tracking for datasets to incentivise sharing of data. Our recent examples of DOIs linked to datasets from our GigaDB database getting integrated into articles in Nature Biotechnology (see more here), and Genome Biology (see here) demonstrates that this is feasible to link datasets with global, resolvable identifiers into articles.

Whilst Pangaea and the Environmental Science community have managed to do this for a number of years (including examples from as far back as 2005), the integration of data DOIs into the references of the Genome Biology article was the first time we are aware of that this has been accomplished in the field of genomics. This example is a great example of the practicalities of how data can be cited (following the best practice guidelines of the DCC), but until the bibliometric indices properly track them this is only a first step. With this important next step likely to finally happening in the new year, this meeting was a good opportunity for the data DOI producers and publishers to compare notes and ready themselves for the important year ahead. As December comes to a close, we at GigaScience would like to wish you all seasons greetings, and we look forward to an exciting 2012 for the field of data publishing!

Sunday Dec 11, 2011

HVP Beijing: dealing with variation

The Human Variome Project (HVP) Beijing Meeting has officially ended (though a number of delegates will be busy tomorrow at the Advisory Council meeting). The energy and commitment towards better understanding and treatment of heritable diseases displayed by both the speakers and participants was great to see.

Peter Taschner’s talk on the Leiden Open (source) Variation Database (LOVD) system was very well received, and a number of other speakers were using LOVD for their locus-specific databases. I enjoyed Peter Robinson’s presentation on phenotype ontology and representation in gene and disease specific databases. He discussed his free differential diagnosis tool Phenomizer, which integrates OMIM and the Orphanet rare disease nosology (Orphanet was also the subject of an earlier talk by Mariana Jovanovic).

Ethics and curation were recurring themes at this HVP meeting as well.  Many of the ethical issues concerning the sharing of human genetic data were raised by Sue Povey and Carol Isaacson Barash: consent; the tradeoff regarding the release of a carrier’s or affected individual’s geographical and ethnic information, which is both potentially identifying and of great scientific use; the impact of culture on the idea of genetic privacy; and the use of databases for diagnosis and treatment decisions. They are all thorny issues that will be discussed again and again as the field continues to change.

I was very impressed by the universal recognition of the value of curation. Most presentations discussed curation procedures and the various challenges of curating potentially sensitive and potentially diagnostic human genetic, genomic and medical data. Arleen Auerbach’s talk on the Fanconi Anemia Mutation Database (which now uses LOVD) dealt with curation issues exclusively. Anthony Brookes spoke on the technical standards and data models necessary for system interoperability, which fit well with the mood of the audience who wanted to share data without enforcing a one-system rule. The open access software Cafe Variome for sharing the existence of data, but not necessarily the data itself, that he described generated a lot of interest. Mauno Vihinen introduced VariO, an ontology for varation at the DNA and RNA level, and took the curation discussion in a different direction; Vihinen proposed an independent evaluation and rating system for human gene variation databases akin to hotel stars or Michelin stars.

The meeting ended with a brainstorming session for new recommendations for action by the Human Variome Project. It sounds like the HVP group has ambitious plans to start on before they meet again in Paris this spring!

Monday Dec 05, 2011

InCoB/ISMB-Asia: keynotes and curation

I recently returned from the InCoB/ISMB-Asia meeting. The meeting officially ended a couple of days ago but I am still digesting the good food, the good conversations and the good science, all of which I know will be with me a good while.  In the interest of avoiding a copious monograph, I’ll try to stick to a few personal high points. However, I encourage you to check out the supplemental issues in Immunome Research and our fellow BioMed Central journals BMC Bioinformatics and BMC Genomics for a more complete view of the meeting.

I would like to compliment the conference organizers for generating an excellent lineup of keynote speakers. Minoru Kanehisa gave an update on the new developments in the KEGG databases, including their ambitious new resource KEGG MEDICUS that aims to ingrate medical, pharmaceutical and genomic information for use by researchers, clinicians, pharmacists and the public. Pascale Gaudet spoke on the ever-increasing need for biocuration and the importance of biocurators, the ongoing efforts of International Society for Biocuration, and community standards and BioDBCore.  Several of her themes were echoed in the later sessions “Standards in Bioinformatics” and “BioCloud/Grid Computing for Sharing Bioinformatics Resources.” Jun Wang talked about three “Million Genomes” projects underway at BGI, leading some members of the audience (at least those of a certain age who were raised in the States) to conclude that BGI may want to invest in a signboard similar to the red one that used to appear in conjunction with golden arches. Alex Bateman discussed the ways in which Pfam and Rfam have been working with the Wikipedia community to the mutual benefit of all parties. He also gave a brief how-to for scientists looking to get involved in Wikipedia and a prod to those among us (including myself) who lack social responsibility, using but not editing Wikipedia.

My favorite keynote was Arthur Olson’s. While I generally find myself to be a highly visual learner who derives little additional benefit from other types of teaching aids, I freely admit that a set of tinker toys got me through O. Chem. Had the models and Tangible User Interface in development at the Molecular Graphics Laboratory been available when I was still in school, my scientific trajectory might have been quite different. They are doing some seriously cool stuff. And my informal survey suggested that Olson’s shake-and-play self-assembling viral model would be a welcome present for the scientist on your holiday gift list.

I also enjoyed Janet Kelso’s presentation on ancient genomics and evolution, Susanna-Assunta Sansone’s talk on the continuing progress of the BioSharing and ISAcommons communities (GigaScience is involved in both efforts), and the series of talks by Tin Wee Tan, Shoba Ranganathan and their collaborators on database standards development and their push for archive-able and easily reinstate-able databases. I am extremely grateful to have been invited to speak amongst so many prominent scientists at InCoB/ISMB-Asia (slides available on slideshare). My only real complaint about the meeting was the lack of network connections that kept me from Tweeting.

Thanks to all of you (including the many that remain unmentioned here as, despite my promises and best efforts, I’ve already produced a tome-like blog post) who made the meeting both fun and productive. I had a great time in KL!

Saturday Nov 19, 2011

GigaScience at #ICG6: announcing the release of GigaDB and new datasets

GigaScience release posterAnother busy week for the GigaScience team, with the release of a new-look database, more datasets, and a number of talks and announcements at BGI's annual International Conference of Genomics in Shenzhen. It was a great (if exhausting) meeting this year, with the state-of-the art in genomics science on display, announcements of three exciting "Million Genomes" projects to come from the BGI and their many collaborators, and a chance to catch up with many members of our editorial board and friends.

GigaDb - a new look website
The biggest news at the meeting was the launch of our new-look GigaDB.org website and additional datasets at the pre-conference data release workshop and press-conference. This is still very much in beta-form (comments and feedback greatly appreciated at editorial@gigasciencejournal.com), but builds upon our original release of datasets in July and presents them together in a single portal.  Following the success of the outbreak E. coli 0:104 and Macaque genome datasets in demonstrating the practicalities of data citation, we have released another 20 datasets with citable DOIs. These span most of the tree of life, and include previously unsupported data-types.

New Data from across the Tree of Life
Following on from the release of seven vertebrate genomes from the Genome10K project in July, we have now added genomic data from the Sheep, Tibetan Antelope and Naked Mole Rat. Genome, transcriptome and methylome data is provided from an Asian Individual, and we are currently uploading data from Ancient DNA studies on an Eskimo and Aboriginal Australian. We now have plant genomes from the Potato, Foxtail Millet, Sorghum, Cucumber, Chinese Cabbage and Pigeon Pea, and invertebrate genomes from three species of ants, many strains of silkworm and a pathogenic pig roundworm. Many of these datasets (including the Sheep, Tibetan Antelope, Millet, Sorghum and transcriptome data) are previously unpublished, this novel and more rapid release of data
should potentially speed up research in these important model and commercial species, and in human health.

For more coverage on the meeting check out the #icg6 hashtag on twitter, and reporting on the software and data release in Bio-IT World. Laurie's slides are available here, and slides from Scott's talk on data issues in the Bioinformatics session are also available here. To see a video of Laurie's talk you can also see the following clip on youtube.

Friday Nov 11, 2011

Genomicists go Shenzhen: GigaScience at the International Conference on Genomics VI

ICG logoAfter many months on the road visiting conferences it's nice when one comes to you. This weekend marks BGI's annual big bash: the 6th International Conference on Genomics, this year held in the mock-Swiss splendor of the Shenzhen OCT East resort. With a great line-up featuring many of our editorial board members (including Stephan Beck, Wang Jun, Ming Qi, Sumio Sugano and Richard Durbin), there are sessions spanning many key areas of our scope, including cloud computing, metagenomics, epigenomics, and personalized medicine.  Follow this blog, our twitter page, and the hashtag #icg6 for live updates from the meeting, and stay tuned for some important announcements regarding GigaScience.

After GigaScience being announced at last years meeting (and a nice plug in Nature Genetics coming from it), this year we will be making some important announcements at the welcome reception and press-conference on Saturday, as well as presenting on the Tuesday 15th November in the Bioinformatics session. Our editor-in-chief Laurie Goodman will also be chairing workshops on the final afternoon, so watch this space for news on how they go, as well as news and updates from the conference.

Friday Oct 21, 2011

GigaData news: Macaque DOIs published in Nature Biotechnology

This week marks another success for the fledgling practice of data citation, with two datasets from our GigaScience database published in Nature Biotechnology. The genomes sequenced by our colleagues at the BGI for the Cynomolgus and Chinese rhesus macaques were initially released in our first batch of datasets with DOIs at our launch in July, and were amongst the first (at the time) unpublished genomes released in this way. Data citation is an important concept, allowing data producers to obtain an early form of credit for releasing their work, speeding up research by encouraging early data release, and allowing the impact and reuse of data to be tracked.

After the recent success of our first dataset being published in the New England of Medicine (the genome of the recent outbreak strain of E. coli O104:H4), this is the first time one of our data DOIs has been accepted in a Nature journal. For data citation to work the assistance of journals is key, and Nature Biotechnology has been particularly helpful in promoting the scheme, arguing in an editorial as far back as 2009 that novel forms of credit for data producers were needed, and suggesting DOIs as an ideal solution for this. The Datacite consortium was set up in late 2010 to do exactly that, and we would like to thank them and the British Library for their help in issuing these DOIs.

Macaque species are the most commonly used non-human primate models in medical research, and their genomes will hopefully aid human disease research and drug discovery. Looking at orthologues of human druggable protein domains in these species is aiding the potential therapeutic exploitation of their ‘druggable genome’, and has already lead to BGI producing an exome sequencing platform for the species. On top of their genome assemblies, the DOI landing pages include links to functionally annotated and coding sequence sets, as well as a link to a browser and database. After the release of other datasets such as the CHO cell line genome, we are currently collecting another large batch of datasets to be released, so watch this space for further news and announcements.

References

1. Yan, G. et al. Genome sequencing and comparison of two nonhuman primate animal models, the cynomolgus and Chinese rhesus macaques. Nat Biotech advance online publication, (2011).


2. Credit where credit is overdue. Nat Biotech 27, 579 (2009).

To cite the two datasets please use the following citations:

3. Yan, G; Zhang, G; Fang, X; Zhang, Y; Li, C; Ling, F; Cooper, DN; Li, O; Li, Y; van Gool, AJ; Du, H; Chen, J; Chen, R; Zhang, P; Huang, Z; Thompson, JR; Meng, Y; Bai, Y; Wang, J; Zhuo, M; Wang, T; Huang, Y; Wei, L; Li, J; Wang, Z; Hu, H; Le, L; Stenson, PD; Li, B; Liu, X; Ball, EV; An, N; Huang, Q; Zhang, Y; Fan, W; Zhang, X; Li, Y; Wang, W; Katze, MG; Su, B; Nielsen, R; Yang, H; Wang, J; Wang, X; Wang, J (2011): Genomic data from the Chinese Rhesus Macaque (Macaca mulatta lasiota). GigaScience. doi:10.5524/100002

http://dx.doi.org/10.5524/100002

4. Yan, G; Zhang, G; Fang, X; Zhang, Y; Li, C; Ling, F; Cooper, DN; Li, O; Li, Y; van Gool, AJ; Du, H; Chen, J; Chen, R; Zhang, P; Huang, Z; Thompson, JR; Meng, Y; Bai, Y; Wang, J; Zhuo, M; Wang, T; Huang, Y; Wei, L; Li, J; Wang, Z; Hu, H; Le, L; Stenson, PD; Li, B; Liu, X; Ball, EV; An, N; Huang, Q; Zhang, Y; Fan, W; Zhang, X; Li, Y; Wang, W; Katze, MG; Su, B; Nielsen, R; Yang, H; Wang, J; Wang, X; Wang, J (2011): Genomic data from the Crab Eating Macaque/Cynomolgus Monkey (Macaca fascicularis). GigaScience. doi:10.5524/100003
http://dx.doi.org/10.5524/100003

Wednesday Oct 19, 2011

ICHG2011: Genetics and Genomics Gets Personal

ASHG posterGigaScience was on hand to witness plenty of lively discussion last week at the annual American Society of Human Genetics jamboree: the International Conference of Human Genetics in Montreal. As always, the meeting had a strong medical genetics presence but the rapid growth and uptake of genomics technologies in the field produced much fascinating work on display this year. However, some amongst the heavy clinical contingent were obviously uncomfortable with the lack of clinical validation of much of this work and debate was heated in many of the plenary debate sessions. This can be followed if you are patient enough to trawl the >4700 tweets utilizing the hashtag #ichg2011 or, fortunately, a growing number of ICHG 2011 webcasts.

The scene was set in the opening "Whole Genome Sequencing: To Do It or Not to Do It?" panel (involving the always controversial James Watson memorably talking about "Genetic Losers") and the technology v. medicine debate was particularly polarized in the "Current and Emerging Sequencing Technologies" panel the following day (nicely summarized by Luke Jostins in his blog). Whilst there was a consensus that sequencing will become a standard tool in the diagnosis of genetic diseases, the second panel was divided on whether this approach should be a purely targeted one, restricted to finding the pathogenic mutations causing a disease. Some of the more clinically focused members argued that medical genome sequencing was "hype", held back by the lack of genetic councilors, lack of clear policies from healthcare providers and insurance companies, and a very poor level of genetic training of clinicians in general.

The concerns raised have much merit but the circular arguments and calls for further debate didn't really acknowledge that technological advances and events on the ground are threatening to make them redundant.  It was obvious that panelists such as Rade Drmanac from Complete Genomics were going to argue against genomics technology in the clinic, but some of the clinicians on the board provided evidence that many physicians are already using them on a large scale. Joris Veltman provided examples from his recent work using exome sequencing rather than single gene tests on 500 individuals, and our editorial board member Ming Qi also admitted to doing similar work in the clinic. With one lucky attendee at the conference winning the chance to have their exome sequenced from 23 & Me (market value of 999 USD from 23 & Me - or BGI), many on Twitter pointed out that the market will likely decide the debates' outcome.

Also on display at the meeting were many examples of larger and larger projects utilizing exomic or even whole genome sequencing. Announcements were made at the meeting from Autism Speaks and our colleagues at the BGI about a new project to sequence 10,000 individuals with Autism spectrum disorders. Initial data was presented by Tim Spector of his EpiTwin project, a BGI collaboration to sequence the epigenome of 5,000 twins. Cisca Wijmenga also presented an overview of the "Genome of the Netherlands", another BGI collaboration that has already sequenced 250 Dutch trios. With many similar scale projects presented at the meeting such as Nick Schork's work with Complete Genomics to produce 1000 human reference genomes of the "Wellderly", it's clear that the field is having to deal with bigger and bigger datasets. A nice visual representation of this was shown by Cisca Wijmenga when she presented a slide showing the number of discs needed to transfer 770 whole genomes worth of GoNL project data from BGI back to the Netherlands.

The particularly challenging issues of scale remain. Hints of future ways this is likely to be tackled included an announcement at the meeting from DNAnexus about their tie-in with Google to host a mirror of the (no longer defunct) Short Read Archive in the cloud. The prickly topic of patient data security also needs resolving, and there were promising posters on display trying to improve utility and security of this type of data with tools such as MedSavant by Marc Fiume and GWAS data encryption protocols from Itsik Pe'er. All of these issues surrounding data handling are very relevant to the scope of GigaScience, and we are currently commissioning papers covering the many issues surrounding the handling of medical data. If you have an interesting point of view you would like to put forward as a commentary or review in this area or if you have useful research or tools relating to this, please contact us at editorial@gigasciencejournal.com about submitting to the journal. Whilst there is a huge amount still to resolve and do, these areas are of great interest and we are keen to follow and be part of the debate in the future.

Tuesday Oct 04, 2011

GigaScience in the press (and still on the road)...

pressWe at GigaScience always appreciate good press, and on top of the welcome coverage on blogs (e.g. this in Annals of Botany) and twitter, we are pleased with the profile of the journal in this months Bio-IT world (especially coming on top of the coverage of our database in the previous issue). The article is a nice introduction to the editors, editorial policies, and hopes for the journal, and is useful reading for those who'd like to know a bit more on top of what is already featured in GigaBlog and our authors instructions. We are very pleased with the detailed coverage, although we are not sure this can top our recent mention in the prestigious Annals of Improbable Research.

After our busy summer attending conferences and chasing down papers, GigaScience is not slowing down and is still on the road. Today we are currently at the NGS Asia congress in Singapore (follow #ngsasia on twitter for updates) and next week we will be gathering with the human genetics community at their annual ASHG/ICHG congress in Montreal (we'll be tweeting using #ICHG2011). Those attending can get hold of us via the BMC and BGI booths (1122 and 1322), so please come and say hello.

Tuesday Sep 20, 2011

Beyond the Genome: taking GigaScience into the Clouds

cloudsWith the summers conference season over, GigaScience are still keeping mobile, and this week Laurie is taking in “Beyond the Genome”, our BioMed Central stablemates Genome Biology and Genome Medicine meeting in Washington DC.  Now in its second year, and its great line-up voted by Genome Web as one of the top-3 genomics meetings, by covering key parts of our “big-data” scope and having our editorial board members Mike Schatz and Karen Nelson on the scientific committee, it was obvious we had to attend.

Monday kicked off proceedings with the Genome Informatics pre-Meeting, excellently chaired by Mike who put together a great line-up of talks on Cloud Computing (Matt Wood from AWS, and Ben Langmead plugging his Myrna and  Crossbow tools) Lincoln Stein giving interesting and extremely "big-data" insights into the handling of the enormous ICGC datasets, reproducible workflows and Galaxy from James Taylor (a subject close to our hearts, his slides here) and a BGI perspective from our very own Yingrui Li (slides here) amongst others.

Technical Notes - call for Cloud computing tools

With Cloud computing becoming such a key tool in data-intensive science, and coming from the BGI being in the the unique position of being journal with it's own Cloud (BGI-Cloud), today was a good opportunity to announce our call for submissions and volunteers to work with us on a new type of Cloud computing article - Technical Notes. By using BGI-Cloud as a test environment, GigaScience would like to particularly highlight tools, methods or procedures for the analysis or handling of large-scale data that are optimized to run in a cloud environment. Whilst there are already several hubs and platforms for useful cloud-based tools and workflows (CloudBioLinux being an excellent example), our series/hub hopes to combine some of the advantages of these with the visibility and quality assessment of the more traditional journal article.

By offering reviewers and editors access and free time to review and test these articles and tools in a standard environment, we hope to increase reproducibility and ease-of-testing of research, and take a first step towards what many hope will be a future of "executable articles". To trial this we are offering initial volunteers with tools of interest the opportunity of some free time in the BGI cloud (on top of BGI's already generous covering of the open-access article-processing charges for the journals first year), so please contact us at editorial@gigasciencejournal.com if you you would like to talk to us about submitting a Technical Note and associated application.

With days on Cancer, Exomes (nicely tied in with the Genome Biology special issue) and Microbiomes still to come, Beyond the Genome has already been interesting and insightful, and it will be hard to top the first day. For those not fortunate enough to attend you can follow the action on twitter with the hashtag #BtG11 or from Oliver Hofmann's fantastic notes. We'd like to thank Mike, Yingrui and Lincoln for the nice GigaScience mentions and plugs in their talks, and our colleagues at BMC for letting us attend.

Tuesday Sep 13, 2011

HUPO 2011: lessons for Proteomics from the Genomics Tsunami

Our whistlestop summer conference tour circumnavigating the globe has come to a jetlagged end, with the final conference being last weeks HUPO (Human Proteomics Organisation) congress in Geneva. With it being the 10th anniversary meeting it was a good opportunity to look back on how Proteomics has progressed over the past decade, from it's early gel-based origins to its current more mass-spectrometry based incarnation as a key high-throughput "Omics" technology. Whilst there have been huge challenges and some criticism relating to issues with reproducibility (leading even to a "fix-proteomics" campaign), the several sessions relating to standards, data and repositories were good opportunities to observe how these are currently being addressed.

The many talks from members of HUPO-PSI (Proteomics Standards Initiative), including four from our editorial board member Henning Hermjakob, demonstrated how organized the community has been to systematically divide up and produce standards, formats, tools and repositories for a diverse range of data types. The HUPO-PSI Initiative Program session followed the full spectrum, from 2D-gels (Juan Pablo Albar presenting on his recent BMC Research Notes paper on best practice for data sharing in Proteomics) to Molecular Interaction data (Sandra Orchard presenting on the IMEx consortium).

Many of the biggest challenges seemed to be economic and cultural rather than technical, with much discussion on the closing of Peptidome by NCBI, and recent stability issues at the main ProteomExchange raw data portal - Tranche. Whilst this is unfortunate, there seemed to be much work in process to rectify issues with raw data hosting, and processed and annotated data seemed to be in safe hands with the PRIDE and PeptideAtlas repositories. Whilst adoption and journal compliance is still building up (for an example see our last GigaBlog posting), PRIDE in particular offers authors and reviewers great visualization and quality assessment tools (PRIDEInspector), and in light of this our editorial policies strongly recommend deposition of suitable data in this database.

With a 9-year history and over 50-publications and white-papers produced to date, HUPO-PSI has tried to follow many of the lessons learned by Proteomics slightly older "big-brother" the Genomics community. With this subject in mind GigaScience presented a talk at the "Proteomics Repositories and Journals - a partnership made in heaven/hell?" session specifically focusing on lessons learned for the Proteomics community from the Genomics "Tsunami" (slides here). Whilst Proteomics data-volumes are still smaller than the petabytes that the genomics community are currently struggling with, it's reassuring that the growing Proteomics community are trying to preempt these issues. There were interesting talks on show demonstrating very "genomics-esque" cloud-based workflow systems such as ISB's TPP (transproteomic pipeline) amztpp command line tool. It was also interesting to see areas the two fields are coalescing, with Mike Snyder presenting a fantastic personalized-medicine oriented multi-"Omics" talk on what he terms Whole "Omics" Profiling (and BGI calls "Trans-Omics").

Whilst there are obviously huge challenges that lie ahead, it is clear Proteomics has come a long way in the last decade, and as a key part of the scope of GigaScience we hope to be there to cover much of what will progress as the field matures in the decades to come. Please contact us at editorial@gigasciencejournal.com if you have Proteomics data related research, reviews and comment you would like us to consider for the journal. Looking forward to meeting many of you at HUPO 2012 in Boston!

Saturday Sep 03, 2011

Exercises in blogging at #solo11. Show me the data!

Latest stop on the GigaScience magical mystery conference tour is Science Online London, and this year they have tried to make the format more interactive by organizing several interactive workshops and breakout sessions, including one on blogging that this is posting is a product of. One of the main themes running through the meeting has of course been open science (especially in the great keynote by Michael Nielson), and open-data (particularly on the "linking with the literature" and "dealing with data" panels). Data-citation was raised on several occasions, and we did our part promoting it and the recent crowdsourcing of E. coli in the Microattribution session.

The theme of the second day has been on the better use of tools and information in the context of a the rare genetic disease SMA (Spinal Muscular Atrophy). Maryann Martony from the Neuroscience Information Network initially set the scene, discussing the state of play in Neuroscience data-sharing, and outlining the technical and cultural challenges sharing this type of data which in comparison to genomics or proteomics has a much wider and more heterogeneous set of data types. The afternoon consisted of several several workshops relevant to SMA research, and the "Beyond Scholarly Publication" workshop aimed to showcase scholarly HTML and the wordpress platform by producing blog postings on SMA research.

Due to a combination of geographic and inertia issues I've not written this up on wordpress, but thought I would contribute on our normal blog with some relevant issues regarding data-sharing. With it being one of the themes of the meeting, of the suggested SMA articles to discuss none of them had the raw data available in public repositories. Whilst this was mostly due to the papers being based on datatypes without well recognized repositories such as electrophysiological data, one of the papers (Wu et al., BMC Neuroscience) was based on Proteomics data, which does have the Tranche and PRIDE databases to hold raw and processed data. Heading off to the HUPO proteomics meeting tomorrow (and due to talk at the repositories session there), it's timely to highlight the challenges this particular field still faces, and it will be very interesting to hear this week how these resources are being accepted by their communities.

Despite all of the positive words regarding datasharing at Science Online, this particular exercise has been a bit of a reality check, and its very clear that there is still an enormous of work still to do. With the meeting about to end, and the attendees all going there separate ways around the world, one positive thing to take is that they and organizations such as the NIF are at least using their combined energies to address this. Lets all hope they succeed, and #solo12 will be a nice opportunity to assess how this will progress. See you then, and don't forget to follow the huge amounts of coverage on twitter (follow #solo11) and the soon-to-be-posted videos.

References

Wu CY, Whye D, Glazewski L, Choe L, Kerr D, Lee KH, et al. Proteomic assessment of a cell model of spinal muscular atrophy. 2011;12:25+. Available from: http://dx.doi.org/10.1186/1471-2202-12-25.


Friday Sep 02, 2011

The complexity of life

ICSB 2011 left me with a greater than ever appreciation for, to borrow from the title of the last plenary session, the complexity of life.  I was impressed by the increasingly complex and explanative models that are being built, and faster and more detailed imaging methods under development, not to mention the exciting new applications of systems biology for disease treatment.  I am sure I’m not the only one who feels this way. 

I want to especially thank the participants in the Plant Systems Biology and the Systems Neuroscience Sessions.  These parallel sessions were smaller, but both the audience and the speakers were dedicated.

This year’s ICSB was a great meeting.  Facetiously, my only disappointment was that there were only awards for posters and not for the best models of the poster winners as well.

For those of you who are already missing the meeting, or who missed out on the meeting, or who just can’t wait until Toronto next year, you can relive the meeting here.

Finally, a reminder that Scott and I will be at the HUPO World Congress in Geneva starting on Sunday.  Hope to see some of you there!