Sharing Genomic Sequence Data

An unfortunate consequence of the recent explosion in next-generation sequencing (NGS) activity is the huge amount of generated data that are unused or not used to their full potential. Analysis of sequence data can take orders of magnitude more time than producing it and, as a result, is unable to keep pace. The existence of several NGS platforms, a wide variety of sequence-analysis software and the lack of a standardized system across labs for NGS have resulted in independently developed tools, protocols and analyses that cannot easily be shared. However, pooling information to create large data sets may lead to increased research and clinical discoveries. A number of organizations have formed over the last several years to address these issues. The Genome in a Bottle Consortium, for example, was created in 2012 by the NIST to develop reference materials, methods and data for interpreting sequencing variants.

Another group, the Global Alliance for Genomics and Health (GA4GH), is a nonprofit organization founded in January 2013, when an international group of 50 colleagues met to discuss issues in genomics. A letter of intent was written in June 2013, and the group subsequently and quickly attracted members. The GA4GH held its first plenary meeting in March 2014 and will have its second in October. Presently, it has over 220 members, including institutions in health care, research, disease and patient advocacy, information technology, and life science. Participating instrument and consumables companies are Illumina, Promega and QIAGEN.

The Alliance is working to facilitate the sharing of clinical and genomic sequencing data in a responsible and secure fashion. The group is making efforts to solve the problem of a lack of genomics data sharing resulting from the varied approaches taken by different groups to generate, store and analyze their data. The GA4GH is also working to further remove policy and regulatory obstacles to data sharing. “After identifying best practices and developing new interoperable tools, methods and harmonized technological platforms, the Alliance will be sharing these approaches, cross-pollinating ideas and learning, and communicating with diverse communities,” Peter Goodhand, acting executive director of the GA4GH,“ told IBO. “[T]he Global Alliance is already working to accelerate and catalyze specific projects that demonstrate the use of such methods and approaches.”

According to Mr. Goodhand, the most pressing concerns in sequencing informatics for research applications related to genomic medicine and human health are the scale of data, isolation of data in silos, and the technical and regulatory barriers to sharing the data. He stated that the organization is working to resolve the technical difficulties in sharing “largely by applying proven approaches and ideas from modern data science.”

To meet its goals, the GA4GH is divided into four working groups, each handling a different subset of the organization’s aims. The Regulatory and Ethics Working Group is addressing the ethical, legal and social aspects of the Alliance’s goals, as well as policies and standards in areas including privacy and data governance. The scope of the Security Working Group is the technological side of such concerns as data security and privacy protection. The Data Working Group (DWG) is concerned with the representation, storage and analysis of data. This includes developing standards in collaboration with partners in industry and platform development to allow interoperability of data. The fourth working group of the GA4GH is its Clinical Working Group (CWG), which is focused on the relationship between phenotypic and genotypic data.

In the shorter term, the DWG is striving to achieve its goals by overseeing and supporting standards for and improving formats for sequencing data. BAM and CRAM are the formats it will support for storing alignment information against references sequences, and it will support the VCF format for storing sequence variations and annotations. To overcome some limitations of these formats, such as in scale and data sharing, the DWG will develop application programming interfaces (APIs) for genomic sequencing data. An API is a format used by an application to communicate with a larger program that controls it, providing building blocks for that program.

Four APIs are being developed. The first is for reference variants of common DNA sequences, polymorphisms and structural variations, and will support mapping from the above data formats and from databases. The second is for read data from sequencers. Among other features, it will allow querying over groups of samples and support relating reads to a reference genome. Another API will be for expression, methylation and other epigenetic data for information on gene expression. This API will interact with the above APIs. The fourth API will be for metadata, which is information about a sample, including the sequencing center and tissue type. Mr. Goodhand indicated that data consumers can use the same analyses and tools for all data using the same API, regardless of their data providers.

The GA4GH has already started work on the APIs, according to Mr. Goodhand, to form a suite of Genomics APIs. He stated that a number of data providers, including the European Bioinformatics Institute, Google and the US National Center for Biotechnology Information, are using Version 0.1 of the Genomics API. He indicated that Version 0.5 is expected to be available in October.

To tie the development of APIs to research, the DWG is working with three active projects. The Beacon Project is a web service that any institution can use, in which users can provide technically simple genomic information that can be obtained with no information that could be considered a violation of privacy. The DWG will work to increase the number of institutions offering the service and gain understanding of international hurdles to data sharing. The project is determining willingness to share genetic data. The ICGC Pan-Cancer Analysis Project (PCAP) is using whole-genome sequencing to study somatic mutations for more than 20 types of cancers. The DWG will develop data standards and software for the storage, access and exchange of the PCAP’s read and reference-mapping data. Through the third project, Matchmaker Exchange, clinicians and researchers will be able to find samples of interest, such as with a given genotype, from other sources. The project will allow the data provider and consumer to interact and work together.

The CWG will establish standards for representing these data for both clinical and research use. The Group will begin by working on cancer and rare genetic diseases and eventually expand to other areas. According to Jennifer Skinner, senior project lead of the GA4GH, the Alliance is in the process of determining what it considers the most pressing issues in sequencing informatics for clinical applications. It recognizes that establishing data standards for linking genotype to phenotype will differ for research and clinical purposes, and this will be a topic of discussion in the Alliance’s next plenary session in October.

The CWG has begun efforts in three areas. The first, phenotype ontology, deals with the vocabulary associated with phenotypic data to be used in databases and publications. The CWG is developing common terms across the ontologies that currently exist. Data harmonization, the second area, will allow sharing of data among different studies. The third area, biomedical informatics and electronic health records and data extraction, will develop methods for extracting phenotypic data from already-existing health records. The CWG also intends to develop standards for disease registries and to study how other countries are linking genomic data to clinical applications.

In the longer term, the CWG intends to establish best-practice standards for using genomics for clinical applications. In addition, it plans to use genotypic information to direct patients to clinical trials for cancer and rare diseases and to establish an international system for the exchange and harmonization of data.

Like the DWG, the CWG will be working with related research projects including Matchmaker Exchange and Phenotype Ontologies. It will also be involved in the BRCA Challenge, which will amass genetic variants of the BRCA gene from around the world to better understand the disease. More specifically, the CWG will work on data sets of genetic variants paired with phenotypic and other scientific information, and facilitate data sharing to make progress on actionable variants.

The four working groups will facilitate accomplishment of the GA4GH’s goals of standardization of data, allowing researchers and clinicians to work with large data sets from various sources. “Clinically, it is important for us to be able to accurately discriminate between genetic changes that are benign and of no consequence to health, versus those that lead to disease or predispose an individual to a risk of developing disease,” stated Mr. Goodhand. “Being able to compare genetic changes in an individual with hundreds of thousands or even millions of other individuals—with and without disease—will provide a much more accurate prediction of risk for each individual.” He commented further that these expanded data sets would allow new patterns to be discovered that could not be identified through smaller-scale work of a single party. According to Mr. Goodhand, the Alliance plans to collaborate with other groups working toward standardization in sequencing informatics through its “principles of openness, transparency, and inclusiveness.”

Mr. Goodhand added, “We are on the cusp of a tremendous opportunity in medicine. More and more people across the world are choosing to make their genetic and clinical data available so that these medical advances can be made.” He continued, “This is a unique moment to help shape, on the front end, how best to collect, share and learn from the data in ways that protect privacy and autonomy, accelerate medical progress, and are in the public good.”

< | >