Bioinformatics Special Report, Part I: Where Does Biotechnology Data Come From?

The Human Genome Project, a collaborative effort to sequence the entire human genome, took thirteen years and three billion dollars to complete. This project began in 1990, and, by modern standards, is already considered slow and dated. Today, the human genome can be sequenced in a single day, for only a few thousand dollars.

As sequencing technologies develop, the amount of genomic data that researchers have at their disposal is growing exponentially. To date, 4,327 different species have fully sequenced genomes.

At the intersection of biology and computer science, the field of Bioinformatics has emerged as an important means to process and meaningfully manipulate the wealth of biological data that is available.

Bioinformatics is the field of science involved with using computational methods to convert raw biological data into meaningful biological paradigms. For example, by analyzing sequence and microarray expression data in breast cancer patients, researchers discovered a mutation in the BRCA1 gene associated with a high risk of breast cancer. This bioinformatics study gave researchers a specific target, and studies have since determined exactly how mutated BRCA1 causes breast cancer.

Part I of this article will discuss how bioinformatics data is generated. It will focus on the methods used to collect this data, and the companies that support this collection. Part II will look at how this data is manipulated, discussing the government’s role in archiving and distributing this information. Finally, Part III will explore the real-life applications of bioinformatics, discussing the role of the private sector in applying this data to therapeutic development.

Where does the data come from?

To fully understand the scope of bioinformatics projects, it is important to know how this data originates. Several laboratory methods were recently created to study properties of the cell on a genome-wide scale. In this section, we will focus on three particular methods: Microarrays, Chromatin Immunoprecipitation (ChIP), and DNA Sequencing.

Microarrays

Microarray technology is used to determine the relative expression of a large subset of genes in an organism’s genome. In doing so, a researcher can determine which genes are active in an experimental condition (i.e. disease).

The microarray is a small chip, approximately the size of a standard camera memory card. Thousands of short DNA molecules, called probes, are hybridized to the surface of the chip.

These probes are specifically designed, such that they will adhere to naturally occurring DNA sequences in the cell. A sample of DNA is taken from the cell and added to the chip. When the sample DNA adheres to the probe DNA, a fluorescent signal is emitted. The sum of fluorescent signals comprises the gene expression profile for the cell.

Under different conditions, cells will express different genes. For example, a cell with limited nutrient availability will activate genes involved in breaking down inherent cell components for energy. When nutrients are available, the cell will turn off these genes. When a gene is activated, enzymes first convert the DNA sequence into messenger RNA (mRNA). In microarray experiments, the mRNA from a cell is collected, chemically converted into DNA, and added to microarray chips.

With this technology, a researcher can manipulate a cell, collect the DNA of genes expressed under the new conditions, directly visualize how those manipulations affect the cell’s gene expression profile, and in turn understand what expressions occur under particular conditions.

The concept of a gene chip was conceived in the 1980’s. However, microarray technology, in its modern form, was first introduced in 1997, and its role in the biological sciences continues to grow. Microarrays can now be used to detect a wide array of molecules beyond DNA, including proteins, tissues, carbohydrates, micro-RNAs, and organic chemical compounds (such as drugs).

Several companies, including Affymetrix (AFFX), Illumina (ILMN), and Agilent (A), specialize in the manufacture of DNA microarray chips. These companies have increased a chip’s processing power and simultaneously decreased the amount of a chip’s background noise. Affymetrix’s product, GeneChip, can be used for thousands of simultaneous experiments, for the purpose of direct comparison.

ChIP

Chromatin Immunoprecipitation is a technique used to determine where, specifically, proteins bind onto DNA. DNA-binding proteins are critical to regulation, a fundamental paradigm for pharmaceutical development. Thus, it is becoming increasingly important to precisely understand how these systems function.

In ChIP, cells are initially treated with formaldehyde. Formaldehyde is a crosslinking agent, which stabilizes any chemical interactions in the cell, including protein-DNA interactions. Subsequently, the DNA is extracted from the cell and fragmented mechanically. To isolate the regions of DNA containing the protein of interest, antibody selection is used. Antibodies are designed to interact with a specific protein, and these can be used to filter out any unwanted DNA fragments.

Several companies, such as Abcam (ABC.L), Cell Signaling Technology, and Upstate, specialize in the production of highly specific “ChIP-grade” antibodies. At this stage, the researcher has a collection of the DNA sequences where the protein of interest is bound. Heating the sample separates the DNA and protein.

ChIP can be used to determine local DNA-protein interactions, or it can be paired with DNA microarrays to determine the genome-wide binding profile for a protein of interest. The latter technique, called ChIP-on-chip, illuminates the array of genes to which the protein of interest binds. Since these microarrays have a different function than those for gene expression profiling, a different set of probes must be used. This is because proteins typically bind to regulatory regions, such as promoters or enhancers, which are often found outside of a gene’s coding region. As such, different companies, including NimbleGen and Invitrogen – merged into Life Technologies (LIFE) – manufacture the ChIP-on-chip microarrays.

Sequencing

DNA sequencing is the process of determining the linear order of nucleotide bases within a sample molecule of DNA. Sequencing has innumerable applications in biological sciences, including the identification of mutations, diagnosis of illness, gene therapy, and forensic sciences. By the comparison of DNA sequences, researchers are able to elucidate the molecular mechanisms of potential genetic diseases.

In 1977, Frederick Sanger and colleagues published their Chain-Termination method for sequencing, the first reliable method to sequence long fragments of DNA. Sanger was awarded the Nobel Prize in Chemistry in 1980, and the method is now generally referred to as Sanger sequencing. While technologies have since improved, Sanger sequencing is widely regarded as the breakthrough that allowed DNA sequencing to flourish. Sanger sequencing is the gold standard for first generation sequencing technologies, and was the predominant method used in the Human Genome Project.

DNA is a double-helix comprised of individual units called nucleotides, of which there are four varieties (adenine, thymine, cytosine, and guanine – the A, T, C, and G – that spell out the title of the 1997 science fiction thriller GATTACA). These nucleotides connect to one another vertically, to create single helices, and across, to connect the two helices.

In Sanger sequencing and most second-generation sequencing methods, a single-helix DNA molecule is used as a template, as nucleotides are added across the single-helix to form a double-helix. In Sanger sequencing, a mixture of individual nucleotides and chemically-modified nucleotides are added to the template. The chemical modification prevents further vertical stacking, and, by this approach, researchers can slowly build upon the template and deduce the composition of the sequence.

One decade after the completion of the Human Genome Project, sequencing techniques remain a vital component to laboratory research. While Sanger sequencing laid the foundation, it is, by modern standards, inefficient for larger projects. The method can only sequence up to 800 nucleotides at a time, and the human genome contains roughly four million times that amount. Therefore, it is no surprise that several companies are actively developing new methods to streamline this process.

Pyrosequencing, licensed by 454 Life Sciences, detects a small molecule called pyrophosphate, which is released in the chemical reaction of vertical nucleotide stacking.  Illumina has developed their own method of sequencing, involving fluorescently labeled nucleotides and high-powered cameras to capture which nucleotide is added. Ion Torrent Systems, a subsidiary of Life Technologies, has developed a semiconductor capable of detecting each added nucleotide.

Other companies that are developing DNA sequencing strategies include Oxford Nanopore Technologies (nanopore sequencing), Affymetrix (microarray sequencing), Complete Genomics (DNA nanoball sequencing), Applied Biosystems (SOLiD sequencing), Helicos Biosciences (Heliscope single molecule sequencing), and Pacific Biosciences (single molecule real time sequencing).

With microarrays, ChIP, and DNA sequencing, researchers can accumulate a wealth of biological data. Microarrays depict the genes actively expressed under various conditions. ChIP determines the DNA-binding profile for a protein of interest. Sequencing determines the physical arrangement of nucleotides along the DNA molecule. These techniques are the foundation of bioinformatics.

Bioinformatic researchers (OneMedPlace has coined the term ‘bioinformaticians’) develop complex algorithms to analyze these data, searching for meaningful patterns within this vast stream of information. Ultimately, these findings lead to experiments, discovery, and the development of pharmaceuticals and other therapeutics for debilitating disease. Studies in bioinformatics have already increased our understanding of cancer biology. Only time will demonstrate the full therapeutic potential for bioinformatics.