Completed in 2003, the Human Genome Project (HGP) was a 13-year project coordinated by the U.S. Department of Energy (DOE) and the National Institutes of Health. During the early years of the HGP, the Wellcome Trust (U.K.) became a major partner; additional contributions came from Japan, France, Germany, China, and others. Project goals were to
- identify all the approximately 20,500 genes in human DNA,
- determine the sequences of the 3 billion chemical base pairs that make up human DNA,
- store this information in databases,
- improve tools for data analysis,
- transfer related technologies to the private sector, and
- address the ethical, legal, and social issues (ELSI) that may arise from the project.
Though the HGP is finished, analyses of the data will continue for many years.
Starting points include
- Human Genome News: This 13-year publication facilitated HGP communication, helped prevent duplication of research effort, and informed persons interested in genome research.
- Timeline: This research tool chronicles major events in the HGP and related follow-on projects.
- Publications archive: Library of HGP program reports, research abstracts, research goals, and other historical documents.
- Research archive: Archive of HGP research by topic including goals, abstracts, and reports.
Since 2001, the DOE Genomic Science Program has been using microbial and plant genomic data, high-throughput analytical technologies, and modeling and simulation to develop a predictive understanding of biological systems behavior relevant to solutions for energy and environmental challenges including bioenergy production, environmental remediation, and climate stabilization.
During the Human Genome Project, this website served as the primary electronic information source for HGP researchers and the public. It is now a unique archive—a repository for historical documents detailing the history of the HGP from the project's beginnings in 1989 until it was completed in 2003.
The Human Genome Project (HGP) was an international 13-year effort, 1990 to 2003. Primary goals were to discover the complete set of human genes and make them accessible for further biological study, and determine the complete sequence of DNA bases in the human genome. See Timeline for more HGP history.
Human Genome News
Published from 1989 until 2002, this newsletter facilitated HGP communication, helped prevent duplication of research effort, and informed persons interested in genome research.
The Sequence of the Human Genome
A 2.91-billion base pair (bp) consensus sequence of the euchromatic portion of the human genome was generated by the whole-genome shotgun sequencing method. The 14.8-billion bp DNA sequence was generated over 9 months from 27,271,853 high-quality sequence reads (5.11-fold coverage of the genome) from both ends of plasmid clones made from the DNA of five individuals.
Two assembly strategies—a whole-genome assembly and a regional chromosome assembly—were used, each combining sequence data from Celera and the publicly funded genome effort. The public data were shredded into 550-bp segments to create a 2.
9-fold coverage of those genome regions that had been sequenced, without including biases inherent in the cloning and assembly procedure used by the publicly funded group. This brought the effective coverage in the assemblies to eightfold, reducing the number and size of gaps in the final assembly over what would be obtained with 5.11-fold coverage.
The two assembly strategies yielded very similar results that largely agree with independent mapping data. The assemblies effectively cover the euchromatic regions of the human chromosomes. More than 90% of the genome is in scaffold assemblies of 100,000 bp or more, and 25% of the genome is in scaffolds of 10 million bp or larger.
Analysis of the genome sequence revealed 26,588 protein-encoding transcripts for which there was strong corroborating evidence and an additional ∼12,000 computationally derived genes with mouse matches or other weak supporting evidence.
Although gene-dense clusters are obvious, almost half the genes are dispersed in low G+C sequence separated by large tracts of apparently noncoding sequence. Only 1.1% of the genome is spanned by exons, whereas 24% is in introns, with 75% of the genome being intergenic DNA.
Duplications of segmental blocks, ranging in size up to chromosomal lengths, are abundant throughout the genome and reveal a complex evolutionary history. Comparative genomic analysis indicates vertebrate expansions of genes associated with neuronal function, with tissue-specific developmental regulation, and with the hemostasis and immune systems.
DNA sequence comparisons between the consensus sequence and publicly funded genome data provided locations of 2.1 million single-nucleotide polymorphisms (SNPs). A random pair of human haploid genomes differed at a rate of 1 bp per 1250 on average, but there was marked heterogeneity in the level of polymorphism across the genome. Less than 1% of all SNPs resulted in variation in proteins, but the task of determining which SNPs have functional consequences remains an open challenge.
Decoding of the DNA that constitutes the human genome has been widely anticipated for the contribution it will make toward understanding human evolution, the causation of disease, and the interplay between the environment and heredity in defining the human condition.
A project with the goal of determining the complete nucleotide sequence of the human genome was first formally proposed in 1985 (1). In subsequent years, the idea met with mixed reactions in the scientific community (2). However, in 1990, the Human Genome Project (HGP) was officially initiated in the United States under the direction of the National Institutes of Health and the U.S.
Department of Energy with a 15-year, $3 billion plan for completing the genome sequence. In 1998 we announced our intention to build a unique genome- sequencing facility, to determine the sequence of the human genome over a 3-year period. Here we report the penultimate milestone along the path toward that goal, a nearly complete sequence of the euchromatic portion of the human genome.
The sequencing was performed by a whole-genome random shotgun method with subsequent assembly of the sequenced segments.
The modern history of DNA sequencing began in 1977, when Sanger reported his method for determining the order of nucleotides of DNA using chain-terminating nucleotide analogs (3). In the same year, the first human gene was isolated and sequenced (4).
In 1986, Hood and co-workers (5) described an improvement in the Sanger sequencing method that included attaching fluorescent dyes to the nucleotides, which permitted them to be sequentially read by a computer.
The first automated DNA sequencer, developed by Applied Biosystems in California in 1987, was shown to be successful when the sequences of two genes were obtained with this new technology (6).
From early sequencing of human genomic regions (7), it became clear that cDNA sequences (which are reverse-transcribed from RNA) would be essential to annotate and validate gene predictions in the human genome.
These studies were the basis in part for the development of the expressed sequence tag (EST) method of gene identification (8), which is a random selection, very high throughput sequencing approach to characterize cDNA libraries. The EST method led to the rapid discovery and mapping of human genes (9). The increasing numbers of human EST sequences necessitated the development of new computer algorithms to analyze large amounts of sequence data, and in 1993 at The Institute for Genomic Research (TIGR), an algorithm was developed that permitted assembly and analysis of hundreds of thousands of ESTs. This algorithm permitted characterization and annotation of human genes on the basis of 30,000 EST assemblies (10).
The complete 49-kbp bacteriophage lambda genome sequence was determined by a shotgun restriction digest method in 1982 (11). When considering methods for sequencing the smallpox virus genome in 1991 (12), a whole-genome shotgun sequencing method was discussed and subsequently rejected owing to the lack of appropriate software tools for genome assembly.
However, in 1994, when a microbial genome-sequencing project was contemplated at TIGR, a whole-genome shotgun sequencing approach was considered possible with the TIGR EST assembly algorithm. In 1995, the 1.8-Mbp Haemophilus influenzae genome was completed by a whole-genome shotgun sequencing method (13).
The experience with several subsequent genome-sequencing efforts established the broad applicability of this approach (14, 15).
A key feature of the sequencing approach used for these megabase-size and larger genomes was the use of paired-end sequences (also called mate pairs), derived from subclone libraries with distinct insert sizes and cloning characteristics.
Paired-end sequences are sequences 500 to 600 bp in length from both ends of double-stranded DNA clones of prescribed lengths.
The success of using end sequences from long segments (18 to 20 kbp) of DNA cloned into bacteriophage lambda in assembly of the microbial genomes led to the suggestion (16) of an approach to simultaneously map and sequence the human genome by means of end sequences from 150-kbp bacterial artificial chromosomes (BACs) (17, 18). The end sequences spanned by known distances provide long-range continuity across the genome. A modification of the BAC end-sequencing (BES) method was applied successfully to complete chromosome 2 from the Arabidopsis thaliana genome (19).
In 1997, Weber and Myers (20) proposed whole-genome shotgun sequencing of the human genome. Their proposal was not well received (21). However, by early 1998, as less than 5% of the genome had been sequenced, it was clear that the rate of progress in human genome sequencing worldwide was very slow (22), and the prospects for finishing the genome by the 2005 goal were uncertain.
In early 1998, PE Biosystems (now Applied Biosystems) developed an automated, high-throughput capillary DNA sequencer, subsequently called the ABI PRISM 3700 DNA Analyzer.
Discussions between PE Biosystems and TIGR scientists resulted in a plan to undertake the sequencing of the human genome with the 3700 DNA Analyzer and the whole-genome shotgun sequencing techniques developed at TIGR (23). Many of the principles of operation of a genome-sequencing facility were established in the TIGR facility (24).
However, the facility envisioned for Celera would have a capacity roughly 50 times that of TIGR, and thus new developments were required for sample preparation and tracking and for whole-genome assembly. Some argued that the required 150-fold scale-up from the H. influenzae genome to the human genome with its complex repeat sequences was not feasible (25).
The Drosophila melanogaster genome was thus chosen as a test case for whole-genome assembly on a large and complex eukaryotic genome. In collaboration with Gerald Rubin and the Berkeley Drosophila Genome Project, the nucleotide sequence of the 120-Mbp euchromatic portion of the Drosophila genome was determined over a 1-year period (26–28).
The Drosophila genome-sequencing effort resulted in two key findings: (i) that the assembly algorithms could generate chromosome assemblies with highly accurate order and orientation with substantially less than 10-fold coverage, and (ii) that undertaking multiple interim assemblies in place of one comprehensive final assembly was not of value.
These findings, together with the dramatic changes in the public genome effort subsequent to the formation of Celera (29), led to a modified whole-genome shotgun sequencing approach to the human genome. We initially proposed to do 10-fold sequence coverage of the genome over a 3-year period and to make interim assembled sequence data available quarterly.
The modifications included a plan to perform random shotgun sequencing to ∼5-fold coverage and to use the unordered and unoriented BAC sequence fragments and subassemblies published in GenBank by the publicly funded genome effort (30) to accelerate the project. We also abandoned the quarterly announcements in the absence of interim assemblies to report.
Although this strategy provided a reasonable result very early that was consistent with a whole-genome shotgun assembly with eightfold coverage, the human genome sequence is not as finished as the Drosophila genome was with an effective 13-fold coverage.
However, it became clear that even with this reduced coverage strategy, Celera could generate an accurately ordered and oriented scaffold sequence of the human genome in less than 1 year. Human genome sequencing was initiated 8 September 1999 and completed 17 June 2000.
The first assembly was completed 25 June 2000, and the assembly reported here was completed 1 October 2000. Here we describe the whole-genome random shotgun sequencing effort applied to the human genome. We developed two different assembly approaches for assembling the ∼3 billion bp that make up the 23 pairs of chromosomes of the Homo sapiens genome.
Any GenBank-derived data were shredded to remove potential bias to the final sequence from chimeric clones, foreign DNA contamination, or misassembled contigs.
Insofar as a correctly and accurately assembled genome sequence with faithful order and orientation of contigs is essential for an accurate analysis of the human genetic code, we have devoted a considerable portion of this manuscript to the documentation of the quality of our reconstruction of the genome.
We also describe our preliminary analysis of the human genetic code on the basis of computational methods. Figure 1 (see fold-out chart associated with this issue; files for each chromosome can be found in Web fig. 1 on Science Online at www.sciencemag.org/cgi/content/full/291/5507/1304/DC1) provides a graphical overview of the genome and the features encoded in it. The detailed manual curation and interpretation of the genome are just beginning.
To aid the reader in locating specific analytical sections, we have divided the paper into seven broad sections. A summary of the major results appears at the beginning of each section.
Sources of DNA and Sequencing Methods
Genome Assembly Strategy and Characterization
Gene Prediction and Annotation
A Genome-Wide Examination of Sequence Variations
An Overview of the Predicted Protein- Coding Genes in the Human Genome
Summary. This section discusses the rationale and ethical rules governing donor selection to ensure ethnic and gender diversity along with the methodologies for DNA extraction and library construction. The plasmid library construction is the first critical step in shotgun sequencing.
If the DNA libraries are not uniform in size, nonchimeric, and do not randomly represent the genome, then the subsequent steps cannot accurately reconstruct the genome sequence.
We used automated high-throughput DNA sequencing and the computational infrastructure to enable efficient tracking of enormous amounts of sequence information (27.3 million sequence reads; 14.9 billion bp of sequence).
Sequencing and tracking from both ends of plasmid clones from 2-, 10-, and 50-kbp libraries were essential to the computational reconstruction of the genome. Our evidence indicates that the accurate pairing rate of end sequences was greater than 98%.
Various policies of the United States and the World Medical Association, specifically the Declaration of Helsinki, offer recommendations for conducting experiments with human subjects.
We convened an Institutional Review Board (IRB) (31) that helped us establish the protocol for obtaining and using human DNA and the informed consent process used to enroll research volunteers for the DNA-sequencing studies reported here.
We adopted several steps and procedures to protect the privacy rights and confidentiality of the research subjects (donors). These included a two-stage consent process, a secure random alphanumeric coding system for specimens and records, circumscribed contact with the subjects by researchers, and options for off-site contact of donors.
In addition, Celera applied for and received a Certificate of Confidentiality from the Department of Health and Human Services. This Certificate authorized Celera to protect the privacy of the individuals who volunteered to be donors as provided in Section 301(d) of the Public Health Service Act 42 U.S.C. 241(d).
Celera and the IRB believed that the initial version of a completed human genome should be a composite derived from multiple donors of diverse ethnic backgrounds Prospective donors were asked, on a voluntary basis, to self-designate an ethnogeographic category (e.g., African-American, Chinese, Hispanic, Caucasian, etc.). We enrolled 21 donors (32).
What is the Human Genome Project?
The Human Genome Project (HGP) was the international, collaborative research program whose goal was the complete mapping and understanding of all the genes of human beings. All our genes together are known as our “genome.”
The main goals of the Human Genome Project were first articulated in 1988 by a special committee of the U.S. National Academy of Sciences, and later adopted through a detailed series of five-year plans jointly written by the National Institutes of Health and the Department of Energy.
Congress funded both the NIH and the DOE to embark on further exploration of this concept, and the two government agencies formalized an agreement by signing a Memorandum of Understanding to “coordinate research and technical activities related to the human genome.”
James Watson was appointed to lead the NIH component, which was dubbed the Office of Human Genome Research. The following year, the Office of Human Genome Research evolved into the National Center for Human Genome Research.
In 1990, the initial planning stage was completed with the publication of a joint research plan, “Understanding Our Genetic Inheritance: The Human Genome Project, The First Five Years, FY 1991-1995.” This initial research plan set out specific goals for the first five years of what was then projected to be a 15-year research effort.
HGP researchers deciphered the human genome in three major ways: determining the order, or “sequence,” of all the bases in our genome's DNA; making maps that show the locations of genes for major sections of all our chromosomes; and producing what are called linkage maps, through which inherited traits (such as those for genetic disease) can be tracked over generations.
- Goals The main goals of the Human Genome Project were first articulated in 1988 by a special committee of the U.S. National Academy of Sciences, and later adopted through a detailed series of five-year plans jointly written by the National Institutes of Health and the Department of Energy. Congress funded both the NIH and the DOE to embark on further exploration of this concept, and the two government agencies formalized an agreement by signing a Memorandum of Understanding to “coordinate research and technical activities related to the human genome.” James Watson was appointed to lead the NIH component, which was dubbed the Office of Human Genome Research. The following year, the Office of Human Genome Research evolved into the National Center for Human Genome Research. In 1990, the initial planning stage was completed with the publication of a joint research plan, “Understanding Our Genetic Inheritance: The Human Genome Project, The First Five Years, FY 1991-1995.” This initial research plan set out specific goals for the first five years of what was then projected to be a 15-year research effort. HGP researchers deciphered the human genome in three major ways: determining the order, or “sequence,” of all the bases in our genome's DNA; making maps that show the locations of genes for major sections of all our chromosomes; and producing what are called linkage maps, through which inherited traits (such as those for genetic disease) can be tracked over generations.
The HGP has revealed that there are probably about 20,500 human genes. This ultimate product of the HGP has given the world a resource of detailed information about the structure, organization and function of the complete set of human genes. This information can be thought of as the basic set of inheritable “instructions” for the development and function of a human being.
The International Human Genome Sequencing Consortium published the first draft of the human genome in the journal Nature in February 2001 with the sequence of the entire genome's three billion base pairs some 90 percent complete.
More than 2,800 researchers who took part in the consortium shared authorship.A startling finding of this first draft was that the number of human genes appeared to be significantly fewer than previous estimates, which ranged from 50,000 genes to as many as 140,000.
The full sequence was completed and published in April 2003.
Upon publication of the majority of the genome in February 2001, Francis Collins, then director of the National Human Genome Research Institute, noted that the genome could be thought of in terms of a book with multiple uses: “It's a history book – a narrative of the journey of our species through time. It's a shop manual, with an incredibly detailed blueprint for building every human cell. And it's a transformative textbook of medicine, with insights that will give health care providers immense new powers to treat, prevent and cure disease.”
The tools created through the HGP also continue to inform efforts to characterize the entire genomes of several other organisms used extensively in biological research, such as mice, fruit flies and flatworms.
These efforts support each other, because most organisms have many similar, or “homologous,” genes with similar functions. Therefore, the identification of the sequence or function of a gene in a model organism, for example, the roundworm C.
elegans, has the potential to explain a homologous gene in human beings, or in one of the other model organisms.
Of course, information is only as good as the ability to use it. Therefore, advanced methods for widely disseminating the information generated by the HGP to scientists, physicians and others, is necessary in order to ensure the most rapid application of research results for the benefit of humanity. Biomedical technology and research are particular beneficiaries of the HGP.
However, the momentous implications for individuals and society for possessing the detailed genetic information made possible by the HGP were recognized from the outset.
Another major component of the HGP – and an ongoing component of NHGRI – is therefore devoted to the analysis of the ethical, legal and social implications (ELSI) of our newfound knowledge, and the subsequent development of policy options for public consideration.
- Impact The HGP has revealed that there are probably about 20,500 human genes. This ultimate product of the HGP has given the world a resource of detailed information about the structure, organization and function of the complete set of human genes. This information can be thought of as the basic set of inheritable “instructions” for the development and function of a human being. The International Human Genome Sequencing Consortium published the first draft of the human genome in the journal Nature in February 2001 with the sequence of the entire genome's three billion base pairs some 90 percent complete. More than 2,800 researchers who took part in the consortium shared authorship.A startling finding of this first draft was that the number of human genes appeared to be significantly fewer than previous estimates, which ranged from 50,000 genes to as many as 140,000. The full sequence was completed and published in April 2003. Upon publication of the majority of the genome in February 2001, Francis Collins, then director of the National Human Genome Research Institute, noted that the genome could be thought of in terms of a book with multiple uses: “It's a history book – a narrative of the journey of our species through time. It's a shop manual, with an incredibly detailed blueprint for building every human cell. And it's a transformative textbook of medicine, with insights that will give health care providers immense new powers to treat, prevent and cure disease.” The tools created through the HGP also continue to inform efforts to characterize the entire genomes of several other organisms used extensively in biological research, such as mice, fruit flies and flatworms. These efforts support each other, because most organisms have many similar, or “homologous,” genes with similar functions. Therefore, the identification of the sequence or function of a gene in a model organism, for example, the roundworm C. elegans, has the potential to explain a homologous gene in human beings, or in one of the other model organisms. Of course, information is only as good as the ability to use it. Therefore, advanced methods for widely disseminating the information generated by the HGP to scientists, physicians and others, is necessary in order to ensure the most rapid application of research results for the benefit of humanity. Biomedical technology and research are particular beneficiaries of the HGP. However, the momentous implications for individuals and society for possessing the detailed genetic information made possible by the HGP were recognized from the outset. Another major component of the HGP – and an ongoing component of NHGRI – is therefore devoted to the analysis of the ethical, legal and social implications (ELSI) of our newfound knowledge, and the subsequent development of policy options for public consideration.
Lessons from the Human Genome Project
by Rebecca Fine
figures by Elayne Fivenson
The Human Genome Project, one of the most ambitious scientific projects ever undertaken, achieved a monumental goal: sequencing the entire human genome. Since its completion in 2003, this project has laid the groundwork for thousands of scientific studies associating genes with human diseases.
DNA and the genome: a primer
First, let’s talk a little bit about terminology. DNA is a molecule that carries genetic information. It is made up of four types of smaller molecules, referred to as “bases”: adenine (A), thymine (T), cytosine (C), and guanine (G). The order of these bases provides instructions for assembling the essential building blocks of life.
A gene is a segment of DNA that contains instructions for one of these building blocks, such as a single protein. A genome, in contrast, is a complete set of DNA instructions, including all of a person’s genes. In humans, the genome consists of 3 billion bases. All humans share about 99.9% of this genome, and the remainder is variable (and 0.
1% of 3 billion is still 3 million bases – nothing to sneeze at!). A spot in the genome that can differ between people (e.g., where some people have an A and others have a G) is called a single nucleotide polymorphism, or SNP (Figure 1).
The version of a SNP a person has is called their genotype, and these small genetic differences are part of what makes people unique.
Figure 1. Single nucleotide polymorphisms. A SNP in three different people, where each person has a different base at the same spot in the genome.
The Human Genome Project: decoding our DNA, one base at a time
The Human Genome Project (HGP), which began in 1990, was a massive international effort carried out by twenty research centers and universities in six countries.
The primary goal of this project was to determine the order of all 3 billion bases in the entire human genome; this process is called sequencing. You can think of sequencing as assembling a puzzle. First, scientists collect a biological sample, such as saliva or blood.
Then, they make lots of copies of the DNA in the sample and break those copies into many smaller, overlapping pieces (Figure 2). The order of bases, or the sequence, in each of those pieces can then be determined by a series of chemical reactions.
The DNA must be broken into pieces prior to sequencing because these reactions can only read short DNA strands, typically less than 1000 bases.
Figure 2. Shotgun whole genome sequencing. This image shows an overview of the shotgun sequencing procedure, where DNA is copied, broken, sequenced, and computationally analyzed to figure out the original genetic sequence.
Next, the challenge is to assemble the pieces of this “puzzle” into the correct order, using the overlapping sequences from each piece as a guide.
This is a difficult computational problem, especially for a genome containing 3 billion bases! Scientists also knew this problem would be even more complicated in humans than in other organisms because the human genome contains many highly repetitive sequences (e.g., patterns such as AGAGAGA or TTTTTTT).
Using overlaps to guide reconstruction of the genome is especially challenging in these types of regions – imagine a puzzle in which many of the pieces are shaped almost identically.
Another major goal of the project was to determine how many genes we actually have in our genome. Previous estimates ranged widely, with some scientists believing there might be up to 100,000 genes.
The HGP found that, in fact, humans have only about 20,000-25,000 genes (current estimates peg this to the lower end of that range).
This number was quite a surprise to many scientists – many other organisms, such as rice and water fleas, actually have many more genes than we do! This was an important lesson for genetics: the complexity of an organism is not necessarily correlated with how many genes it has.
Genome-wide association studies: how does genetics relate to common diseases?
The Human Genome Project made it possible to ask and address new types of scientific questions. One example of such an important question is determining which SNPs increase or decrease risk for a given disease (recall that SNPs are genetic bases which can differ between people).
Before the HGP, if scientists wanted to answer this type of question, they could only realistically focus on a few small regions of the genome at a time.
Now, it would theoretically be possible to sequence many people with and without a disease and systematically test each base in the genome, asking: is one version of a SNP more common in people who have the disease? This type of study design is called a genome-wide association study (GWAS) (Figure 3).
One of the important considerations for GWAS is cost efficiency, as sequencing the entire genome is still far too expensive to perform on large numbers of people. Therefore, scientists often use a cheaper approach: selecting hundreds of thousands of known SNPs ahead of time and testing each individual’s genotype at only those SNPs.
Figure 3. Genome-wide association studies. This image depicts an overview of the genome-wide association study (GWAS) procedure, where scientists collect DNA information from patients and healthy controls and then systematically test for SNPs that are associated with having the disease of interest.
Scientists had previously been fairly successful at determining the genes that cause many rare and severe diseases, such as cystic fibrosis and sickle-cell anemia.
For these types of diseases, often a single SNP with an extremely strong effect could be pinpointed (though it’s important to note that gene discovery does not immediately translate into therapeutic drug development – it is only the first step of a long and complex process).
It seemed natural to hope that GWAS would prove similarly effective at determining the genetic basis for more common diseases, such as heart disease, diabetes, inflammatory bowel disease, and schizophrenia.
In the first years of GWAS, however, it became apparent that matters would not be so simple: the findings suggested that a very large number of genes – for some traits, easily into the hundreds or perhaps even thousands – might have effects on a given disease. Moreover, these effects tended to be very small for each SNP (for example, a given SNP that affects risk for obesity is usually associated with gaining only a fraction of a pound).
This conceptual discovery has been an important advance in our understanding of human biology. In the context of drug development, this finding means that targeting a single gene with a drug may not cure all people with a particular disease; scientists are working to use information gained from GWAS to develop and improve therapeutic treatments.
GWAS in the present: where are we now?
Fast-forward sixteen years from the completion of the HGP, and genomics has moved at a speed no one could have predicted. In recent years, one of the most significant developments in human genetics has been a resource called the UK Biobank.
This is a massive dataset consisting of genotype information (which can be used for GWAS) from about 500,000 human volunteers. Each participant also provides a veritable treasure trove of health data, ranging from basic information such as height and weight to dietary questionnaires and disease status (a total of over 2,400 traits!).
This resource has revolutionized genomics, not only because of the huge sample size and detailed medical information, but also because the data is freely accessible to any scientist who applies to use it. As a result, the genetic analysis of the UK Biobank data has essentially been crowdsourced to scientists all over the world.
The impact of this is clear from the numbers – since UK Biobank’s initial release in 2015, almost 600 papers have analyzed it, with countless new studies on the way.
Modern genomics is a triumph of collaborative science and shows how much there is to gain with large-scale, collective projects. Eighteen years ago, we didn’t even have the complete human genome sequence.
Now, we have a publicly available resource of 500,000 genomes, on top of the millions of other people from whom genetic information has been collected for other studies. And GWAS is only one example of the type of research enabled by the HGP; there are countless other scientific fields that have sprung up in its wake.
To name just a few other ongoing efforts, researchers have developed tests for genetic diseases, created a large catalogue of genetic abnormalities observed in many different types of cancers, studied DNA of ancient hominids to better understand human evolution, and developed ever-improving and ever-cheaper sequencing methods; for a great example of human genetics directly contributing to therapeutic development, check out this story about PCSK9 and cholesterol. All of these endeavors have helped us better understand human biology and have improved medical research. The Human Genome Project set in motion genomic research on a scale that would have been hard to imagine in 2001, and the field shows no sign of slowing down anytime soon.
The Human Genome Project—discovering the human blueprint
Although every person on our planet is built from the same blueprint, no two people are exactly the same. While we are similar enough to readily distinguish ourselves from other living creatures we also celebrate our individual uniqueness. So what is it that makes us all human, yet unique? Our DNA.
Our DNA (Deoxyribo Nucleic Acid) is found in the nucleus of every cell in our body (apart from red blood cells, which don’t have a nucleus). DNA is a long molecule, made up of lots of smaller units. To make a DNA molecule you need:
- nitrogenous bases—there are four of these: adenine (A), thymine (T), cytosine (C), guanine (C)
- carbon sugar molecules
- phosphate molecules
If you take one of the four nitrogenous bases, and put it together with a sugar molecule and a phosphate molecule, you get a nucleotide base. The sugar and phosphate molecules connect the nucleotide bases together to form a single strand of DNA.
Two of these strands then wind around each other, making the twisted ladder shape of the DNA double helix. The nucleotide bases pair up to make rungs of the ladder, and the sugar and phosphate molecules make the sides. The bases pair up together in specific combinations: A always pairs with T, and C always pairs with G to make base pairs.
Put three billion of these base pairs together in the right order, and you have a complete set of human DNA—the human genome. This amounts to a DNA molecule about a metre long.
It’s the order in which the base pairs are arranged—their sequence—in our DNA that provides the blueprint for all living things and makes us what we are. The DNA sequence of the base pairs in a fish’s DNA is different to those in a monkey.
The base pair sequence of all people is nearly identical—that’s what makes us all humans.
However, there are small differences in the order of the three billion base pairs in everyone’s DNA that cause the variations we see in hair colour, eye colour, nose shape etc.
No two people have exactly the same DNA sequence (except for identical twins, because they came from a single egg that split into two, forming two copies of the same DNA).
We get our DNA from our parents. The DNA of the human genome is broken up into 23 pairs of chromosomes (46 in total). We receive 23 from our mother and 23 from our father. Egg and sperm cells have only one copy of each chromosome so that when they come together to form a baby, the baby has the normal 2 copies.
Three billion is a lot of cats to herd
Three billion is a lot of base pairs, and together they contain an enormous amount of information. If they were all written out as a list, they would fill around 10,000 epic fantasy-novel-sized (think Game of Thrones thickness) books. They aren’t just random lists of information though.
Rather, within this long string, there are distinct sections of DNA that affect a particular characteristic or condition. These stretches of DNA are known as genes. Their base pair sequence is used to create the amino acids that join together to make a protein.
Some genes are small, only around 300 base pairs, and others contain over one million.
Genes make up only around 1.5 per cent of our DNA—the rest is extra that initially didn’t appear to have any specific purpose, and was dubbed ‘junk DNA’.
Turns out, though, that at least some of this ‘junk’ is actually pretty useful—it’s used to define where some genes start and finish, and to regulate how the genes behave.
While most of the junk DNA comes from copies of virus genomes that invaded our distant ancestors, new studies suggest much of this DNA may have also gained functions during our evolution.