For understanding the microbiome, revolutions in DNA sequencing technology have opened insights into the taxonomy and functionality of the teeming masses of microbes living in the guts of mammals. Efforts to identify and characterize microbes are daunting because of the sheer volume of microbial matter present in the gut. The human gut microbiome, which is the totality of the genetic material of microorganisms living in the small and large intestines, has been estimated to hold approximately 22 million unique microbial genes. Within that microbiome, sequencing with the ability to distinguish unique bacterial species indicates that there are between 150 and 200 different species of bacteria living in the (healthy) human gut (1). With the knowledge gained from microbiome studies, health practitioners are better equipped to diagnose such issues as Small Intestinal Bacterial Overgrowth (SIBO), Inflammatory Bowel Disease (IBD), intestinal permeability (“leaky gut”), and other gastrointestinal disorders.
As the microbiome becomes better understood, health practitioners stand to gain better diagnostic tools and more targeted therapies for those who suffer from a wide range of health issues. Studies are already showing that supplementing probiotics can affect emotions, immunity, and multiple metabolic processes. As analysis of the microbiome becomes more refined, more details are provided for researchers to create medicinal, nutritional, and other approaches for resolving microbiome imbalances and impacting human health. Analysis of a person’s gut microbiome could allow a customized therapy that is very specific to that person and could address multiple health issues in that individual simultaneously. Health practitioners in many different fields all stand to gain from researching the microbes living within us.
Studying gut microbiomes has become an interdisciplinary practice, partially due to the now ubiquitous use of hardware and software to characterize the microbiome, helping to determine the taxonomy, quantity, and potentially associated functionality of gut microbiomes. Databases are necessary to store ever-growing libraries of different kinds of output (data) from various kinds of DNA sequencing, used as reference databases to match against for identification. This field has grown so large that a whole webpage is dedicated to just listing hundreds of potential reference databases.
The ideal situation is for a reference database to produce a match to your sample. However, that is not always the case. There is the “dark matter” of the microbiome, the as-yet unclassified microbes. Many of the newer technologies, methodologies, databases, and analyzing software have arisen from the necessity of trying to identify something previously unidentified. The saving grace seems to be, “if it’s close enough to something else, we can explore from there.” Many current platforms using “next generation sequencing” (NGS) will perform “de novo” sequencing on a sample, that is, without the need for any reference sequences. The genes are spliced into fragments called contigs, and the ability to have a good output (identification of a novel genome) depends on the continuity and number of contigs versus gaps in the data.
Studying, classifying, and characterizing entire genomes of the microbial inhabitants of a bulk sample (such as the gut microbiome) is the field of metagenomics. Note, the term metagenomics has also come to be a shorthand way of referring to the set of tools and techniques this field uses to do its studies. Many huge strides have been made in the field of metagenomics and the tools and techniques acquired to accomplish its investigations. However, the burgeoning field of study is continually bumping into constraints of computing power or manpower.
Overcoming manpower constraints
The advent of DNA sequencing provided huge relief, but didn’t eliminate, some manpower constraints in microbe investigations. Previously, the identification of microbes was done by culturing a sample in a petri dish under very controlled conditions and identifying the result by visual inspection. This represented a huge time and manpower constraint. As technology moved along, cultured samples could be analyzed by software. However, the biggest time-saver came with revolutions in investigating DNA through the technique of sequencing. Microbial samples that were unculturable, slow growing, or just difficult to culture could now be investigated using DNA sequencing.
The basic idea of DNA sequencing is to determine the sequence of nucleotides (called bases) in a DNA molecule. A simplistic summary of DNA sequencing steps is: fragment the DNA, prepare a library of individually sequenced DNA fragments, make copies of a single fragment, and use computer analysis to discover overlapping ends of the fragments to get sequence data. Previously, this technique was focused on a single gene, the 16S ribosomal RNA, which is common to bacteria and archaea. However, so-called “shotgun” sequencing, a slight tweak on the process, can give a picture of the entire genome of microbes. The small genomic fragments, called “contigs,” allows for full or partial gene sequences to be predicted by computer programs, extracted, and then matched to data in a reference database that has the function of the genes already figured out. Note, that gene predictions for the metagenome are a level of complexity higher than just looking at individual genomes.
Metagenomic whole genome sequencing (mWGS) enables comprehensive data on every gene and all chromosomes in the DNA of all microbes present in the microbiome, all at one time. Considered a type of “Next Generation Sequencing,” (NGS) or “Second Generation Sequencing,” it utilizes stronger computing power by doing parallel processing of multiple DNA strands at the same time. The ability to capture all this data at once is key to piecing together the encoded functional potential of the metagenome and the many roles it plays in the health of the host. None of this could be possible without massive computing power.
Overcoming computing constraints
Another constraint that metagenomics contends with is computing power. There is no doubt that the huge advances in this field are made possible by advances in hardware and software. Indeed, early DNA sequencers such as the Roche GS-FLX were only able to deliver about 4-6 million DNA bases of sequence per run, but current sequencers such as the Illumina NovaSeq can do close to 6 terabases (1), each terabase being 10^12 base pairs!
With the amount of data produced from investigations, one of the other constraints is the ability to analyze extremely large amounts of data in a short time. When matching a sample against a reference database is necessary in most cases, this can become a big bottleneck. Various strategies have been employed to streamline databases to overcome this constraint.
One strategy to overcome database size issues was to make the reference database out of only the genetic sequences that allow differentiation amongst species. In other words, instead of matching to what makes the population similar, match to the changes that made the organism different from the rest. Yet a possibly more powerful strategy is to create a grouping or set, such as the concept of the kmer. The Kraken taxonomic profiling tool separates the reference genomes into kmers, and associates each kmer with “the lowest taxonomic rank that represents all the genomes in which it can be found,” which is called the “lowest common ancestor” (LCA) concept (1). The kmer is then a unit that can be used for matching.
Mapping higher-level functions
Of particular interest to the study of the microbiome is trying to match genetic material in the metagenome to higher level functions in the organism. Functional annotation, that is, assigning functions to genes, has benefited greatly from the ability to process genetic information through multiple platforms.
Protein sequences produced by varying software products can be compared against a database that has already merged data from protein sequences and their known functions. For example, UniProtKB which had combined functional annotations with protein sequences got merged with GenBank’s genomic sequences in the HUMAnN reference databases (1). The main hindrance in using this kind of reference database is that underlying algorithms can introduce biases that are difficult to identify without knowing exactly what the algorithm is doing computationally. Open source software helps that tremendously – not only can you see exactly how a data set is derived in the programming itself, but with sufficient knowledge, you can help change it.
A current area of research focus is the identification of pathways, where groups of genes may be involved in higher-level functions of the organism. For example, finding all the genes in a metagenome that contribute to making secondary bile acids (1). An extreme amount of complexity is introduced when individual genes contribute to more than one pathway or when genes that make up one particular pathway come from different taxa. The HUMAnN3 again comes to the rescue, estimating a single measure of microbial pathway strength from metagenomic data, using a type of grouping called gene set analysis (GSA).
Potential drivers of growth in the field
One of the current problems facing the field of metagenomics is that the methods chosen to extract and sequence metagenomic data do not always work in reverse and map back to their original environmental milieu. The desire to be able to do this is pushing the evolution of software and techniques to preserve more context with the sample. This could not only help characterize new microbial species (the more information, the better) but also help get a sense of the microbiome as a whole.
Another new area of interest that is pushing boundaries is “spatially resolved microbiomes.” In the microbiome, where the distance between the surface of the lumen and the deep layers of the mucosal barrier can be of entirely different bacterial makeup, getting a grip on the gradations in those layers can help with predictive power and determining pathways. One method, the HiPR-FISH, can tag genetic material in a way that creates a type of pattern, analogous to a barcode. Using this method, researchers were able to determine that spatial relationships between bacteria can be disrupted by antibiotics.
Advances in computing power and the advent of second-generation sequencing have spurred advances in metagenomics. Next to come is the impact of third-generation sequence capabilities, which we have not fully realized yet. The third generation reduces the bottleneck of manpower even more than second-generation, by freeing the process of the need for PCR amplification and having the capability of doing long reads. Considering the secrets of the microbiome that have already been unlocked, who can guess at what marvels will be discovered in the next few years, allowing practitioners to create ever-improving customized treatment plans for patients?
Check out our upcoming event about Metagenomics, the Microbiome, and the Evolution of GI Evaluations
- Yen S, Johnson JS. Metagenomics: A path to understanding the gut microbiome – mammalian genome [Internet]. SpringerLink. Springer US; 2021 [cited 2022Jul31]. Available from: https://link.springer.com/article/10.1007/s00335-021-09889-x