DNA Metabarcoding: A New Approach forRapid Biodiversity Assessment

Pavan-Kumar A*, Gireesh-Babu P and Lakra WS

Biodiversity characterization is important to understand the ecological processes on earth. The recent advancements in molecular techniques have enabled us to identify the species composition more efficiently than the traditional methods. In DNA metabarcoding, the pooled genomic DNA extracted from environmental samples is used to amplify evolutionarily conserved genes by universal primers and sequenced using next generation sequencing technologies. In this brief review, the concept of DNA metabarcoding and its applications, limitations and challenges have been discussed.

Keywords: Biodiversity; DNA metabarcoding; Next Generation Sequencing; Environmental DNA


The extant /present biodiversity is a result of several million years of evolution of life on earth. Biodiversity is a key component of ecosystem and plays a major role in proper functioning of the ecosystem. Several factors like climate change, habitat loss and invasive species are disturbing the ecosystem biotic components thereby adversely affecting the function and services of ecosystem [1]. The effective management measures for restoring the degraded ecosystem can be taken if the information (data) about indicator species / abundance and pattern of biological diversity (Species) in that ecosystem is available. Traditional morphological and meristic tools for characterizing and assessing the biodiversity demands high skilled personnel and have limitations in identifying cryptic species. With the advent of molecular biology, DNA based species identification methods have been devised using molecular markers(mitochondrial and nuclear). Since the last decade, taxonomically informative genes have been tested over large groups of organisms(animals: mitochondrial cytochrome c oxidase subunit I [2]; Fungi: Nuclear ribosomal internal transcribed spacer [3]; Plants: two chloroplast genes, rbcL & matK [4]; Bacteria: 16S rRNA & protein coding Chaperonin-60, cpn60 [5-6] for their efficiency to delimit the species and designated them as barcode genes for respective groups. The success in this approach resulted in creation of huge reference databases that include species taxonomic details along with DNA barcode gene sequences(Fish-BOL, BOLD, MarBOL, QBOL etc). These reference barcode sequence databases are useful in assigning taxon to unknown specimen by comparing the sequence similarity of specimen barcode gene with reference database. Until recently, most of the barcoding studies were aimed at developing reference databases by generating species specific DNA barcodes from individual specimens. However, it is also important to characterize/assess the species diversity and abundance within an ecosystem as a whole to understand the spatial and temporal changes in species diversity [7]. Normal DNA barcoding approach using Sanger sequencing method can identify only one specimen at a time and cannot identify multiple species if the sample contains a mixture of different species. With the advancements in sequencing technology, it is now possible to assess the species composition of ecosystems including environmental samples such as soil, sediment and water at a stretch than screening individual specimens at a time.

DNA Metabarcoding

Taberlet et al. [8] introduced the term DNA metabarcoding todesignate high-throughput multispecies identification using the totalor typically degraded DNA extracted from an environment sample orfrom bulk samples of entire organisms. The multispecies identificationtechnique was originally applied to microbial communities [9] in thename of metagenomics, and now it is being applied for eukaryotic organisms such as fungi [10], invertebrates [11], plants [12] andvertebrates [13-15]. Metabarcoding differs from metagenomics inseveral ways as metagenomics refers to the study of all genomes withina particular ecosystem whereas metabarcoding aims to study a subsetof genes / gene. From methodology point of view, metagenomicsapproach includes preparation of shotgun (random) libraries forsequencing while metabarcoding is based on amplicon sequencing.Metagenomics approach generally used to get more insights aboutthe interaction between species within an ecosystem (taxonomic andfunctional information). Metabarcoding approach is mainly used todocument / characterize species diversity in the ecosystem and it canhave better coverage to identify rare taxa within an ecosystem.

DNA Metabarcoding methodology

Selection of Next Generation Sequencing (NGS) technology

A series of high-throughput sequencing technologies based ondifferent chemistries and detection techniques have been introducedcommercially. All these NGS technologies can generate severalhundred thousands of millions of sequencing reads in parallel. Thismassively parallel throughput sequencing capacity can generatesequence reads from fragmented libraries of a specific genome (i.e.genome sequencing) or from a pool of PCR amplified molecules(i.e. amplicon sequencing, [16]). Metabarcoding approach relieson this technology where large number of amplicons of taxonomicinformative (barcode) gene can be sequenced without a necessity forcloning [8]. A comparison of currently available NGS platforms isgiven in Table 1. Till now, most of the DNA metabarcoding studieshave used Roche 454 FLX platform due to its ability to produce longread length and relatively short run time. However, this platformcannot read homopolymers accurately and may provide erroneoussequences. This problem has been largely alleviated by implyingbioinformtic tools that filter out erroneous sequences [17]. Othersequencing technologies such as Illumina, SoLiD and Ion torrentplatforms can read homopolymers accurately but the read lengthis relatively low. Selection of appropriate sequencing methodologydepends on the question to be addressed and length of fragment(amplicon). If amplicon length is short (100-200bp), Illumina andIon torrent platforms are appropriate whereas for large ampliconsRoche GS FLX is more useful.

Table 1: Comparison of available NGS technologies.

PCR amplification of DNA

A certain genomic region can be amplified from the DNAextracted from sample that contains a mixture of species. Althoughany part of the genome can be used to delimit the species, certainfeatures like mutation rate (molecular evolution rate), availability ofuniversal primers, short sequences with sufficient phylogenetic signaland availability of comprehensive taxonomic reference database aresome of the important features to consider before selecting a DNAfragment for metabarcoding studies. Some researchers have usedshort fragments of the nuclear 18S and 28S ribosomal markers formetazoans [18-19], but these regions may underestimate speciesdiversity due to their slow rate of evolution compared to othermitochondrial markers [20-22]. Short fragment of mitochondrial 12Sribosomal gene has successfully been used for delineating metazoans,however, taxonomic reference databases are limited for this marker compared to cytochrome c oxidase subunit I [18]. The mitochondrialpartial cytochrome c oxidase I gene (COI) has been adopted asthe standard barcode gene for most of the animal groups [3] andthis gene has the most represented taxonomic reference databasein public domain. However, through in silico [23] and empiricalanalyses it has been found that the available universal primers forCOI gene are not well conserved in certain groups viz., nematodes[22,24], echinoderms [25] and gastropods [26]. In summary, theaccuracy of metabarcoding is highly dependent on marker choice, butunfortunately no marker has all the features to be used as a perfectmetabarcoding marker and the best marker choice could be studyspecific[27]. Ficetola et al. [23] have developed a software ecoPCR(electronic PCR) to test the efficacy of barcode primers. Based on twoparameters: barcode coverage (Bc) and barcode specificity (Bs), thismethod measures the conservation of the primers and the capacity ofthe amplified region to discriminate between taxa [28]. This softwarefacilitates a preliminary comparison of several DNA regions toidentify the most appropriate barcodes. A summary of the genes /markers used for various metabarcoding studies is given in Table 2.

Table 2:

Amplicon multiplexing, library preparation andsequencing

Generally, NGS platforms are intended to sequence wholegenome of important model and non-model organisms and duringthis process they generate thousands of millions of reads perinvestigation. However, metabarcoding studies aims to sequence shortfragment of homologous gene (amplicon sequencing) from differentspecies of many samples. Sequencing each sample (contains manyorganisms) separately is not economical. Multiplexing of sampleswith short, identifying sequences (barcodes/ multiplex identifier(MID)/index) is a widely used strategy in which different moleculartags (4-5 nucleotides) are attached to all DNA fragments / ampliconsto identify samples. This kind of sequence indexing is used for datasorting after sequencing and to assign the sequence reads to specificsamples. The most effective way to produce an indexed ampliconsis to amplify the genomic target region using PCR with specificprimers that include a sequencing adaptor and a barcode (Figure 1) [29-30]. Other indexing strategies rely on ligation of barcodes orbarcoded sequencing adapters to the DNA amplicons [31]. Afterindexing, amplicons can be pooled (equimolar concentration of eachamplicon) and sequencing would be carried out using the pooledbarcoded libraries. The number of different samples to be pooledper sequencing run is determined by number of barcodes available.High quality reagents with barcoded adapters and PCR primers arereadily available in kits from many vendors. After tagging with MIDs,the amplicon library fragments are clonally amplified onto the beadparticles through emulsion PCR. These particles containing sequenceclones are deposited on chips/ flow cells for sequencing by NGSplatforms. Designing of tags / MID sequences is important as thesesequences may lead to PCR bias and several software are availablefor this purpose (BARCRAWL [32]; oligotag program of OBITools,http://www.gren

Figure 1: DNA metabarcoding methodology.

Data analysis

The data generated by NGS platforms are huge and traditionalcomputer operating systems (Windows/DOS) are not capable to handle the data. UNIX operating system has been considered as thestandard computing environment for NGS data. Further, most ofthe bioinformatics software programs/ algorithms are compatiblewith UNIX operating system. In UNIX, tasks can be performed bywriting commands and in personal computers UNIX environmentcan be provided by installing Linux. Certain operating systems suchMac OSX (Apple Inc) provide UNIX environment that uses bothGraphical User Interfaces (GUI) and command mode.

NGS output data consists of DNA sequences (reads) andcorresponding quality values (for each nucleotide of each sequenceread). All the resulting sequences may not represent the speciescomposition of sample. The sequence data may contain sequencingnoise, PCR chimeras [33], contaminant sequences, nuclearmitochondrial pseudogenes (Numts) and PCR errors. Eliminating allthese error sequences from final data analysis is prerequisite beforeassigning taxon to the sequences. In brief, the data analysis consists ofthree steps viz., data pre-processing (removal of primers, sequencingadaptors and demultiplexing), processing raw reads (denoising,chimera and PCR artefacts removal) and performing analyses(clustering, BLASTing) (Figure 2). Different software / algorithmshave been developed to perform the above tasks. Different researchershave designed different pipelines by combining various software toanalyse NGS metabarcode data. Among different packages, QIIME(Quantitative Insights Into Microbial Ecology [341]) has beensuccessfully used for different metabarcoding and metageneticsstudies [35-37]. QIIME is an open-source bioinformatics pipelineconsists of native python code and additionally cover many externalapplications for data analysis from raw data processing to taxonomicassignment. It is initially developed for microbial community however; QIIME environment is now being used for metazoan andplant metabarcoding studies [38,39].

Figure 2: Bioinformatics pipeline for NGS data analysis. Abbreviations: BOLD: Barcode of Life Database; FISH-BOL: Fish Barcode of Life Initiative ; Mar BOL: Marine Barcode of Life initiative; QBOL: Quarantineorganisms Barcode of Life

Likewise, OBITools is another open source package that has beenspecifically designed for analyzing metabarcoding data. The mainadvantage of the OBITools is their ability to take into account thetaxonomic annotations, ultimately allowing sorting and filtering ofsequence records based on the taxonomy. Apart from these software,some other packages like Operational Clustering of Taxonomic Unitsfrom Parallel UltraSequencing (OCTUPUS) Bioconductor packagesShortRead [40] and Biostrings (run on R language) are also availablefor NGS data analysis.

In metabarcoding, species are defined operationally as a clusterof similar sequences, and the clusters are known as OperationalTaxonomic Unit (OTU). The most critical step in the analysis is toassign taxon to sequences / cluster of sequences (OTU). This can beachieved by comparing each sequence to a reference database that isa subset of public databases (eg. EMBL, NCBI GenBank, SILVIA andBOLD) or a set of sequences specifically produced for the study. Thecomparison for sequence similarity between queried sequence andreference database sequence can be performed through BLAST searchor ecotage [41]. In the case of non availability of reference database,sequences would not be linked to a taxonomic name, however; wouldbe clustered in MOTUs (Molecular Operational Taxonomic units)that can be compared in different studies, for example, comparingthe diversity of MOTUs in different localities or under differentparameters in the same locality [28]. The Program MEGAN (MEtaGenome Analyzer) is useful in representing the species/ taxoncomposition of sample and for taxonomic binning, even for verylarge data sets [42].

Biodiversity studies

In this decade of biodiversity (2011-2020), fund allocation has beenincreased for biodiversity characterization and these efforts resultedin creation / strengthening of existing taxonomic reference databasessuch as Global Biodiversity Information Facility (GBIF) and Barcodeof Life database (BOLD). These databases have cured referencetaxonomic information for about 144,357 Species (BOLD [43]) species and are being constantly updatingwith new taxon information to include all the species on earth. Once the comprehensive database is prepared, DNA metabarcodingapproach can be used to analyse the species composition of differenttypes of samples, from soils to sediments, faeces, air and water.Metabarcoding of soil / sediment samples collected from nuclearpower plant areas or any other industrial area can be used to comparetemporal and spatial species assemblages to assess human impactson biodiversity [44]. Likewise, water samples can be used to detectthe presence of invasive species [45]. Next generation sequencingtechnology has been used to analyse species composition ofsensitive ecosystem (Coral reefs [46]), extreme habitats (acid mines[47]). DNA metabarcoding approach has been successfully used to characterize soil microbial diversity [48-50], fungal diversity [51] andplant diversity [52] using 16S rRNA, ITS and P6 loop of the plastidDNA trnL intron amplicons, respectively. Hajibabaei et al. [53] wereused short fragments of COI DNA barcodes were used to identifyfreshwater macro invertebrates from benthic samples.

The effect of climate change on biodiversity could be assessedand species distributions can be predicted for future if data on pastdistributions together with past climate conditions are available.DNA metabarcoding of soil samples collected at different depthscould provide a new source of information about past speciesdistributions [7]. Murray et al. [54] have analysed ancient DNAusing metabarcoding approach and identified diverse range of taxa,including endemic, extirpated and previously unrecorded taxa.Haouchar et al. [55] assessed ancient DNA of vertebrate fossils andplants and provided valuable information about past biodiversityof Kangaroo Island, Australia. Haile et al. [56] utilized both 454pyrosequencing and conventional Sanger sequencing methods in theanalysis of ancient DNA recovered from Arctic permafrost cores.

Trophic studies

The interaction between predator and prey play an importantrole in maintaining ecosystem health and stability. Gut contentanalysis of the species to identify their feeding habits, especially forendangered species will help to formulate conservation measures/strategies [57]. Traditionally, the diet composition of any species isassessed by macro- or micro histological methods [58] and stableisotopes [59]. However, these methods are time consuming, requirehighly skilled personnel and cannot identify variably digested fooditems. To overcome this, DNA isolated from gut content and faecescan be used for the molecular identification of diet compositionby metabarcoding. Several studies have used NGS technology forinvestigating gut microbe ecology and species composition of diet.Some of these studies have included analyses of herbivore diet fromgut contents using the plastid trnL sequence [13, 41, 60-61]. Severalstudies investigated the species composition of diet by analysingprey DNA collected from faeces of Australian fur seal (Arctocephaluspusillus doriferus [27]) little penguin (Eudyptula minor [62-63]),reptile [15] leopard cat [64]. Recently De Barba et al., [66] analysedplant, vertebrate and invertebrate components of the diet of brownbear by analysing faecal matter through multiplexing strategy.


Metabarcoding, like many new technological advances in science,offers new opportunities and at the same time new challenges. Sincemetabarcoding studies generally include amplicon sequencing,factors like PCR efficiency, primer tags and sequencing efficacy needto be considered to avoid errors [67, 68, 69]. To circumvent primertag bias, a two stage PCR where template DNA is first amplified usinguntagged primers and subsequently by tagged primers during the lastfew PCR cycles has been suggested [53,68]. Another limitation is lackof comprehensively cured reference databases for certain metazoansfor assigning taxon to the OTUS. Future studies are needed toimprove sampling strategies (selection of season, sampling locationwithin habitat) and to understand the relationship between sequencereads and species density [70]. Further, the integration of knowledge from ecology, taxonomy and evolution is essential for addressing anybiodiversity questions using metabarcoding.


