Alignment and Mapping Reads to a Reference Genome With Tophat
Bioinformatics. 2009 May 1; 25(9): 1105–1111.
TopHat: discovering splice junctions with RNA-Seq
Cole Trapnell
oneCenter for Bioinformatics and Computational Biology, University of Maryland, College Park, MD 20742 and iiDepartment of Mathematics, University of California, Berkeley, CA 94720, U.s.
Lior Pachter
1Centre for Bioinformatics and Computational Biological science, Academy of Maryland, College Park, Dr. 20742 and 2Department of Mathematics, University of California, Berkeley, CA 94720, United states
Steven 50. Salzberg
1Center for Bioinformatics and Computational Biological science, University of Maryland, College Park, MD 20742 and 2Section of Mathematics, University of California, Berkeley, CA 94720, USA
Received 2008 Oct 23; Revised 2009 February 24; Accepted 2009 Feb 26.
- Supplementary Materials
-
[Supplementary Information]
GUID: 41894CAB-1EA3-4639-A30D-D48E22E08F33
GUID: E34167D1-9011-4407-A33F-0F9F55835F27
GUID: 1B34FC1C-7DB3-4A3F-A4FB-ECD7674D9E1C
GUID: BED03302-F005-4991-B5A1-EA22862BA410
Abstract
Motivation: A new protocol for sequencing the messenger RNA in a cell, known as RNA-Seq, generates millions of short sequence fragments in a single run. These fragments, or 'reads', can exist used to measure levels of gene expression and to identify novel splice variants of genes. Withal, current software for adjustment RNA-Seq data to a genome relies on known splice junctions and cannot place novel ones. TopHat is an efficient read-mapping algorithm designed to align reads from an RNA-Seq experiment to a reference genome without relying on known splice sites.
Results: We mapped the RNA-Seq reads from a contempo mammalian RNA-Seq experiment and recovered more than 72% of the splice junctions reported past the notation-based software from that study, along with nearly xx 000 previously unreported junctions. The TopHat pipeline is much faster than previous systems, mapping well-nigh 2.ii meg reads per CPU 60 minutes, which is sufficient to process an entire RNA-Seq experiment in less than a day on a standard desktop estimator. Nosotros draw several challenges unique to ab initio splice site discovery from RNA-Seq reads that will require farther algorithm evolution.
Availability: TopHat is gratis, open-source software available from http://tophat.cbcb.umd.edu
Contact: ude.dmu.sc@eloc
Supplementary information: Supplementary information are bachelor at Bioinformatics online.
ane INTRODUCTION
For many years, the standard method for determining the sequence of transcribed genes has been to capture and sequence messenger RNA using expressed sequence tags (ESTs) (Adams et al., 1993) or full-length complementary DNA (cDNA) sequences using conventional Sanger sequencing technology. Recently a new experimental method, RNA-Seq, has emerged that has a number of advantages over conventional EST sequencing: information technology uses adjacent-generation sequencing (NGS) technologies that can sample the mRNA with fewer biases, it generates far more data per experiment, and it generates data that can be used as a direct measure of the level of factor expression. Thus RNA-Seq experiments non but capture the transcriptome, they can supervene upon conventional microarray experiments for measuring expression. Compared with microarray applied science, RNA-Seq experiments provide much higher resolution measurements of expression at comparable cost (Marioni et al., 2008).
The major drawback of RNA-Seq over conventional EST sequencing is that the sequences themselves are much shorter, typically 25–fifty nt versus several hundred nucleotides with older technologies. One of the disquisitional steps in an RNA-Seq experiment is that of mapping the NGS 'reads' to the reference transcriptome. However, because the transcriptomes are incomplete fifty-fifty for well-studied species such as man and mouse, RNA-Seq analyses are forced to map to the reference genome as a proxy for the transcriptome. Mapping to the genome achieves two major objectives of RNA-Seq experiments:
-
Identification of novel transcripts from the locations of regions covered in the mapping.
-
Estimation of the affluence of the transcripts from their depth of coverage in the mapping.
Considering RNA-Seq reads are short, the first job is challenging. Current mapping strategies (east.g. Cloonan et al., 2008; Marioni et al., 2008; Mortazavi et al., 2008; Sultan et al., 2008) include alignment procedures designed to localize Illumina or SOLiD reads to known exons in the genome. However, whenever an RNA-Seq read spans an exon purlieus, office of the read will not map contiguously to the reference, which causes the mapping procedure to fail for that read. The studies cited in a higher place solve this problem by concatenating known adjacent exons and and then creating constructed sequence fragments from these spliced transcripts. Reads that do not marshal to the genome but that map to these synthetic fragments represent evidence for splice junctions between known exons.
We can notice splice sites ab initio past identifying reads that span exon junctions, just this strategy presents a number of computational challenges, particularly with short read lengths. For rarely transcribed genes, many splice junctions may be spanned by very few reads. Therefore, a splice junction mapping algorithm must be able to place reads that may accept only a few bases on one side of a junction, or else that junction will be missed. Improvements in read length will not completely resolve this problem. However, failing to look for novel junctions at a genome-wide scale wastes much of the potential of RNA-Seq for capturing and describing the transcriptome of a man prison cell (or other species).
One recent method for ab initio junction mapping relies on a machine learning strategy to place junctions. QPALMA (De Bona et al., 2008) trains a back up vector machine-like algorithm using known splice junctions from the genome of involvement. While the QPALMA pipeline has organizational similarities to TopHat, there are major differences. First, QPALMA uses a training pace that requires a set of known junctions from the reference genome. Second, the QPALMA pipeline's initial mapping phase uses Vmatch (Abouelhoda et al., 2004), a general-purpose suffix array-based alignment program. Vmatch is a flexible, fast aligner, but because it is non designed to map short reads on machines with small chief memories, it is essentially slower than other specialized short-read mappers. De Bono et al. written report that Vmatch maps reads at effectually 644 400 reads per CPU hour against the 120 Mbp Arabidopsis thaliana genome. QPALMA'due south runtime appears to exist dominated past its splice site scoring algorithm; its authors estimate that mapping 71 one thousand thousand RNA-Seq reads to A.thaliana would have 400 CPU hours, which is ∼180 000 reads per CPU hour.
In this commodity, we describe TopHat, a software packet that identifies splice sites ab initio by large-scale mapping of RNA-Seq reads. TopHat maps reads to splice sites in a mammalian genome at a charge per unit of ∼ii.2 meg reads per CPU hour. Rather than filtering out possible splice sites with a scoring scheme, TopHat aligns all sites, relying on an efficient 2-fleck-per-base encoding and a information layout that effectively uses the cache on mod processors. This strategy works well in practice because TopHat kickoff maps not-junction reads (those contained inside exons) using Bowtie (http://bowtie-bio.sourceforge.net), an ultra-fast short-read mapping program (Langmead et al., 2009). Bowtie indexes the reference genome using a technique borrowed from data-compression, the Burrows–Wheeler transform (Burrows and Wheeler, 1994; Ferragina and Manzini, 2001). This memory-efficient data construction allows Bowtie to browse reads against a mammalian genome using around 2 GB of retention (within what is commonly available on a standard desktop computer).Figure 1 illustrates the workflow of TopHat.
The TopHat pipeline. RNA-Seq reads are mapped against the whole reference genome, and those reads that do not map are set aside. An initial consensus of mapped regions is computed past Maq. Sequences flanking potential donor/acceptor splice sites within neighboring regions are joined to class potential splice junctions. The IUM reads are indexed and aligned to these splice junction sequences.
2 METHODS
TopHat finds junctions by mapping reads to the reference in two phases. In the offset phase, the pipeline maps all reads to the reference genome using Bowtie. All reads that do not map to the genome are set up aside as 'initially unmapped reads', or IUM reads. Bowtie reports, for each read, one or more alignment containing no more than a few mismatches (two, past default) in the five′-most southward bases of the read. The remaining portion of the read on the iii′ end may accept boosted mismatches, provided that the Phred-quality-weighted Hamming distance is less than a specified threshold (lxx past default). This policy is based on the empirical observation that the 5′ stop of a read contains fewer sequencing errors than the 3′ end. (Hillier et al., 2008). TopHat allows Bowtie to report more than than i alignment for a read (default=10), and suppresses all alignments for reads that have more this number. This policy allows so called 'multireads' from genes with multiple copies to be reported, but excludes alignments to low-complication sequence, to which failed reads frequently align. Low complication reads are not included in the set of IUM reads; they are simply discarded.
TopHat then assembles the mapped reads using the associates module in Maq (Li et al., 2008). TopHat extracts the sequences for the resulting islands of face-to-face sequence from the sparse consensus, inferring them to be putative exons. To generate the island sequences, Tophat invokes the Maq get together subcommand (with the -s flag) which produces a compact consensus file containing chosen bases and the respective reference bases. Considering the consensus may include wrong base calls due to sequencing errors in depression-coverage regions, such islands may be a 'pseudoconsensus': for any low-coverage or low-quality positions, TopHat uses the reference genome to call the base. Because most reads roofing the ends of exons will besides span splice junctions, the ends of exons in the pseudoconsensus will initially exist covered by few reads, and as a result, an exon's pseudoconsensus will likely be missing a small amount of sequence on each stop. In order to capture this sequence forth with donor and acceptor sites from flanking introns, TopHat includes a pocket-size amount of flanking sequence from the reference on both sides of each island (default=45 bp).
Because genes transcribed at low levels will be sequenced at depression coverage, the exons in these genes may have gaps. TopHat has a parameter that controls when two singled-out but nearby exons should exist merged into a unmarried exon. This parameter defines the length of the longest commanded coverage gap in a single island. Because introns shorter than seventy bp are rare in mammalian genomes such as mouse (Pozzoli et al., 2007), any value less than lxx bp for this parameter is reasonable. To be conservative, the TopHat default is half-dozen bp.
To map reads to splice junctions, TopHat first enumerates all canonical donor and acceptor sites inside the island sequences (besides as their reverse complements). Next, it considers all pairings of these sites that could class canonical (GT–AG) introns between neighboring (merely not necessarily adjacent) islands. Each possible intron is checked against the IUM reads for reads that span the splice junction, as described beneath. By default, TopHat only examines potential introns longer than 70 bp and shorter than 20 000 bp, simply these default minimum and maximum intron lengths can be adjusted past the user. These values describe the vast majority of known eukaryotic introns. For instance, more than than 93% of mouse introns in the UCSC known gene set fall within this range. However, users willing to make a small-scale sacrifice in sensitivity will see substantially lower running fourth dimension past reducing the maximum intron length. To improve running times and avert reporting faux positives, the programme excludes donor–acceptor pairs that fall entirely within a single island, unless the island is very deeply sequenced. An instance of a 'single isle' junction is illustrated in Figure 2. The gene shown has 2 alternating transcripts, i of which has an intron that coincides with the UTR of the other transcript. The effigy shows the normalized coverage of the intron and its flanking exons by uniquely mappable reads as reported by Mortazavi et al. Both transcripts are conspicuously nowadays in the RNA-Seq sample, and TopHat reports the entire region equally a single island. In social club to detect such junctions without sacrificing functioning and specificity, TopHat looks for introns inside islands that are deeply sequenced. During the island extraction stage of the pipeline, the algorithm computes the following statistic for each island spanning coordinates i to j in the map:
(ane)
where d k is the depth of coverage at coordinate thousand in the Bowtie map, and n is the length of the reference genome. When scaled to range [0, chiliad], this value represents the normalized depth of coverage for an island. We observed that single-island junctions tend to fall within islands with high D (data not shown). TopHat thus looks for junctions contained in islands with D≥300, though this parameter can be changed past the user. A high D -value will prevent TopHat from looking for junctions within unmarried islands, which will improve running fourth dimension. A depression D -value will force TopHat to look within many islands, slowing the pipeline, but potentially finding more junctions.
An intron entirely overlapped by the 5′-UTR of another transcript. Both isoforms are nowadays in the brain tissue RNA sample. The pinnacle track is the normalized uniquely mappable read coverage reported by ERANGE for this region (Mortazavi et al., 2008). The lack of a large coverage gap causes TopHat to study a unmarried island containing both exons. TopHat looks for introns within single islands in order to detect this junction.
For each splice junction, Tophat searches the IUM reads in order to find reads that span junctions using a seed-and-extend strategy. The pipeline indexes the IUM reads using a simple lookup table to amortize the cost of searching for a spliced alignment over many reads. Every bit illustrated in Figure 3, TopHat finds any reads that span splice junctions by at to the lowest degree m bases on each side (where k=5 bp past default), and so the table is keyed by 2k-mers, where each 2k-mer is associated with reads that contain that 2k-mer. For each read, the table contains (s−2k+ane) entries respective to possible positions where a splice may fall within a read, where s is the length of the high-quality region on the v′ terminate (default=28 bp). Users with longer reads may wish to increase south to amend sensitivity. Lowering due south will improve running time, simply may reduce sensitivity. Increasing k will improve running fourth dimension, but may limit TopHat to finding junctions just in highly expressed (and thus deeply covered) genes. Reducing it volition dramatically increase running time, and while sensitivity volition meliorate, the program may report more than false positives. Next TopHat takes each possible splice junction and makes a 2g-mer 'seed' for it by concatenating the grand bases downstream of the acceptor to the k bases upstream of the donor. The IUM read alphabetize is then queried with this 2chiliad-mer to find all reads which comprise the seed. This exact iim-mer match is extended to find all reads that span the splice junction. To extend the exact friction match for the seed region, TopHat aligns the portions of the read to the left and correct of the seed with the left island and correct isle, respectively, allowing a user-specified number of mismatches. TopHat will miss spliced alignments to reads with mismatches in the seed region of the splice junction, but nosotros expect this tradeoff between speed and sensitivity volition be favorable for virtually users.
The seed and extend alignment used to match reads to possible splice sites. For each possible splice site, a seed is formed by combining a small-scale corporeality of sequence upstream of the donor and downstream of the acceptor. This seed, shown in night gray, is used to query the index of reads that were non initially mapped by Bowtie. Whatever read containing the seed is checked for a consummate alignment to the exons on either side of the possible splice. In the low-cal gray portion of the alignment, TopHat allows a user-specified number of mismatches. Because reads typically contain low-quality base of operations calls on their 3′ ends, TopHat but examines the first 28 bp on the 5′ end of each read by default.
The algorithm reports all of the spliced alignments it finds, and then builds a set of non-redundant splice junctions using these alignments. Withal, some spliced alignments are discarded prior to reporting junctions in gild to avert reporting false junctions. In their large-scale RNA-Seq report, Wang et al. (2008) reported millions of culling splicing events in humans and observed that 86% of the minor isoforms were expressed at at least fifteen% of the level of the major isoform. TopHat's heuristic filter for spliced alignments is based on this observation. For each junction, the average depth of read coverage is computed for the left and right flanking regions of the junction separately. The number of alignments crossing the junction is divided past the coverage of the more deeply covered side to obtain an approximate of the pocket-sized isoform frequency. If TopHat estimates that the splice junction occurs at 15% of the depth of coverage of the exons flanking it, the junction is not reported. The minimum minor isoform frequency parameter is adaptable by the user, and may be entirely disabled. While the default value in TopHat reflects a outcome from a man RNA-Seq written report, we wait that minor isoforms are expressed at like frequencies in other mammals, and that the value will be suitable when the software is used to process reads from other mammals.
3 RESULTS
Nosotros compared TopHat with ERANGE on a prepare of 47 781 892 reads, each 25 bp long, from a recent RNA-Seq study using Mus musculus brain tissue (Mortazavi et al., 2008). To marshal reads across splice junctions, ERANGE appends to the reference genome a prepare of spanning sequences that contain all annotated splice sites. For each splice site, a sequence of length L−4 (for reads of length L) is extracted from the exons flanking that site, and these are concatenated to create a spanning sequence. This constituted a total of 205 151 junctions for M.muscle. Mortazavi et al. trimmed reads to 25 bp, then we chose s=25 and k=5, which caused TopHat to written report junctions spanned by the 25 bp on the 5′ end of a read, with at to the lowest degree 5 bp on each side of the junction. We also required reads to match the exon sequence on each side of the junction exactly. In add-on, we used simply reference base of operations calls for the isle 'pseudoconsensus' sequences. This may accept prevented TopHat from identifying some junctions with SNPs in the flanking exon sequence. However, incorrect base calls in islands, especially almost isle endpoints, would cause many more than junctions to be missed, a problem that was greatly reduced by the use of the reference bases inside our assembled islands.
For each factor, ERANGE reports the number of mapped reads per kilobase of exon per 1000000 mapped reads (RPKM), a measure of transcription activeness. The authors narrate 15.0 and 25.0 as moderate and high levels of transcription, respectively. ERANGE reported 108 674 splice junctions in genes with positive RPKM, and 37 675 junctions in genes with RPKM ≥xv.0. TopHat reported 81.nine% of the ERANGE junctions in genes above xv.0 RPKM, and 72.ii% of all ERANGE junctions. Effigy 4 shows how TopHat'south sensitivity in detecting junctions varies with the RPKM of the genes. An instance of TopHat'south power to observe junctions even in genes with very low RPKM is illustrated in Figure six. Of the 30 121 junctions reported by ERANGE and non reported past TopHat, fifteen 689 (52%) fell within genes expressed below 5 RPKM and were likely missed due to lack of coverage. A further 3209 (10%) of the missed junctions had RPKM ≥five.0 but had endpoints more than than 20 000 bp apart. Filtering based on minor isoform fraction excluded 4560 (fifteen%) junctions. TopHat detected several g known splice junctions that ERANGE excluded, presumably during its multiread 'rescue' phase, where it randomly assigns each spliced multiread to matched genes according to their relative expression levels. Of the 104 711 junctions reported past TopHat, 84 988 are listed among the UCSC gene models for Yard. musculus, or 81.1%. The remaining xix 722 may represent novel junctions.
TopHat sensitivity as RPKM varies. For genes transcribed to a higher place 15.0 RPKM, TopHat detects more than 80% reported by ERANGE in the M. musculus encephalon tissue report. TopHat detects more than 72% of all junctions observed past ERANGE, including those in genes expressed at only a single transcript per cell. A de novo assembly of the RNA-Seq reads, followed by spliced alignment of the assembled transcripts produces markedly poorer sensitivity, detecting around xl% of junctions in genes transcribed above 25.0 RPKM, merely comparatively few junctions in more highly transcribed genes.
TopHat detects junctions in genes transcribed at very low levels. The gene Pnlip was transcribed at only 7.88 RPKM in the brain tissue according to ERANGE, and yet TopHat reports the consummate known factor model.
To assess TopHat'due south power to place true junctions without reporting false positives, we imitation the results of Illumina short-read sequencing of alternatively spliced genes at several depths. The EMBL-EBI Alternative Splicing Transcript Database (ASTD) (Le Texier et al., 2006) contains 1295 transcripts from mouse chromosome 7. These were generated past the short-read simulator from Maq. The simulator computes an empirical distribution of read quality scores and uses these to generate sequencing errors in the reads it produces. We trained the simulator using the reads from the Mortazavi et al. study, and then the sequencing error profile on imitation reads should exist similar to the real reads. We generated simulated sequence from the ASTD transcripts, which contained 9879 splice junctions, at 1-, 5-, 10-, 25- and l-fold coverage. TopHat'southward junction predictions at each coverage level are summarized in Table i. TopHat captures up to 94% of the 9879 ASTD splice junctions on mouse chromosome seven. Sensitivity suffers when transcripts are sequenced at less than five-fold coverage. TopHat reports few simulated positives even in deeply sequenced transcripts.
Table 1.
TopHat junction finding under simulated sequencing of transcripts
| Depth of | Truthful | Total (%) | False | Reported (%) |
|---|---|---|---|---|
| sequence coverage | positives | positives | ||
| 1 | 1744 | 17 | 114 | 6 |
| 5 | 7666 | 77 | 585 | 7 |
| x | 8737 | 88 | 428 | 4 |
| 25 | 9275 | 93 | 267 | 2 |
| 50 | 9351 | 94 | 235 | 2 |
The UCSC gene models are relatively conservative, so we searched the GenBank mouse EST database using BLAT (Kent, 2002) for the previously unreported junctions. We also searched the database for known junctions and randomly generated junctions every bit positive and negative controls, respectively. The positive control grouping was drawn from the 205 151 junction sequences synthetic past Mortazavi et al. every bit part of the ERANGE report. The second set consisted of previously unreported junction sequences reported by TopHat. The negative command consisted of random pairings of the left and right halves of junction sequences from the 2nd group. All sequences in each of the three groups were 42 bp long, and each group contained 1000 sequences called randomly. Effigy 5 shows the distribution of E-values for each sequence's best Nail striking against the GenBank mouse EST database. As expected, nearly all of the known junctions are confirmed by high-quality hits to ESTs. Also expected is the lack of high-quality hits for sequences in the 'random-pairing' negative command. More 11% of the yard TopHat junctions nosotros searched for really have high-quality hits to mouse ESTs. In total, 2543 of the 19 722 junctions not in UCSC gene models had hits to mouse ESTs with East-value <1×10−6.
The BLAT E-value distribution of known, previously unreported, and randomly generated splice junction sequences when searched confronting GenBank mouse ESTs. As expected, known junctions accept loftier-quality BLAT hits to the EST database. Randomly-generated junction sequences do not. High-quality BLAT hits for more than 11% of the junctions identified by TopHat propose that the UCSC factor models for mouse are incomplete. These junctions are virtually certainly genuine, and considering the mouse EST database is not complete, 11% is only a lower bound on the specificity of TopHat.
Nosotros examined the previously unreported junctions that lacked high-quality hits to mouse EST by dividing them into iii categories: junctions between ii known exons, junctions between a known exon and a novel one and junctions between ii novel exons. Of the 17 719 junctions without EST hits, x 499 joined novel exons, 6077 joined a novel exon with a known one and 603 joined a pair of known exons. One example of a junction from the 2d category is occurred in the ADP-ribosylation factor Arfgef1, which is of import in vesicular trafficking (Morinaga et al., 1996). The junction in Figure 7 skips two of the factor's 38 exons. TopHat reported several junctions in Arfgef1 that were previously unknown and indicates that Arfgef1 is alternatively spliced.
A previously unreported splice junction detected by TopHat is shown as the topmost horizontal line. This junction skips two exons in the ADP-ribosylation cistron Arfgef1. Equally explained in Section 2, islands of read coverage in the Bowtie mapping are extended by 45 bp on either side.
We also compared TopHat to a simple strategy based on de novo assembly of RNA-Seq reads. The advantage of such a strategy is that, like TopHat, no known junctions or gene models are needed. We ran the Velvet short-read assembler (Zerbino and Birney, 2008) (version 0.7.eleven, -k=21) on our RNA-Seq reads to produce 149 628 transcript contigs with N50=131. We and then aligned these contigs dorsum to the mouse reference genome using the spliced alignment program GMAP (Wu and Watanabe, 2005), one of the leading methods for alignment of ESTs and full-length cDNAs to genomic DNA. The sensitivity of the Velvet+GMAP method is shown in Effigy 4. The method detects around 20% of all junctions reported by ERANGE. While the method detects around forty% of junction in genes transcribed higher up an RPKM value of 25.0, its detection rate decreases as RPKM farther increases. Nosotros speculate that many of these highly transcribed genes accept several alternate isoforms, and that junctions in these genes may crusade Velvet to interruption contigs at the transcript junctions shared by multiple isoforms.
The unabridged TopHat run took 21 h, 50 min on a 3.0 GHz Intel Xeon 5160 processor, using <four GB of RAM, a throughput of near two.2 million reads per CPU hour.
4 Discussion
In our comparing, TopHat reported more 72% of all exon splice junctions captured past the ERANGE notation-based analysis pipeline, including junctions from genes transcribed at around i transcript per cell. TopHat captured effectually 80% of splice junctions in more actively transcribed genes. More meaning is its ability to detect novel splice junctions. While it is difficult to assess how many of TopHat's 19 722 newly discovered junctions are 18-carat, TopHat's alignment parameters for this run were quite strict: but exact matches were reported for splice junctions, and reads were required to have relatively long anchors on each side of the splice site. Shut inspection of junctions strengthened the case that many are true splices. The TopHat pipeline processed an entire RNA-Seq run in less than a mean solar day on a single processor of a standard workstation. ERANGE is advisable for high-quality measurement of gene expression in mammalian RNA-Seq projects, provided that a reliable annotation of exon–exon junctions is bachelor. QPALMA can accurately align short reads across junctions without an notation, merely makes such substantial sacrifices in speed that it may non exist practical for large mammalian projects. TopHat thus represents a significant accelerate over previous RNA-Seq splice detection methods, both in its performance and its power to notice junctions de novo.
The TopHat pipeline and its default parameter values are designed for detecting junctions even in genes transcribed at very low levels. Nevertheless, the system may fail to find junctions for a variety of reasons. The most common reason for missing a junction is that the transcript has very low sequencing coverage, in which case in that location might exist no read that straddles the junction with sufficient sequence on each side. Junctions spanning very long introns or introns with non-canonical donor and acceptor sites (such as GC–AG introns) will too be missed. As discussed in Department 2, TopHat can also miss single-island junctions in islands with a low normalized depth of coverage. Single-isle junctions can occur when the UTR of one isoform entirely overlaps an intron from some other isoform, as illustrated in Figure ii. They may as well occur when a transcript is incompletely processed. While several thousand known junctions were captured past TopHat merely not reported by ERANGE, this merely reflects differences in the goal of the 2 programs. ERANGE is primarily meant to quantitate gene expression, while TopHat aims to identify junctions. For reads with multiple spliced alignments, ERANGE assigns each read to a single position, in order to increment the accurateness of its expression estimates. Were TopHat to do this, its sensitivity would suffer slightly.
In the nearly futurity, new RNA-Seq protocols that produce paired-end reads volition make TopHat's chore easier. Splice detection rates will meliorate, and imitation positives should become much less common, as mate-pair information can drastically reduce the number of possible splices that must exist considered. The current version of TopHat looks for splice junctions between all islands within a sure altitude of each other on each strand of the reference. A version of TopHat that made utilise of mate pairs might consider only pairings of islands where one read from a mate pair maps to each island. The alignment constraints between splices and reads can besides be relaxed: longer introns and those with not-canonical donor and acceptors sites volition be readily detectable.
In the nearer term, TopHat will aim to provide base-pair resolution exon annotations along with approximate quantitation of expression for those exons. This task is not without difficulty, since coding regions must all the same be distinguished from UTRs and non-coding RNAs. However, the resolution and economy of RNA-Seq in detecting transcribed regions dramatically reduces the amount of sequence that must exist considered by a computational gene prediction approach. Nosotros are confident that such methods volition encounter dandy success in the most time to come. The current pipeline has no means of identifying microexons (shorter than a single read) considering they will not be captured past the initial Bowtie mapping. An boosted mapping phase using IUM reads should be able to capture many of these microexons.
5 SOFTWARE
TopHat is implemented in C++ and Python and runs on Linux and Mac OS X. It makes substantial utilise of previously described tools, including Bowtie (Langmead et al., 2009), Maq (Li et al., 2008) and the SeqAn library (Döring et al., 2008).
Supplementary Material
ACKNOWLEDGEMENTS
We give thanks Adam Phillippy, Geo Pertea, Ben Langmead, Kasper Hansen, Angela Brooks and Ali Mortazavi for helpful technical discussions. Nosotros thank Diane Trout, Ali Mortazavi, Brian Williams, Kenneth McCue, Lorian Schaeffer and Barbara Wold for making their data available for our example study.
Funding: National Institues of Health (R01-LM06845, R01-GM083873 to S.L.S.); National Science Foundation (CCF 0347992 to L.P.).
Conflict of Involvement: none declared.
REFERENCES
- Abouelhoda M, et al. Replacing suffix trees with enhanced suffix arrays. J. Detached Alg. 2004;two:53–86. [Google Scholar]
- Adams MD, et al. Rapid cDNA sequencing (expressed sequence tags) from a directionally cloned human babe brain cDNA library. Nat. Genet. 1993;iv:373–380. [PubMed] [Google Scholar]
- Burrows M, Wheeler D. Technical Report 124. Palo Alto, California: Dec, Digital Systems Research Center; 1994. A cake sorting lossless data compression algorithm. [Google Scholar]
- Cloonan N, et al. Stem prison cell transcriptome profiling via massive-scale mRNA sequencing. Nat. Meth. 2008;5:613–619. [PubMed] [Google Scholar]
- De Bona F, et al. Optimal spliced alignments of brusk sequence reads. Bioinformatics. 2008;24:i174–i180. [PubMed] [Google Scholar]
- Döring A, et al. Seqan an efficient, generic c++library for sequence analysis. BMC Bioinformatics. 2008;nine:xi. [PMC gratis article] [PubMed] [Google Scholar]
- Ferragina P, Manzini One thousand. Proceedings of the Twelfth Annual ACM-SIAM Symposium on Discrete Algorithms. Washington, D.C. USA: 2001. An experimental study of an opportunistic index; pp. 269–278. [Google Scholar]
- Hillier LW, et al. Whole-genome sequencing and variant discovery in C. elegans. Nat. Meth. 2008;5:183–188. [PubMed] [Google Scholar]
- Kent WJ. Blat—the blast-like alignment tool. Genome Res. 2002;12:656–664. [PMC free commodity] [PubMed] [Google Scholar]
- Langmead B, et al. Ultrafast and retentivity-efficient alignment of short Deoxyribonucleic acid sequences to the human genome. Genome Biol. 2009;10:R25. [PMC costless article] [PubMed] [Google Scholar]
- Le Texier V, et al. Alttrans: transcript pattern variants annotated for both alternative splicing and alternative polyadenylation. BMC Bioinformatics. 2006;7:169. [PMC free article] [PubMed] [Google Scholar]
- Li H, et al. Mapping short dna sequencing reads and calling variants using mapping quality scores. Genome Res. 2008;xviii:1851–1858. [PMC free commodity] [PubMed] [Google Scholar]
- Marioni J, et al. RNA-Seq: an assessment of technical reproducibility and comparison with gene expression arrays. Genome Res. 2008;xviii:1509–1517. [PMC free commodity] [PubMed] [Google Scholar]
- Morinaga Due north, et al. Isolation of a brefeldin A-inhibited guanine nucleotide-exchange protein for ADP ribosylation factor (ARF) i and ARF3 that contains a Sec7-similar domain. Proc. Natl Acad. Sci. United states of america. 1996;93:12856–12860. [PMC free commodity] [PubMed] [Google Scholar]
- Mortazavi A, et al. Mapping and quantifying mammalian transcriptomes by RNA-Seq. Nat. Meth. 2008;5:621–628. [PubMed] [Google Scholar]
- Pozzoli U, et al. Intron size in mammals: complexity comes to terms with economic system. Trends Genet. 2007;23:20–24. [PubMed] [Google Scholar]
- Sultan 1000, et al. A global view of cistron activity and culling splicing by deep sequencing of the man transcriptome. Science. 2008;321:956. [PubMed] [Google Scholar]
- Wang ET, et al. Alternative isoform regulation in human tissue transcriptomes. Nature. 2008;456:470–476. [PMC free article] [PubMed] [Google Scholar]
- Wu TD, Watanabe CK. GMAP: a genomic mapping and alignment program for mRNA and EST sequences. Bioinformatics. 2005;21:1859–1875. [PubMed] [Google Scholar]
- Zerbino DR, Birney E. Velvet: algorithms for de novo curt read assembly using de Bruijn graphs. Genome Res. 2008;xviii:821–829. [PMC gratis article] [PubMed] [Google Scholar]
Source: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2672628/
0 Response to "Alignment and Mapping Reads to a Reference Genome With Tophat"
Post a Comment