![]() |
|
![]() |
||||
Thread | Thread Starter | Forum | Replies | Last Post |
tophat mapping bug? | cedance | Bioinformatics | 2 | 11-11-2011 08:56 AM |
Tophat - Difference in mapping percentages of the read pairs | adarshjose | RNA Sequencing | 0 | 12-17-2010 08:30 AM |
an error with Tophat mapping | zhyd9 | Bioinformatics | 0 | 10-04-2010 06:37 AM |
TopHat mapping qualities | dychiang | Bioinformatics | 0 | 10-20-2009 02:44 PM |
![]() |
|
Thread Tools |
![]() |
#1 |
Junior Member
Location: rome Join Date: Mar 2011
Posts: 2
|
![]()
Hi all.
I have a question about tophat (v > 1.3.1). I am running tophat with 72 single end reads from RNA dataset (Illumina Genome Analyzer was used); I have both a gff3 file and a reference genome. Reading logs/bowtie.left_kept_reads.fixmap.log I have: # reads processed: 7333307 # reads with at least one reported alignment: 4403978 (60.05%) # reads that failed to align: 2928767 (39.94%) # reads with alignments suppressed due to -m: 562 (0.01%) Reported 4826602 alignments to 1 output stream(s) So I expected to find no more than 4826602 alignments in accepted_hits.bam. That is not right: after converting accepted_hits.bam to sam format I have: wc -l accepted_hits.sam 6121159 accepted_hits.sam Each line should correspond to an alignment but bowtie report talks about 4826602 alignments. Can anyone tell me, please, what am I missing? Thanks a lot. |
![]() |
![]() |
![]() |
#2 |
Senior Member
Location: Sydney, Australia Join Date: Jun 2011
Posts: 166
|
![]()
What does samtools view -c accepted_hits.sam tell you ? Counting lines is the wrong way to do it because there are other lines in the SAM file as well.
|
![]() |
![]() |
![]() |
#3 |
Senior Member
Location: Canada Join Date: Nov 2010
Posts: 124
|
![]()
I think how it works is as follows:
The bowtie.left_kept_reads.fixmap.log file indicates how many sequences were able to align when the entire sequence was used, in your case, 72bp. But since the reads are from RNA, a lot of them span introns and won't align. After this initial alignment, Tophat breaks the sequence down into smaller pieces (25bp by default) and tries to align these, which will let it align additional sequences. If you open up one of the bowtie.left_kept_reads_seg#.fixmap.log files, you should see that the first line, "reads processed", is 2928767, ie. the reads that failed to align in the first attempt. Last edited by biznatch; 12-12-2011 at 08:27 PM. |
![]() |
![]() |
![]() |
#4 |
Senior Member
Location: Phoenix, AZ Join Date: Mar 2010
Posts: 279
|
![]()
Run Samtools flagstat and Picard MappingStats. What you will notice is the difference in the number of mapped reads. Samtools will show more than Picard. This is because samtools tells you how many reads are aligned, while picard will tell you how many unique reads are aligned.
What you are seeing is an irritating result of the tophat alignments. If a read can map to multiple locations it will report each possible alignment. Therefore, if you input 10 reads and 7 have "at least one reported alignment" you can end up with 12 alignment events (ie. 120% mapping, but only 70% uniquely mapped reads) |
![]() |
![]() |
![]() |
#5 |
Senior Member
Location: Canada Join Date: Nov 2010
Posts: 124
|
![]()
I got this command: samtools view accepted_hits.bam | cut -f 1 | sort | uniq | wc -l
from here: http://vallandingham.me/RNA_seq_diff...xpression.html It's supposed to tell you how many unique lines and therefore how many unique reads there are in the bam file. |
![]() |
![]() |
![]() |
#6 | |
Member
Location: Denver, CO Join Date: Mar 2011
Posts: 37
|
![]() Quote:
|
|
![]() |
![]() |
![]() |
#7 |
Senior Member
Location: Sydney, Australia Join Date: Jun 2011
Posts: 166
|
![]()
That command is wrong. What you're extracting is the first column from the BAM file which is a read ID, like GAPC:2:86:13315:7719#0
Every read has it's own ID, so you're not getting unique reads but just all of the reads you had before. |
![]() |
![]() |
![]() |
#8 |
Member
Location: Denver, CO Join Date: Mar 2011
Posts: 37
|
![]()
you're right about each read having a unique name, but reads with the same id will sometimes map multiple times. thus, giving you multiple entries in the bam with the same read name.
|
![]() |
![]() |
![]() |
#9 |
Senior Member
Location: Sydney, Australia Join Date: Jun 2011
Posts: 166
|
![]()
Ah, good point. Thanks.
|
![]() |
![]() |
![]() |
#10 | |
Junior Member
Location: MO Join Date: Aug 2010
Posts: 1
|
![]() Quote:
However, the command does not work for paired-end data. This is because the read1/read2 flag is stripped off in the read ID in the BAM. For example, if the read ID is 'GAPC:2:86:13315:7719#0\1' for read 1 and 'GAPC:2:86:13315:7719#0\2' for read 2 in the fastq files, the read ID in the BAM file will be 'GAPC:2:86:13315:7719#0. Thus, a uniq on the read IDs will erroneously disregard the read1/read2 information. |
|
![]() |
![]() |
![]() |
#11 | |
Member
Location: Denver, CO Join Date: Mar 2011
Posts: 37
|
![]() Quote:
|
|
![]() |
![]() |
![]() |
#12 | |
Member
Location: Netherlands Join Date: Jul 2012
Posts: 10
|
![]() Quote:
I have a running paired end command line. The only thing I can come up with is: tophat -o path/to/file -p 6 --library-type fr-firststrand --b2-very-sensitive --no-coverage-search --GTF file.gtf genome_reference single_read.fastq Thanks in advance! |
|
![]() |
![]() |
![]() |
Thread Tools | |
|
|