SEQanswers

Go Back   SEQanswers > Applications Forums > RNA Sequencing



Similar Threads
Thread Thread Starter Forum Replies Last Post
tophat mapping bug? cedance Bioinformatics 2 11-11-2011 08:56 AM
Tophat - Difference in mapping percentages of the read pairs adarshjose RNA Sequencing 0 12-17-2010 08:30 AM
an error with Tophat mapping zhyd9 Bioinformatics 0 10-04-2010 06:37 AM
TopHat mapping qualities dychiang Bioinformatics 0 10-20-2009 02:44 PM

Reply
 
Thread Tools
Old 12-12-2011, 02:36 AM   #1
linsson
Junior Member
 
Location: rome

Join Date: Mar 2011
Posts: 2
Default tophat: mapping percentages

Hi all.

I have a question about tophat (v > 1.3.1). I am running tophat with 72 single end reads from RNA dataset (Illumina Genome Analyzer was used); I have both a gff3 file and a reference genome. Reading logs/bowtie.left_kept_reads.fixmap.log I have:

# reads processed: 7333307
# reads with at least one reported alignment: 4403978 (60.05%)
# reads that failed to align: 2928767 (39.94%)
# reads with alignments suppressed due to -m: 562 (0.01%)
Reported 4826602 alignments to 1 output stream(s)

So I expected to find no more than 4826602 alignments in accepted_hits.bam.
That is not right: after converting accepted_hits.bam to sam format I have:

wc -l accepted_hits.sam
6121159 accepted_hits.sam

Each line should correspond to an alignment but bowtie report talks about 4826602 alignments. Can anyone tell me, please, what am I missing?

Thanks a lot.
linsson is offline   Reply With Quote
Old 12-12-2011, 08:00 PM   #2
Dario1984
Senior Member
 
Location: Sydney, Australia

Join Date: Jun 2011
Posts: 166
Default

What does samtools view -c accepted_hits.sam tell you ? Counting lines is the wrong way to do it because there are other lines in the SAM file as well.
Dario1984 is offline   Reply With Quote
Old 12-12-2011, 08:14 PM   #3
biznatch
Senior Member
 
Location: Canada

Join Date: Nov 2010
Posts: 124
Default

I think how it works is as follows:

The bowtie.left_kept_reads.fixmap.log file indicates how many sequences were able to align when the entire sequence was used, in your case, 72bp. But since the reads are from RNA, a lot of them span introns and won't align. After this initial alignment, Tophat breaks the sequence down into smaller pieces (25bp by default) and tries to align these, which will let it align additional sequences.

If you open up one of the bowtie.left_kept_reads_seg#.fixmap.log files, you should see that the first line, "reads processed", is 2928767, ie. the reads that failed to align in the first attempt.

Last edited by biznatch; 12-12-2011 at 08:27 PM.
biznatch is offline   Reply With Quote
Old 12-14-2011, 02:56 PM   #4
Jon_Keats
Senior Member
 
Location: Phoenix, AZ

Join Date: Mar 2010
Posts: 279
Default

Run Samtools flagstat and Picard MappingStats. What you will notice is the difference in the number of mapped reads. Samtools will show more than Picard. This is because samtools tells you how many reads are aligned, while picard will tell you how many unique reads are aligned.

What you are seeing is an irritating result of the tophat alignments. If a read can map to multiple locations it will report each possible alignment. Therefore, if you input 10 reads and 7 have "at least one reported alignment" you can end up with 12 alignment events (ie. 120% mapping, but only 70% uniquely mapped reads)
Jon_Keats is offline   Reply With Quote
Old 12-14-2011, 06:49 PM   #5
biznatch
Senior Member
 
Location: Canada

Join Date: Nov 2010
Posts: 124
Default

I got this command: samtools view accepted_hits.bam | cut -f 1 | sort | uniq | wc -l

from here: http://vallandingham.me/RNA_seq_diff...xpression.html

It's supposed to tell you how many unique lines and therefore how many unique reads there are in the bam file.
biznatch is offline   Reply With Quote
Old 12-21-2011, 05:20 PM   #6
jbrwn
Member
 
Location: Denver, CO

Join Date: Mar 2011
Posts: 37
Default

Quote:
Originally Posted by biznatch View Post
I got this command: samtools view accepted_hits.bam | cut -f 1 | sort | uniq | wc -l

from here: http://vallandingham.me/RNA_seq_diff...xpression.html

It's supposed to tell you how many unique lines and therefore how many unique reads there are in the bam file.
right, and for total reads counts the lines of your fastq and divide by 4.
jbrwn is offline   Reply With Quote
Old 12-21-2011, 06:00 PM   #7
Dario1984
Senior Member
 
Location: Sydney, Australia

Join Date: Jun 2011
Posts: 166
Default

That command is wrong. What you're extracting is the first column from the BAM file which is a read ID, like GAPC:2:86:13315:7719#0

Every read has it's own ID, so you're not getting unique reads but just all of the reads you had before.
Dario1984 is offline   Reply With Quote
Old 12-21-2011, 06:03 PM   #8
jbrwn
Member
 
Location: Denver, CO

Join Date: Mar 2011
Posts: 37
Default

Quote:
Originally Posted by Dario1984 View Post
That command is wrong. What you're extracting is the first column from the BAM file which is a read ID, like GAPC:2:86:13315:7719#0

Every read has it's own ID, so you're not getting unique reads but just all of the reads you had before.
you're right about each read having a unique name, but reads with the same id will sometimes map multiple times. thus, giving you multiple entries in the bam with the same read name.
jbrwn is offline   Reply With Quote
Old 12-21-2011, 07:00 PM   #9
Dario1984
Senior Member
 
Location: Sydney, Australia

Join Date: Jun 2011
Posts: 166
Default

Ah, good point. Thanks.
Dario1984 is offline   Reply With Quote
Old 01-13-2012, 12:31 PM   #10
nrockweiler
Junior Member
 
Location: MO

Join Date: Aug 2010
Posts: 1
Default

Quote:
reads with the same id will sometimes map multiple times. thus, giving you multiple entries in the bam with the same read name.
Correct, but a uniq on the list will give you a set of read IDs that mapped at least 1 time. This list will not include reads that did not align. This is because the BAM file from tophat (v1.3.2) only reports aligned reads.

However, the command does not work for paired-end data. This is because the read1/read2 flag is stripped off in the read ID in the BAM. For example, if the read ID is 'GAPC:2:86:13315:7719#0\1' for read 1 and 'GAPC:2:86:13315:7719#0\2' for read 2 in the fastq files, the read ID in the BAM file will be 'GAPC:2:86:13315:7719#0. Thus, a uniq on the read IDs will erroneously disregard the read1/read2 information.
nrockweiler is offline   Reply With Quote
Old 01-13-2012, 12:36 PM   #11
jbrwn
Member
 
Location: Denver, CO

Join Date: Mar 2011
Posts: 37
Default

Quote:
Originally Posted by nrockweiler View Post
However, the command does not work for paired-end data. This is because the read1/read2 flag is stripped off in the read ID in the BAM. For example, if the read ID is 'GAPC:2:86:13315:7719#0\1' for read 1 and 'GAPC:2:86:13315:7719#0\2' for read 2 in the fastq files, the read ID in the BAM file will be 'GAPC:2:86:13315:7719#0. Thus, a uniq on the read IDs will erroneously disregard the read1/read2 information.
it's a good point, but this thread was to answer the original poster's question about single end reads.
jbrwn is offline   Reply With Quote
Old 01-08-2013, 04:39 AM   #12
Tuinhof
Member
 
Location: Netherlands

Join Date: Jul 2012
Posts: 10
Default

Quote:
Originally Posted by linsson View Post
Hi all.

I have a question about tophat (v > 1.3.1). I am running tophat with 72 single end reads from RNA dataset (Illumina Genome Analyzer was used); I have both a gff3 file and a reference genome.
What was your command line for running single end reads?
I have a running paired end command line.

The only thing I can come up with is:

tophat -o path/to/file -p 6 --library-type fr-firststrand --b2-very-sensitive --no-coverage-search --GTF file.gtf genome_reference single_read.fastq

Thanks in advance!
Tuinhof is offline   Reply With Quote
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 01:06 PM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2021, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO