SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics



Similar Threads
Thread Thread Starter Forum Replies Last Post
htseq-count only not-unique counts moser Bioinformatics 7 08-09-2013 09:28 AM
[HTSeq-Count] How are reads counted? syintel87 Bioinformatics 0 06-17-2013 02:18 PM
Count unique reads in a FASTQ file id0 Bioinformatics 2 10-24-2012 02:01 AM
Reads Count - HTSeq alternative? NGSAwesome RNA Sequencing 1 10-15-2012 10:13 AM
htseq-count gets more reads? deepsea Bioinformatics 3 03-29-2012 12:27 PM

Reply
 
Thread Tools
Old 08-19-2013, 06:51 AM   #1
ThePresident
Member
 
Location: Sherbrooke / Canada

Join Date: Jun 2012
Posts: 72
Default HTseq-count : how to get unique reads?

Hello,

I used HTseq-count to get number of reads / gene from my RNA-seq data.

The problem is that the sum of all counted reads (on every gene) is much bigger then the total sequencing depth I got from Illumina. After reading other posts, this could be due to the "duplicate" alignemnt... Anyway, my question is: how to get the true number of reads for some genes (in my case rRNA) in order to see what percentage of my total sequencing depth is actually "lost" on rRNA?

Thanks in advance for any hints,

TP
ThePresident is offline   Reply With Quote
Old 08-19-2013, 07:25 AM   #2
ThePresident
Member
 
Location: Sherbrooke / Canada

Join Date: Jun 2012
Posts: 72
Default

Perhaps, I addressed the question in the wrong way. Simply, i just wanna know how to get the Ribosomal RNA contamination in each of my libraries? I want to be able to say that 1E07 reads out of 1E09 are mapping to rRNA, so my samples are "contaminated" by 1% rRNA. See?

Thanks
ThePresident is offline   Reply With Quote
Old 08-19-2013, 08:19 AM   #3
GenoMax
Senior Member
 
Location: East Coast USA

Join Date: Feb 2008
Posts: 7,091
Default

This will require additional work but could get rid of the contamination in the process.
http://seqanswers.com/forums/showthread.php?t=26960 Use the rRNA sequence for the species you are working with.
GenoMax is offline   Reply With Quote
Old 08-19-2013, 11:35 AM   #4
ThePresident
Member
 
Location: Sherbrooke / Canada

Join Date: Jun 2012
Posts: 72
Default

Yeah I saw that thread... I was hoping for faster solution, but well
Thanks
ThePresident is offline   Reply With Quote
Old 08-19-2013, 12:19 PM   #5
chadn737
Senior Member
 
Location: US

Join Date: Jan 2009
Posts: 392
Default

htseq-count should identify the number of multimapped reads and include this number in its own category so that the total number should not exceed your starting number of reads....unless, whatever you mapped the reads with does not output a NH:i field in the SAM file.

What did you use to map the reads?

Are you specifically interested in only rRNA reads...not everything that maps multiple times will be an rRNA read.
chadn737 is offline   Reply With Quote
Old 08-19-2013, 04:28 PM   #6
ThePresident
Member
 
Location: Sherbrooke / Canada

Join Date: Jun 2012
Posts: 72
Default

Quote:
Originally Posted by chadn737 View Post
htseq-count should identify the number of multimapped reads and include this number in its own category so that the total number should not exceed your starting number of reads....unless, whatever you mapped the reads with does not output a NH:i field in the SAM file.

What did you use to map the reads?

Are you specifically interested in only rRNA reads...not everything that maps multiple times will be an rRNA read.
I used bowtie for mapping. After that, I used Htseq-count to get number of reads/gene. If I sum all the reads for one replicate from Htseq table I get like double of what I'm supposed to have. I guess it's because of all multimapped reads. I will try to align my libraries only to rRNA, that should give me number of reads mapping there.

TP
ThePresident is offline   Reply With Quote
Old 08-19-2013, 04:36 PM   #7
chadn737
Senior Member
 
Location: US

Join Date: Jan 2009
Posts: 392
Default

Quote:
Originally Posted by ThePresident View Post
I used bowtie for mapping. After that, I used Htseq-count to get number of reads/gene. If I sum all the reads for one replicate from Htseq table I get like double of what I'm supposed to have. I guess it's because of all multimapped reads. I will try to align my libraries only to rRNA, that should give me number of reads mapping there.

TP
Bowtie does not output a NH:i field in the SAM file and so htseq-count will not be able to identify multimapping reads.

Furthermore, Bowtie is not ideally suited for RNA-seq data. You should use Tophat instead, Tophat puts out a NH:i field, and so then your htseq-count results will make sense. I would not use htseq-count on data aligned with bowtie unless multi-mapped reads were first removed or identified. Otherwise, your read counts for each gene will be incorrect. Furthermore, bowtie is not a spliced aligner and will preferentially align to the best match, even if its a pseudogene. With tophat, you can deal with splicing. Unless you have good reason to use bowtie, I would realign everything with tophat instead and problem solved.

Also, as I said previously, not all multimapping reads will be due to rRNA, so if you are trying to identify/eliminate multimapping reads, aligning to rRNA does not solve your problem.

Depending on whether you used bowtie or bowtie 2 (and if bowtie 2, what mode you ran it in) it is possible to remove all multimapping reads and get the unique reads only by filtering on quality scores.
chadn737 is offline   Reply With Quote
Old 08-19-2013, 04:45 PM   #8
ThePresident
Member
 
Location: Sherbrooke / Canada

Join Date: Jun 2012
Posts: 72
Default

Well, I'm dealing with bacterial RNA-seq data, so a spliced aligner such as Tophat was irrelevant in this case. However, I didn't realize that bowtie is messing up with multimapped reads

I think that overall my alignment was not so bad, 'cause many genes identified as transcriptionally altered in my RNA-seq were confirmed by qPCR. However, I might try tophat as well.

About rRNA, it's just that I want to present some data in a paper I'm working on, and would like to give the number of reads that aligned on rRNA in my libraries. That way, it would give the audience some idea about abundance of "real" transcripts in my libraries.
ThePresident is offline   Reply With Quote
Old 08-19-2013, 04:50 PM   #9
chadn737
Senior Member
 
Location: US

Join Date: Jan 2009
Posts: 392
Default

Quote:
Originally Posted by ThePresident View Post
Well, I'm dealing with bacterial RNA-seq data, so a spliced aligner such as Tophat was irrelevant in this case. However, I didn't realize that bowtie is messing up with multimapped reads

I think that overall my alignment was not so bad, 'cause many genes identified as transcriptionally altered in my RNA-seq were confirmed by qPCR. However, I might try tophat as well.

About rRNA, it's just that I want to present some data in a paper I'm working on, and would like to give the number of reads that aligned on rRNA in my libraries. That way, it would give the audience some idea about abundance of "real" transcripts in my libraries.
There seems no reason to use tophat then.

What version of bowtie did you use and how did you use it. If you want to present data on rRNA, you can align specifically to these, however, to quickly identify uniquely mapped reads, you can filter based on quality score depending on what version of bowtie you used and how you ran it.
chadn737 is offline   Reply With Quote
Old 08-19-2013, 04:55 PM   #10
ThePresident
Member
 
Location: Sherbrooke / Canada

Join Date: Jun 2012
Posts: 72
Default

I used bowtie (not bowtie2) with following command line:

Code:
bowtie -q -a --best m50 p6 -t index 6_Index-1.WT_1_R1.fastq -S AlignWT1
So, I used -a for bowtie to report all alignments and --best to give only best hits.

Now, how you would filter for quality scores with bowtie?
ThePresident is offline   Reply With Quote
Old 08-19-2013, 05:01 PM   #11
chadn737
Senior Member
 
Location: US

Join Date: Jan 2009
Posts: 392
Default

Quote:
Originally Posted by ThePresident View Post
I used bowtie (not bowtie2) with following command line:

Code:
bowtie -q -a --best m50 p6 -t index 6_Index-1.WT_1_R1.fastq -S AlignWT1
So, I used -a for bowtie to report all alignments and --best to give only best hits.

Now, how you would filter for quality scores with bowtie?
By default, bowtie should assign a mapping quality of 255 to reads that map once. You can filter using samtools view:

Code:
samtools view -hb -q 255 input.bam > output.bam
Your output should have only uniquely aligned reads.
chadn737 is offline   Reply With Quote
Old 08-19-2013, 05:11 PM   #12
ThePresident
Member
 
Location: Sherbrooke / Canada

Join Date: Jun 2012
Posts: 72
Default

Thanks, I'll try that instead!
ThePresident is offline   Reply With Quote
Reply

Tags
htseq count

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 08:30 AM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2021, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO