Seqanswers Leaderboard Ad

**chadn737** · 10-31-2012, 10:56 AM

Sounds like HTSeq is doing what it was designed to do.

**all_your_base** · 10-31-2012, 11:01 AM

That is not accurate, HTSeq is supposed to aggregate read counts based on .SAM file entries. If Bowtie2 decides a particular read is the best alignment, and therefore discards multiple alignments, that single reported alignment is supposed to be counted for expression profiling.

If you only aggregated reads that map exactly one place to the genome, you would lose >60% of all of your counts. When Bowtie2 reports the mapping efficiency, let's say at 80%, that includes reads that have multiple alignments but only one alignment reported. Purely uniquely mapping reads really account for <40% of a typical sequencing run.

**all_your_base** · 10-31-2012, 11:32 AM

[SOLVED]

Found the problem... It's not with HTSeq, but Bowtie2 itself.

Bowtie2 is supposed to report the SEQ and QUAL strings for secondary alignments, which HTSeq needs to aggregate counts. There is an option that allows Bowtie2 to suppress this information as asterisks that is accidentally stuck on by default. See my other thread:

Here.

**chadn737** · 10-31-2012, 11:46 AM

<40% unique? That certainly has not been my experience. Even when working with transcriptomes of known polyploids.

**all_your_base** · 10-31-2012, 11:49 AM

Well, you are probably right, ~40% is probably most accurate for polyploid plants, but it really is near 40% in such cases.

Try running Bowtie2 with -k 4, sorting the .SAM, and separating only those valid alignments with a single mapping positions for the forward and reverse reads. In maize, soybean, Arabidopsis, wheat, etc., I've found the uniquely mapping reads to be the minority.

What do you typically see for mammals?

**chadn737** · 10-31-2012, 12:04 PM

I work in plants. In Arabidopsis I typically see on the order of 87% of all reads (including unmapped ones) map uniquely. I have seen this same trend across multiple samples.

For a while I was mapping B. Oleracea transcriptomes to B. Rapa before I got the Oleracea genome. Even then the uniquely mapped reads never dropped below 58%. Now I would not find <40% surprising for wheat, but for Arabidopsis that is far too low.

But why are you using -k 4 rather than the default where Bowtie 2 finds the best alignment? By using this setting, Bowtie 2 will still report an alignment of lower quality than the best match as long as it is a valid alignment. As a result, you will increase the number of multi-mapping reads even if they are not real.

Also...I'm assuming you are doing transcriptomes (else why use HTSeq count) in which case, why use Bowtie 2 rather than Tophat 2?

**all_your_base** · 10-31-2012, 12:23 PM

I wonder why you see so many more uniquely mapping reads than I do.

I use -k 4 only when I want to separate uniquely mapping reads, duplicated reads, and multiple mapping reads. With -k 4, Bowtie2 will look for up to 4 alignments. Therefore, I can tell by the .SAM file is something only has a uniquely mapping location, only 2 mapping locations, or more than two. I find this is helpful for detecting duplicated genes by following the duplicated reads (those that map to exactly 2 locations).

For situations in which I just want to do expression profiling, I don't specify a -k value.

When analyzing transcriptomes I do use Tophat2, unless I'm specifically looking for gene duplication events, and then I use Bowtie2 and -k 4.

Got a question for you... you ask why I would use HTSeq when mapping to genomes, but why wouldn't I? I often map illumina reads to a completed genome that is annotated with .gff file. I then use HTSeq while specifying the gene features and aggregate counts per gene.

Is there a better way to do this?

**chadn737** · 10-31-2012, 12:39 PM

Originally posted by all_your_base View Post

I wonder why you see so many more uniquely mapping reads than I do.

I use -k 4 only when I want to separate uniquely mapping reads, duplicated reads, and multiple mapping reads. With -k 4, Bowtie2 will look for up to 4 alignments. Therefore, I can tell by the .SAM file is something only has a uniquely mapping location, only 2 mapping locations, or more than two. I find this is helpful for detecting duplicated genes by following the duplicated reads (those that map to exactly 2 locations).

For situations in which I just want to do expression profiling, I don't specify a -k value.

When analyzing transcriptomes I do use Tophat2, unless I'm specifically looking for gene duplication events, and then I use Bowtie2 and -k 4.

Got a question for you... you ask why I would use HTSeq when mapping to genomes, but why wouldn't I? I often map illumina reads to a completed genome that is annotated with .gff file. I then use HTSeq while specifying the gene features and aggregate counts per gene.

Is there a better way to do this?

I did not mean to imply that HTseq-count should not be used for counting the counts for a gene feature when mapping BACK to a genome. Rather I was referring to the data, whether it be RNA-seq or something like WGS. I was assuming that you were working with RNA-seq data and so wondering why you would be using Bowtie rather than Tophat to do this.

It does make sense to me now why you are using Bowtie 2 with -k 4.

But when you do this, do you account for the alignment scores? Because using -k 4 means that Bowtie 2 will report the best match and also a lower quality match as long as it still fits with the other settings.

I find it highly unlikely that in the vast majority of cases, even those with say 2 alignments, that both those alignments will have equal alignment scores. I would expect that in the vast majority of cases Bowtie 2 will report an alignment of lower quality than the primary alignment that it would report without the -k 4 setting. This would lead to an overestimation of the real number of reads mapping to multiple locations.

Topics	Statistics	Last Post
Cancer Metastasis: A Deep Dive into Cellular Plasticity by seqadmin Started by seqadmin, 04-11-2024, 12:08 PM	0 responses 59 views 0 likes	Last Post by seqadmin 04-11-2024, 12:08 PM
Proteogenomic Profiles Offer New Clues in Prostate Cancer by seqadmin Started by seqadmin, 04-10-2024, 10:19 PM	0 responses 57 views 0 likes	Last Post by seqadmin 04-10-2024, 10:19 PM
Novel Diagnostic Assay Enhances Ovarian Cancer Detection by seqadmin Started by seqadmin, 04-10-2024, 09:21 AM	0 responses 51 views 0 likes	Last Post by seqadmin 04-10-2024, 09:21 AM
Evolutionary Dynamics of Centromeres: A Comparative Genomic Analysis by seqadmin Started by seqadmin, 04-04-2024, 09:00 AM	0 responses 56 views 0 likes	Last Post by seqadmin 04-04-2024, 09:00 AM

Seqanswers Leaderboard Ad

Announcement

Using HTSeq with Bowtie2 .sam files

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News