Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Using HTSeq with Bowtie2 .sam files

    Does anyone here use HTSeq to aggregate read counts per gene for their Bowtie2 generated .SAM files? If not, what other method do you use?

    I have found that PE Illumina reads put through Bowtie2 create a .SAM that breaks HTSeq. It seems that Bowtie2 tries multiple alignments, and only reports the best possible alignment. When HTSeq sees a read's flag variable that indicates it had multiple alignments, it throws a warning and drops the read. Therefore, I believe HTSeq is only reporting counts for uniquely mapping read pairs in Bowtie2 .SAM output.

    Does anyone have any thoughts on this?

  • #2
    Sounds like HTSeq is doing what it was designed to do.

    Comment


    • #3
      That is not accurate, HTSeq is supposed to aggregate read counts based on .SAM file entries. If Bowtie2 decides a particular read is the best alignment, and therefore discards multiple alignments, that single reported alignment is supposed to be counted for expression profiling.

      If you only aggregated reads that map exactly one place to the genome, you would lose >60% of all of your counts. When Bowtie2 reports the mapping efficiency, let's say at 80%, that includes reads that have multiple alignments but only one alignment reported. Purely uniquely mapping reads really account for <40% of a typical sequencing run.

      Comment


      • #4
        [SOLVED]

        Found the problem... It's not with HTSeq, but Bowtie2 itself.

        Bowtie2 is supposed to report the SEQ and QUAL strings for secondary alignments, which HTSeq needs to aggregate counts. There is an option that allows Bowtie2 to suppress this information as asterisks that is accidentally stuck on by default. See my other thread:

        Here.

        Comment


        • #5
          <40% unique? That certainly has not been my experience. Even when working with transcriptomes of known polyploids.

          Comment


          • #6
            Well, you are probably right, ~40% is probably most accurate for polyploid plants, but it really is near 40% in such cases.

            Try running Bowtie2 with -k 4, sorting the .SAM, and separating only those valid alignments with a single mapping positions for the forward and reverse reads. In maize, soybean, Arabidopsis, wheat, etc., I've found the uniquely mapping reads to be the minority.

            What do you typically see for mammals?

            Comment


            • #7
              I work in plants. In Arabidopsis I typically see on the order of 87% of all reads (including unmapped ones) map uniquely. I have seen this same trend across multiple samples.

              For a while I was mapping B. Oleracea transcriptomes to B. Rapa before I got the Oleracea genome. Even then the uniquely mapped reads never dropped below 58%. Now I would not find <40% surprising for wheat, but for Arabidopsis that is far too low.

              But why are you using -k 4 rather than the default where Bowtie 2 finds the best alignment? By using this setting, Bowtie 2 will still report an alignment of lower quality than the best match as long as it is a valid alignment. As a result, you will increase the number of multi-mapping reads even if they are not real.

              Also...I'm assuming you are doing transcriptomes (else why use HTSeq count) in which case, why use Bowtie 2 rather than Tophat 2?

              Comment


              • #8
                I wonder why you see so many more uniquely mapping reads than I do.

                I use -k 4 only when I want to separate uniquely mapping reads, duplicated reads, and multiple mapping reads. With -k 4, Bowtie2 will look for up to 4 alignments. Therefore, I can tell by the .SAM file is something only has a uniquely mapping location, only 2 mapping locations, or more than two. I find this is helpful for detecting duplicated genes by following the duplicated reads (those that map to exactly 2 locations).

                For situations in which I just want to do expression profiling, I don't specify a -k value.

                When analyzing transcriptomes I do use Tophat2, unless I'm specifically looking for gene duplication events, and then I use Bowtie2 and -k 4.

                Got a question for you... you ask why I would use HTSeq when mapping to genomes, but why wouldn't I? I often map illumina reads to a completed genome that is annotated with .gff file. I then use HTSeq while specifying the gene features and aggregate counts per gene.

                Is there a better way to do this?

                Comment


                • #9
                  Originally posted by all_your_base View Post
                  I wonder why you see so many more uniquely mapping reads than I do.

                  I use -k 4 only when I want to separate uniquely mapping reads, duplicated reads, and multiple mapping reads. With -k 4, Bowtie2 will look for up to 4 alignments. Therefore, I can tell by the .SAM file is something only has a uniquely mapping location, only 2 mapping locations, or more than two. I find this is helpful for detecting duplicated genes by following the duplicated reads (those that map to exactly 2 locations).

                  For situations in which I just want to do expression profiling, I don't specify a -k value.

                  When analyzing transcriptomes I do use Tophat2, unless I'm specifically looking for gene duplication events, and then I use Bowtie2 and -k 4.

                  Got a question for you... you ask why I would use HTSeq when mapping to genomes, but why wouldn't I? I often map illumina reads to a completed genome that is annotated with .gff file. I then use HTSeq while specifying the gene features and aggregate counts per gene.

                  Is there a better way to do this?
                  I did not mean to imply that HTseq-count should not be used for counting the counts for a gene feature when mapping BACK to a genome. Rather I was referring to the data, whether it be RNA-seq or something like WGS. I was assuming that you were working with RNA-seq data and so wondering why you would be using Bowtie rather than Tophat to do this.

                  It does make sense to me now why you are using Bowtie 2 with -k 4.

                  But when you do this, do you account for the alignment scores? Because using -k 4 means that Bowtie 2 will report the best match and also a lower quality match as long as it still fits with the other settings.

                  I find it highly unlikely that in the vast majority of cases, even those with say 2 alignments, that both those alignments will have equal alignment scores. I would expect that in the vast majority of cases Bowtie 2 will report an alignment of lower quality than the primary alignment that it would report without the -k 4 setting. This would lead to an overestimation of the real number of reads mapping to multiple locations.

                  Comment

                  Latest Articles

                  Collapse

                  • seqadmin
                    Essential Discoveries and Tools in Epitranscriptomics
                    by seqadmin




                    The field of epigenetics has traditionally concentrated more on DNA and how changes like methylation and phosphorylation of histones impact gene expression and regulation. However, our increased understanding of RNA modifications and their importance in cellular processes has led to a rise in epitranscriptomics research. “Epitranscriptomics brings together the concepts of epigenetics and gene expression,” explained Adrien Leger, PhD, Principal Research Scientist...
                    04-22-2024, 07:01 AM
                  • seqadmin
                    Current Approaches to Protein Sequencing
                    by seqadmin


                    Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
                    04-04-2024, 04:25 PM

                  ad_right_rmr

                  Collapse

                  News

                  Collapse

                  Topics Statistics Last Post
                  Started by seqadmin, 04-11-2024, 12:08 PM
                  0 responses
                  59 views
                  0 likes
                  Last Post seqadmin  
                  Started by seqadmin, 04-10-2024, 10:19 PM
                  0 responses
                  57 views
                  0 likes
                  Last Post seqadmin  
                  Started by seqadmin, 04-10-2024, 09:21 AM
                  0 responses
                  51 views
                  0 likes
                  Last Post seqadmin  
                  Started by seqadmin, 04-04-2024, 09:00 AM
                  0 responses
                  56 views
                  0 likes
                  Last Post seqadmin  
                  Working...
                  X