Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Using htseq-count with paired-end and orphaned reads

    Dear all,

    I am building a pipeline for 100bp paired-end sequencing and I have a question regarding how htseq-count deals with the output from TopHat and how to best map my reads to features prior to using DESeq.

    For each sample I have 2 bam files: one corresponding to alignments of paired-end reads and another containing single reads that didn't have a mate pair. Originally, I merged my 2 bam files, sorted and converted into SAM and ran htseq-count to get gene counts, but I got an "expecting mate pair error". This made sense as single reads coming from merged files lacked mate pairs that htseq-count expected in sorted SAM files.

    My questions is: Is running htseq-count separately on these two files and them combining their outputs for each sample biasing my analysis, or does it render the same results as if htseq allowed runs of files containing both paired-end and single reads?

    I apologize for the long explanation, and thanks for any help you can give me.

  • #2
    What the difference between the library type single and paired, DeSeq

    I am preparing read counts for DeSeq. I have no reference genome for these transcriptomes. I am writing a script to extract counts from sam files produced by bowtie2 https://github.com/i5K-KINBRE-script...anscriptome.pl.

    It seems to me that Deseq assumes each pair counts as one alignment for "paired" data. Bowtie2 and many other packages attempt aligning single reads when the pair fails. I also have reads that have no pairs (one set failed my base quality filters).

    It seems like addition, useful information may be gained using the pairs with a quality issue in only one end or an unexpected insert length. It also seems that feeding the pairing information to Bowtie2 could increase the quality of alignment for some ambiguously mapping reads. It also seems to me DeSeq makes no use of pairs other than to normalize between samples. Does this sound correct to you? If that is the case why label samples "paired" or "single."

    My main questions are:

    1) If I use "paired" in DeSeq should I force Bowtie2 to only report pairs where both align as a pair and divide the counts by 2?

    2) If I include my single end alignments, should I count each single read as +1 and each pair as +1? Should I use "paired" or "single" in DeSeq and why does this setting matter?

    Thanks

    Comment


    • #3
      The "paired" or "single" in the DESeq vignette is just an example of (1) how to deal with multiple factors and (2) that single-end and paired-end libraries can produce slightly different results (likely due to mappability). If you just have one type of library, then there's no reason to include that in your model. As you suspected, both read pairs and single reads count as 1, since they each represent a single original molecule.

      Perhaps one of the DESeq authors has a different opinion, but I would only really worry about including orphaned reads if the rate of them was very different between samples (you'd see if this is causing an issue in a principal components graph or by doing clustering). This would likely indicate a general difference in overall quality between the two

      BTW, your perl script could be replaced by htseq-count, where you make an annotation file wherein each contig is a feature (this has the benefit of not counting multimappers and allowing filtering by MAPQ).

      Comment


      • #4
        DeSeq paired-end and orphaned reads

        That you for your reply. I have no reference genome. I am assuming you mean that I create a gff where each feature is a contig. If this is correct, what software do you use to create the gff from contigs.

        I am adding filtering steps to my script currently:
        1) For Bowtie2 column 5 is map quality.

        2) Column 7 indicates whether the read has a mate and which contig the mate aligns to. Keeping reads aligning with a value other than "=" or "*" in column 7 would remove pairs with reads that align to different contigs.

        3) The optional field "XS:i:<N>" only appears for reads that aligns ambiguously. (Would you recommend filtering a pair if one read aligns ambiguously or only if both align ambiguously?)

        4) I could also filter by by problematic pairing using the optional field "YT:Z:<S>" where a value of UU indicates the read was not part of a pair. Value of CP indicates the read was part of a pair and the pair aligned concordantly. Value of DP indicates the read was part of a pair and the pair aligned discordantly. Value of UP indicates the read was part of a pair but the pair failed to aligned either concordantly or discordantly.

        Any thoughts on the filters listed above.

        Thanks,
        Jennifer
        Last edited by sheltonj; 09-16-2013, 11:12 AM.

        Comment


        • #5
          Yeah, you'd need to make the GFF or GTF file. A GTF file would be easy to make from the header of one of your SAM/BAM files (since you just need the contig names and lengths). It sounds like the additions to your script should take care of most of the issues )

          Regarding (2), that will also filter any read with a secondary alignment (XS score) equivalent to that of the primary alignment and where the primary alignment contains mismatches. Regardless, you'll certainly want to get rid of those!

          Regarding (3), the presence of an XS score actually just means that there's a secondary alignment with a score of at least that specified by --score-min. If AS=0 and XS=-60, I wouldn't call the alignment ambiguous. This is generally why it's better to just filter at a MAPQ (say 10 or 20).

          Regarding (4), I don't recall off-hand what bowtie2's metric is for calling something unique or not, so I can't advise on the utility of this.

          Best of luck,
          Devon

          Comment


          • #6
            Hi thanks,

            I drafted a perl script to count paired end reads and broken pairs that align to de novo transcriptomes. The script takes Bowtie2 "best match" sam files as input.

            The script "count_reads_aligned_to_de_novo_transcriptome.pl" is briefly described at the bottom of this README here https://github.com/i5K-KINBRE-script...ster/README.md.

            The script is available here https://github.com/i5K-KINBRE-script...anscriptome.pl

            Comments, questions, or suggestions would be appreciated

            -Jennifer

            Comment

            Latest Articles

            Collapse

            • seqadmin
              Current Approaches to Protein Sequencing
              by seqadmin


              Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
              04-04-2024, 04:25 PM
            • seqadmin
              Strategies for Sequencing Challenging Samples
              by seqadmin


              Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
              03-22-2024, 06:39 AM

            ad_right_rmr

            Collapse

            News

            Collapse

            Topics Statistics Last Post
            Started by seqadmin, 04-11-2024, 12:08 PM
            0 responses
            25 views
            0 likes
            Last Post seqadmin  
            Started by seqadmin, 04-10-2024, 10:19 PM
            0 responses
            28 views
            0 likes
            Last Post seqadmin  
            Started by seqadmin, 04-10-2024, 09:21 AM
            0 responses
            24 views
            0 likes
            Last Post seqadmin  
            Started by seqadmin, 04-04-2024, 09:00 AM
            0 responses
            52 views
            0 likes
            Last Post seqadmin  
            Working...
            X