Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • High discordant alignments

    I've set up a galaxy workflow for paired end first stranded RNAseq, and I've gotten some odd summary results from Tophat2 alignment. At least I think they're odd as I'm new to this.

    Left reads:
    Input : 218685181
    Mapped : 193500858 (88.5% of input)
    of these: 14727362 ( 7.6%) have multiple alignments (40016 have >20)
    Right reads:
    Input : 218685181
    Mapped : 196263585 (89.7% of input)
    of these: 14724480 ( 7.5%) have multiple alignments (40380 have >20)
    Unpaired reads:
    Input : 5950944
    Mapped : 5300035 (89.1% of input)
    of these: 227937 ( 4.3%) have multiple alignments (142 have >20)
    89.1% overall read mapping rate.

    Aligned pairs: 173668750
    of these: 13863688 ( 8.0%) have multiple alignments
    170432898 (98.1%) are discordant alignments
    1.5% concordant pair alignment rate.
    Here's the flagstat output


    490744296 + 0 in total (QC-passed reads + QC-failed reads)
    0 + 0 duplicates
    490744296 + 0 mapped (100.00%:-nan%)
    486148534 + 0 paired in sequencing
    241299292 + 0 read1
    244849242 + 0 read2
    523372 + 0 properly paired (0.11%:-nan%)
    443477134 + 0 with itself and mate mapped
    42671400 + 0 singletons (8.78%:-nan%)
    418612688 + 0 with mate mapped to a different chr
    312416516 + 0 with mate mapped to a different chr (mapQ>=5)
    For the number of reads mapped, the concordant pairs seem extremely low. I'm wondering if I missed a parameter in Tophat or Bowtie? Notably, I have not set a read group identifier in Bowtie (necessary?), nor could I figure out how from the Bowtie documentation. I also wonder if something could be awry with my fastq files, as they have been concatenated from a split dataset. Here are the first couple reads from the foreward and reverse data respectively.

    @HW-ST997:217:C3KKGACXX:4:1101:1432:2038 1:N:0:TGACCA
    TTCATCTTTAGATAATGAATTATATCCAAGATCAGACTGGCCACCTGTACTAGATCTATCATCAGTAGCATATACTTTGATTAAACCCG
    +
    FF00B<<FFFFFFBBFFFBFIFBBF0BBFFFFBFFFFIF<FFF<FBFF7BBBB<<B<''<B<BBB<<BBBBBFFFBBF<<B<7B7<BBB
    @HW-ST997:217:C3KKGACXX:4:1101:1474:2051 1:N:0:TGACCA
    GAGGGAGTATAGGGCTGTGACTAGTATGTTGAGTCCTGTAAGTAGGAGAGTGATATTTGATCAGGAGAACGTGGTTACTAGCACAGAGA
    +
    FIFIIBFBBFFFIIFFFFFFFFFFFBFFIIIFFFIIIFFFFFFFFFBF<BBBBF0BFFFBFFBFFFFFFFBFBFBFB<BBBBBBBBBFB
    @HW-ST997:217:C3KKGACXX:4:1101:1451:2106 1:N:0:TGACCA
    ACTGGGAAACGTTCACGCTGGGTCCAGCATTTGCCATGGACAAGATGCCAGGACCCGTATGCTTCAGGATGAAGTTCTTGTCATCAAAT
    +
    FIIFFBBFFFFFFBB7<7BBFFF77BBFFIFFFIFBFFFIFFIIF<B<0<BB7BBBBB<BBBBBBBB0BBBB0<7<BBBB0'0B<B<BB




    @HW-ST997:217:C3KKGACXX:4:1101:1452:2018 2:N:0:TGACCA
    TTACCCCCATACTCCTTACACTATTCCTCATCANCCNACTAAAAATATTAAACACAAACTACCACCTACCTCCCTCACCAAAGCCCATA
    +
    FFFFFFFF7FFFIIIIIFFFFFFFIIFFFFFFB#0B#07<FFFIFFFFIFBFFIFFFFFFFFBFF<BB<BFFFFB<BBBBBFBFFB<BB
    @HW-ST997:217:C3KKGACXX:4:1101:1474:2051 2:N:0:TGACCA
    AGTCATTCTCATAATCGCCCACGGGCTTACATCNTCNTTACTATTCTGCCTAGCAAACTCAAACTACGAACGCACTCACAGTCGCATCA
    +
    FFFIIFFFIIFIIFFBFBFFFIIIIFFFIFFFF#0<#07<BBFFFBBFBFFBBFFFFFBFFFFFFFFFFFFFBBBBFFBFFBBBFBBFB
    @HW-ST997:217:C3KKGACXX:4:1101:1409:2234 2:N:0:TGACCA
    ATCTCAGAAAAGAAGACATGGAATATGCCCTGNNTANACTGGATGACACCAAATTCCGCTCTCATGAGGGTGAAACTTCCTACATCCGA
    +
    <BFFFIFFIIIBBFFBFBBFFFFF7FFFFFII##07#07BFFBFFBFFFIFFFBF7BBFFBBBBBBB<BB0<B<'7<BBBBBBBBBBB<
    Thanks in advance for any help!

    -Jeremy

  • #2
    What options did you use when running tophat/bowtie ?
    Since you use stranded-data, you might want to check the '--library-type' option.

    Comment


    • #3
      Thanks for the response yueluo. I ran it through a galaxy wrapper but I selected the first-strand option, so the wrapper should be passing the command onto Bowtie. I just spoke with a colleague who informed me that my paired end reads appear to be out of order.

      For instance:

      Read1-foreward:
      1101:1432:2038 1:N:0:TGACCA
      Read1-Reverse
      1101:1452:2018 2:N:0:TGACCA

      This may have happened when I concatenated the files, or it might just be how I received the sequencing data. Do you have any ideas about how I can re-sort by coordinates?

      Comment


      • #4
        I suggest you go back to the raw files, and map them without modifying them in any way. If you want to merge multiple datasets, you can do that after you have the sam/bam files.

        Comment


        • #5
          I suggest you go back to the raw files, and map them without modifying them in any way. If you want to merge multiple datasets, you can do that after you have the sam/bam files.
          After looking into this some more, I'm not sure there is a way to feed multiple files into the galaxy Tophat2 wrapper. Fortunately it looks like they have tool specifically for combining paired end read files (which I swear I looked for before ). We'll see if this works. As a backup, we'll run another instance of Tophat2 via command line arguments.

          You suggest not modifying them in any way. Does this include trimming/clipping and other QC measures? I am worried about this as it seems that if a read has enough low scoring bases, then it might be cut from say the forward file but not the reverse, leading again to misalignment.

          Comment


          • #6
            Originally posted by reventropy View Post
            You suggest not modifying them in any way. Does this include trimming/clipping and other QC measures? I am worried about this as it seems that if a read has enough low scoring bases, then it might be cut from say the forward file but not the reverse, leading again to misalignment.
            That's exactly why I made the suggestion; there are a lot of poorly-written tools that break read pairing, and that's usually the culprit.

            If you need to do quality or adapter trimming, I can suggest BBDuk, which is made to handle single or paired files, keeping reads together. It's extremely fast and uses a better quality-trimming algorithm than most alternatives, as well as being more sensitive in adapter-trimming (you can specify the number of mismatches allowed). You can also use it for contaminant removel (phiX, e.coli, various spike-ins or vectors).

            Comment

            Latest Articles

            Collapse

            • seqadmin
              Current Approaches to Protein Sequencing
              by seqadmin


              Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
              04-04-2024, 04:25 PM
            • seqadmin
              Strategies for Sequencing Challenging Samples
              by seqadmin


              Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
              03-22-2024, 06:39 AM

            ad_right_rmr

            Collapse

            News

            Collapse

            Topics Statistics Last Post
            Started by seqadmin, 04-11-2024, 12:08 PM
            0 responses
            18 views
            0 likes
            Last Post seqadmin  
            Started by seqadmin, 04-10-2024, 10:19 PM
            0 responses
            22 views
            0 likes
            Last Post seqadmin  
            Started by seqadmin, 04-10-2024, 09:21 AM
            0 responses
            17 views
            0 likes
            Last Post seqadmin  
            Started by seqadmin, 04-04-2024, 09:00 AM
            0 responses
            49 views
            0 likes
            Last Post seqadmin  
            Working...
            X