Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Identical fragments, different chromosomes, picard MarkDuplicates

    Hi All,

    I have RNA seq data from ~ 20 samples, 2x72, Solexa, about 20-25 million fragments per sample.

    When trying to run picard's MarkDuplicates I got this error back:

    Exception in thread "main" java.lang.RuntimeException: SAM validation error: ERROR: Record 2278214, Read name WICMT-SOLEXA_100409_61E8NAAXX:2:17:3572:14759#0, Mate Alignment start (195002931) must be <= reference sequence length (181748087) on reference chr2

    If looking at the read-pair that caused this error:
    grep WICMT-SOLEXA_100409_61E8NAAXX:2:17:3572:14759#0 accepted_hits.sam
    WICMT-SOLEXA_100409_61E8NAAXX:2:17:3572:14759#0 113 chr1 195002931 255 72M = 3420320 0 AGAAAAAAATCCACCACCACCACCACCACCAAAAGGAACTACCCCACTGTGATGTAGGGCTGTAGAGGGGGG ###?BBB??'>=/=>2>A/AA7BB9BBBDBEGFEDEDBEDBEEFFCFDEEEEFFEDGGFGGGGGGGGGGGGG NM:i:1
    WICMT-SOLEXA_100409_61E8NAAXX:2:17:3572:14759#0 177 chr2 3420320 255 72M = 195002931 0 TTTTTTTTTTCTTTGAGACAGGGTTTCTCTGTGTAGCCTTGGCTGTCCTGGAACTCACTCTGTAGACCAAGC GDEEEEDEEDGFEFGGGEGGGGGEGFGGGGGGGGGG?GGGGGGGGGGGGGGGGGGGFGGGGGGGGGGGGGGG NM:i:2

    The problem is that I have fragments where the different ends map to different chromosomes. In this case this causes an error because the first end maps on pos 195002931 (on chromosome 1), and chromosome 2, which the second end maps to, is not that long.

    Is there a way to inform picard to swallow these alignments? Would be good if the SAM format would include the chr mapping for the pair as well. Picard does not disregard other non-proper pairs.

    Or should I just not use fragments where the different ends map to diff chromosomes? How do you usually treat this?

    Thank you,
    Boel

  • #2
    Hi Boel,

    I stumbled over this as well. I think Picard can handle these correctly, but I think there is a bug in TopHat that causes these to be reported incorrectly.

    What I have noticed is that TopHat always uses the '=' symbol for the 2nd mate's reference ID. So that even if the mate maps to a different chromosome, it is still marked as the same chromosome in TopHat. A lot of these potentially could unnoticed by Picard as long as the position of the mate is less than the chromosome size. However, Picard complains when it (inevitably) encounters a 2nd mate that violates chromosome size boundaries.

    Am I correct in observing this?

    Currently I just throw these reads away. Is there a better way to handle it? I suppose it would be possible to sort by read name and repair the mate chromosome for these alignments.

    Overall, it would be great to see better SAM compatibility in TopHat.

    Comment


    • #3
      Hi choy, and thanks for your reply. Your observation seems to be true ("=' given by TopHat despite mapping to diff chromosomes). I'll try to correct these errors in my files. Would definitely be great to have TopHat give the right SAM expressions.

      Comment

      Latest Articles

      Collapse

      • seqadmin
        Advancing Precision Medicine for Rare Diseases in Children
        by seqadmin




        Many organizations study rare diseases, but few have a mission as impactful as Rady Children’s Institute for Genomic Medicine (RCIGM). “We are all about changing outcomes for children,” explained Dr. Stephen Kingsmore, President and CEO of the group. The institute’s initial goal was to provide rapid diagnoses for critically ill children and shorten their diagnostic odyssey, a term used to describe the long and arduous process it takes patients to obtain an accurate...
        12-16-2024, 07:57 AM
      • seqadmin
        Recent Advances in Sequencing Technologies
        by seqadmin



        Innovations in next-generation sequencing technologies and techniques are driving more precise and comprehensive exploration of complex biological systems. Current advancements include improved accessibility for long-read sequencing and significant progress in single-cell and 3D genomics. This article explores some of the most impactful developments in the field over the past year.

        Long-Read Sequencing
        Long-read sequencing has seen remarkable advancements,...
        12-02-2024, 01:49 PM

      ad_right_rmr

      Collapse

      News

      Collapse

      Topics Statistics Last Post
      Started by seqadmin, 12-17-2024, 10:28 AM
      0 responses
      22 views
      0 likes
      Last Post seqadmin  
      Started by seqadmin, 12-13-2024, 08:24 AM
      0 responses
      42 views
      0 likes
      Last Post seqadmin  
      Started by seqadmin, 12-12-2024, 07:41 AM
      0 responses
      28 views
      0 likes
      Last Post seqadmin  
      Started by seqadmin, 12-11-2024, 07:45 AM
      0 responses
      42 views
      0 likes
      Last Post seqadmin  
      Working...
      X