Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • SAM/BAM sort by read names produces truncated read names

    Hi,

    I tried to sort the alignment file by read name, but it appears that truncated read names were produced. This phenomenon was observed no matter which program I used: SAMtools sort (0.1.8), Picard SortSam (1.77) or Novosort (2.08) .

    Here is the first few records of the original SAM file:
    Code:
    HWI-ST621:415:D197AACXX:8:1101:1        113     chr2    236798427       70      100M1S  chr8    3088040 0       ACCTCTGTTTCTAAGCAGTGGAATAGAATTGCTTATGGAATAGCCAGGTCATAGGATGTNATAANTTCCCTGGAAATCAGAGGGGAAAAGAAGCAAAACAN   C@?>?AC@:C@>CECDEE@ACFEBFFDEEHECDACADHFHFEHIJGJIGIHJJIHDB80#HF?1#GDJIHCIGGHHAIIIJJHEHJJIHHHHHFFFDD=1#        PG:Z:novoalign  RG:Z:LS148      AS:i:18 UQ:i:18 NM:i:2  MD:Z:59G4T35
    HWI-ST621:415:D197AACXX:8:1101:1        177     chr8    3088040 70      101M    chr2    236798427       0       AAATACATACATACACACAGACTGATTTTCTCTTCAGCAATATTTTAATGAAACCCCATACTGCAAATTACATAAACTAGTTAAAGTACACCAACCTCAAG   DEEDDDFDCEECEEDDBFFFDHHHFGHECJJIHFJJJIJJJIJHHGIHGDDGGJJJIIHGHIJJJIIJIGJJIIIFIIJJJJJJIIHFFAHHHDFFDFCCB        PG:Z:novoalign  RG:Z:LS148      AS:i:0  UQ:i:0  NM:i:0  MD:Z:101
    HWI-ST621:415:D197AACXX:8:1101:1223:2124        83      chr8    143208201       70      100M1S  =       143207998       -303    CGCTGAGAGCAAGGTGCCAGCAGGGTGGGCCCTTCTGGAGGCTCCGGCCGGGATCTGTTCCAGGCCACCCCCGCCTTCCGGCCATCCTCAGCTTGGCTCCN   >@CA>A:A>>>3(CA<AACDDDB<<?3?@9?CDCDCBCC?7<BBDBB@<93?DCCAA8<B?A<<DB7DCIGGBHGAHIIHFJJIEJIIHHHHHFFFDD=1#        PG:Z:novoalign  RG:Z:LS148      AS:i:47 UQ:i:47 NM:i:1  MD:Z:6C93       PQ:i:59 SM:i:70 AM:i:70
    HWI-ST621:415:D197AACXX:8:1101:1223:2124        163     chr8    143207998       70      92M     =       143208201       303     TTGTGGAGTCAGGTGTCCCTGGGGTCACGGTGACTGGCCAGGCGNGGGGAGCCAGGAGGCACACGGTCCTGGGCTCTNGCAGGGCTGGAGTG    @BBDFFADD?FHH@@EGGGGIIII@BCGHG8?DGHGB@FHHGAG#-<CC;@E?ACEE?B7?BCA?B;?BDDCB9??A#++28?B?B@B1<>A PG:Z:novoalign  RG:Z:LS148      AS:i:12 UQ:i:12 NM:i:2  MD:Z:44C32G14   PQ:i:59 SM:i:70 AM:i:70
    HWI-ST621:415:D197AACXX:8:1101:14       65      chr6    74783346        70      1S100M  chr1    1867309 0       NGATTAAGCAGCCAAGCTGTATCCTGAGGGAAACATGGGCAATGGAAAGCATCAGATTTCCTGGGTCAAAGCTATCCTGAGCTCAGGCACTGGGCTAACTG   #4=DFFFFGHHHHJJJJJJIJJJJJJJJJJGHIJIHIIJIGIIJJBFHIIIJJJJDIJJIHHIJJIGGHHHHHFFFFFFEDEEEEDDD@DDDDDDCDCDDD        PG:Z:novoalign  RG:Z:LS148      AS:i:6  UQ:i:6  NM:i:0  MD:Z:100
    HWI-ST621:415:D197AACXX:8:1101:14       129     chr1    1867309 70      101M    chr6    74783346        0       ACACACACACACACACACGAACTGCAGGGGGCTCTGGAGCCATGGAGTTAGAAAAGCTCTCTGAGAGGCCAGGTGTAGTGGCTCATGCCTGTAATCCCAGC   CCCFDFFFHHHHGJJJIJJIJJJJJFHIJIJFHIJJJDHEHHHHG@D?BDACCEDCBDDDDDDCDDDDBDBDB@CCCCCCBDDCCC@ACAC@>AB>CCACD        PG:Z:novoalign  RG:Z:LS148      AS:i:30 UQ:i:30 NM:i:1  MD:Z:68T32
    HWI-ST621:415:D197AACXX:8:1101:14       97      chr2    62756955        70      1S100M  chr6    74783591        0       NGTGCTGTTTGGTTTGTGTGTATTATATGGGTTTGGATTACAATAATTCCTCCCTTTTGTATAATGTTTTGCAGTTTTTAAAGCACTTCATGCTCTAAATC   #1=DDFFDHHGGFHIIHHEHGFGIDHHIIIIFGIIICGGEHHHIIIII>GGGIIIIIIIICFGHHGGHIIIIDAAEHHHEBDDFCEEECCDCCCCCC>ACC        PG:Z:novoalign  RG:Z:LS148      AS:i:6  UQ:i:6  NM:i:0  MD:Z:100
    HWI-ST621:415:D197AACXX:8:1101:14       145     chr6    74783591        70      101M    chr2    62756955        0       ATTTTTGTAAGTCACCAATGGTTGGATGTTGGCAGTTTCATAAGGTTCATTCTAATAGTTCCTGGGACACAAATGACTCGAAGTAGGTCAAGACAGGTTCA   <DDDDDDDDDDDDEEECCFDFFGHEHGJJIIJJJJIGHIHCIIIJIGCGIIGDIHEIIHGIJJJJIHIIJJIIHGBHHJIJJJJJJJJHHHFHFFFFD?C@        PG:Z:novoalign  RG:Z:LS148      AS:i:0  UQ:i:0  NM:i:0  MD:Z:101
    HWI-ST621:415:D197AACXX:8:1101:1        81      chr1    155944063       70      101M    chr11   19838477        0       CAGCTGTACCTGGCAGCAGCCCCTTCCCCAAGATGGTGACACCTCTGTCCACACCCTCTGTAATAGTGACCGGAGAGCCTGTGGAGCATTCCACCAGGATT   DDDEDAA:BCAA:DD@BDDDDB?@=BDEDEEDFFFD@;??=HHIIIIGJIHF<JIHFGBIHIJIIIIIHJJJJJJJIJJJJIJJJJJIHHHHHFFFFFCC@        PG:Z:novoalign  RG:Z:LS148      AS:i:0  UQ:i:0  NM:i:0  MD:Z:101
    HWI-ST621:415:D197AACXX:8:1101:1        161     chr11   19838477        70      101M    chr1    155944063       0       AGCCCCTTATGCAGAAAAAGGGACTCCACCTGGAGCCCTCTCTGGATCTACTTCTCCCAGATAAATCAGTCGGCTGTGTAATCTTTCAGGAAACCTGACCC   ??<DDFFFFHHDDDHIGDDAFE9FFGHGCHEGG9FGGHGGGGCFHBF*0BBCBGGE@GHGCHA@ECE@H;ADBFDCDDCCDD@CCC;33:32:595<9>3<        PG:Z:novoalign  RG:Z:LS148      AS:i:1  UQ:i:1  NM:i:0  MD:Z:101
    After sorting:
    Code:
    HWI-ST  81      chr7    83652142        70      82M     chr8    142160880       0       CTTTGTATTTACAGATACCACGGCCATTTTGCAATGTCCTCAGCACATAGTGGAAGCTGAACAAACAATCACATTTTCTAAT      @D<EA?7)==77@=7)('-'FF;FABB*0>EDB9DFDGDEBDEECC<FHHHBE@9HHEAB<;>FFDBBFA<DFA;A,B48;?   PG:Z:novoalign  RG:Z:LS148      AS:i:22 UQ:i:22 NM:i:1  MD:Z:76A5
    HWI-ST  65      chr9    120922414       70      101M    chr6    160312253       0       TCACTGAGTCTGATTGAAGCAACTGGCATTGGTGATCATACTTCAATATTTCTCTCATATTTGAAGTTAGAATTAGTTGATGTGAGATATTATATTAGCCT   @CCFFFFFHFHFAHHIDGHIJGIIJGHCGIJICFHIIIIIJIJJIIJGIIEIJHHGGIICGHIBGHFGHHGGHIDC@DHGIHGIGHHHHECBDFFFFFEDE        PG:Z:novoalign  RG:Z:LS148      AS:i:0  UQ:i:0  NM:i:0  MD:Z:101
    HWI-ST  81      chr2    46872242        70      101M    chr17   79461315        0       CATGGATTAAAATATTAAGTAATTTGATCTAGATGATTGTTTACAGTTTAACGCAAATACACTTAGTCTGTTCTGATTATTTACTCAAGGATTATATTACT   >C>:EDDFCDDFFDFFHHHHHHJIHGG=GIGJJIIIIJGIIIHIJHDGGHHJFIIJIIGC:JHHAIIFJJJIHGH@IJJJHHCGB>HGGHGHHFFFFF@C@        PG:Z:novoalign  RG:Z:LS148      AS:i:0  UQ:i:0  NM:i:0  MD:Z:101
    HWI-ST  65      chr8    103315908       70      93M     chr17   40205036        0       AGATATCTGAGAAACTGACCTAAATAAGCAATCTGAAAAGATTAAGGTTCCTTCAATTATTATACTACTTGTTCTCCAAATAACACACTAACT   <@@ADD>DDBA<FG?A43?@FFF:3AEB>DFECE91:C<CFCFCFFC::4?D>FCDDD<FC8DFEFDG88@.==C=4@D;7@:7?CCBDD@>@        PG:Z:novoalign  RG:Z:LS148      AS:i:0  UQ:i:0  NM:i:0  MD:Z:93
    HWI-ST  89      chr16   61016706        70      101M    =       61016706        0       TGTTGAGTCAATGTAAGACCTTGGTAAGAATTCTTCAATTTAGACATGGCTAATTTTTAATGTCAACCACAGCTATTGAGGTACTTATATTAATTAACCTT   C?CECACCFFFFDDDE?=CCGGIIIGGEGIIIGGIIEGIIHHDBFGIGFIIIHGIIIIIGGIHG@CHHHGHHHGHDEIFIIGIHBIIIHHDDHEDEDF@?@        PG:Z:novoalign  RG:Z:LS148      AS:i:0  UQ:i:0  NM:i:0  MD:Z:101
    HWI-ST  97      chr12   16510044        70      101M    chr9    75346048        0       TAATAAAAATTCAGTTTTAACTATAGATGCCTTCTTCTCCTCTTGTGTTTGATTTATTGCTCCAAATGGGCCAACCTGGATGTCTATATTTCTTCCACTAA   CCCFFFFFHHHHHJIIGIIJJIJIIJJJJJJIEIIJJJJJHJIJGFGFHJJJIIJJJJJJJJIGJJJJIIJJJIJHJHHFHHBBEDFFCFEFEEEEDDDDD        PG:Z:novoalign  RG:Z:LS148      AS:i:0  UQ:i:0  NM:i:0  MD:Z:101
    HWI-ST  73      chr5    22843028        70      97M     =       22843028        0       TAACTGTGTTTACTTTTCTCAGTTTCTACCAGAGAAAAGGCAGGTGCATTTTTTTGGTATGTTTGTGTAAAGTGAATTTGGCTTTACTTTTTCAAAT       =?<DD>=;FHDFFHGE@EFH?EA<B4AA@EBGCC1?91*:8CFG0?@?<D@@B;AFB=7=3?CHEEBE77B@6>;(6;.;;@;?>A>5(5:@CC5@>    PG:Z:novoalign  RG:Z:LS148      AS:i:3  UQ:i:3  NM:i:0  MD:Z:97
    HWI-ST  73      chr6    152150636       70      101M    =       152150636       0       CATTTGTCATCATTACACGGTCATGGGAGTGCTAAGAAGACTTAAATGCAGGGCTACCACCCCTTCCCAATTCATCTTTTATCCATTTTATTTCTCTAAGG   @CCDDDDEHHFHHFBHGGHHAFEFFHIGG:?CFGIGIGGHHEGIEHIGHGDE@;B=FA@F@FGGGEEHECCFFEFFCECDECCCDDDEDDCC@BCC>CCCC        PG:Z:novoalign  RG:Z:LS148      AS:i:0  UQ:i:0  NM:i:0  MD:Z:101
    HWI-ST  113     chr7    63064316        30      101M    chr17   26080536        0       CCTGCTCATCTCAGGCCTGCCGGCTCCTCCACCTGCCTTTTCGAGTACCCTGGGAACCCCCCGAGGACAGGTGTCATCGGTTGCTTCATCTCACCATCCCT   A94+(:ACCC??@BB@@7DDBDB<2????@8;BDB@A@BCDBCCCA<-DCC>3?8DB=7@@IHCIIJIGIJIIIJGHHGGGGHGEIDIFFFFAFFFDF@@@        PG:Z:novoalign  RG:Z:LS148      AS:i:31 UQ:i:31 NM:i:1  MD:Z:42C58
    HWI-ST  89      chr4    96140737        70      101M    =       96140737        0       AACAACGAGCCTCACTAGGTGACGATTAGCTATGGTTTCCCTGGTCTATACTGGATTTGGGTTCATTGGTAAATCATTCTATTCATAGCAATACAAGATAT   <<A?8DDDDDDCCAEEEFFFFHHHHHFIJJJJIIIIIGIGHIFIGHIIGDGGIJIJJIIHIHIEHIIJJJJJJIIJJJIJJIIJJIJJHGHHHFFFFFB@@        PG:Z:novoalign  RG:Z:LS148      AS:i:0  UQ:i:0  NM:i:0  MD:Z:101
    Does anyone have any idea of what's wrong with the programs or data?

    Thanks a lot!

    Allen

  • #2
    Very strange. Was that a typo in the version of samtools (I have 0.1.18 on my machine), or do you really have an out of date copy?

    Comment


    • #3
      The original SAM file also looks to have truncated names. Your read names should all end in ":8:[\d]+:[\d]+:[\d]+" (or something like that), where [\d]+ is regex for a number. The SAM file that you posted looks to have 3 reads (according to read name), but 5 reads if you look at the sequences. Is there something screwed up in your original fastq files?

      Comment


      • #4
        Originally posted by maubp View Post
        Very strange. Was that a typo in the version of samtools (I have 0.1.18 on my machine), or do you really have an out of date copy?
        You are right, that was a typo mistake. Thanks for spotting that.

        Comment


        • #5
          Originally posted by dpryan View Post
          The original SAM file also looks to have truncated names. Your read names should all end in ":8:[\d]+:[\d]+:[\d]+" (or something like that), where [\d]+ is regex for a number. The SAM file that you posted looks to have 3 reads (according to read name), but 5 reads if you look at the sequences. Is there something screwed up in your original fastq files?
          Yes you are right, it seems the read titles were screwed up by novoalign. The original read titles were fine.

          Code:
          @HWI-ST621:415:D197AACXX:7:1101:1179:2146 1:N:0:
          NCAGAATGAGCAATTAGAAATCCTCTGTNNTNNTAGNNNNCTGGAAATTAAACCAAGTGTATAATGCACCTAATGAAGTGTATGGTCTGANGTTTAANTAG
          +
          #1=DDFFFHHHHHJJJJJJJJJJJJJJI##2##1:C####00?DHGIJJJEHIHIEHCHFGIIJJJIGEEHHFEHFFFDDDFEEECDEDC#,5<@@C####
          @HWI-ST621:415:D197AACXX:7:1101:1185:2187 1:N:0:
          TTTGAACATCCCCACTAGGTTCTTTTCCATTGNCAANNNGGAGCATCAGCCAGTGAATCTGTTTCAGGTTTCCATTCTGCAGAACTCCTCCAAAGCATGTG
          +
          CCCFDFFFHHHHHEHIJJJCHHIIJJIIGGIG#1:C###00?DHIJHGIIJJJGHIEHIIIGDHGIJI@DHFH>AEHFFFFFFECCCCEDCDCCDDDCDCC

          Comment


          • #6
            Hi Allenyu

            Try adding " --hdrhd 4" to your novoalign command in case there is more than 1 byte difference between the read names of a set of paired reads.
            Also note that read1 and read2 should be in order throughout your FASTQ input file. If this is not the case then most aligners will probably not do the right thing.

            Comment


            • #7
              Hi Allen,

              Yes, you need to sort your Fastq input before running Novoalign. No luck man.


              Originally posted by zee View Post
              Hi Allenyu

              Try adding " --hdrhd 4" to your novoalign command in case there is more than 1 byte difference between the read names of a set of paired reads.
              Also note that read1 and read2 should be in order throughout your FASTQ input file. If this is not the case then most aligners will probably not do the right thing.
              Marco

              Comment


              • #8
                Thanks! Now trying to use sorted reads first.

                Comment

                Latest Articles

                Collapse

                • seqadmin
                  Advancing Precision Medicine for Rare Diseases in Children
                  by seqadmin




                  Many organizations study rare diseases, but few have a mission as impactful as Rady Children’s Institute for Genomic Medicine (RCIGM). “We are all about changing outcomes for children,” explained Dr. Stephen Kingsmore, President and CEO of the group. The institute’s initial goal was to provide rapid diagnoses for critically ill children and shorten their diagnostic odyssey, a term used to describe the long and arduous process it takes patients to obtain an accurate...
                  12-16-2024, 07:57 AM
                • seqadmin
                  Recent Advances in Sequencing Technologies
                  by seqadmin



                  Innovations in next-generation sequencing technologies and techniques are driving more precise and comprehensive exploration of complex biological systems. Current advancements include improved accessibility for long-read sequencing and significant progress in single-cell and 3D genomics. This article explores some of the most impactful developments in the field over the past year.

                  Long-Read Sequencing
                  Long-read sequencing has seen remarkable advancements,...
                  12-02-2024, 01:49 PM

                ad_right_rmr

                Collapse

                News

                Collapse

                Topics Statistics Last Post
                Started by seqadmin, 12-17-2024, 10:28 AM
                0 responses
                23 views
                0 likes
                Last Post seqadmin  
                Started by seqadmin, 12-13-2024, 08:24 AM
                0 responses
                42 views
                0 likes
                Last Post seqadmin  
                Started by seqadmin, 12-12-2024, 07:41 AM
                0 responses
                28 views
                0 likes
                Last Post seqadmin  
                Started by seqadmin, 12-11-2024, 07:45 AM
                0 responses
                42 views
                0 likes
                Last Post seqadmin  
                Working...
                X