SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics



Similar Threads
Thread Thread Starter Forum Replies Last Post
SFF Read names johan 454 Pyrosequencing 8 04-19-2012 07:54 AM
Bowtie changes read names in SAM output ashish Bioinformatics 9 07-22-2011 12:33 PM
Mosaik - Read Names Trimmed sichan Bioinformatics 0 01-26-2011 12:20 PM
MAQ simulated read header names MBekritsky Bioinformatics 0 11-18-2010 06:46 PM
Paired read names / SAM qname format misko Bioinformatics 2 06-30-2010 10:14 AM

Reply
 
Thread Tools
Old 09-23-2012, 10:05 PM   #1
allenyu
Junior Member
 
Location: Hong Kong

Join Date: Jul 2009
Posts: 9
Default SAM/BAM sort by read names produces truncated read names

Hi,

I tried to sort the alignment file by read name, but it appears that truncated read names were produced. This phenomenon was observed no matter which program I used: SAMtools sort (0.1.8), Picard SortSam (1.77) or Novosort (2.08) .

Here is the first few records of the original SAM file:
Code:
HWI-ST621:415:D197AACXX:8:1101:1        113     chr2    236798427       70      100M1S  chr8    3088040 0       ACCTCTGTTTCTAAGCAGTGGAATAGAATTGCTTATGGAATAGCCAGGTCATAGGATGTNATAANTTCCCTGGAAATCAGAGGGGAAAAGAAGCAAAACAN   C@?>?AC@:C@>CECDEE@ACFEBFFDEEHECDACADHFHFEHIJGJIGIHJJIHDB80#HF?1#GDJIHCIGGHHAIIIJJHEHJJIHHHHHFFFDD=1#        PG:Z:novoalign  RG:Z:LS148      AS:i:18 UQ:i:18 NM:i:2  MD:Z:59G4T35
HWI-ST621:415:D197AACXX:8:1101:1        177     chr8    3088040 70      101M    chr2    236798427       0       AAATACATACATACACACAGACTGATTTTCTCTTCAGCAATATTTTAATGAAACCCCATACTGCAAATTACATAAACTAGTTAAAGTACACCAACCTCAAG   DEEDDDFDCEECEEDDBFFFDHHHFGHECJJIHFJJJIJJJIJHHGIHGDDGGJJJIIHGHIJJJIIJIGJJIIIFIIJJJJJJIIHFFAHHHDFFDFCCB        PG:Z:novoalign  RG:Z:LS148      AS:i:0  UQ:i:0  NM:i:0  MD:Z:101
HWI-ST621:415:D197AACXX:8:1101:1223:2124        83      chr8    143208201       70      100M1S  =       143207998       -303    CGCTGAGAGCAAGGTGCCAGCAGGGTGGGCCCTTCTGGAGGCTCCGGCCGGGATCTGTTCCAGGCCACCCCCGCCTTCCGGCCATCCTCAGCTTGGCTCCN   >@CA>A:A>>>3(CA<AACDDDB<<?3?@9?CDCDCBCC?7<BBDBB@<93?DCCAA8<B?A<<DB7DCIGGBHGAHIIHFJJIEJIIHHHHHFFFDD=1#        PG:Z:novoalign  RG:Z:LS148      AS:i:47 UQ:i:47 NM:i:1  MD:Z:6C93       PQ:i:59 SM:i:70 AM:i:70
HWI-ST621:415:D197AACXX:8:1101:1223:2124        163     chr8    143207998       70      92M     =       143208201       303     TTGTGGAGTCAGGTGTCCCTGGGGTCACGGTGACTGGCCAGGCGNGGGGAGCCAGGAGGCACACGGTCCTGGGCTCTNGCAGGGCTGGAGTG    @BBDFFADD?FHH@@EGGGGIIII@BCGHG8?DGHGB@FHHGAG#-<CC;@E?ACEE?B7?BCA?B;?BDDCB9??A#++28?B?B@B1<>A PG:Z:novoalign  RG:Z:LS148      AS:i:12 UQ:i:12 NM:i:2  MD:Z:44C32G14   PQ:i:59 SM:i:70 AM:i:70
HWI-ST621:415:D197AACXX:8:1101:14       65      chr6    74783346        70      1S100M  chr1    1867309 0       NGATTAAGCAGCCAAGCTGTATCCTGAGGGAAACATGGGCAATGGAAAGCATCAGATTTCCTGGGTCAAAGCTATCCTGAGCTCAGGCACTGGGCTAACTG   #4=DFFFFGHHHHJJJJJJIJJJJJJJJJJGHIJIHIIJIGIIJJBFHIIIJJJJDIJJIHHIJJIGGHHHHHFFFFFFEDEEEEDDD@DDDDDDCDCDDD        PG:Z:novoalign  RG:Z:LS148      AS:i:6  UQ:i:6  NM:i:0  MD:Z:100
HWI-ST621:415:D197AACXX:8:1101:14       129     chr1    1867309 70      101M    chr6    74783346        0       ACACACACACACACACACGAACTGCAGGGGGCTCTGGAGCCATGGAGTTAGAAAAGCTCTCTGAGAGGCCAGGTGTAGTGGCTCATGCCTGTAATCCCAGC   CCCFDFFFHHHHGJJJIJJIJJJJJFHIJIJFHIJJJDHEHHHHG@D?BDACCEDCBDDDDDDCDDDDBDBDB@CCCCCCBDDCCC@ACAC@>AB>CCACD        PG:Z:novoalign  RG:Z:LS148      AS:i:30 UQ:i:30 NM:i:1  MD:Z:68T32
HWI-ST621:415:D197AACXX:8:1101:14       97      chr2    62756955        70      1S100M  chr6    74783591        0       NGTGCTGTTTGGTTTGTGTGTATTATATGGGTTTGGATTACAATAATTCCTCCCTTTTGTATAATGTTTTGCAGTTTTTAAAGCACTTCATGCTCTAAATC   #1=DDFFDHHGGFHIIHHEHGFGIDHHIIIIFGIIICGGEHHHIIIII>GGGIIIIIIIICFGHHGGHIIIIDAAEHHHEBDDFCEEECCDCCCCCC>ACC        PG:Z:novoalign  RG:Z:LS148      AS:i:6  UQ:i:6  NM:i:0  MD:Z:100
HWI-ST621:415:D197AACXX:8:1101:14       145     chr6    74783591        70      101M    chr2    62756955        0       ATTTTTGTAAGTCACCAATGGTTGGATGTTGGCAGTTTCATAAGGTTCATTCTAATAGTTCCTGGGACACAAATGACTCGAAGTAGGTCAAGACAGGTTCA   <DDDDDDDDDDDDEEECCFDFFGHEHGJJIIJJJJIGHIHCIIIJIGCGIIGDIHEIIHGIJJJJIHIIJJIIHGBHHJIJJJJJJJJHHHFHFFFFD?C@        PG:Z:novoalign  RG:Z:LS148      AS:i:0  UQ:i:0  NM:i:0  MD:Z:101
HWI-ST621:415:D197AACXX:8:1101:1        81      chr1    155944063       70      101M    chr11   19838477        0       CAGCTGTACCTGGCAGCAGCCCCTTCCCCAAGATGGTGACACCTCTGTCCACACCCTCTGTAATAGTGACCGGAGAGCCTGTGGAGCATTCCACCAGGATT   DDDEDAA:BCAA:DD@BDDDDB?@=BDEDEEDFFFD@;??=HHIIIIGJIHF<JIHFGBIHIJIIIIIHJJJJJJJIJJJJIJJJJJIHHHHHFFFFFCC@        PG:Z:novoalign  RG:Z:LS148      AS:i:0  UQ:i:0  NM:i:0  MD:Z:101
HWI-ST621:415:D197AACXX:8:1101:1        161     chr11   19838477        70      101M    chr1    155944063       0       AGCCCCTTATGCAGAAAAAGGGACTCCACCTGGAGCCCTCTCTGGATCTACTTCTCCCAGATAAATCAGTCGGCTGTGTAATCTTTCAGGAAACCTGACCC   ??<DDFFFFHHDDDHIGDDAFE9FFGHGCHEGG9FGGHGGGGCFHBF*0BBCBGGE@GHGCHA@ECE@H;ADBFDCDDCCDD@CCC;33:32:595<9>3<        PG:Z:novoalign  RG:Z:LS148      AS:i:1  UQ:i:1  NM:i:0  MD:Z:101
After sorting:
Code:
HWI-ST  81      chr7    83652142        70      82M     chr8    142160880       0       CTTTGTATTTACAGATACCACGGCCATTTTGCAATGTCCTCAGCACATAGTGGAAGCTGAACAAACAATCACATTTTCTAAT      @D<EA?7)==77@=7)('-'FF;FABB*0>EDB9DFDGDEBDEECC<FHHHBE@9HHEAB<;>FFDBBFA<DFA;A,B48;?   PG:Z:novoalign  RG:Z:LS148      AS:i:22 UQ:i:22 NM:i:1  MD:Z:76A5
HWI-ST  65      chr9    120922414       70      101M    chr6    160312253       0       TCACTGAGTCTGATTGAAGCAACTGGCATTGGTGATCATACTTCAATATTTCTCTCATATTTGAAGTTAGAATTAGTTGATGTGAGATATTATATTAGCCT   @CCFFFFFHFHFAHHIDGHIJGIIJGHCGIJICFHIIIIIJIJJIIJGIIEIJHHGGIICGHIBGHFGHHGGHIDC@DHGIHGIGHHHHECBDFFFFFEDE        PG:Z:novoalign  RG:Z:LS148      AS:i:0  UQ:i:0  NM:i:0  MD:Z:101
HWI-ST  81      chr2    46872242        70      101M    chr17   79461315        0       CATGGATTAAAATATTAAGTAATTTGATCTAGATGATTGTTTACAGTTTAACGCAAATACACTTAGTCTGTTCTGATTATTTACTCAAGGATTATATTACT   >C>:EDDFCDDFFDFFHHHHHHJIHGG=GIGJJIIIIJGIIIHIJHDGGHHJFIIJIIGC:JHHAIIFJJJIHGH@IJJJHHCGB>HGGHGHHFFFFF@C@        PG:Z:novoalign  RG:Z:LS148      AS:i:0  UQ:i:0  NM:i:0  MD:Z:101
HWI-ST  65      chr8    103315908       70      93M     chr17   40205036        0       AGATATCTGAGAAACTGACCTAAATAAGCAATCTGAAAAGATTAAGGTTCCTTCAATTATTATACTACTTGTTCTCCAAATAACACACTAACT   <@@ADD>DDBA<FG?A43?@FFF:3AEB>DFECE91:C<CFCFCFFC::4?D>FCDDD<FC8DFEFDG88@.==C=4@D;7@:7?CCBDD@>@        PG:Z:novoalign  RG:Z:LS148      AS:i:0  UQ:i:0  NM:i:0  MD:Z:93
HWI-ST  89      chr16   61016706        70      101M    =       61016706        0       TGTTGAGTCAATGTAAGACCTTGGTAAGAATTCTTCAATTTAGACATGGCTAATTTTTAATGTCAACCACAGCTATTGAGGTACTTATATTAATTAACCTT   C?CECACCFFFFDDDE?=CCGGIIIGGEGIIIGGIIEGIIHHDBFGIGFIIIHGIIIIIGGIHG@CHHHGHHHGHDEIFIIGIHBIIIHHDDHEDEDF@?@        PG:Z:novoalign  RG:Z:LS148      AS:i:0  UQ:i:0  NM:i:0  MD:Z:101
HWI-ST  97      chr12   16510044        70      101M    chr9    75346048        0       TAATAAAAATTCAGTTTTAACTATAGATGCCTTCTTCTCCTCTTGTGTTTGATTTATTGCTCCAAATGGGCCAACCTGGATGTCTATATTTCTTCCACTAA   CCCFFFFFHHHHHJIIGIIJJIJIIJJJJJJIEIIJJJJJHJIJGFGFHJJJIIJJJJJJJJIGJJJJIIJJJIJHJHHFHHBBEDFFCFEFEEEEDDDDD        PG:Z:novoalign  RG:Z:LS148      AS:i:0  UQ:i:0  NM:i:0  MD:Z:101
HWI-ST  73      chr5    22843028        70      97M     =       22843028        0       TAACTGTGTTTACTTTTCTCAGTTTCTACCAGAGAAAAGGCAGGTGCATTTTTTTGGTATGTTTGTGTAAAGTGAATTTGGCTTTACTTTTTCAAAT       =?<DD>=;FHDFFHGE@EFH?EA<B4AA@EBGCC1?91*:8CFG0?@?<D@@B;AFB=7=3?CHEEBE77B@6>;(6;.;;@;?>A>5(5:@CC5@>    PG:Z:novoalign  RG:Z:LS148      AS:i:3  UQ:i:3  NM:i:0  MD:Z:97
HWI-ST  73      chr6    152150636       70      101M    =       152150636       0       CATTTGTCATCATTACACGGTCATGGGAGTGCTAAGAAGACTTAAATGCAGGGCTACCACCCCTTCCCAATTCATCTTTTATCCATTTTATTTCTCTAAGG   @CCDDDDEHHFHHFBHGGHHAFEFFHIGG:?CFGIGIGGHHEGIEHIGHGDE@;B=FA@F@FGGGEEHECCFFEFFCECDECCCDDDEDDCC@BCC>CCCC        PG:Z:novoalign  RG:Z:LS148      AS:i:0  UQ:i:0  NM:i:0  MD:Z:101
HWI-ST  113     chr7    63064316        30      101M    chr17   26080536        0       CCTGCTCATCTCAGGCCTGCCGGCTCCTCCACCTGCCTTTTCGAGTACCCTGGGAACCCCCCGAGGACAGGTGTCATCGGTTGCTTCATCTCACCATCCCT   A94+(:ACCC??@BB@@7DDBDB<2????@8;BDB@A@BCDBCCCA<-DCC>3?8DB=7@@IHCIIJIGIJIIIJGHHGGGGHGEIDIFFFFAFFFDF@@@        PG:Z:novoalign  RG:Z:LS148      AS:i:31 UQ:i:31 NM:i:1  MD:Z:42C58
HWI-ST  89      chr4    96140737        70      101M    =       96140737        0       AACAACGAGCCTCACTAGGTGACGATTAGCTATGGTTTCCCTGGTCTATACTGGATTTGGGTTCATTGGTAAATCATTCTATTCATAGCAATACAAGATAT   <<A?8DDDDDDCCAEEEFFFFHHHHHFIJJJJIIIIIGIGHIFIGHIIGDGGIJIJJIIHIHIEHIIJJJJJJIIJJJIJJIIJJIJJHGHHHFFFFFB@@        PG:Z:novoalign  RG:Z:LS148      AS:i:0  UQ:i:0  NM:i:0  MD:Z:101
Does anyone have any idea of what's wrong with the programs or data?

Thanks a lot!

Allen
allenyu is offline   Reply With Quote
Old 09-24-2012, 01:02 AM   #2
maubp
Peter (Biopython etc)
 
Location: Dundee, Scotland, UK

Join Date: Jul 2009
Posts: 1,541
Default

Very strange. Was that a typo in the version of samtools (I have 0.1.18 on my machine), or do you really have an out of date copy?
maubp is offline   Reply With Quote
Old 09-24-2012, 02:38 AM   #3
dpryan
Devon Ryan
 
Location: Freiburg, Germany

Join Date: Jul 2011
Posts: 3,480
Default

The original SAM file also looks to have truncated names. Your read names should all end in ":8:[\d]+:[\d]+:[\d]+" (or something like that), where [\d]+ is regex for a number. The SAM file that you posted looks to have 3 reads (according to read name), but 5 reads if you look at the sequences. Is there something screwed up in your original fastq files?
dpryan is offline   Reply With Quote
Old 09-24-2012, 03:26 AM   #4
allenyu
Junior Member
 
Location: Hong Kong

Join Date: Jul 2009
Posts: 9
Default

Quote:
Originally Posted by maubp View Post
Very strange. Was that a typo in the version of samtools (I have 0.1.18 on my machine), or do you really have an out of date copy?
You are right, that was a typo mistake. Thanks for spotting that.
allenyu is offline   Reply With Quote
Old 09-24-2012, 03:29 AM   #5
allenyu
Junior Member
 
Location: Hong Kong

Join Date: Jul 2009
Posts: 9
Default

Quote:
Originally Posted by dpryan View Post
The original SAM file also looks to have truncated names. Your read names should all end in ":8:[\d]+:[\d]+:[\d]+" (or something like that), where [\d]+ is regex for a number. The SAM file that you posted looks to have 3 reads (according to read name), but 5 reads if you look at the sequences. Is there something screwed up in your original fastq files?
Yes you are right, it seems the read titles were screwed up by novoalign. The original read titles were fine.

Code:
@HWI-ST621:415:D197AACXX:7:1101:1179:2146 1:N:0:
NCAGAATGAGCAATTAGAAATCCTCTGTNNTNNTAGNNNNCTGGAAATTAAACCAAGTGTATAATGCACCTAATGAAGTGTATGGTCTGANGTTTAANTAG
+
#1=DDFFFHHHHHJJJJJJJJJJJJJJI##2##1:C####00?DHGIJJJEHIHIEHCHFGIIJJJIGEEHHFEHFFFDDDFEEECDEDC#,5<@@C####
@HWI-ST621:415:D197AACXX:7:1101:1185:2187 1:N:0:
TTTGAACATCCCCACTAGGTTCTTTTCCATTGNCAANNNGGAGCATCAGCCAGTGAATCTGTTTCAGGTTTCCATTCTGCAGAACTCCTCCAAAGCATGTG
+
CCCFDFFFHHHHHEHIJJJCHHIIJJIIGGIG#1:C###00?DHIJHGIIJJJGHIEHIIIGDHGIJI@DHFH>AEHFFFFFFECCCCEDCDCCDDDCDCC
allenyu is offline   Reply With Quote
Old 09-24-2012, 06:43 AM   #6
zee
NGS specialist
 
Location: Malaysia

Join Date: Apr 2008
Posts: 249
Default

Hi Allenyu

Try adding " --hdrhd 4" to your novoalign command in case there is more than 1 byte difference between the read names of a set of paired reads.
Also note that read1 and read2 should be in order throughout your FASTQ input file. If this is not the case then most aligners will probably not do the right thing.
zee is offline   Reply With Quote
Old 09-24-2012, 07:19 AM   #7
marcowanger
Senior Member
 
Location: Hong Kong

Join Date: Dec 2008
Posts: 350
Default

Hi Allen,

Yes, you need to sort your Fastq input before running Novoalign. No luck man.


Quote:
Originally Posted by zee View Post
Hi Allenyu

Try adding " --hdrhd 4" to your novoalign command in case there is more than 1 byte difference between the read names of a set of paired reads.
Also note that read1 and read2 should be in order throughout your FASTQ input file. If this is not the case then most aligners will probably not do the right thing.
__________________
Marco
marcowanger is offline   Reply With Quote
Old 09-24-2012, 10:46 PM   #8
allenyu
Junior Member
 
Location: Hong Kong

Join Date: Jul 2009
Posts: 9
Default

Thanks! Now trying to use sorted reads first.
allenyu is offline   Reply With Quote
Reply

Tags
picard, read name, samtools, sort, sortsam

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 06:04 PM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2019, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO