SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics



Similar Threads
Thread Thread Starter Forum Replies Last Post
How to identify incorrectly fused gene models Hobbe Bioinformatics 1 08-01-2013 10:36 AM
samtools picard SamFormatConverter Bio.X2Y Bioinformatics 3 07-08-2013 07:46 AM
How to get a numeric vector of POS? yuchioj Bioinformatics 0 10-10-2012 01:33 PM
PubMed: rNA: a fast and accurate short reads numerical aligner. Newsbot! Literature Watch 0 08-28-2012 02:00 AM
cufflinks probem: start pos of exons cbil Bioinformatics 2 03-25-2011 01:20 PM

Reply
 
Thread Tools
Old 12-18-2012, 12:24 PM   #1
efoss
Member
 
Location: Seattle

Join Date: Jul 2011
Posts: 98
Default Picard's SamFormatConverter incorrectly complains about a non-numerical pos column

In trying to convert a sam file to a bam file using Picard's SamFormatConverter, I get the following error:

Code:
[Tue Dec 18 12:49:04 PST 2012] net.sf.picard.sam.SamFormatConverter INPUT=AML_PAERAH_73407_RNA.sam OUTPUT=AML_PAERAH_73407_RNA.bam VALIDATION_STRINGENCY=LENIENT    VERBOSITY=INFO QUIET=false COMPRESSION_LEVEL=5 MAX_RECORDS_IN_RAM=500000 CREATE_INDEX=false CREATE_MD5_FILE=false
[Tue Dec 18 12:49:04 PST 2012] Executing as efoss@gizmod60 on Linux 3.2.0-24-generic amd64; OpenJDK 64-Bit Server VM 1.6.0_24-b24
Ignoring SAM validation error: ERROR: Read name HWI-ST156_294:2:46:13125:2318:0, CIGAR M operator maps off end of reference
Ignoring SAM validation error: ERROR: Read name HWI-ST156_294:2:5:11648:3249:0, CIGAR M operator maps off end of reference
[Tue Dec 18 12:49:13 PST 2012] net.sf.picard.sam.SamFormatConverter done. Elapsed time: 0.14 minutes.
Runtime.totalMemory()=696385536
Exception in thread "main" net.sf.samtools.SAMFormatException: Error parsing text SAM file. Non-numeric value in POS column; Line 539389
Line: HWI-ST156_294:2:21:6547:4182:0    99      MT      4294967264      39      51M     =       93      -176    CCACACGTTCCCCTTAAATAAGACATCACGATGGATCACAGGTCTATCACC      HHHHHHHHHHHHHHHHHFHHHHEHHGHHGHGEEFGHHDHHBGDFGDFGFFH     MD:Z:51 NH:i:1  HI:i:1  NM:i:0  SM:i:39 XQ:i:40 X2:i:0
The pos column is the fourth column, and you can see that it contains 4294967264. So why is this counted as not numeric?

Thanks.

Eric
efoss is offline   Reply With Quote
Old 12-18-2012, 12:54 PM   #2
swbarnes2
Senior Member
 
Location: San Diego

Join Date: May 2008
Posts: 912
Default

Yeah, it reads 4294967264. That's 4,294,967,264, or 4 billion.

Did you really submit a chromosome that large?

I think your real error is farther up the output:

Quote:
Ignoring SAM validation error: ERROR: Read name HWI-ST156_294:2:46:13125:2318:0, CIGAR M operator maps off end of reference
swbarnes2 is offline   Reply With Quote
Old 12-18-2012, 01:01 PM   #3
efoss
Member
 
Location: Seattle

Join Date: Jul 2011
Posts: 98
Default

Hi swbarnes2,

Thanks very much for the suggestion. No, I didn't submit a chromosome that long. I aligned with gsnap, and the contents of the .chromosome file created by gsnap are pasted below. Do you have any thoughts on how this can happen?

Thanks again.

Eric

Code:
1       1..249250621    249250621
2       249250622..492449994    243199373
3       492449995..690472424    198022430
4       690472425..881626700    191154276
5       881626701..1062541960   180915260
6       1062541961..1233657027  171115067
7       1233657028..1392795690  159138663
8       1392795691..1539159712  146364022
9       1539159713..1680373143  141213431
10      1680373144..1815907890  135534747
11      1815907891..1950914406  135006516
12      1950914407..2084766301  133851895
13      2084766302..2199936179  115169878
14      2199936180..2307285719  107349540
15      2307285720..2409817111  102531392
16      2409817112..2500171864  90354753
17      2500171865..2581367074  81195210
18      2581367075..2659444322  78077248
19      2659444323..2718573305  59128983
20      2718573306..2781598825  63025520
21      2781598826..2829728720  48129895
22      2829728721..2881033286  51304566
X       2881033287..3036303846  155270560
Y       3036303847..3095677412  59373566
MT      3095677413..3095693981  16569
GL000191.1      3095693982..3095800414  106433
GL000192.1      3095800415..3096347910  547496
GL000193.1      3096347911..3096537699  189789
GL000194.1      3096537700..3096729168  191469
GL000195.1      3096729169..3096912064  182896
GL000196.1      3096912065..3096950978  38914
GL000197.1      3096950979..3096988153  37175
GL000198.1      3096988154..3097078238  90085
GL000199.1      3097078239..3097248112  169874
GL000200.1      3097248113..3097435147  187035
GL000201.1      3097435148..3097471295  36148
GL000202.1      3097471296..3097511398  40103
GL000203.1      3097511399..3097548896  37498
GL000204.1      3097548897..3097630206  81310
GL000205.1      3097630207..3097804794  174588
GL000206.1      3097804795..3097845795  41001
GL000207.1      3097845796..3097850057  4262
GL000208.1      3097850058..3097942746  92689
GL000209.1      3097942747..3098101915  159169
GL000210.1      3098101916..3098129597  27682
GL000211.1      3098129598..3098296163  166566
GL000212.1      3098296164..3098483021  186858
GL000213.1      3098483022..3098647260  164239
GL000214.1      3098647261..3098784978  137718
GL000215.1      3098784979..3098957523  172545
GL000216.1      3098957524..3099129817  172294
GL000217.1      3099129818..3099301966  172149
GL000218.1      3099301967..3099463113  161147
GL000219.1      3099463114..3099642311  179198
GL000220.1      3099642312..3099804113  161802
GL000221.1      3099804114..3099959510  155397
GL000222.1      3099959511..3100146371  186861
GL000223.1      3100146372..3100326826  180455
GL000224.1      3100326827..3100506519  179693
GL000225.1      3100506520..3100717692  211173
GL000226.1      3100717693..3100732700  15008
GL000227.1      3100732701..3100861074  128374
GL000228.1      3100861075..3100990194  129120
GL000229.1      3100990195..3101010107  19913
GL000230.1      3101010108..3101053798  43691
GL000231.1      3101053799..3101081184  27386
GL000232.1      3101081185..3101121836  40652
GL000233.1      3101121837..3101167777  45941
GL000234.1      3101167778..3101208308  40531
GL000235.1      3101208309..3101242782  34474
GL000236.1      3101242783..3101284716  41934
GL000237.1      3101284717..3101330583  45867
GL000238.1      3101330584..3101370522  39939
GL000239.1      3101370523..3101404346  33824
GL000240.1      3101404347..3101446279  41933
GL000241.1      3101446280..3101488431  42152
GL000242.1      3101488432..3101531954  43523
GL000243.1      3101531955..3101575295  43341
GL000244.1      3101575296..3101615224  39929
GL000245.1      3101615225..3101651875  36651
GL000246.1      3101651876..3101690029  38154
GL000247.1      3101690030..3101726451  36422
GL000248.1      3101726452..3101766237  39786
GL000249.1      3101766238..3101804739  38502
efoss is offline   Reply With Quote
Old 12-18-2012, 01:12 PM   #4
efoss
Member
 
Location: Seattle

Join Date: Jul 2011
Posts: 98
Default

For what it's worth, here are the two lines that match that read name:

Code:
HWI-ST156_294:2:46:13125:2318:0 99      MT      16336   39      51M     =       16539   254     GCACATTACAGTCAAATCCCTTCTCGTCCCCATGGATGACCCCCCTCAGAT      HHHHHGHHHHHHHHHHHH.HGGGGGHHHHHHHHHGHFHHFHHDHHDEHDFE     MD:Z:51 NH:i:1  HI:i:1  NM:i:0  SM:i:39 XQ:i:40 X2:i:0
HWI-ST156_294:2:46:13125:2318:0 147     MT      16539   39      35M16S  =       16336   -254    ACACGTTCCCCTTAAATAAGACATCACGATGGATCACAGGTCTATCACCCT      CHE8EEFFEGGEDHHHFGGGBHHHEFHHHHHFHHGHHHFHGHHHHEEGGHG     MD:Z:35 NH:i:1  HI:i:1  NM:i:0  SM:i:39 XQ:i:40 X2:i:0
efoss is offline   Reply With Quote
Old 12-18-2012, 02:08 PM   #5
efoss
Member
 
Location: Seattle

Join Date: Jul 2011
Posts: 98
Default

To follow up on this: The 35M16S CIGAR tells me that the read is indeed hanging off the end. (My understanding of hard clipping versus soft clipping is that in both instances, some bases don't match the reference genome, but in hard clipping you remove those bases from the read whereas with soft clipping, you note that the bases don't match but leave them in the read.) So I can write a script to remove these offending lines, but that seems like kind of a clumsy approach.

Eric
efoss is offline   Reply With Quote
Old 12-18-2012, 03:16 PM   #6
aaronh
Member
 
Location: California

Join Date: Sep 2008
Posts: 45
Default

That number is also the max unsigned integer - 1 so this smells like a Picard bug.
aaronh is offline   Reply With Quote
Old 12-18-2012, 03:32 PM   #7
efoss
Member
 
Location: Seattle

Join Date: Jul 2011
Posts: 98
Default

Quote:
Originally Posted by aaronh View Post
That number is also the max unsigned integer - 1 so this smells like a Picard bug.
Hi aaronh,

Yes, I think it is a Picard bug as well. I can create bam files from these sam files using samtools' view function just fine.

Eric
efoss is offline   Reply With Quote
Old 12-19-2012, 07:03 PM   #8
efoss
Member
 
Location: Seattle

Join Date: Jul 2011
Posts: 98
Default

I found out what the problem is with the 4.3 billion POS column: There is a bug in the version of gsnap that I was using that can lead to this type of thing with circular chromosomes. (The chromosome in question was the mitochondrial chromosome.) This bug has been fixed in the current version of gsnap, which was released just a few days ago.

Eric
efoss is offline   Reply With Quote
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 08:39 PM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2019, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO