SEQanswers

SEQanswers (http://seqanswers.com/forums/index.php)
-   Bioinformatics (http://seqanswers.com/forums/forumdisplay.php?f=18)
-   -   GSNAP and SAM header (http://seqanswers.com/forums/showthread.php?t=47520)

JadeB 10-17-2014 01:43 AM

GSNAP and SAM header
 
Dear all,

I'm currently trying to align reads on a reference sequence with gsnap with this code :

Quote:

gsnap --gunzip -D ./ref/cristata -d cristata ./data/cristata/Pig-GC01_AGTTCC_L005_R1.fastq.gz ./data/cristata/Pig-GC01_AGTTCC_L005_R2.fastq.gz -t 8 -B 4 -A sam --nofails --clip-overlap --split-output GC01
I'm working with very short reads, it's why i'm using the clip-overlap command.

Work is finishing without error message, but my sam header is not complete, so I can't use my results files.

For example :

Quote:

head ./GC01.concordant_uniq

@SQ SN:GC01-(1934) LN:17090
HWI-ST314:258:D1WBYACXX:5:1101:3036:2447 99 GC01-(1934) 11568 40 58M42H = 11626 116 GACCCTTCCTACCCACATCACGTCCATCCAAAATTCAACCACACGAGAGCACCTCCTT CCCFFFFFHHHHHJJJJJJIJJHIIIGJJJJJIJJJJJJJIJJJGJJJJJJJJJJJIJ MD:Z:58 NH:i:1 HI:i:1 NM:i:0 SM:i:40 XQ:i:40 X2:i:0
[...]
I only have the @SQ line, and no @HD line in the header.

Do you see something I missed ?

Thanks a lot in advance for your help !!

Jade

dpryan 10-17-2014 02:00 AM

You can just add it when converting to BAM. Just make a file with the @HD line and then
Code:

cat HD.txt file.sam | samtools view -Sbo file.bam -

WhatsOEver 10-17-2014 02:02 AM

The @HD line is just the header and not mandatory for any downstream applications (as far as I know). BWA is also not creating it. If you need it, just add it like dpryan suggested, if not, just leave it.

JadeB 10-17-2014 02:09 AM

Ok. Thanks a lot for your answers.

If I don't need it, it's ok. But I thought my next error message was due to this lack.

In fact, when I do this :

Quote:

for file in ./data/cristata/GC01.*.sam
do
samtools view -Sb $file > $file.bam
done
or this :

Quote:

for file in ./data/cristata/GC01.*.sam
do
samtools view -Sb -T ./ref/cristata/cristata.fasta $file > $file.bam
done
My job fails, with such a message :

Quote:

[samopen] SAM header is present: 1 sequences.
[sam_read1] reference 'SN:GC01-(1934) LN:17090
' is recognized as '*'.
[main_samview] truncated file.
[...]
But it doesn't seem to be truncated :

Quote:

tail -n 3 GC01.concordant_uniq

HWI-ST314:258:D1WBYACXX:5:2316:14327:100908 163 GC01-(1934) 3724 40 52M48H = 3776 105 ACCAACTTATACACCTTCTCTGAAAGAGCTTCTTGCCACTAACACTGGCACT @C@FFFFFHGHHHJJGIIIIJJGHIJJJIJJJIJGIJIFHIICGIIIII)?D MD:Z:49C2 NH:i:1 HI:i:1 NM:i:1 SM:i:40 XQ:i:40 X2:i:0
HWI-ST314:258:D1WBYACXX:5:2316:17260:100910 83 GC01-(1934) 2574 40 44H56M = 2518 -112 GTTCGTTTGTTCAACGATTAATAGTCCTACGTGATCTGAGTTCAGACCGGAGTAAT IJJGJJJJJJJJJIHGDFDIIIJJJJJJIIHHHIGJIJIJIGIFHHGHFDFFFCCC MD:Z:56 NH:i:1 HI:i:1 NM:i:0 SM:i:40 XQ:i:40 X2:i:0
HWI-ST314:258:D1WBYACXX:5:2316:17260:100910 163 GC01-(1934) 2518 40 56M44H = 2574 112 GGTTTACGACCTCGATGTTGGATCAGGACATCCTAATGGTGCAGCAGCTGTTAAGG @CBFFFFFHHHHHJGGIHJJJEHHIIJGIIIJJJIGJJIDGGHIGGHII3=EFGHI MD:Z:49A6 NH:i:1 HI:i:1 NM:i:1 SM:i:40 XQ:i:40 X2:i:0
Any ideas ?

WhatsOEver 10-17-2014 02:15 AM

What samtools version are you using?
Was there something like "missing EOF marker" in the "error message"?

This is a quite common bug in version 0.1.19 (I think it has been resolved in the newer ones?!). It's actually just a warning and means nothing if you know that your input file was not truncated

JadeB 10-17-2014 02:23 AM

I'm actually using the 0.1.19-44428cd one.

But, I don't have something like "missing EOF marker", only lines as above (for files which are not empty), and "Your job has been killed."

I have bam files finally, but I'm not sure I can use them...

WhatsOEver 10-17-2014 03:02 AM

Ok, maybe the following helps...

1) Is there a particular reason for using the "-T" parameter. I never use it and I actually don't what it is useful for. Although it don't think it will change anything, you could try the conversion without that.

2) Your "GC01.concordant_uniq" file is a sam file created from one of the bam files using samtools view -h file.sam > GC01.concordant_uniq, isn't it? A more accurate comparison would be to use "diff file1 file2". If the output is empty, your files are equal.

3) If (1) doesn't help and (2) doesn't give you enough confidence, upgrade samtools to version 1.1 and check if the warning remains.

dpryan 10-17-2014 03:12 AM

@WhatsOEver: -T is useful when a file is completely missing a header. Otherwise, it doesn't do much.

@JadeB: The "[sam_read1] reference 'SN:GC01-(1934) LN:17090" means that samtools is seeing part of your header as a read. This is pretty unusual, but you're correct that this will only happen if it's not parsing the header correctly (internally, sam_hdr_read() moves the file pointer to just past the header). We're up to samtools 1.1 now, so go ahead and upgrade in any case, though I suspect there's something wrong with the header (maybe spaces rather than tabs) that will also prevent the newer versions from working. If 1.1 also doesn't work, post the first 100 or so lines (just "head -n 100 file.sam > file.txt") or so as a text file and I can have a look.

JadeB 10-17-2014 03:20 AM

Quote:

Originally Posted by WhatsOEver (Post 152307)
2) Your "GC01.concordant_uniq" file is a sam file created from one of the bam files using samtools view -h file.sam > GC01.concordant_uniq, isn't it? A more accurate comparison would be to use "diff file1 file2". If the output is empty, your files are equal.

No, my sam files are gsnap output, not made with samtools view, then I'm trying to have bam files...

I'm currently trying an other thing : error could maybe be linked with different names between my fasta file and inside my fasta file, in the header. I homogenize everything, and if it doesn't change anything, I'll ask for an update of samtools version to the bioinformatics server I use...

Thanks again !


All times are GMT -8. The time now is 10:59 PM.

Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2020, vBulletin Solutions, Inc.