SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics



Similar Threads
Thread Thread Starter Forum Replies Last Post
interpreting the CIGAR in the SAM format efoss Bioinformatics 2 10-29-2011 10:04 AM
bowtie - invalid CIGAR string - wrong sam format genome Bioinformatics 2 02-16-2011 01:36 PM
samtools pileup format foxyg Bioinformatics 3 11-16-2010 07:25 AM
samtools view error: CIGAR and sequence length are inconsistent (tophat/bowtie) glacierbird Bioinformatics 2 06-29-2010 01:58 AM
samtools output format bair Bioinformatics 4 01-28-2010 05:56 AM

Reply
 
Thread Tools
Old 04-28-2010, 06:01 AM   #1
Bruins
Member
 
Location: Groningen

Join Date: Feb 2010
Posts: 78
Default CIGAR format in SAMtools

Hi all,

I am using the SAMtools pileup format (created using 'samtools pileup -c') for SNP calling. I need to extract the frequency of each variation from the read base column of the pileup. However, there is one part of this column that I do not understand:

Quote:
... Also at the read base column, a symbol ^ marks the start of a read segment which is a contiguous subsequence on the read separated by N/S/H CIGAR operations. The ASCII of the character following ^ minus 33 gives the mapping quality. A symbol $ marks the end of a read segment.
Would someone care to explain this in different words? I find in my pileup many examples of '^~.' (at first glance at loci with a coverage of a few reads) and '$', how do I interpret this?

Also, I have been trying to find an explanation of the CIGAR format. So far, my search has led me to Exonerate and the remark that apparently different versions of CIGAR exist. I was unable to find any other operations than M, D and I, let alone N, S or H. Is there some documentation on the CIGAR format, other than the exonerate man pages?

Thank you,
Wil

ps thanks for all the hard work on SAMtools and Picard!
Bruins is offline   Reply With Quote
Old 04-28-2010, 07:33 AM   #2
Justin Brown
Junior Member
 
Location: St. Louis

Join Date: Apr 2009
Posts: 1
Default

Hi Wil,

If you haven't already, I suggest looking at the SAM/BAM format specification, section 1.4.3

http://samtools.sourceforge.net/SAM1.pdf

Also, the SAMtools paper gives an explanation of the cigar and extended cigar (section 2.1.2)

http://www.ncbi.nlm.nih.gov/pmc/articles/PMC2723002/

The standard cigar has three operations:
M : match or mismatch
I : insertion
D: deletion

The extended cigar adds
N : skipped bases on reference
S : soft clipping
H : hard clipping
P : padding

Hope that helps,
Justin
Justin Brown is offline   Reply With Quote
Old 04-28-2010, 11:12 PM   #3
Bruins
Member
 
Location: Groningen

Join Date: Feb 2010
Posts: 78
Default

Hi Justin,

This is very helpful, thanks a lot!

Wil

***edit

I understand the CIGAR operations now. But I still do not know how to interpret read base clumns that contain '^~.' or '$' - what is a read segment in this context? (sorry if it is a silly question...)

Some example lines from my unfiltered pileup:
Code:
gnl|Nvit1.0|Chr1        1       T       T       30      0       60      1       ^~.     b
gnl|Nvit1.0|Chr1        51      C       C       147     0       60      40      ......$..................................       bbabaabbbbabbbbbbbbbbbbbbbbbbb_bbbbbbbba
gnl|Nvit1.0|Chr1        104     A       R       82      82      60      12      G$G$G$,,,,,,,^~,^~,     Bbbc`bbb\ab`
How do I interpret these?

Last edited by Bruins; 04-29-2010 at 01:23 AM.
Bruins is offline   Reply With Quote
Old 05-17-2010, 03:30 PM   #4
kjngo
Junior Member
 
Location: Davis, California

Join Date: Dec 2009
Posts: 6
Default

Hi Bruins,
I don't know if anyone answered your question about the ^* and the $ in the pileup file.
^* symbolizes that a read begins to align at this position and the character after the ^ is the mapping quality of the read. $ marks the position where a read ends on the alignment.

You can reference this at : http://samtools.sourceforge.net/pileup.shtml
Hope this helps
kjngo is offline   Reply With Quote
Old 05-18-2010, 12:26 AM   #5
Graham Etherington
Member
 
Location: Norwich, England.

Join Date: Apr 2010
Posts: 22
Default

A quick question about the Pileup format - if $ means a position where a read ends on the alignment, than can the example pileup given on the Samtools website ever be true?
http://samtools.sourceforge.net/pileup.shtml
At the third positions in the pileup at reference positions 272 and 274 there are $ signs, which in my understanding of this statement means that two reads end within two nucleotides of each other at the same alignment position, which is an impossible scenario.
Is this just a bad example of the pile up format, or have I misunderstood the explanation of the $ sign?
Graham Etherington is offline   Reply With Quote
Old 05-18-2010, 12:36 AM   #6
nilshomer
Nils Homer
 
nilshomer's Avatar
 
Location: Boston, MA, USA

Join Date: Nov 2008
Posts: 1,285
Default

Quote:
Originally Posted by Graham Etherington View Post
...two reads end within two nucleotides of each other at the same alignment position...
The above statement does not make sense to me. Could you clarify?

Maybe this example will help:
Code:
REF: AAAACA
R1:  AAAAC
R2:    AACAA
Pileup(ish):
Code:
REF 1 A ^k. <
REF 2 A . <
REF 3 A .^k. <<
REF 4 A .. <<
REF 5 C .$. <<
REF 6 A .$ <
nilshomer is offline   Reply With Quote
Old 05-18-2010, 01:52 AM   #7
Graham Etherington
Member
 
Location: Norwich, England.

Join Date: Apr 2010
Posts: 22
Default

Got it! Thanks (and sorry for being so thick )
Graham Etherington is offline   Reply With Quote
Old 05-18-2010, 04:01 AM   #8
Bruins
Member
 
Location: Groningen

Join Date: Feb 2010
Posts: 78
Default

Hi all,

thanks for your replies (and related question :P)!!

So... '^~.' for a chromosome position where one read aligns means: the read starts with a match (.) and that match has quality represented by ~ (ASCII code of ~ minus 33).
'G$G$G$' means that at that particular position, three reads end with a mismatch (in all three cases a G).

If I got this correct, I think I know how to interpret lines like the ones in my example. Any comments?

cheers,
Bruins is offline   Reply With Quote
Old 03-29-2011, 04:13 AM   #9
Jenzo
Member
 
Location: Bad Nauheim, Germany

Join Date: Feb 2011
Posts: 31
Default

Hey all,

back to the CIGAR string: Would you consider it possible to calculate "Percent Identity" out of the CIGAR string?

My problem is: I use various mappers to align ~807000 454-Sequences against a set of "reference" sequences and need to decide, which of the 454-Sequences are (more or less) contained in my reference sequences. I would like to take only those sequences into account that have minimum of 95% Sequence Identity (number of identical residues (“matches”) in relation to the length of the alignment) in their alignment. For example in SSAHA2 (http://www.sanger.ac.uk/resources/software/ssaha2/) it is possible to set the parameter -identity properly. Other mappers do not provide such an option, so I would filter the SAM-Output via CIGAR string.
Do you get my problem? Perhaps it is not necessary and i totally overlooked other possibilities?

Thanks!
Jenzo is offline   Reply With Quote
Old 03-29-2011, 11:38 AM   #10
lh3
Senior Member
 
Location: Boston

Join Date: Feb 2008
Posts: 693
Default

for nearly all existing SAMs: no, not directly. but you can use samtools calmd to generate extra tags given the reference sequence and SAM.
lh3 is offline   Reply With Quote
Old 03-30-2011, 12:06 AM   #11
Jenzo
Member
 
Location: Bad Nauheim, Germany

Join Date: Feb 2011
Posts: 31
Default

Dear lh3,
thanks for this hint. I'll try to use calmd now!

Update: With an own perlscript it was possible to use information in cigar string and calmd-sam to calculate % Sequence Identity. Thanks a lot for help provided!

Last edited by Jenzo; 03-30-2011 at 04:20 AM.
Jenzo is offline   Reply With Quote
Old 02-03-2012, 03:32 AM   #12
Irina Pulyakhina
Member
 
Location: Oxford

Join Date: Sep 2010
Posts: 24
Default

Hi all,

Does anyone know the difference between N and D in CIGAR? I assume for long deletions (let's say 1000 b.p.) you'll get "1000N" and for short ones (let's say 2 b.p.) you'll get "2D" in the CIGAR. But where is the border? At what point will it change from D to N?

Thanks!
Irina Pulyakhina is offline   Reply With Quote
Old 02-03-2012, 04:50 AM   #13
arvid
Senior Member
 
Location: Berlin

Join Date: Jul 2011
Posts: 156
Default

Quote:
Originally Posted by Irina Pulyakhina View Post
Hi all,

Does anyone know the difference between N and D in CIGAR? I assume for long deletions (let's say 1000 b.p.) you'll get "1000N" and for short ones (let's say 2 b.p.) you'll get "2D" in the CIGAR. But where is the border? At what point will it change from D to N?

Thanks!
N is typically used to define an intron when aligning mRNA-originating reads to a genome; D would be for real deletions. Though the usage of N is not strictly defined in the SAM spec.
arvid is offline   Reply With Quote
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 11:49 AM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2018, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO