SEQanswers

Go Back   SEQanswers > Sequencing Technologies/Companies > Complete Genomics

Similar Threads
Thread Thread Starter Forum Replies Last Post
cuffmerge crashes when converting gtf files to sam files swbiggs4 Bioinformatics 12 08-19-2014 01:18 PM
BWA samflag inconsistencies Tally Bioinformatics 5 11-27-2011 04:07 PM
Analyze Evidence Asociated with CG Assembly rskr Complete Genomics 3 11-10-2011 09:09 AM
Evidence of Strand Bias AnamikaDarwin Bioinformatics 1 06-27-2011 04:21 PM
Viewing paired-end inconsistencies in 454 scaffolds phageman Bioinformatics 2 02-17-2009 11:31 AM

Reply
 
Thread Tools
Old 03-06-2012, 09:36 PM   #1
ritzriya
Member
 
Location: INDIA

Join Date: Jun 2010
Posts: 46
Question Inconsistencies between evidence and SAM files

Hi everyone,

The evidence2sam command which produces the SAM file, contains records/lines where a every read is repeated twice mapping to two respective different locations, such as:

Quote:
GS27657-FS3-L02-8:1660459 179 chr14 19089795 16 12M1I1P4I6M6N5M1I4M = 19090178 383 CCTAATTCTTATTTTTATTTTTTTATTTATTTT 9::::656877887;<<<:::<;6-47737783 RG:Z:NA19238-L2-200-37-ASM-chr14 GC:Z:3S2G28S GS:Z:AAAA GQ:Z:::4-
GS27657-FS3-L02-8:1660459 115 chr14 19090178 16 10M5N23M = 19089795 -383 TTCATGAGAGGGTCCACTATTTTTCCCTTGTTA .08877587857;*;1<<9778877871;;::7 RG:Z:NA19238-L2-200-37-ASM-chr14 GC:Z:28S2G3S GS:Z:TGTG GQ:Z:77;;
Here, the read 'GS27657-FS3-L02-8:1660459' maps to the same chromosome (chr14) at two adjacent locations.

Following are my doubts about the Complete Genomics(CG) data:
1) Why is every read ID in the SAM file repeated twice mapping to different locations? Is there a way to resolve this? What does this signify?
2) It is observed that the Evidence file contains reads of uniform length - 70 bp. Post conversion using evidence2sam from cgatools, why does the read length reduce to 33 bp only?
3) Also, if the read sequence is compared between the evidence file and the SAM file,the read sequence is not at all matching to any part of the 70 bp sequence. Is this an error?

If anybody could help, it would be great.

Thanks in advance.
ritzriya is offline   Reply With Quote
Old 03-07-2012, 01:12 AM   #2
gtyrelle
Member
 
Location: the Netherlands

Join Date: Feb 2011
Posts: 16
Default

We create mate-pair libraries for sequencing, what you are seeing is each (~35bp) arm of the mate-pair represented in the sam file. While the SAM specification (PDF) is not exactly fun reading, it does contain the information to figure out what is going on here (from the spec):

Quote:
QNAME: Query template NAME. Reads/segments having identical QNAME are regarded to come from the same template. A QNAME * indicates the information is unavailable.
You can also use the flags for each read to figure out what is going on too, the explain SAM flags page is helpful for this. For example the first read shown has the following properties according to its flag (179): read paired, read mapped in proper pair, read reverse strand, mate reverse strand, second in pair.

The reads we generate are not always precisely 35bp in length. Each read is composed of sub-reads, which may contain both positive gaps and negative gaps (~ 2bp overlap). The overlapping bases are represented in the evidence files, but this feature is not supported by the SAM specification so the overlap is collapsed during conversion making the reads in the SAM files slightly shorter. There are additional flags generated by evidence2sam that specify which of the bases in the read were collapsed (GC/GS/GQ).

Using the explain SAM flags page, you can see that the reads are on the reverse strand, which is why they are not exactly the same between evidence and SAM files.

Your previous post, had an error generated by GATK when parsing the SAM files from the evidence2sam command in cgatools:

Quote:
ERROR MESSAGE: SAM/BAM file SAMFileReader{CGA_test/originalSAM/output_sorted.bam} is malformed: Adjacent I/D events in read GS27657-FS3-L02-8:1660459

This appears to be a GATK issue, discussed here (starting at post #23) and reported here. What was the specific GATK command you were using that generated this error ? And are you using public data or is this your own data ?

I haven't done much work on this particular error, but I would guess GATK is complaining about the first read with cigar 12M1I1P4I6M6N5M1I4M, which contains a number of insertions. I'm confident this is not an error with the read, as the local de novo assembly process can generate complex variant calls that may consider adjacent insertion/deletion events.

Greg
__________________
Bioinformatics Applications, Europe
Lifetech Inc. http://www.lifetech.com/
gtyrelle is offline   Reply With Quote
Old 03-07-2012, 03:04 AM   #3
ritzriya
Member
 
Location: INDIA

Join Date: Jun 2010
Posts: 46
Question

Oh, Okay. So what I finally understand is that the SAM file consists of a split-up of the mate-pair read, i.e. 2 sequences of each of the mate-pair. And even if the entire SAM file consists of such duplicate entries, it is normal and expected. I hope I am talking sense and correct.

Thank you for giving the in depth information about the SAM file tag - was really helpful and interesting.

Moving to the GATK error, I went to the forums which you have pointed to - I used -rf BadCigar option with GATK. It still gives me the error. I will not bother you with GATK error as it seems to be related to the tool; will get it solved independently.

Thanks once again for the prompt responses to my issues
ritzriya is offline   Reply With Quote
Old 03-07-2012, 03:40 AM   #4
gtyrelle
Member
 
Location: the Netherlands

Join Date: Feb 2011
Posts: 16
Default

Quote:
I hope I am talking sense and correct.
You are correct.

It is more appropriate to post issues with GATK to their support forum, however I would still like to know which GATK command you were using. The reason I ask is that in the forum thread, the error seemed to arise with the DINDEL (indel detection) command. Were you using this command, or simply parsing the BAM files ?

You are welcome to continue to ask questions regarding use of GATK with Complete data, it is not a bother, and we very much welcome the feedback on issues encountered with third party tools. This feedback helps us improve compatibility, which is not perfect at the moment, but we genuinely want to see improve.
__________________
Bioinformatics Applications, Europe
Lifetech Inc. http://www.lifetech.com/
gtyrelle is offline   Reply With Quote
Old 03-07-2012, 07:29 PM   #5
ritzriya
Member
 
Location: INDIA

Join Date: Jun 2010
Posts: 46
Question

Following is the GATK command I used:

Quote:
java -jar GenomeAnalysisTK-1.4-37-g0b29d54/GenomeAnalysisTK.jar -T UnifiedGenotyper -I output_sorted.bam -R hg1to24.fa -glm BOTH -o GATK_out -log logfile.txt
I have tried to run -glm SNP - it works. There lies a problem only with -glm INDEL and BOTH options. I used ValidateSam from Picard to check for errors - It said that "Padded characters are not allowed at the start or end of the CIGAR string". I am stuck at this point - whether to remove such records from the file and proceed or is there some solution? I have already posted this query to GATK forum. Will wait for the answer. If anyone could help, it would be great. Thanks Greg.
ritzriya is offline   Reply With Quote
Old 03-08-2012, 06:09 AM   #6
gtyrelle
Member
 
Location: the Netherlands

Join Date: Feb 2011
Posts: 16
Default

Excellent, thank you for the informative discussion and letting us know the commands you have been using. As I suspected, the issue is with using the INDEL detection methods in GATK. The "Padded characters are not allowed at the start..." issue was discussed in the previous thread, local de novo assembly of reads can introduce P operators at the beginning of the cigar string to deal with indels. My understanding is the SAM spec does not explicitly disallow P operators at the beginning of a cigar string, but most tools (GATK/samtools) do not support this.

I do not expect that the results of processing the evidence reads via GATK will produce better results than the native CG pipeline. Keep in mind, that the evidence reads have already been "realigned" (local de novo assembly) and in theory are the optimal set of alignments at a given position in the genome supporting a variant (SNP, SUB, INDEL). I encourage you to try processing these reads via GATK and reporting the results of this experiment, after comparing back to our results.
__________________
Bioinformatics Applications, Europe
Lifetech Inc. http://www.lifetech.com/
gtyrelle is offline   Reply With Quote
Reply

Tags
complete genomics, sam file

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 02:02 PM.


Powered by vBulletin® Version 3.8.6
Copyright ©2000 - 2014, Jelsoft Enterprises Ltd.