SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics



Similar Threads
Thread Thread Starter Forum Replies Last Post
HTSeq Script from DEXSeq Reports Assertion Fail in SAM file FuzzyCoder Bioinformatics 5 09-27-2011 08:52 AM
Assertion failed error in BFAST localalign seeker Bioinformatics 7 09-02-2011 09:33 PM
mpileup failed when using samtools.pl varFilter jianfeng.mao Genomic Resequencing 2 07-11-2011 08:49 AM
Samtools Pileup Assertion Error AnamikaDarwin Bioinformatics 2 06-29-2009 12:44 PM
maq map - Assertion 'matches' failed AnamikaDarwin Bioinformatics 1 12-10-2008 07:56 AM

Reply
 
Thread Tools
Old 01-09-2012, 05:39 AM   #1
SEQquestions
Member
 
Location: Leeds

Join Date: Jan 2010
Posts: 13
Default samtools: bam_pileup.c:112: resolve_cigar2: Assertion `s->k < c->n_cigar' failed

Hi,

I have a bam file that I have run through the GATK pipeline (i.e. de novo realignments around indels, duplicates removed and quality recalibration). I then wish to run mpileup on the processed bam files using the command:

samtools mpileup -s -f hg19.fa.fai sample.sorted.bam > sample.pileup

The programme seems to work initially but then aborts saying:
samtools: bam_pileup.c:112: resolve_cigar2: Assertion `s->k < c->n_cigar' failed.

I haven't learned to programme in C so am really struggling to understand what this means and how I can circumvent the issue. Can anyone help?

NB The original sequencing was done with Complete Genomics so I have had to convert from their proprietary alignment formats to bam using cgatools

Last edited by SEQquestions; 01-09-2012 at 05:40 AM. Reason: formatting
SEQquestions is offline   Reply With Quote
Old 01-09-2012, 08:34 AM   #2
swbarnes2
Senior Member
 
Location: San Diego

Join Date: May 2008
Posts: 912
Default

Did you try examining your .sam file? Maybe the .bam conversion didn't work correctly.
swbarnes2 is offline   Reply With Quote
Old 01-10-2012, 01:04 AM   #3
SEQquestions
Member
 
Location: Leeds

Join Date: Jan 2010
Posts: 13
Default

Hi,

Thanks for your response swbarnes2. I have worked it out: for some reason the conversion from CG to sam/bam ends up with the padding operator being allowed at the beginning or end of the cigar string - which pileup does not like. I need to go through and remove all alignments where that has happened.

Does anyone have any ideas about a way to do this that doesn't involve converting to bam to sam and deleting lines where the cigar string has \d+P at the beginning or end? I am looking at whole genomes here so could be a lengthy process

Cheers
SEQquestions is offline   Reply With Quote
Old 01-27-2012, 12:44 AM   #4
nora
Member
 
Location: Germany

Join Date: Dec 2009
Posts: 11
Default

I'm having the exact same problem (Complete Genomics data converted to BAM with CGA tools, same error with samtools).

I tried removing the reads with padding operators at the beginning or end of the cigar string, but this doesn't solve the problem. Some of these reads seem to cause problems indeed, but not all of them, and there must be an additional/different cause.

Has anybody found a workaround? I'll be checking which reads cause samtools to throw the error and I'll be contacting the Complete Genomics support, but I'd be glad if someone could help!

Thanks!
nora is offline   Reply With Quote
Old 01-27-2012, 01:10 AM   #5
SEQquestions
Member
 
Location: Leeds

Join Date: Jan 2010
Posts: 13
Default

Hi nora,

I found a way round it (a fix rather than a cure, as it were) by using the following awk statement:
awk '{if (\$6!~/^[[:digit:]]+N[[:digit:]]+D|^[[:digit:]]+P|[[:digit:]]+P\$|^[[:digit:]]+I/) print \$0}'

Basically meaning that you need to remove any line where the cigar string fulfils any of the following:
* starts with \d+N\dD
* starts with \d+P
* starts with \d+I
* ends with \d+P

where \d+ is the perl regex meaning 'any integer containing an unspecified number of digits'. Hope that helps
SEQquestions is offline   Reply With Quote
Old 01-27-2012, 01:52 AM   #6
nora
Member
 
Location: Germany

Join Date: Dec 2009
Posts: 11
Default

Thanks a lot! I just had a look at the CIGAR strings causing problems, and indeed it seems that everything that doesn't start with the match operator causes the samtools error.

This is less than 1% of the reads for my data though, so I'll go with removing those reads...
nora is offline   Reply With Quote
Old 01-30-2012, 01:50 AM   #7
gtyrelle
Member
 
Location: the Netherlands

Join Date: Feb 2011
Posts: 16
Default

Hi @Nora & @ SEQquestions, I work for the App Sci team at Complete. Would you mind posting some examples of the cigar strings causing samtools to crash ? Were these SAM/BAM files generated from initial mappings or evidence files ? Finally what are you plans for the resulting BAM files ? further downstream analysis or visualization ?

Conversion of our raw and realigned data to BAM is still an ongoing project. We are actively seeking feedback from users about their experiences using and working with the converted data.
__________________
Bioinformatics Applications, Europe
Lifetech Inc. http://www.lifetech.com/
gtyrelle is offline   Reply With Quote
Old 01-30-2012, 04:37 AM   #8
SEQquestions
Member
 
Location: Leeds

Join Date: Jan 2010
Posts: 13
Default

Hi Greg,

Basically, as nora said, any cigar string that does not start with the match operator is causing an issue. A few examples are:

105N1D23M6N10M
31P23I6P10I
39P23I6P10I
114N1D23M5N10M

It is necessary for us to convert to BAM files because we have some Illumina-sequenced samples that we need to be able to compare with our CG-sequenced sampled - and seeing as CG tools/methods are not applicable to Illumina data, we need to get the CG data in to bam format so we can run it through the same analyses and software.

Not ideal by any means but we are attempting, as much as possible, to compare apples with apples rather than apples with pears

Cheers
SEQquestions is offline   Reply With Quote
Old 01-31-2012, 12:24 AM   #9
nora
Member
 
Location: Germany

Join Date: Dec 2009
Posts: 11
Default

Hi Greg,

I'd like to add that, oddly enough, not *every* CIGAR string that is starting with a different operator is causing a problem. If I take, say, the first 200000 reads in my converted BAM file and run them through samtools mpileup, no error is thrown even though there are some CIGAR strings starting with other operators than M, e.g.:

11P2I8M5N23M
2P4I19M6N10M
6I4M6N10M2N12M
4P2I8M6N23M
5P1I9M6N23M
5P1I9M6N24M
6I4M6N23M
1I9M6N23M
1I9M6N10M1N12M
1I9M6N23M
2P7I16M6N10M
9I14M7N10M
5P4I19M4N10M
1P8I15M6N10M
7P2I22M5N10M
7P2I21M6N10M
7P2I10M2N10M6N10M
8P1I22M6N10M
1P8I15M6N10M

However, if I use the full BAM file, it fails, and if I remove all reads with a CIGAR string that does not start or end with the match operator, it solves the problem (at least for my 4 samples).

These make up around 0.01% of the reads in my case.

The BAM files were generated from all initial mappings and the evidence files (converted all with CGAtools and merged them to one file). The reason I convert to BAM files is also because I have a number of established workflows/pipelines that are tailored to use BAM, so I will use them for both downstream analysis and visalization.
nora is offline   Reply With Quote
Old 02-08-2012, 04:34 AM   #10
gtyrelle
Member
 
Location: the Netherlands

Join Date: Feb 2011
Posts: 16
Default

Hi Nora and SEQuestions,

Sorry for the late reply, we've been on the road quite a bit in the last two weeks. Let me start by saying that the specific issue raised in this thread is known to CGI. For any CG customers that happen upon this thread, and are having issues using BAM files generated by cgatools, get in touch with your local FAS. If you are not sure who that is, send an email to support@completegenomics.com, we'll get you connected. In many cases we can get you the analysis you want without needing to generate BAM files.

That being said, we very much want our data to be usable at different points in downstream processing pipelines, including our raw data (reads and mappings). GATK is of particular interest to us. However, problems arise because of an impedance missmatch due to our unique read structure and the approach that CG takes to mapping and calling variants vs. other platforms.

Briefly: we take a hybrid approach somewhere between reference guided assembly and de novo assembly. There is an initial mapping phase where reads are mapped to the reference genome, then sites that are discordant are flagged for local de novo (LDN) assembly. We provide conversion tools for the initial mappings, and also the results of the LDN, which we call evidence mappings. One common use case for converting to BAM is doing a head-to-head coverage comparison. There are subtleties in the way the pipeline deals with multiple read mappings, inclusion of unmapped reads during LDN etc. that can make the conversion to BAM and the subsequent coverage comparisons come out wonky. The details are too much for a forum post, so if anyone out there is interested, again feel free to email support@completegenomics.com and I'll get in touch with you. I'd be happy to answer questions in this thread also.

Finally, we don't do assembly this way just to be different, we approach mapping and calling this way because our research tells us it generates the most accurate consensus results given the characteristics of the data we generate. Our challenge is to work with the community to find the sweet spot for our tools and data to work with third-party academic tools.
__________________
Bioinformatics Applications, Europe
Lifetech Inc. http://www.lifetech.com/
gtyrelle is offline   Reply With Quote
Old 03-05-2012, 02:56 AM   #11
ritzriya
Member
 
Location: Canada

Join Date: Jun 2010
Posts: 49
Question samtools: bam_pileup.c:112: resolve_cigar2: Assertion `s->k < c->n_cigar' failed

Hi everyone,

I have the same problem as this thread started on. I have tried to remove the padded characters by using the awk command but still I get the same error. Following is the commands I have used to get the output vcf file:

Quote:
$./samtools mpileup -uf hg1to24.fa CGA/output_sorted.bam | bcftools/bcftools view -vcg - > mpileup_view.bcf
[mpileup] 1 samples in 1 input files
<mpileup> Set max per-file depth to 8000
samtools: bam_pileup.c:112: resolve_cigar2: Assertion `s->k < c->n_cigar' failed.
[afs] 0:13637.821 1:1240.324 2:329.855
$bcftools/bcftools view mpileup_view.bcf | bcftools/vcfutils.pl varFilter -D100 > mpileup_view_filtered.vcf
[bcf_sync] incorrect number of fields (0 != 5) at 0:0
If anyone could help, I would be grateful.

Thanks.
ritzriya is offline   Reply With Quote
Old 03-05-2012, 03:16 AM   #12
nora
Member
 
Location: Germany

Join Date: Dec 2009
Posts: 11
Default

Hi ritzriya,

maybe you can post the commands you used for generating the BAM file? Did you remove only the padding operators or others too?

Cheers
nora is offline   Reply With Quote
Old 03-05-2012, 03:22 AM   #13
ritzriya
Member
 
Location: Canada

Join Date: Jun 2010
Posts: 49
Question

Quote:
Originally Posted by nora View Post
Hi ritzriya,

maybe you can post the commands you used for generating the BAM file? Did you remove only the padding operators or others too?

Cheers
Sure. Following are the commands I used to generate the SAM file and then using samtools I converted it into BAM file. I dont think this is what you are asking me, is it? Please let me know.

[QUOTE]awk '{if ($6!~/^[[:digit:]]+N[[:digit:]]+D|^[[:digit:]]+P|[[:digit:]]+P$|^[[:digit:]]+I/) print $0}' input.sam > edited.sam [QUOTE]

Last edited by ritzriya; 03-05-2012 at 03:29 AM. Reason: Wrong command
ritzriya is offline   Reply With Quote
Old 03-05-2012, 03:25 AM   #14
nora
Member
 
Location: Germany

Join Date: Dec 2009
Posts: 11
Default

Sorry, I meant the commands for generating the BAM file with the removed operators
nora is offline   Reply With Quote
Old 03-05-2012, 03:27 AM   #15
gtyrelle
Member
 
Location: the Netherlands

Join Date: Feb 2011
Posts: 16
Default

Hi ritzriya,

Using samtools mpileup to call variants off CG evidence reads is going to generate worse small variant calls calls than the native CG calls in the var and masterVar files. I would strongly advise using the native CG calls, unless you have very specific reasons for using samtools mpileup on the evidence reads.
__________________
Bioinformatics Applications, Europe
Lifetech Inc. http://www.lifetech.com/
gtyrelle is offline   Reply With Quote
Old 03-05-2012, 03:34 AM   #16
ritzriya
Member
 
Location: Canada

Join Date: Jun 2010
Posts: 49
Question

Quote:
Originally Posted by gtyrelle View Post
Hi ritzriya,

Using samtools mpileup to call variants off CG evidence reads is going to generate worse small variant calls calls than the native CG calls in the var and masterVar files. I would strongly advise using the native CG calls, unless you have very specific reasons for using samtools mpileup on the evidence reads.
Hi gtyrelle,

So you mean that its not advisable to use samtools mpileup for CG data to find variants? The command "generatemasterVar" (beta) from cgatools ONLY must be used to detect SNPs and INDELs?
ritzriya is offline   Reply With Quote
Old 03-05-2012, 03:39 AM   #17
nora
Member
 
Location: Germany

Join Date: Dec 2009
Posts: 11
Default

Hi ritzriya,

if this is the exact command you used then awk is missing the input and didn't remove anything... I would use samtools view to pipe the output to the awk command and then write the output file, like this:

samtools view -S myfile.sam | awk '{if ($6!~/^[[:digit:]]+N[[:digit:]]+D|^[[:digit:]]+P|[[:digit:]]+P$|^[[:digit:]]+I/) print $0}' > myeditedfile.sam
nora is offline   Reply With Quote
Old 03-05-2012, 05:21 AM   #18
gtyrelle
Member
 
Location: the Netherlands

Join Date: Feb 2011
Posts: 16
Default

Quote:
Originally Posted by ritzriya View Post
So you mean that its not advisable to use samtools mpileup for CG data to find variants?
That is exactly what I mean. We have our own pipeline for small variants that uses local de novo assembly, and exports the results in both the var and masterVar files. The generate masterVar function in cgatools does not call variants. It aggregates information from a genome assembly into a single file that is easy to parse and process. For example, small variant calls (multiple lines per variant) along with scores and cross-references are exported to the var file. The masterVar file includes the same variants present in the var file (one variant per line), but has additional columns with information like allele read counts, annotations etc.

Nora's commands will help you use the evidence files in samtools for visualization. I don't recommend using these files for variant calling, one reason is that there is a post-processing filter on the exported evidence reads, after we call variants, such that not all reads will be present for re-calling. In addition read depth is calculated by weighting reads that map in multiple locations, there is no mechanism in samtools to consider read depth calculations with multiple weighted mappings (at least as far as I know).
__________________
Bioinformatics Applications, Europe
Lifetech Inc. http://www.lifetech.com/
gtyrelle is offline   Reply With Quote
Old 03-05-2012, 11:51 PM   #19
ritzriya
Member
 
Location: Canada

Join Date: Jun 2010
Posts: 49
Question

Quote:
Originally Posted by gtyrelle View Post
We have our own pipeline for small variants that uses local de novo assembly, and exports the results in both the var and masterVar files.
Can you please throw some light on what kind of "pipeline" you are offering to call variants using CGAtools? Or other open source softwares like GATK need to be used to call variants only?

Thanks and Regards,
R.
ritzriya is offline   Reply With Quote
Old 03-05-2012, 11:57 PM   #20
ritzriya
Member
 
Location: Canada

Join Date: Jun 2010
Posts: 49
Question

Quote:
Originally Posted by nora View Post
Hi ritzriya,

if this is the exact command you used then awk is missing the input and didn't remove anything... I would use samtools view to pipe the output to the awk command and then write the output file, like this:

samtools view -S myfile.sam | awk '{if ($6!~/^[[:digit:]]+N[[:digit:]]+D|^[[:digit:]]+P|[[:digit:]]+P$|^[[:digit:]]+I/) print $0}' > myeditedfile.sam
I converted the sam file using the awk command mentioned above. But for downstream analysis, I have to convert it into BAM file using samtools view. When I do the conversion, I get the following error, nora:

Quote:
>samtools view -bh -S new.sam -o new.bam
[samopen] no @SQ lines in the header.
[sam_read1] missing header? Abort!
How to resolve this one?
ritzriya is offline   Reply With Quote
Reply

Tags
cigar, complete genomics, mpileup, samtools

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 10:39 AM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2020, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO