Seqanswers Leaderboard Ad

**swbarnes2** · 01-09-2012, 09:34 AM

Did you try examining your .sam file? Maybe the .bam conversion didn't work correctly.

**SEQquestions** · 01-10-2012, 02:04 AM

Hi,

Thanks for your response swbarnes2. I have worked it out: for some reason the conversion from CG to sam/bam ends up with the padding operator being allowed at the beginning or end of the cigar string - which pileup does not like. I need to go through and remove all alignments where that has happened.

Does anyone have any ideas about a way to do this that doesn't involve converting to bam to sam and deleting lines where the cigar string has \d+P at the beginning or end? I am looking at whole genomes here so could be a lengthy process

Cheers

**nora** · 01-27-2012, 01:44 AM

I'm having the exact same problem (Complete Genomics data converted to BAM with CGA tools, same error with samtools).

I tried removing the reads with padding operators at the beginning or end of the cigar string, but this doesn't solve the problem. Some of these reads seem to cause problems indeed, but not all of them, and there must be an additional/different cause.

Has anybody found a workaround? I'll be checking which reads cause samtools to throw the error and I'll be contacting the Complete Genomics support, but I'd be glad if someone could help!

Thanks!

**SEQquestions** · 01-27-2012, 02:10 AM

Hi nora,

I found a way round it (a fix rather than a cure, as it were) by using the following awk statement:
awk '{if (\$6!~/^[[:digit:]]+N[[:digit:]]+D|^[[:digit:]]+P|[[:digit:]]+P\$|^[[:digit:]]+I/) print \$0}'

Basically meaning that you need to remove any line where the cigar string fulfils any of the following:
* starts with \d+N\dD
* starts with \d+P
* starts with \d+I
* ends with \d+P

where \d+ is the perl regex meaning 'any integer containing an unspecified number of digits'. Hope that helps

**nora** · 01-27-2012, 02:52 AM

Thanks a lot! I just had a look at the CIGAR strings causing problems, and indeed it seems that everything that doesn't start with the match operator causes the samtools error.

This is less than 1% of the reads for my data though, so I'll go with removing those reads...

**gtyrelle** · 01-30-2012, 02:50 AM

Hi @Nora & @ SEQquestions, I work for the App Sci team at Complete. Would you mind posting some examples of the cigar strings causing samtools to crash ? Were these SAM/BAM files generated from initial mappings or evidence files ? Finally what are you plans for the resulting BAM files ? further downstream analysis or visualization ?

Conversion of our raw and realigned data to BAM is still an ongoing project. We are actively seeking feedback from users about their experiences using and working with the converted data.

**SEQquestions** · 01-30-2012, 05:37 AM

Hi Greg,

Basically, as nora said, any cigar string that does not start with the match operator is causing an issue. A few examples are:

105N1D23M6N10M
31P23I6P10I
39P23I6P10I
114N1D23M5N10M

It is necessary for us to convert to BAM files because we have some Illumina-sequenced samples that we need to be able to compare with our CG-sequenced sampled - and seeing as CG tools/methods are not applicable to Illumina data, we need to get the CG data in to bam format so we can run it through the same analyses and software.

Not ideal by any means but we are attempting, as much as possible, to compare apples with apples rather than apples with pears

Cheers

**nora** · 01-31-2012, 01:24 AM

Hi Greg,

I'd like to add that, oddly enough, not *every* CIGAR string that is starting with a different operator is causing a problem. If I take, say, the first 200000 reads in my converted BAM file and run them through samtools mpileup, no error is thrown even though there are some CIGAR strings starting with other operators than M, e.g.:

11P2I8M5N23M
2P4I19M6N10M
6I4M6N10M2N12M
4P2I8M6N23M
5P1I9M6N23M
5P1I9M6N24M
6I4M6N23M
1I9M6N23M
1I9M6N10M1N12M
1I9M6N23M
2P7I16M6N10M
9I14M7N10M
5P4I19M4N10M
1P8I15M6N10M
7P2I22M5N10M
7P2I21M6N10M
7P2I10M2N10M6N10M
8P1I22M6N10M
1P8I15M6N10M

However, if I use the full BAM file, it fails, and if I remove all reads with a CIGAR string that does not start or end with the match operator, it solves the problem (at least for my 4 samples).

These make up around 0.01% of the reads in my case.

The BAM files were generated from all initial mappings and the evidence files (converted all with CGAtools and merged them to one file). The reason I convert to BAM files is also because I have a number of established workflows/pipelines that are tailored to use BAM, so I will use them for both downstream analysis and visalization.

**gtyrelle** · 02-08-2012, 05:34 AM

Hi Nora and SEQuestions,

Sorry for the late reply, we've been on the road quite a bit in the last two weeks. Let me start by saying that the specific issue raised in this thread is known to CGI. For any CG customers that happen upon this thread, and are having issues using BAM files generated by cgatools, get in touch with your local FAS. If you are not sure who that is, send an email to [email protected], we'll get you connected. In many cases we can get you the analysis you want without needing to generate BAM files.

That being said, we very much want our data to be usable at different points in downstream processing pipelines, including our raw data (reads and mappings). GATK is of particular interest to us. However, problems arise because of an impedance missmatch due to our unique read structure and the approach that CG takes to mapping and calling variants vs. other platforms.

Briefly: we take a hybrid approach somewhere between reference guided assembly and de novo assembly. There is an initial mapping phase where reads are mapped to the reference genome, then sites that are discordant are flagged for local de novo (LDN) assembly. We provide conversion tools for the initial mappings, and also the results of the LDN, which we call evidence mappings. One common use case for converting to BAM is doing a head-to-head coverage comparison. There are subtleties in the way the pipeline deals with multiple read mappings, inclusion of unmapped reads during LDN etc. that can make the conversion to BAM and the subsequent coverage comparisons come out wonky. The details are too much for a forum post, so if anyone out there is interested, again feel free to email [email protected] and I'll get in touch with you. I'd be happy to answer questions in this thread also.

Finally, we don't do assembly this way just to be different, we approach mapping and calling this way because our research tells us it generates the most accurate consensus results given the characteristics of the data we generate. Our challenge is to work with the community to find the sweet spot for our tools and data to work with third-party academic tools.

**ritzriya** · 03-05-2012, 03:56 AM

samtools: bam_pileup.c:112: resolve_cigar2: Assertion `s->k < c->n_cigar' failed

Hi everyone,

I have the same problem as this thread started on. I have tried to remove the padded characters by using the awk command but still I get the same error. Following is the commands I have used to get the output vcf file:

$./samtools mpileup -uf hg1to24.fa CGA/output_sorted.bam | bcftools/bcftools view -vcg - > mpileup_view.bcf
[mpileup] 1 samples in 1 input files
<mpileup> Set max per-file depth to 8000
samtools: bam_pileup.c:112: resolve_cigar2: Assertion `s->k < c->n_cigar' failed.
[afs] 0:13637.821 1:1240.324 2:329.855
$bcftools/bcftools view mpileup_view.bcf | bcftools/vcfutils.pl varFilter -D100 > mpileup_view_filtered.vcf
[bcf_sync] incorrect number of fields (0 != 5) at 0:0

If anyone could help, I would be grateful.

Thanks.

**nora** · 03-05-2012, 04:16 AM

Hi ritzriya,

maybe you can post the commands you used for generating the BAM file? Did you remove only the padding operators or others too?

Cheers

**ritzriya** · 03-05-2012, 04:22 AM

Originally posted by nora View Post

Hi ritzriya,

maybe you can post the commands you used for generating the BAM file? Did you remove only the padding operators or others too?

Cheers

Sure. Following are the commands I used to generate the SAM file and then using samtools I converted it into BAM file. I dont think this is what you are asking me, is it? Please let me know.

[QUOTE]awk '{if ($6!~/^[[:digit:]]+N[[:digit:]]+D|^[[:digit:]]+P|[[:digit:]]+P$|^[[:digit:]]+I/) print $0}' input.sam > edited.sam [QUOTE]

**nora** · 03-05-2012, 04:25 AM

Sorry, I meant the commands for generating the BAM file with the removed operators

**gtyrelle** · 03-05-2012, 04:27 AM

Hi ritzriya,

Using samtools mpileup to call variants off CG evidence reads is going to generate worse small variant calls calls than the native CG calls in the var and masterVar files. I would strongly advise using the native CG calls, unless you have very specific reasons for using samtools mpileup on the evidence reads.

Topics	Statistics	Last Post
Cancer Metastasis: A Deep Dive into Cellular Plasticity by seqadmin Started by seqadmin, 04-11-2024, 12:08 PM	0 responses 30 views 0 likes	Last Post by seqadmin 04-11-2024, 12:08 PM
Proteogenomic Profiles Offer New Clues in Prostate Cancer by seqadmin Started by seqadmin, 04-10-2024, 10:19 PM	0 responses 32 views 0 likes	Last Post by seqadmin 04-10-2024, 10:19 PM
Novel Diagnostic Assay Enhances Ovarian Cancer Detection by seqadmin Started by seqadmin, 04-10-2024, 09:21 AM	0 responses 28 views 0 likes	Last Post by seqadmin 04-10-2024, 09:21 AM
Evolutionary Dynamics of Centromeres: A Comparative Genomic Analysis by seqadmin Started by seqadmin, 04-04-2024, 09:00 AM	0 responses 52 views 0 likes	Last Post by seqadmin 04-04-2024, 09:00 AM

Seqanswers Leaderboard Ad

Announcement

samtools: bam_pileup.c:112: resolve_cigar2: Assertion `s->k < c->n_cigar' failed

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News