Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • samtools: bam_pileup.c:112: resolve_cigar2: Assertion `s->k < c->n_cigar' failed

    Hi,

    I have a bam file that I have run through the GATK pipeline (i.e. de novo realignments around indels, duplicates removed and quality recalibration). I then wish to run mpileup on the processed bam files using the command:

    samtools mpileup -s -f hg19.fa.fai sample.sorted.bam > sample.pileup

    The programme seems to work initially but then aborts saying:
    samtools: bam_pileup.c:112: resolve_cigar2: Assertion `s->k < c->n_cigar' failed.

    I haven't learned to programme in C so am really struggling to understand what this means and how I can circumvent the issue. Can anyone help?

    NB The original sequencing was done with Complete Genomics so I have had to convert from their proprietary alignment formats to bam using cgatools
    Last edited by SEQquestions; 01-09-2012, 06:40 AM. Reason: formatting

  • #2
    Did you try examining your .sam file? Maybe the .bam conversion didn't work correctly.

    Comment


    • #3
      Hi,

      Thanks for your response swbarnes2. I have worked it out: for some reason the conversion from CG to sam/bam ends up with the padding operator being allowed at the beginning or end of the cigar string - which pileup does not like. I need to go through and remove all alignments where that has happened.

      Does anyone have any ideas about a way to do this that doesn't involve converting to bam to sam and deleting lines where the cigar string has \d+P at the beginning or end? I am looking at whole genomes here so could be a lengthy process

      Cheers

      Comment


      • #4
        I'm having the exact same problem (Complete Genomics data converted to BAM with CGA tools, same error with samtools).

        I tried removing the reads with padding operators at the beginning or end of the cigar string, but this doesn't solve the problem. Some of these reads seem to cause problems indeed, but not all of them, and there must be an additional/different cause.

        Has anybody found a workaround? I'll be checking which reads cause samtools to throw the error and I'll be contacting the Complete Genomics support, but I'd be glad if someone could help!

        Thanks!

        Comment


        • #5
          Hi nora,

          I found a way round it (a fix rather than a cure, as it were) by using the following awk statement:
          awk '{if (\$6!~/^[[:digit:]]+N[[:digit:]]+D|^[[:digit:]]+P|[[:digit:]]+P\$|^[[:digit:]]+I/) print \$0}'

          Basically meaning that you need to remove any line where the cigar string fulfils any of the following:
          * starts with \d+N\dD
          * starts with \d+P
          * starts with \d+I
          * ends with \d+P

          where \d+ is the perl regex meaning 'any integer containing an unspecified number of digits'. Hope that helps

          Comment


          • #6
            Thanks a lot! I just had a look at the CIGAR strings causing problems, and indeed it seems that everything that doesn't start with the match operator causes the samtools error.

            This is less than 1% of the reads for my data though, so I'll go with removing those reads...

            Comment


            • #7
              Hi @Nora & @ SEQquestions, I work for the App Sci team at Complete. Would you mind posting some examples of the cigar strings causing samtools to crash ? Were these SAM/BAM files generated from initial mappings or evidence files ? Finally what are you plans for the resulting BAM files ? further downstream analysis or visualization ?

              Conversion of our raw and realigned data to BAM is still an ongoing project. We are actively seeking feedback from users about their experiences using and working with the converted data.
              Bioinformatics Applications, Europe
              Lifetech Inc. http://www.lifetech.com/

              Comment


              • #8
                Hi Greg,

                Basically, as nora said, any cigar string that does not start with the match operator is causing an issue. A few examples are:

                105N1D23M6N10M
                31P23I6P10I
                39P23I6P10I
                114N1D23M5N10M

                It is necessary for us to convert to BAM files because we have some Illumina-sequenced samples that we need to be able to compare with our CG-sequenced sampled - and seeing as CG tools/methods are not applicable to Illumina data, we need to get the CG data in to bam format so we can run it through the same analyses and software.

                Not ideal by any means but we are attempting, as much as possible, to compare apples with apples rather than apples with pears

                Cheers

                Comment


                • #9
                  Hi Greg,

                  I'd like to add that, oddly enough, not *every* CIGAR string that is starting with a different operator is causing a problem. If I take, say, the first 200000 reads in my converted BAM file and run them through samtools mpileup, no error is thrown even though there are some CIGAR strings starting with other operators than M, e.g.:

                  11P2I8M5N23M
                  2P4I19M6N10M
                  6I4M6N10M2N12M
                  4P2I8M6N23M
                  5P1I9M6N23M
                  5P1I9M6N24M
                  6I4M6N23M
                  1I9M6N23M
                  1I9M6N10M1N12M
                  1I9M6N23M
                  2P7I16M6N10M
                  9I14M7N10M
                  5P4I19M4N10M
                  1P8I15M6N10M
                  7P2I22M5N10M
                  7P2I21M6N10M
                  7P2I10M2N10M6N10M
                  8P1I22M6N10M
                  1P8I15M6N10M

                  However, if I use the full BAM file, it fails, and if I remove all reads with a CIGAR string that does not start or end with the match operator, it solves the problem (at least for my 4 samples).

                  These make up around 0.01% of the reads in my case.

                  The BAM files were generated from all initial mappings and the evidence files (converted all with CGAtools and merged them to one file). The reason I convert to BAM files is also because I have a number of established workflows/pipelines that are tailored to use BAM, so I will use them for both downstream analysis and visalization.

                  Comment


                  • #10
                    Hi Nora and SEQuestions,

                    Sorry for the late reply, we've been on the road quite a bit in the last two weeks. Let me start by saying that the specific issue raised in this thread is known to CGI. For any CG customers that happen upon this thread, and are having issues using BAM files generated by cgatools, get in touch with your local FAS. If you are not sure who that is, send an email to [email protected], we'll get you connected. In many cases we can get you the analysis you want without needing to generate BAM files.

                    That being said, we very much want our data to be usable at different points in downstream processing pipelines, including our raw data (reads and mappings). GATK is of particular interest to us. However, problems arise because of an impedance missmatch due to our unique read structure and the approach that CG takes to mapping and calling variants vs. other platforms.

                    Briefly: we take a hybrid approach somewhere between reference guided assembly and de novo assembly. There is an initial mapping phase where reads are mapped to the reference genome, then sites that are discordant are flagged for local de novo (LDN) assembly. We provide conversion tools for the initial mappings, and also the results of the LDN, which we call evidence mappings. One common use case for converting to BAM is doing a head-to-head coverage comparison. There are subtleties in the way the pipeline deals with multiple read mappings, inclusion of unmapped reads during LDN etc. that can make the conversion to BAM and the subsequent coverage comparisons come out wonky. The details are too much for a forum post, so if anyone out there is interested, again feel free to email [email protected] and I'll get in touch with you. I'd be happy to answer questions in this thread also.

                    Finally, we don't do assembly this way just to be different, we approach mapping and calling this way because our research tells us it generates the most accurate consensus results given the characteristics of the data we generate. Our challenge is to work with the community to find the sweet spot for our tools and data to work with third-party academic tools.
                    Bioinformatics Applications, Europe
                    Lifetech Inc. http://www.lifetech.com/

                    Comment


                    • #11
                      samtools: bam_pileup.c:112: resolve_cigar2: Assertion `s-&gt;k &lt; c-&gt;n_cigar' failed

                      Hi everyone,

                      I have the same problem as this thread started on. I have tried to remove the padded characters by using the awk command but still I get the same error. Following is the commands I have used to get the output vcf file:

                      $./samtools mpileup -uf hg1to24.fa CGA/output_sorted.bam | bcftools/bcftools view -vcg - > mpileup_view.bcf
                      [mpileup] 1 samples in 1 input files
                      <mpileup> Set max per-file depth to 8000
                      samtools: bam_pileup.c:112: resolve_cigar2: Assertion `s->k < c->n_cigar' failed.
                      [afs] 0:13637.821 1:1240.324 2:329.855
                      $bcftools/bcftools view mpileup_view.bcf | bcftools/vcfutils.pl varFilter -D100 > mpileup_view_filtered.vcf
                      [bcf_sync] incorrect number of fields (0 != 5) at 0:0
                      If anyone could help, I would be grateful.

                      Thanks.

                      Comment


                      • #12
                        Hi ritzriya,

                        maybe you can post the commands you used for generating the BAM file? Did you remove only the padding operators or others too?

                        Cheers

                        Comment


                        • #13
                          Originally posted by nora View Post
                          Hi ritzriya,

                          maybe you can post the commands you used for generating the BAM file? Did you remove only the padding operators or others too?

                          Cheers
                          Sure. Following are the commands I used to generate the SAM file and then using samtools I converted it into BAM file. I dont think this is what you are asking me, is it? Please let me know.

                          [QUOTE]awk '{if ($6!~/^[[:digit:]]+N[[:digit:]]+D|^[[:digit:]]+P|[[:digit:]]+P$|^[[:digit:]]+I/) print $0}' input.sam > edited.sam [QUOTE]
                          Last edited by ritzriya; 03-05-2012, 04:29 AM. Reason: Wrong command

                          Comment


                          • #14
                            Sorry, I meant the commands for generating the BAM file with the removed operators

                            Comment


                            • #15
                              Hi ritzriya,

                              Using samtools mpileup to call variants off CG evidence reads is going to generate worse small variant calls calls than the native CG calls in the var and masterVar files. I would strongly advise using the native CG calls, unless you have very specific reasons for using samtools mpileup on the evidence reads.
                              Bioinformatics Applications, Europe
                              Lifetech Inc. http://www.lifetech.com/

                              Comment

                              Latest Articles

                              Collapse

                              • seqadmin
                                Current Approaches to Protein Sequencing
                                by seqadmin


                                Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
                                04-04-2024, 04:25 PM
                              • seqadmin
                                Strategies for Sequencing Challenging Samples
                                by seqadmin


                                Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                                03-22-2024, 06:39 AM

                              ad_right_rmr

                              Collapse

                              News

                              Collapse

                              Topics Statistics Last Post
                              Started by seqadmin, 04-11-2024, 12:08 PM
                              0 responses
                              30 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 04-10-2024, 10:19 PM
                              0 responses
                              32 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 04-10-2024, 09:21 AM
                              0 responses
                              28 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 04-04-2024, 09:00 AM
                              0 responses
                              52 views
                              0 likes
                              Last Post seqadmin  
                              Working...
                              X