Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • replacing CIGAR string

    Hello,

    I want to replace the CIGAR strings of SAM/BAM file without changing anything else so that the soft-clipped bases are shown as matches/mismatches.

    e.g. 11S60M becomes 71M
    15M1D40M15S becomes 15M1D55M

    How can I do this?

    I have a genome with N's between the contigs. I want to fill these regions using overhangs. The length of N's may be incorrect. I aligned my reads with BWA-mem, and will get the consensus seq using samtools mpileup. I thought this would be the best approach.

    Thank you

  • #2
    This is an interesting idea, I've wondered about something similar in the context of how to visualise the soft clipping in an alignment viewer.

    I can see how to do this in code. In addition to replacing the S with an M operation, start soft trimming needs you to also adjust the start POS for the alignment - end soft trimming doesn't have that worry.

    In either case you ought to make sure the extended read does not map off the ends of the reference - some viewers might reject the file or at least ignore that read as invalid.

    Can you program? I would do this in Python working on the SAM format - E.g. https://github.com/peterjc/picobio/b..._strip_tags.py is a simplistic Python script for SAM file editing.

    However, there are now so many good SAM/BAM libraries available you can probably do this in your language of choice on either format.

    Comment


    • #3
      Since the read bases don't match, aren't you better off trimming the reads? What is your use case for considering mismatching bases aligned?

      Comment


      • #4
        I believe dayhan has soft trimmed bases at the edge of an NNNNN region in his reference, e.g.

        Code:
        ACGTNNNN Ref
        ACGT---- Part mapped read, say ACGTGGTTwith 4 bases soft clipped, CIGAR 4M4S
        To become:

        Code:
        ACGTNNNN Ref
        ACGTGGTT clipped bases mapped as GIGAR 8M
        If this happens with lots of reads, you would be able to see visually if there is a consensus over the NNNN regions which could be filled with a gap filling tool (or directly by calling the consensus on the modified SAM/BAM file as he suggests).

        Comment


        • #5
          That does make sense as a simple starting point for filling the Ns. Code-wise, it's pretty simple and would look something like:

          Code:
          for each read
              if cigar starts with S
                  replace S with M
                  alignment position = alignment position - length of S
              if cigar ends with S
                  replace S with M
          A more comprehensive approach to filling the gaps would be to assemble the reads next to the Ns as you would be able to extend from not only the soft clipped reads but also extend into the gaps with read pairs.

          Comment


          • #6
            Shameless plug: my structural variant caller GRIDSS performs genome-wide breakend assembly which, if you're looking for a turn-key solution to extend into the Ns, would fit that purpose quite nicely. If you ignore the subsequent variant calling steps and only look at the intermediate breakend assembly bam file, you'll have a file that contains at each loci, a contig extending into the Ns assembled from the adjacent soft clipped reads and read pairs.

            Comment


            • #7
              Originally posted by maubp View Post
              This is an interesting idea, I've wondered about something similar in the context of how to visualise the soft clipping in an alignment viewer.

              I can see how to do this in code. In addition to replacing the S with an M operation, start soft trimming needs you to also adjust the start POS for the alignment - end soft trimming doesn't have that worry.

              In either case you ought to make sure the extended read does not map off the ends of the reference - some viewers might reject the file or at least ignore that read as invalid.

              Can you program? I would do this in Python working on the SAM format - E.g. https://github.com/peterjc/picobio/b..._strip_tags.py is a simplistic Python script for SAM file editing.

              However, there are now so many good SAM/BAM libraries available you can probably do this in your language of choice on either format.
              Thank you for your response. I can see the soft-clipped parts in IGV and they can be assembled cleanly in most cases (attached). I am not good with programming but I can try what you suggested.

              Originally posted by dcameron View Post
              Shameless plug: my structural variant caller GRIDSS performs genome-wide breakend assembly which, if you're looking for a turn-key solution to extend into the Ns, would fit that purpose quite nicely. If you ignore the subsequent variant calling steps and only look at the intermediate breakend assembly bam file, you'll have a file that contains at each loci, a contig extending into the Ns assembled from the adjacent soft clipped reads and read pairs.
              Thank you. I'd actually tried your software recently, but didn't check the assembly.bam. In the picture attached, I'm showing the assembly in the middle track, and the alignments in the last track. The assembly-read also soft-clipped (the longest on right: 117M117S). So, I guess I still have to change the CIGAR/POS for getting the consensus seq.
              It's a really good program btw.
              Attached Files

              Comment


              • #8
                Originally posted by dayhan View Post
                Thank you. I'd actually tried your software recently, but didn't check the assembly.bam. In the picture attached, I'm showing the assembly in the middle track, and the alignments in the last track. The assembly-read also soft-clipped (the longest on right: 117M117S). So, I guess I still have to change the CIGAR/POS for getting the consensus seq.

                In that example can't you just take the assembled soft clipped bases as the consensus? If you take the final 117 bases of the assembly contig 'read'. The no-tools approach would be to right click on the long contig, copy read details to clipboard, paste into your favorite text editor (eg notepad++) and delete everything except the last 117 bases of the read sequence and you have your consensus sequence.

                For breakend oriented the other direction (eg 100S50M), then you just take the first n bases of the read sequence as your consensus.

                Originally posted by dayhan View Post
                It's a really good program btw.
                Thanks - it's good to see the effort put into usability has been well spent. I'm currently working on splitting out the components in the pipeline and make all intermediate files use the SAM tags defined in the specifications. My hope is this will encourage reuse for other purposes (eg such as this assembly-only scenario), and also encourage interoperability with other tool. There's no reason GRIDSS couldn't use Pindel's split-read detection approach, or CLEVER's discordant read pair identification other than the fact that none of these tools were designed to be modular or play nicely in alternate pipelines.
                Last edited by dcameron; 11-01-2016, 10:17 PM.

                Comment


                • #9
                  Originally posted by dcameron View Post
                  In that example can't you just take the assembled soft clipped bases as the consensus? If you take the final 117 bases of the assembly contig 'read'. The no-tools approach would be to right click on the long contig, copy read details to clipboard, paste into your favorite text editor (eg notepad++) and delete everything except the last 117 bases of the read sequence and you have your consensus sequence.
                  As a starting point, yes, I'm trying manual editing. But there are more than 1000 instances, and I was thinking to automate the process.

                  Thank you for your responses.

                  Comment

                  Latest Articles

                  Collapse

                  • seqadmin
                    Techniques and Challenges in Conservation Genomics
                    by seqadmin



                    The field of conservation genomics centers on applying genomics technologies in support of conservation efforts and the preservation of biodiversity. This article features interviews with two researchers who showcase their innovative work and highlight the current state and future of conservation genomics.

                    Avian Conservation
                    Matthew DeSaix, a recent doctoral graduate from Kristen Ruegg’s lab at The University of Colorado, shared that most of his research...
                    03-08-2024, 10:41 AM
                  • seqadmin
                    The Impact of AI in Genomic Medicine
                    by seqadmin



                    Artificial intelligence (AI) has evolved from a futuristic vision to a mainstream technology, highlighted by the introduction of tools like OpenAI's ChatGPT and Google's Gemini. In recent years, AI has become increasingly integrated into the field of genomics. This integration has enabled new scientific discoveries while simultaneously raising important ethical questions1. Interviews with two researchers at the center of this intersection provide insightful perspectives into...
                    02-26-2024, 02:07 PM

                  ad_right_rmr

                  Collapse

                  News

                  Collapse

                  Topics Statistics Last Post
                  Started by seqadmin, 03-14-2024, 06:13 AM
                  0 responses
                  33 views
                  0 likes
                  Last Post seqadmin  
                  Started by seqadmin, 03-08-2024, 08:03 AM
                  0 responses
                  72 views
                  0 likes
                  Last Post seqadmin  
                  Started by seqadmin, 03-07-2024, 08:13 AM
                  0 responses
                  81 views
                  0 likes
                  Last Post seqadmin  
                  Started by seqadmin, 03-06-2024, 09:51 AM
                  0 responses
                  68 views
                  0 likes
                  Last Post seqadmin  
                  Working...
                  X