SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics



Similar Threads
Thread Thread Starter Forum Replies Last Post
Parse CIGAR string in C/C++ tedwong Bioinformatics 4 04-08-2015 03:30 AM
[Bowtie2] CIGAR string calculation. Coryza Bioinformatics 6 03-14-2014 02:12 AM
Scripture doesn't recognize CIGAR string. thejustpark RNA Sequencing 0 08-27-2012 12:44 PM
CIGAR string from BWA-SW output incorrect ? robs Bioinformatics 13 01-13-2012 04:07 AM
generate CIGAR string from 2 sequences? bbimber Bioinformatics 0 03-20-2010 09:44 AM

Reply
 
Thread Tools
Old 10-27-2016, 10:23 AM   #1
dayhan
Junior Member
 
Location: Massachusetts

Join Date: Jun 2016
Posts: 3
Default replacing CIGAR string

Hello,

I want to replace the CIGAR strings of SAM/BAM file without changing anything else so that the soft-clipped bases are shown as matches/mismatches.

e.g. 11S60M becomes 71M
15M1D40M15S becomes 15M1D55M

How can I do this?

I have a genome with N's between the contigs. I want to fill these regions using overhangs. The length of N's may be incorrect. I aligned my reads with BWA-mem, and will get the consensus seq using samtools mpileup. I thought this would be the best approach.

Thank you
dayhan is offline   Reply With Quote
Old 10-30-2016, 01:56 AM   #2
maubp
Peter (Biopython etc)
 
Location: Dundee, Scotland, UK

Join Date: Jul 2009
Posts: 1,540
Default

This is an interesting idea, I've wondered about something similar in the context of how to visualise the soft clipping in an alignment viewer.

I can see how to do this in code. In addition to replacing the S with an M operation, start soft trimming needs you to also adjust the start POS for the alignment - end soft trimming doesn't have that worry.

In either case you ought to make sure the extended read does not map off the ends of the reference - some viewers might reject the file or at least ignore that read as invalid.

Can you program? I would do this in Python working on the SAM format - E.g. https://github.com/peterjc/picobio/b..._strip_tags.py is a simplistic Python script for SAM file editing.

However, there are now so many good SAM/BAM libraries available you can probably do this in your language of choice on either format.
maubp is offline   Reply With Quote
Old 10-30-2016, 08:35 PM   #3
dcameron
Member
 
Location: Australia

Join Date: Mar 2013
Posts: 25
Default

Since the read bases don't match, aren't you better off trimming the reads? What is your use case for considering mismatching bases aligned?
dcameron is offline   Reply With Quote
Old 10-31-2016, 03:11 AM   #4
maubp
Peter (Biopython etc)
 
Location: Dundee, Scotland, UK

Join Date: Jul 2009
Posts: 1,540
Default

I believe dayhan has soft trimmed bases at the edge of an NNNNN region in his reference, e.g.

Code:
ACGTNNNN Ref
ACGT---- Part mapped read, say ACGTGGTTwith 4 bases soft clipped, CIGAR 4M4S
To become:

Code:
ACGTNNNN Ref
ACGTGGTT clipped bases mapped as GIGAR 8M
If this happens with lots of reads, you would be able to see visually if there is a consensus over the NNNN regions which could be filled with a gap filling tool (or directly by calling the consensus on the modified SAM/BAM file as he suggests).
maubp is offline   Reply With Quote
Old 10-31-2016, 04:15 PM   #5
dcameron
Member
 
Location: Australia

Join Date: Mar 2013
Posts: 25
Default

That does make sense as a simple starting point for filling the Ns. Code-wise, it's pretty simple and would look something like:

Code:
for each read
    if cigar starts with S
        replace S with M
        alignment position = alignment position - length of S
    if cigar ends with S
        replace S with M
A more comprehensive approach to filling the gaps would be to assemble the reads next to the Ns as you would be able to extend from not only the soft clipped reads but also extend into the gaps with read pairs.
dcameron is offline   Reply With Quote
Old 10-31-2016, 04:30 PM   #6
dcameron
Member
 
Location: Australia

Join Date: Mar 2013
Posts: 25
Default

Shameless plug: my structural variant caller GRIDSS performs genome-wide breakend assembly which, if you're looking for a turn-key solution to extend into the Ns, would fit that purpose quite nicely. If you ignore the subsequent variant calling steps and only look at the intermediate breakend assembly bam file, you'll have a file that contains at each loci, a contig extending into the Ns assembled from the adjacent soft clipped reads and read pairs.
dcameron is offline   Reply With Quote
Old 11-01-2016, 08:23 AM   #7
dayhan
Junior Member
 
Location: Massachusetts

Join Date: Jun 2016
Posts: 3
Default

Quote:
Originally Posted by maubp View Post
This is an interesting idea, I've wondered about something similar in the context of how to visualise the soft clipping in an alignment viewer.

I can see how to do this in code. In addition to replacing the S with an M operation, start soft trimming needs you to also adjust the start POS for the alignment - end soft trimming doesn't have that worry.

In either case you ought to make sure the extended read does not map off the ends of the reference - some viewers might reject the file or at least ignore that read as invalid.

Can you program? I would do this in Python working on the SAM format - E.g. https://github.com/peterjc/picobio/b..._strip_tags.py is a simplistic Python script for SAM file editing.

However, there are now so many good SAM/BAM libraries available you can probably do this in your language of choice on either format.
Thank you for your response. I can see the soft-clipped parts in IGV and they can be assembled cleanly in most cases (attached). I am not good with programming but I can try what you suggested.

Quote:
Originally Posted by dcameron View Post
Shameless plug: my structural variant caller GRIDSS performs genome-wide breakend assembly which, if you're looking for a turn-key solution to extend into the Ns, would fit that purpose quite nicely. If you ignore the subsequent variant calling steps and only look at the intermediate breakend assembly bam file, you'll have a file that contains at each loci, a contig extending into the Ns assembled from the adjacent soft clipped reads and read pairs.
Thank you. I'd actually tried your software recently, but didn't check the assembly.bam. In the picture attached, I'm showing the assembly in the middle track, and the alignments in the last track. The assembly-read also soft-clipped (the longest on right: 117M117S). So, I guess I still have to change the CIGAR/POS for getting the consensus seq.
It's a really good program btw.
Attached Images
File Type: png igv_snapshot.png (140.7 KB, 8 views)
dayhan is offline   Reply With Quote
Old 11-01-2016, 10:09 PM   #8
dcameron
Member
 
Location: Australia

Join Date: Mar 2013
Posts: 25
Default

Quote:
Originally Posted by dayhan View Post
Thank you. I'd actually tried your software recently, but didn't check the assembly.bam. In the picture attached, I'm showing the assembly in the middle track, and the alignments in the last track. The assembly-read also soft-clipped (the longest on right: 117M117S). So, I guess I still have to change the CIGAR/POS for getting the consensus seq.

In that example can't you just take the assembled soft clipped bases as the consensus? If you take the final 117 bases of the assembly contig 'read'. The no-tools approach would be to right click on the long contig, copy read details to clipboard, paste into your favorite text editor (eg notepad++) and delete everything except the last 117 bases of the read sequence and you have your consensus sequence.

For breakend oriented the other direction (eg 100S50M), then you just take the first n bases of the read sequence as your consensus.

Quote:
Originally Posted by dayhan View Post
It's a really good program btw.
Thanks - it's good to see the effort put into usability has been well spent. I'm currently working on splitting out the components in the pipeline and make all intermediate files use the SAM tags defined in the specifications. My hope is this will encourage reuse for other purposes (eg such as this assembly-only scenario), and also encourage interoperability with other tool. There's no reason GRIDSS couldn't use Pindel's split-read detection approach, or CLEVER's discordant read pair identification other than the fact that none of these tools were designed to be modular or play nicely in alternate pipelines.

Last edited by dcameron; 11-01-2016 at 10:17 PM.
dcameron is offline   Reply With Quote
Old 11-02-2016, 05:50 AM   #9
dayhan
Junior Member
 
Location: Massachusetts

Join Date: Jun 2016
Posts: 3
Default

Quote:
Originally Posted by dcameron View Post
In that example can't you just take the assembled soft clipped bases as the consensus? If you take the final 117 bases of the assembly contig 'read'. The no-tools approach would be to right click on the long contig, copy read details to clipboard, paste into your favorite text editor (eg notepad++) and delete everything except the last 117 bases of the read sequence and you have your consensus sequence.
As a starting point, yes, I'm trying manual editing. But there are more than 1000 instances, and I was thinking to automate the process.

Thank you for your responses.
dayhan is offline   Reply With Quote
Reply

Tags
bam, cigar, sam, samtools, soft clipping

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 02:19 PM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2018, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO