![]() |
|
![]() |
||||
Thread | Thread Starter | Forum | Replies | Last Post |
Parse CIGAR string in C/C++ | tedwong | Bioinformatics | 4 | 04-08-2015 04:30 AM |
[Bowtie2] CIGAR string calculation. | Coryza | Bioinformatics | 6 | 03-14-2014 03:12 AM |
Scripture doesn't recognize CIGAR string. | thejustpark | RNA Sequencing | 0 | 08-27-2012 01:44 PM |
CIGAR string from BWA-SW output incorrect ? | robs | Bioinformatics | 13 | 01-13-2012 05:07 AM |
generate CIGAR string from 2 sequences? | bbimber | Bioinformatics | 0 | 03-20-2010 10:44 AM |
![]() |
|
Thread Tools |
![]() |
#1 |
Junior Member
Location: Massachusetts Join Date: Jun 2016
Posts: 3
|
![]()
Hello,
I want to replace the CIGAR strings of SAM/BAM file without changing anything else so that the soft-clipped bases are shown as matches/mismatches. e.g. 11S60M becomes 71M 15M1D40M15S becomes 15M1D55M How can I do this? I have a genome with N's between the contigs. I want to fill these regions using overhangs. The length of N's may be incorrect. I aligned my reads with BWA-mem, and will get the consensus seq using samtools mpileup. I thought this would be the best approach. Thank you |
![]() |
![]() |
![]() |
#2 |
Peter (Biopython etc)
Location: Dundee, Scotland, UK Join Date: Jul 2009
Posts: 1,543
|
![]()
This is an interesting idea, I've wondered about something similar in the context of how to visualise the soft clipping in an alignment viewer.
I can see how to do this in code. In addition to replacing the S with an M operation, start soft trimming needs you to also adjust the start POS for the alignment - end soft trimming doesn't have that worry. In either case you ought to make sure the extended read does not map off the ends of the reference - some viewers might reject the file or at least ignore that read as invalid. Can you program? I would do this in Python working on the SAM format - E.g. https://github.com/peterjc/picobio/b..._strip_tags.py is a simplistic Python script for SAM file editing. However, there are now so many good SAM/BAM libraries available you can probably do this in your language of choice on either format. |
![]() |
![]() |
![]() |
#3 |
Member
Location: Australia Join Date: Mar 2013
Posts: 26
|
![]()
Since the read bases don't match, aren't you better off trimming the reads? What is your use case for considering mismatching bases aligned?
|
![]() |
![]() |
![]() |
#4 |
Peter (Biopython etc)
Location: Dundee, Scotland, UK Join Date: Jul 2009
Posts: 1,543
|
![]()
I believe dayhan has soft trimmed bases at the edge of an NNNNN region in his reference, e.g.
Code:
ACGTNNNN Ref ACGT---- Part mapped read, say ACGTGGTTwith 4 bases soft clipped, CIGAR 4M4S Code:
ACGTNNNN Ref ACGTGGTT clipped bases mapped as GIGAR 8M |
![]() |
![]() |
![]() |
#5 |
Member
Location: Australia Join Date: Mar 2013
Posts: 26
|
![]()
That does make sense as a simple starting point for filling the Ns. Code-wise, it's pretty simple and would look something like:
Code:
for each read if cigar starts with S replace S with M alignment position = alignment position - length of S if cigar ends with S replace S with M |
![]() |
![]() |
![]() |
#6 |
Member
Location: Australia Join Date: Mar 2013
Posts: 26
|
![]()
Shameless plug: my structural variant caller GRIDSS performs genome-wide breakend assembly which, if you're looking for a turn-key solution to extend into the Ns, would fit that purpose quite nicely. If you ignore the subsequent variant calling steps and only look at the intermediate breakend assembly bam file, you'll have a file that contains at each loci, a contig extending into the Ns assembled from the adjacent soft clipped reads and read pairs.
|
![]() |
![]() |
![]() |
#7 | ||
Junior Member
Location: Massachusetts Join Date: Jun 2016
Posts: 3
|
![]() Quote:
Quote:
It's a really good program btw. |
||
![]() |
![]() |
![]() |
#8 | |
Member
Location: Australia Join Date: Mar 2013
Posts: 26
|
![]() Quote:
In that example can't you just take the assembled soft clipped bases as the consensus? If you take the final 117 bases of the assembly contig 'read'. The no-tools approach would be to right click on the long contig, copy read details to clipboard, paste into your favorite text editor (eg notepad++) and delete everything except the last 117 bases of the read sequence and you have your consensus sequence. For breakend oriented the other direction (eg 100S50M), then you just take the first n bases of the read sequence as your consensus. Thanks - it's good to see the effort put into usability has been well spent. I'm currently working on splitting out the components in the pipeline and make all intermediate files use the SAM tags defined in the specifications. My hope is this will encourage reuse for other purposes (eg such as this assembly-only scenario), and also encourage interoperability with other tool. There's no reason GRIDSS couldn't use Pindel's split-read detection approach, or CLEVER's discordant read pair identification other than the fact that none of these tools were designed to be modular or play nicely in alternate pipelines. Last edited by dcameron; 11-01-2016 at 11:17 PM. |
|
![]() |
![]() |
![]() |
#9 | |
Junior Member
Location: Massachusetts Join Date: Jun 2016
Posts: 3
|
![]() Quote:
Thank you for your responses. |
|
![]() |
![]() |
![]() |
Tags |
bam, cigar, sam, samtools, soft clipping |
Thread Tools | |
|
|