Seqanswers Leaderboard Ad

**Brian Bushnell** · 04-28-2015, 01:31 PM

Since your reads overlap, I suggest you merge them first with BBMerge to get longer reads (lower fraction of read will be at tips, which aids detection of indels), then map with BBMap, which will easily span those gaps. That will take 4 steps:

Code:

bbmerge.sh in=read1.fq in2=read2.fq out=merged.fq outu=unmerged.fq vloose

bbmap.sh k=10 ref=mito.fa

bbmap.sh k=10 in=merged.fq maxindel=400 tipsearch=300 slow out=mapped_merged.sam

bbmap.sh k=10 in=unmerged.fq maxindel=400 tipsearch=300 slow out=mapped_unmerged.sam

The "tipsearch" flag indicates the max distance that will be used for brute-force search when reads align with lots of mismatches at the tip. "k=10" will use shorter than default kmers, also increasing the rate at which very short overlaps are correctly aligned. This will not completely solve your problem, but it should greatly reduce it.

**jwag** · 04-29-2015, 09:59 AM

Hi Brian,

Thanks for the reply. I used BBMap according to your suggestions. The BBmerge merged ~95% of my reads, then I mapped it to my reference using k=10 (props for making the program very intuitive and easy to run). It definitely decreased the number of unmapped free ends around the deletion.

However, the program I've been using to call the % haplotype frequency (Geneious) now calls the deletion as being multiple smaller deletions, rather than one single large one. This is a known issue with the program and I don't think it has been solved yet. Do you happen to be familiar with any indel calling method that would give me a rate of deletion using the alignment file derived from BBMap?

Cheers,
Jo

**Brian Bushnell** · 04-29-2015, 05:12 PM

Hi Jo,

Sorry, I don't really do variant-calling anymore so I have no specific suggestions. You might try mpileup or GATK with indel-realignnment disabled.

To calculate it semi-manually, you could calculate coverage in that region with and without reads containing deletions. BBMap's "pileup.sh" script will generate per-base coverage information, and it considers bases under deletion events to be covered. So if you run it twice, once on a sam file with all mapped reads and once on a sam file containing only deletion-free reads (you can run BBMap with the "delfilter=0" flag to ban alignments containing any deletions), then look at the coverage ratio before vs after in that region, that should be a good approximation of the haplotype frequency.

Actually, you can directly output coverage from BBMap using the "basecov" flag instead of generating a sam file.

**Len Trigg** · 05-03-2015, 01:18 PM

Hi Jo,

You can try the caller from RTG Core, which will call the haplotypes over the region as a whole, and includes an allelic depth attribute in the VCF which should include exactly the counts that you want. Specifically referring to the unaligned free ends, the complex haplotype caller internally performs a realignment of each of the reads against the candidate haplotypes, since those reads would presumably align best against the deletion haplotype, they will contribute to the AD count corresponding to the deletion allele. (Our caller is primarily set up for haploid and diploid calling rather than heteroplasmy, but if you tell it the MT is diploid it'll probably do OK.)

Cheers,
Len.

**jwag** · 05-04-2015, 09:57 AM

Hi all,

Here's an update. I used the pipeline suggested by Brian to allow for large gaps in bbmap (k=10 maxindel=400 tipsearch=300 slow). I then tried various callers to see what they would do with the ~220bp deletion.

CLC Genomics Workbench: Called it as a deletion at 19% frequency
VarScan indel: Called it at 23%, but missed a smaller obvious indel downstream.
SNVer: Called it as a large substitution at 34% frequency
Geneious: had a difficult time with the large deletion and instead called it as multiple smaller deletions. I believe this is a known problem with Geneious variant calling.

Then I used Brian's second suggestion of the pileup.sh of bbmap. I compared the region per-site using "k=10 maxindel=400 tipsearch=300 slow" in one instance and then "delfilter=0" in the second. This gave a reduction in coverage of about 38%, indicating a deletion frequency of approximately 38%.

So that's about a 20% range difference between calling methods. I'll give Len's suggestion a try next. Thanks all.

Best,
Josiah

**Matt Kearse** · 05-04-2015, 08:22 PM

Hi Josiah,

Prior to variant calling, I recommend you try de novo assembling all the reads that map, then map those contigs (along with any unassembled reads) to the reference sequence. By doing that, the reads in the contigs should be correctly aligned around large indels, which will give you a better estimate of the deletion frequency.

For the issue in Geneious where it sometimes splits the deletion into multiple smaller deletions, I've fixed this for the next Geneious release. If you're interested in seeing the results sooner, you could share your data with me and I'll run it through Geneious and send you back the results. But even with the split deletion you should still get a deletion frequency which should be accurate.

**jwag** · 05-06-2015, 10:08 AM

Hi Matt,

I will give that a try . I'm currently using Geneious for SNP calling -- it seems to be working pretty well. Do you know if people typically remove duplicate reads in an alignment prior to SNP calling with the Geneious caller? Thanks.

**Matt Kearse** · 05-06-2015, 10:03 PM

Geneious only provides the ability to remove duplicates prior to alignment. It's probably a good idea to do so, but I don't know whether or not people typically do that.

I can see from your screenshot that you're using quite an old version of Geneious. Geneious variant calling has had a few improvements since then, so you may want to upgrade.

**jwag** · 05-13-2015, 09:18 AM

Hi again, I wanted to update progress for anyone still following the thread. I've been playing with BBmap, which appears to be doing very well with gapped alignments. I looked into the caller by RTG core but haven't used it yet, because according to the manual it does not call indels larger than 50bp (please correct me if I'm wrong, Len). Here's the pipeline that I think gives a good representation of both the indels and SNPs in a single alignment, in my case:

1. Map merged reads (>150bp after merging) to mtgenome with bbmap using k=10, maxindel=400, tipsearch=300, ambiguous=toss, minratio=0.7. I'm still debating if I should keep or toss ambiguously mapping reads... I thought since there are some highly repetitive regions, it might be best to toss them.

2. Remove duplicate mapping reads with Picard. This seems to be common practice (is done for example in the GATK variant calling pipeline), although people seem to get a bit fuzzy on whether this should be done or not when there is a high coverage saturation (as is the case my extremely deep sequencing of a single mtgenome).

3. Trim mapped reads 10bp on each end in the deduped bam file. I found an older thread that Brian posted in suggesting this can improve deletion calling (not sure if this is still your opinion, Brian). It seemed to help mask the few remaining unmapped free ends around deletion sites. The other option would be indel realignment using something like the GATK pipeline, but I think trimming works fine too.

4. SNP/variant call with Geneious and SNVer. Geneious actually did pretty well with calling the deletions in the BBmap gapped alignment -- hopefully I can find some funds for the new update, because even my older version seems to work better than most of the callers I've tried.

If anyone sees a problem with this pipeline please let me know. Cheers!

**Brian Bushnell** · 05-13-2015, 09:50 AM

Originally posted by jwag View Post

2. Remove duplicate mapping reads with Picard. This seems to be common practice (is done for example in the GATK variant calling pipeline), although people seem to get a bit fuzzy on whether this should be done or not when there is a high coverage saturation (as is the case my extremely deep sequencing of a single mtgenome).

Only remove duplicates if the library is PCR-amplified. And yes, with sufficiently high coverage, it will still cause biases. When you're doing something quantitative - like RNA-seq expression, or determining the exact ratio of allelic frequencies - it's best to use unamplified data, but with amplified data, don't remove duplicates.

3. Trim mapped reads 10bp on each end in the deduped bam file. I found an older thread that Brian posted in suggesting this can improve deletion calling (not sure if this is still your opinion, Brian). It seemed to help mask the few remaining unmapped free ends around deletion sites. The other option would be indel realignment using something like the GATK pipeline, but I think trimming works fine too.

Yes, I still recommend trimming the ends of reads after alignment but prior to variant calling. Bear in mind that it's not straightforward, as you need to appropriately trim the cigar string too.

**jwag** · 05-13-2015, 10:02 AM

Thanks for the reply Brian. The DNA was prepared using a Nexetra kit which I believe involves a few cycles of PCR. So just to double check, I shouldn't remove duplicates since I'm trying to quantify allelic frequencies?

For the post-alignment trim, I had the first 10 and last 10 bases in the aligned reads set to be N's -- I will have to look into how this would change the cigar string.

**Brian Bushnell** · 05-13-2015, 10:11 AM

Originally posted by jwag View Post

Thanks for the reply Brian. The DNA was prepared using a Nexetra kit which I believe involves a few cycles of PCR. So just to double check, I shouldn't remove duplicates since I'm trying to quantify allelic frequencies?

For the post-alignment trim, I had the first 10 and last 10 bases in the aligned reads set to be N's -- I will have to look into how this would change the cigar string.

Amplification will skew quantitative things whether or not you remove duplicates, but duplicate removal can potentially skew them even more if you have high coverage, so I'd skip it in this case.

I should really modify BBDuk to be able to trim mapped reads. Maybe I'll do that today.

**jwag** · 05-13-2015, 10:24 AM

Originally posted by Brian Bushnell View Post

Amplification will skew quantitative things whether or not you remove duplicates, but duplicate removal can potentially skew them even more if you have high coverage, so I'd skip it in this case.

I should really modify BBDuk to be able to trim mapped reads. Maybe I'll do that today.

Ah that makes sense-- I will keep the duplicates in this case.

If BBDuk could trim mapped reads and also update the CIGAR string at the same time, that would be super useful. I've yet to find anything that can do that.

**Len Trigg** · 05-13-2015, 12:52 PM

Jo,

The 1-50bp note is more about the default RTG mapping settings (you'd have to start adjusting the aligner settings to look for longer indels, at somewhat of a speed penalty). The variant caller will happily process anything you throw at it. The mappings in your original screenshot looked fine (in that there were reads spanning the deletion, plus some hanging into it, which all gets correctly taken into account by the caller during it's local realignment to the indel hypotheses, so I would say do *not* trim the ends of the reads after alignment).

Topics	Statistics	Last Post
Cancer Metastasis: A Deep Dive into Cellular Plasticity by seqadmin Started by seqadmin, 04-11-2024, 12:08 PM	0 responses 18 views 0 likes	Last Post by seqadmin 04-11-2024, 12:08 PM
Proteogenomic Profiles Offer New Clues in Prostate Cancer by seqadmin Started by seqadmin, 04-10-2024, 10:19 PM	0 responses 22 views 0 likes	Last Post by seqadmin 04-10-2024, 10:19 PM
Novel Diagnostic Assay Enhances Ovarian Cancer Detection by seqadmin Started by seqadmin, 04-10-2024, 09:21 AM	0 responses 17 views 0 likes	Last Post by seqadmin 04-10-2024, 09:21 AM
Evolutionary Dynamics of Centromeres: A Comparative Genomic Analysis by seqadmin Started by seqadmin, 04-04-2024, 09:00 AM	0 responses 49 views 0 likes	Last Post by seqadmin 04-04-2024, 09:00 AM

Seqanswers Leaderboard Ad

Announcement

Quantifying mitochondrial deletions in a single individual -- is it possible?

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News