SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics



Similar Threads
Thread Thread Starter Forum Replies Last Post
Quantifying copies of a single locus robinweide Bioinformatics 0 10-17-2014 03:28 PM
Single-Cell Sequencing: From Individual Cancer Cells to Individualized Therapies jhodges Events / Conferences 0 05-21-2014 07:56 AM
Quantifying allele fraction of small to medium deletions in amplication sequencing brentonmar Bioinformatics 1 08-08-2013 06:56 AM
PubMed: Using Genetic Algorithm in Reconstructing Single Individual Haplotype with Mi Newsbot! Literature Watch 0 04-03-2012 04:30 PM

Reply
 
Thread Tools
Old 04-28-2015, 10:12 AM   #1
jwag
Member
 
Location: USA

Join Date: Apr 2013
Posts: 42
Default Quantifying mitochondrial deletions in a single individual -- is it possible?

Hi all, I've sequenced the mitochondria of an individual using Illumina (100bp PE, insertion size is 150-180nt). So far, I've been able to determine that its mitochondria has a heteroplasmy (estimate is 30-40%) of a ~230bp deletion relative to its reference.

I've yet to find a indel discovery program that also quantifies the relative allele frequency. Normal methods of indel/CNV quantification do not work because mitochondrial indels/CNV heteroplasmy would not occur as simple 50% increases/decreases, as would be seen in a genomic deletion/CNV.

What I've done so far is mapped reads to the reference and allowed for large gaps (thereby splitting reads at the deletion) and calculated the fraction that split as a percent of coverage. However, there is an issue where many reads do not split across the deletion and instead have unmapped free ends, as the end of the read isn't long enough for the mapping software to call it as a deletion read. Therefore, there is an underestimate of the deletion frequency.

Here's an example of the mapping output that shows reads mapping to the reference, split reads, and reads with unaligned free ends on either side of the deletion (should actually be contributing to the deletion read pool):



I've also considered simply counting the raw reads that correspond to either the deletion or the reference and calculating the fraction that are deletion, e.g.

reference: AAAAABBBBBCCCCC
deletion read: AAAAACCCCC
non-deletion read: AAAAABBBBB

So for example, if I counted 2000 of AAAAACCCCC (deletion read) and 5000 of AAAAABBBBB (non-deletion read), the overall rate of the deletion is 2000/7000=29%. I'm not sure if this is an oversimplification, but it would take out the possibility of the bias caused by unaligned free ends.

This topic has been visited before but I don't think a satisfactory answer has been found.

So my overall question is, is it possible to accurately quantify deletions greater than your read length/insertion size in a non-ploidy population?
jwag is offline   Reply With Quote
Old 04-28-2015, 02:31 PM   #2
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,707
Default

Since your reads overlap, I suggest you merge them first with BBMerge to get longer reads (lower fraction of read will be at tips, which aids detection of indels), then map with BBMap, which will easily span those gaps. That will take 4 steps:

Code:
bbmerge.sh in=read1.fq in2=read2.fq out=merged.fq outu=unmerged.fq vloose

bbmap.sh k=10 ref=mito.fa

bbmap.sh k=10 in=merged.fq maxindel=400 tipsearch=300 slow out=mapped_merged.sam

bbmap.sh k=10 in=unmerged.fq maxindel=400 tipsearch=300 slow out=mapped_unmerged.sam
The "tipsearch" flag indicates the max distance that will be used for brute-force search when reads align with lots of mismatches at the tip. "k=10" will use shorter than default kmers, also increasing the rate at which very short overlaps are correctly aligned. This will not completely solve your problem, but it should greatly reduce it.

Last edited by Brian Bushnell; 04-28-2015 at 02:34 PM.
Brian Bushnell is offline   Reply With Quote
Old 04-29-2015, 10:59 AM   #3
jwag
Member
 
Location: USA

Join Date: Apr 2013
Posts: 42
Default

Hi Brian,

Thanks for the reply. I used BBMap according to your suggestions. The BBmerge merged ~95% of my reads, then I mapped it to my reference using k=10 (props for making the program very intuitive and easy to run). It definitely decreased the number of unmapped free ends around the deletion.

However, the program I've been using to call the % haplotype frequency (Geneious) now calls the deletion as being multiple smaller deletions, rather than one single large one. This is a known issue with the program and I don't think it has been solved yet. Do you happen to be familiar with any indel calling method that would give me a rate of deletion using the alignment file derived from BBMap?

Cheers,
Jo
jwag is offline   Reply With Quote
Old 04-29-2015, 06:12 PM   #4
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,707
Default

Hi Jo,

Sorry, I don't really do variant-calling anymore so I have no specific suggestions. You might try mpileup or GATK with indel-realignnment disabled.

To calculate it semi-manually, you could calculate coverage in that region with and without reads containing deletions. BBMap's "pileup.sh" script will generate per-base coverage information, and it considers bases under deletion events to be covered. So if you run it twice, once on a sam file with all mapped reads and once on a sam file containing only deletion-free reads (you can run BBMap with the "delfilter=0" flag to ban alignments containing any deletions), then look at the coverage ratio before vs after in that region, that should be a good approximation of the haplotype frequency.

Actually, you can directly output coverage from BBMap using the "basecov" flag instead of generating a sam file.
Brian Bushnell is offline   Reply With Quote
Old 05-03-2015, 02:18 PM   #5
Len Trigg
Registered Vendor
 
Location: New Zealand

Join Date: Jun 2011
Posts: 29
Default

Hi Jo,

You can try the caller from RTG Core, which will call the haplotypes over the region as a whole, and includes an allelic depth attribute in the VCF which should include exactly the counts that you want. Specifically referring to the unaligned free ends, the complex haplotype caller internally performs a realignment of each of the reads against the candidate haplotypes, since those reads would presumably align best against the deletion haplotype, they will contribute to the AD count corresponding to the deletion allele. (Our caller is primarily set up for haploid and diploid calling rather than heteroplasmy, but if you tell it the MT is diploid it'll probably do OK.)

Cheers,
Len.
__________________
Len Trigg, Ph.D.
Real Time Genomics
www.realtimegenomics.com
Len Trigg is offline   Reply With Quote
Old 05-04-2015, 10:57 AM   #6
jwag
Member
 
Location: USA

Join Date: Apr 2013
Posts: 42
Default

Hi all,

Here's an update. I used the pipeline suggested by Brian to allow for large gaps in bbmap (k=10 maxindel=400 tipsearch=300 slow). I then tried various callers to see what they would do with the ~220bp deletion.

CLC Genomics Workbench: Called it as a deletion at 19% frequency
VarScan indel: Called it at 23%, but missed a smaller obvious indel downstream.
SNVer: Called it as a large substitution at 34% frequency
Geneious: had a difficult time with the large deletion and instead called it as multiple smaller deletions. I believe this is a known problem with Geneious variant calling.

Then I used Brian's second suggestion of the pileup.sh of bbmap. I compared the region per-site using "k=10 maxindel=400 tipsearch=300 slow" in one instance and then "delfilter=0" in the second. This gave a reduction in coverage of about 38%, indicating a deletion frequency of approximately 38%.

So that's about a 20% range difference between calling methods. I'll give Len's suggestion a try next. Thanks all.

Best,
Josiah

Last edited by jwag; 05-06-2015 at 11:08 AM.
jwag is offline   Reply With Quote
Old 05-04-2015, 09:22 PM   #7
Matt Kearse
Member
 
Location: New Zealand

Join Date: Mar 2014
Posts: 20
Default

Hi Josiah,

Prior to variant calling, I recommend you try de novo assembling all the reads that map, then map those contigs (along with any unassembled reads) to the reference sequence. By doing that, the reads in the contigs should be correctly aligned around large indels, which will give you a better estimate of the deletion frequency.

For the issue in Geneious where it sometimes splits the deletion into multiple smaller deletions, I've fixed this for the next Geneious release. If you're interested in seeing the results sooner, you could share your data with me and I'll run it through Geneious and send you back the results. But even with the split deletion you should still get a deletion frequency which should be accurate.
Matt Kearse is offline   Reply With Quote
Old 05-06-2015, 11:08 AM   #8
jwag
Member
 
Location: USA

Join Date: Apr 2013
Posts: 42
Default

Hi Matt,

I will give that a try . I'm currently using Geneious for SNP calling -- it seems to be working pretty well. Do you know if people typically remove duplicate reads in an alignment prior to SNP calling with the Geneious caller? Thanks.
jwag is offline   Reply With Quote
Old 05-06-2015, 11:03 PM   #9
Matt Kearse
Member
 
Location: New Zealand

Join Date: Mar 2014
Posts: 20
Default

Geneious only provides the ability to remove duplicates prior to alignment. It's probably a good idea to do so, but I don't know whether or not people typically do that.

I can see from your screenshot that you're using quite an old version of Geneious. Geneious variant calling has had a few improvements since then, so you may want to upgrade.
Matt Kearse is offline   Reply With Quote
Old 05-13-2015, 10:18 AM   #10
jwag
Member
 
Location: USA

Join Date: Apr 2013
Posts: 42
Default

Hi again, I wanted to update progress for anyone still following the thread. I've been playing with BBmap, which appears to be doing very well with gapped alignments. I looked into the caller by RTG core but haven't used it yet, because according to the manual it does not call indels larger than 50bp (please correct me if I'm wrong, Len). Here's the pipeline that I think gives a good representation of both the indels and SNPs in a single alignment, in my case:

1. Map merged reads (>150bp after merging) to mtgenome with bbmap using k=10, maxindel=400, tipsearch=300, ambiguous=toss, minratio=0.7. I'm still debating if I should keep or toss ambiguously mapping reads... I thought since there are some highly repetitive regions, it might be best to toss them.

2. Remove duplicate mapping reads with Picard. This seems to be common practice (is done for example in the GATK variant calling pipeline), although people seem to get a bit fuzzy on whether this should be done or not when there is a high coverage saturation (as is the case my extremely deep sequencing of a single mtgenome).

3. Trim mapped reads 10bp on each end in the deduped bam file. I found an older thread that Brian posted in suggesting this can improve deletion calling (not sure if this is still your opinion, Brian). It seemed to help mask the few remaining unmapped free ends around deletion sites. The other option would be indel realignment using something like the GATK pipeline, but I think trimming works fine too.

4. SNP/variant call with Geneious and SNVer. Geneious actually did pretty well with calling the deletions in the BBmap gapped alignment -- hopefully I can find some funds for the new update, because even my older version seems to work better than most of the callers I've tried.

If anyone sees a problem with this pipeline please let me know. Cheers!
jwag is offline   Reply With Quote
Old 05-13-2015, 10:50 AM   #11
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,707
Default

Quote:
Originally Posted by jwag View Post
2. Remove duplicate mapping reads with Picard. This seems to be common practice (is done for example in the GATK variant calling pipeline), although people seem to get a bit fuzzy on whether this should be done or not when there is a high coverage saturation (as is the case my extremely deep sequencing of a single mtgenome).
Only remove duplicates if the library is PCR-amplified. And yes, with sufficiently high coverage, it will still cause biases. When you're doing something quantitative - like RNA-seq expression, or determining the exact ratio of allelic frequencies - it's best to use unamplified data, but with amplified data, don't remove duplicates.

Quote:
3. Trim mapped reads 10bp on each end in the deduped bam file. I found an older thread that Brian posted in suggesting this can improve deletion calling (not sure if this is still your opinion, Brian). It seemed to help mask the few remaining unmapped free ends around deletion sites. The other option would be indel realignment using something like the GATK pipeline, but I think trimming works fine too.
Yes, I still recommend trimming the ends of reads after alignment but prior to variant calling. Bear in mind that it's not straightforward, as you need to appropriately trim the cigar string too.
Brian Bushnell is offline   Reply With Quote
Old 05-13-2015, 11:02 AM   #12
jwag
Member
 
Location: USA

Join Date: Apr 2013
Posts: 42
Default

Thanks for the reply Brian. The DNA was prepared using a Nexetra kit which I believe involves a few cycles of PCR. So just to double check, I shouldn't remove duplicates since I'm trying to quantify allelic frequencies?

For the post-alignment trim, I had the first 10 and last 10 bases in the aligned reads set to be N's -- I will have to look into how this would change the cigar string.
jwag is offline   Reply With Quote
Old 05-13-2015, 11:11 AM   #13
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,707
Default

Quote:
Originally Posted by jwag View Post
Thanks for the reply Brian. The DNA was prepared using a Nexetra kit which I believe involves a few cycles of PCR. So just to double check, I shouldn't remove duplicates since I'm trying to quantify allelic frequencies?

For the post-alignment trim, I had the first 10 and last 10 bases in the aligned reads set to be N's -- I will have to look into how this would change the cigar string.
Amplification will skew quantitative things whether or not you remove duplicates, but duplicate removal can potentially skew them even more if you have high coverage, so I'd skip it in this case.

I should really modify BBDuk to be able to trim mapped reads. Maybe I'll do that today.
Brian Bushnell is offline   Reply With Quote
Old 05-13-2015, 11:24 AM   #14
jwag
Member
 
Location: USA

Join Date: Apr 2013
Posts: 42
Default

Quote:
Originally Posted by Brian Bushnell View Post
Amplification will skew quantitative things whether or not you remove duplicates, but duplicate removal can potentially skew them even more if you have high coverage, so I'd skip it in this case.

I should really modify BBDuk to be able to trim mapped reads. Maybe I'll do that today.
Ah that makes sense-- I will keep the duplicates in this case.

If BBDuk could trim mapped reads and also update the CIGAR string at the same time, that would be super useful. I've yet to find anything that can do that.

Last edited by jwag; 05-13-2015 at 11:29 AM.
jwag is offline   Reply With Quote
Old 05-13-2015, 01:52 PM   #15
Len Trigg
Registered Vendor
 
Location: New Zealand

Join Date: Jun 2011
Posts: 29
Default

Jo,

The 1-50bp note is more about the default RTG mapping settings (you'd have to start adjusting the aligner settings to look for longer indels, at somewhat of a speed penalty). The variant caller will happily process anything you throw at it. The mappings in your original screenshot looked fine (in that there were reads spanning the deletion, plus some hanging into it, which all gets correctly taken into account by the caller during it's local realignment to the indel hypotheses, so I would say do *not* trim the ends of the reads after alignment).
__________________
Len Trigg, Ph.D.
Real Time Genomics
www.realtimegenomics.com
Len Trigg is offline   Reply With Quote
Old 05-13-2015, 02:13 PM   #16
jwag
Member
 
Location: USA

Join Date: Apr 2013
Posts: 42
Default

Quote:
Originally Posted by Len Trigg View Post
Jo,

The 1-50bp note is more about the default RTG mapping settings (you'd have to start adjusting the aligner settings to look for longer indels, at somewhat of a speed penalty). The variant caller will happily process anything you throw at it. The mappings in your original screenshot looked fine (in that there were reads spanning the deletion, plus some hanging into it, which all gets correctly taken into account by the caller during it's local realignment to the indel hypotheses, so I would say do *not* trim the ends of the reads after alignment).
Ok great, I'll give the RTG caller a go.
jwag is offline   Reply With Quote
Old 05-14-2015, 12:56 PM   #17
jwag
Member
 
Location: USA

Join Date: Apr 2013
Posts: 42
Default

Quote:
Originally Posted by Len Trigg View Post
Jo,

The 1-50bp note is more about the default RTG mapping settings (you'd have to start adjusting the aligner settings to look for longer indels, at somewhat of a speed penalty). The variant caller will happily process anything you throw at it. The mappings in your original screenshot looked fine (in that there were reads spanning the deletion, plus some hanging into it, which all gets correctly taken into account by the caller during it's local realignment to the indel hypotheses, so I would say do *not* trim the ends of the reads after alignment).
Hi Len, I was able to run the RTG variant caller using the bam file produced using BBMap (without removing duplicate mapping reads and without trimming ends after mapping) and it seemed to work well. To calculate relative numbers of each haplotype from the VCF file, I took the allelic depths for the ref and alternate alles (AD) and divided it by total read depth at a variant location (DP). Using this, I estimated the rate of the large (~200bp) deletion at 25%, which is close to the numbers put out by other callers (19-34%).
jwag is offline   Reply With Quote
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 03:28 PM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2020, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO