Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Quantifying mitochondrial deletions in a single individual -- is it possible?

    Hi all, I've sequenced the mitochondria of an individual using Illumina (100bp PE, insertion size is 150-180nt). So far, I've been able to determine that its mitochondria has a heteroplasmy (estimate is 30-40%) of a ~230bp deletion relative to its reference.

    I've yet to find a indel discovery program that also quantifies the relative allele frequency. Normal methods of indel/CNV quantification do not work because mitochondrial indels/CNV heteroplasmy would not occur as simple 50% increases/decreases, as would be seen in a genomic deletion/CNV.

    What I've done so far is mapped reads to the reference and allowed for large gaps (thereby splitting reads at the deletion) and calculated the fraction that split as a percent of coverage. However, there is an issue where many reads do not split across the deletion and instead have unmapped free ends, as the end of the read isn't long enough for the mapping software to call it as a deletion read. Therefore, there is an underestimate of the deletion frequency.

    Here's an example of the mapping output that shows reads mapping to the reference, split reads, and reads with unaligned free ends on either side of the deletion (should actually be contributing to the deletion read pool):



    I've also considered simply counting the raw reads that correspond to either the deletion or the reference and calculating the fraction that are deletion, e.g.

    reference: AAAAABBBBBCCCCC
    deletion read: AAAAACCCCC
    non-deletion read: AAAAABBBBB

    So for example, if I counted 2000 of AAAAACCCCC (deletion read) and 5000 of AAAAABBBBB (non-deletion read), the overall rate of the deletion is 2000/7000=29%. I'm not sure if this is an oversimplification, but it would take out the possibility of the bias caused by unaligned free ends.

    This topic has been visited before but I don't think a satisfactory answer has been found.

    So my overall question is, is it possible to accurately quantify deletions greater than your read length/insertion size in a non-ploidy population?

  • #2
    Since your reads overlap, I suggest you merge them first with BBMerge to get longer reads (lower fraction of read will be at tips, which aids detection of indels), then map with BBMap, which will easily span those gaps. That will take 4 steps:

    Code:
    bbmerge.sh in=read1.fq in2=read2.fq out=merged.fq outu=unmerged.fq vloose
    
    bbmap.sh k=10 ref=mito.fa
    
    bbmap.sh k=10 in=merged.fq maxindel=400 tipsearch=300 slow out=mapped_merged.sam
    
    bbmap.sh k=10 in=unmerged.fq maxindel=400 tipsearch=300 slow out=mapped_unmerged.sam
    The "tipsearch" flag indicates the max distance that will be used for brute-force search when reads align with lots of mismatches at the tip. "k=10" will use shorter than default kmers, also increasing the rate at which very short overlaps are correctly aligned. This will not completely solve your problem, but it should greatly reduce it.
    Last edited by Brian Bushnell; 04-28-2015, 01:34 PM.

    Comment


    • #3
      Hi Brian,

      Thanks for the reply. I used BBMap according to your suggestions. The BBmerge merged ~95% of my reads, then I mapped it to my reference using k=10 (props for making the program very intuitive and easy to run). It definitely decreased the number of unmapped free ends around the deletion.

      However, the program I've been using to call the % haplotype frequency (Geneious) now calls the deletion as being multiple smaller deletions, rather than one single large one. This is a known issue with the program and I don't think it has been solved yet. Do you happen to be familiar with any indel calling method that would give me a rate of deletion using the alignment file derived from BBMap?

      Cheers,
      Jo

      Comment


      • #4
        Hi Jo,

        Sorry, I don't really do variant-calling anymore so I have no specific suggestions. You might try mpileup or GATK with indel-realignnment disabled.

        To calculate it semi-manually, you could calculate coverage in that region with and without reads containing deletions. BBMap's "pileup.sh" script will generate per-base coverage information, and it considers bases under deletion events to be covered. So if you run it twice, once on a sam file with all mapped reads and once on a sam file containing only deletion-free reads (you can run BBMap with the "delfilter=0" flag to ban alignments containing any deletions), then look at the coverage ratio before vs after in that region, that should be a good approximation of the haplotype frequency.

        Actually, you can directly output coverage from BBMap using the "basecov" flag instead of generating a sam file.

        Comment


        • #5
          Hi Jo,

          You can try the caller from RTG Core, which will call the haplotypes over the region as a whole, and includes an allelic depth attribute in the VCF which should include exactly the counts that you want. Specifically referring to the unaligned free ends, the complex haplotype caller internally performs a realignment of each of the reads against the candidate haplotypes, since those reads would presumably align best against the deletion haplotype, they will contribute to the AD count corresponding to the deletion allele. (Our caller is primarily set up for haploid and diploid calling rather than heteroplasmy, but if you tell it the MT is diploid it'll probably do OK.)

          Cheers,
          Len.
          Len Trigg, Ph.D.
          Real Time Genomics
          www.realtimegenomics.com

          Comment


          • #6
            Hi all,

            Here's an update. I used the pipeline suggested by Brian to allow for large gaps in bbmap (k=10 maxindel=400 tipsearch=300 slow). I then tried various callers to see what they would do with the ~220bp deletion.

            CLC Genomics Workbench: Called it as a deletion at 19% frequency
            VarScan indel: Called it at 23%, but missed a smaller obvious indel downstream.
            SNVer: Called it as a large substitution at 34% frequency
            Geneious: had a difficult time with the large deletion and instead called it as multiple smaller deletions. I believe this is a known problem with Geneious variant calling.

            Then I used Brian's second suggestion of the pileup.sh of bbmap. I compared the region per-site using "k=10 maxindel=400 tipsearch=300 slow" in one instance and then "delfilter=0" in the second. This gave a reduction in coverage of about 38%, indicating a deletion frequency of approximately 38%.

            So that's about a 20% range difference between calling methods. I'll give Len's suggestion a try next. Thanks all.

            Best,
            Josiah
            Last edited by jwag; 05-06-2015, 10:08 AM.

            Comment


            • #7
              Hi Josiah,

              Prior to variant calling, I recommend you try de novo assembling all the reads that map, then map those contigs (along with any unassembled reads) to the reference sequence. By doing that, the reads in the contigs should be correctly aligned around large indels, which will give you a better estimate of the deletion frequency.

              For the issue in Geneious where it sometimes splits the deletion into multiple smaller deletions, I've fixed this for the next Geneious release. If you're interested in seeing the results sooner, you could share your data with me and I'll run it through Geneious and send you back the results. But even with the split deletion you should still get a deletion frequency which should be accurate.

              Comment


              • #8
                Hi Matt,

                I will give that a try . I'm currently using Geneious for SNP calling -- it seems to be working pretty well. Do you know if people typically remove duplicate reads in an alignment prior to SNP calling with the Geneious caller? Thanks.

                Comment


                • #9
                  Geneious only provides the ability to remove duplicates prior to alignment. It's probably a good idea to do so, but I don't know whether or not people typically do that.

                  I can see from your screenshot that you're using quite an old version of Geneious. Geneious variant calling has had a few improvements since then, so you may want to upgrade.

                  Comment


                  • #10
                    Hi again, I wanted to update progress for anyone still following the thread. I've been playing with BBmap, which appears to be doing very well with gapped alignments. I looked into the caller by RTG core but haven't used it yet, because according to the manual it does not call indels larger than 50bp (please correct me if I'm wrong, Len). Here's the pipeline that I think gives a good representation of both the indels and SNPs in a single alignment, in my case:

                    1. Map merged reads (>150bp after merging) to mtgenome with bbmap using k=10, maxindel=400, tipsearch=300, ambiguous=toss, minratio=0.7. I'm still debating if I should keep or toss ambiguously mapping reads... I thought since there are some highly repetitive regions, it might be best to toss them.

                    2. Remove duplicate mapping reads with Picard. This seems to be common practice (is done for example in the GATK variant calling pipeline), although people seem to get a bit fuzzy on whether this should be done or not when there is a high coverage saturation (as is the case my extremely deep sequencing of a single mtgenome).

                    3. Trim mapped reads 10bp on each end in the deduped bam file. I found an older thread that Brian posted in suggesting this can improve deletion calling (not sure if this is still your opinion, Brian). It seemed to help mask the few remaining unmapped free ends around deletion sites. The other option would be indel realignment using something like the GATK pipeline, but I think trimming works fine too.

                    4. SNP/variant call with Geneious and SNVer. Geneious actually did pretty well with calling the deletions in the BBmap gapped alignment -- hopefully I can find some funds for the new update, because even my older version seems to work better than most of the callers I've tried.

                    If anyone sees a problem with this pipeline please let me know. Cheers!

                    Comment


                    • #11
                      Originally posted by jwag View Post
                      2. Remove duplicate mapping reads with Picard. This seems to be common practice (is done for example in the GATK variant calling pipeline), although people seem to get a bit fuzzy on whether this should be done or not when there is a high coverage saturation (as is the case my extremely deep sequencing of a single mtgenome).
                      Only remove duplicates if the library is PCR-amplified. And yes, with sufficiently high coverage, it will still cause biases. When you're doing something quantitative - like RNA-seq expression, or determining the exact ratio of allelic frequencies - it's best to use unamplified data, but with amplified data, don't remove duplicates.

                      3. Trim mapped reads 10bp on each end in the deduped bam file. I found an older thread that Brian posted in suggesting this can improve deletion calling (not sure if this is still your opinion, Brian). It seemed to help mask the few remaining unmapped free ends around deletion sites. The other option would be indel realignment using something like the GATK pipeline, but I think trimming works fine too.
                      Yes, I still recommend trimming the ends of reads after alignment but prior to variant calling. Bear in mind that it's not straightforward, as you need to appropriately trim the cigar string too.

                      Comment


                      • #12
                        Thanks for the reply Brian. The DNA was prepared using a Nexetra kit which I believe involves a few cycles of PCR. So just to double check, I shouldn't remove duplicates since I'm trying to quantify allelic frequencies?

                        For the post-alignment trim, I had the first 10 and last 10 bases in the aligned reads set to be N's -- I will have to look into how this would change the cigar string.

                        Comment


                        • #13
                          Originally posted by jwag View Post
                          Thanks for the reply Brian. The DNA was prepared using a Nexetra kit which I believe involves a few cycles of PCR. So just to double check, I shouldn't remove duplicates since I'm trying to quantify allelic frequencies?

                          For the post-alignment trim, I had the first 10 and last 10 bases in the aligned reads set to be N's -- I will have to look into how this would change the cigar string.
                          Amplification will skew quantitative things whether or not you remove duplicates, but duplicate removal can potentially skew them even more if you have high coverage, so I'd skip it in this case.

                          I should really modify BBDuk to be able to trim mapped reads. Maybe I'll do that today.

                          Comment


                          • #14
                            Originally posted by Brian Bushnell View Post
                            Amplification will skew quantitative things whether or not you remove duplicates, but duplicate removal can potentially skew them even more if you have high coverage, so I'd skip it in this case.

                            I should really modify BBDuk to be able to trim mapped reads. Maybe I'll do that today.
                            Ah that makes sense-- I will keep the duplicates in this case.

                            If BBDuk could trim mapped reads and also update the CIGAR string at the same time, that would be super useful. I've yet to find anything that can do that.
                            Last edited by jwag; 05-13-2015, 10:29 AM.

                            Comment


                            • #15
                              Jo,

                              The 1-50bp note is more about the default RTG mapping settings (you'd have to start adjusting the aligner settings to look for longer indels, at somewhat of a speed penalty). The variant caller will happily process anything you throw at it. The mappings in your original screenshot looked fine (in that there were reads spanning the deletion, plus some hanging into it, which all gets correctly taken into account by the caller during it's local realignment to the indel hypotheses, so I would say do *not* trim the ends of the reads after alignment).
                              Len Trigg, Ph.D.
                              Real Time Genomics
                              www.realtimegenomics.com

                              Comment

                              Latest Articles

                              Collapse

                              • seqadmin
                                Current Approaches to Protein Sequencing
                                by seqadmin


                                Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
                                04-04-2024, 04:25 PM
                              • seqadmin
                                Strategies for Sequencing Challenging Samples
                                by seqadmin


                                Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                                03-22-2024, 06:39 AM

                              ad_right_rmr

                              Collapse

                              News

                              Collapse

                              Topics Statistics Last Post
                              Started by seqadmin, 04-11-2024, 12:08 PM
                              0 responses
                              18 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 04-10-2024, 10:19 PM
                              0 responses
                              22 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 04-10-2024, 09:21 AM
                              0 responses
                              17 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 04-04-2024, 09:00 AM
                              0 responses
                              49 views
                              0 likes
                              Last Post seqadmin  
                              Working...
                              X