Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • General question: human CNV/Structural variants algorithms using next-gen data cannot

    This is pretty much a general question in human CNV/Structural variants field (with next-gen data, NOT arrays).

    As shown in 1000genome project, groups develop different algorithm-based approach to identify structural variants (mainly three algorithms: paired-end, read-depth and split-read).

    However results from these approaches barely overlap with each other (of course they have different preferences, say, split-read is powerful for those small indels); and seems the false positive is quite high (or we simply don't know their false positive, because we cannot use alternative approach to validate those small structural variants like we use array CGH for large ones)

    Or in simple words, I don't trust even those mainstream, or widely used approaches like Breakdancer, CNVnator (I only relatively show confidence in Pindels, because it provides nucleotide-resolution breakpoints). Do you trust them?

    If not, then what should we do? To carry out some post-processing or filtering to reduce the potential false positive? For example, to adjust the read-depth threshold for read-depth-based approaches; or only limit our attention to calls supported by uniquely-mapping discordant paired-end reads for paired-end-based approaches?

    Or do we need to develop our own codes for our specific research? What softwares do you guys use? (say CNVnator, Breakdancer)

    Personally I would say, when someday sequencing is powerful enough to accurately produce long-enough reads, then we can say goodbye to these mapping-based methods, because we can simply assemble all reads, also in the absence of problems caused by repetitive sequences in human genome.

  • #2
    I usually take a 3-tiered approach, using CNVnator (read Depth), Break Dancer (Read pair) and CREST (split read). However, I too see a lot fo false positives from each tool. What would be great is if we could get a consensus from the group for how to remove these Issues.

    One approach I take is that I created a SV_BLacklist file. This is from combining the Gaps, Segmental Duplications, and Repeat Mask tracks from UCSC. If either end of the SV intersects with one of these features, I remove it. Undoubtedly, this removes some true positives, but if I don't, my circos plots are full of intrachromosomal events.

    Any one else what to post how they filter thier results?

    Comment


    • #3
      "self chain" and "mapability" data for filtering problem regions ...

      For further filtering or flagging reads as problematic, you might try ..

      the "self chain" track:
      /http://hgdownload.cse.ucsc.edu/goldenPath/hg19/database/chainSelf.sql

      There's also "mapability" tracks available at the same place
      example :
      wgEncodeCrgMapabilityAlign75mer.sql

      Comment


      • #4
        The main reason those tools don't overlap is because of different size ranges. Split-read like Pindel is more sensitive to small variants, from 1bp to several kb while less sensitive and with high FDR for larger variants. Read-depth like CNVnator starts to be useful for a variant larger than hundreds of bp, and the larger the variant, the more sensitive and the more reliable. Read pair like BreakDancer works well for variants larger than dozens of bp.

        All methods suffer in repetitive regions, where indels and SVs occur frequently. If you care more about FDR, remove all calls overlap with repetitive regions. If you want to understand biology and want to know all the changes in your interesting samples like cancer, you may not wish to filter them just based on repetitiveness, even through your validation experiments may fail.

        Comment


        • #5
          One thing I was playing around with a little while back was to try and assemble the predicted breakpoints. I took all the reads in which either pair mapped near the breakpoint and put them into velvet. When it worked it made a contig from either side of the breakpoint and one for the actual breakpoint. I didn't have time to pursue it further, but it tended to mostly agree with CREST.

          Comment


          • #6
            Hi Kai, you are definitely right. Pindel is quite special, which finds small indels while others cannot. So I usually never compare Pindel results with CNVnator/Breakdancer/VariationHunter. But problem is when you compare CNVnator and Breakdancer/VH, which identifies SVs with similar length, very few can overlap. This is quite frustrating.



            Originally posted by KaiYe View Post
            The main reason those tools don't overlap is because of different size ranges. Split-read like Pindel is more sensitive to small variants, from 1bp to several kb while less sensitive and with high FDR for larger variants. Read-depth like CNVnator starts to be useful for a variant larger than hundreds of bp, and the larger the variant, the more sensitive and the more reliable. Read pair like BreakDancer works well for variants larger than dozens of bp.

            All methods suffer in repetitive regions, where indels and SVs occur frequently. If you care more about FDR, remove all calls overlap with repetitive regions. If you want to understand biology and want to know all the changes in your interesting samples like cancer, you may not wish to filter them just based on repetitiveness, even through your validation experiments may fail.

            Comment


            • #7
              Originally posted by henry.wood View Post
              One thing I was playing around with a little while back was to try and assemble the predicted breakpoints. I took all the reads in which either pair mapped near the breakpoint and put them into velvet. When it worked it made a contig from either side of the breakpoint and one for the actual breakpoint. I didn't have time to pursue it further, but it tended to mostly agree with CREST.

              Sounds interesting. Can you explain a little bit more?
              One problem for assembly is, what if say the deletion is heterozygous, which means there'll still be some reads mapping to the deleted parts.
              So maybe you mean we assemble all reads "outside" the calls? Since the there could be soft-clipped reads, then we can assemble them into contig which represents the real genome structure for our sample?
              thx

              Comment


              • #8
                Speaking of assembly of SVs, you could also try Cortex (sorry for the plug, I am an author)


                Sensitivity drops with variant length (increased chance of coverage gap, plus graph complexity), so it won't assemble the v large CNVs or segdups. Roughly speaking you can call hets up to kb's in size and homs up to tens or hundreds of kb (depending on species/genome/read length). There's a lot of detail in the supp info about what you will be able to assemble for a given experiment.

                Comment


                • #9
                  I should have said explicitly. There is a trade-off. Cortex assembles full alleles, giving you flank, allele1, allele2, flank, rather than just breakpoints. That's the advantage over other SV callers - it is more precise (we do validation of the exact sequence in our alleles with finished fomsids in our paper). HOWEVER, it does not have power to detect very large events (with current read-lengths). So it depends what you want to be able to detect - don't waste your time with Cortex if you want to find 200kb het duplications or segdups etc.

                  Comment


                  • #10
                    Originally posted by CNVboy View Post
                    Sounds interesting. Can you explain a little bit more?
                    One problem for assembly is, what if say the deletion is heterozygous, which means there'll still be some reads mapping to the deleted parts.
                    So maybe you mean we assemble all reads "outside" the calls? Since the there could be soft-clipped reads, then we can assemble them into contig which represents the real genome structure for our sample?
                    thx
                    You're right. It's a while since I did it and I've forgotten the details. I didn't use all the reads, I only used the reads were there wasn't perfect alignment. So I kept the reads where one pair aligned and the other didn't, as well as the soft clipped reads. I fiddled around with it for a little while, but then I realised I wasn't meant to be writing breakpoint algorithms, and I was only doing it in order to put off writing a talk. It cut down the list from breakdancer quite a bit, but I never got it to outperform CREST.

                    Comment


                    • #11
                      Do all of these algorithms work on capture data? I'm having a hard time figuring this out.

                      Comment


                      • #12
                        @Heisman
                        speaking from my own experience which might be incorrect...: read depth algorithms designed for WGS data tend to be very noisy in exome data due to an additional source of noise coming from the capture step. Especially in low capture efficiency regions with low read depth the variation of logratios is considerable. I'd try the few Exome-specific read depth algorithms published recently. Breakdancer type read pairing algorithms don't work on capture data, since capture data only spans 1% or whatever portion of the genome so simplistically speaking a given SV breakpoint is spanned by your reads with just a 1% chance. algorithms designed for finding short indels (<50bp) should however work as such for detecting variants within your target regions.

                        @RockChalkJayhawk and others who have used BreakDancer
                        I'm running some 100GB whole genome Illumina sequencing files on BreakDancer. The program has been running for 2 weeks and has done 500 variants and made it to chromosome 3. Have any of you encountered this slow running times?

                        Comment


                        • #13
                          @RockChalkJayhawk and others who have used BreakDancer
                          I'm running some 100GB whole genome Illumina sequencing files on BreakDancer. The program has been running for 2 weeks and has done 500 variants and made it to chromosome 3. Have any of you encountered this slow running times?[/QUOTE]

                          We parrallelize our runs by chromosome first. That should speed things up for you.

                          Comment

                          Latest Articles

                          Collapse

                          • seqadmin
                            Strategies for Sequencing Challenging Samples
                            by seqadmin


                            Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                            03-22-2024, 06:39 AM
                          • seqadmin
                            Techniques and Challenges in Conservation Genomics
                            by seqadmin



                            The field of conservation genomics centers on applying genomics technologies in support of conservation efforts and the preservation of biodiversity. This article features interviews with two researchers who showcase their innovative work and highlight the current state and future of conservation genomics.

                            Avian Conservation
                            Matthew DeSaix, a recent doctoral graduate from Kristen Ruegg’s lab at The University of Colorado, shared that most of his research...
                            03-08-2024, 10:41 AM

                          ad_right_rmr

                          Collapse

                          News

                          Collapse

                          Topics Statistics Last Post
                          Started by seqadmin, 03-27-2024, 06:37 PM
                          0 responses
                          13 views
                          0 likes
                          Last Post seqadmin  
                          Started by seqadmin, 03-27-2024, 06:07 PM
                          0 responses
                          12 views
                          0 likes
                          Last Post seqadmin  
                          Started by seqadmin, 03-22-2024, 10:03 AM
                          0 responses
                          53 views
                          0 likes
                          Last Post seqadmin  
                          Started by seqadmin, 03-21-2024, 07:32 AM
                          0 responses
                          69 views
                          0 likes
                          Last Post seqadmin  
                          Working...
                          X