Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Filtering out false positive structural variants

    Hi,
    I'm using BreakDancer and GASV for SV predictions but their output is a huge number of SVs most of them are false positives (e.g., for one whole genome, I'm getting around 10000 SVs). Is there any way to filter out false positives?

    Thanks
    Thanks,

  • #2
    There seems to be no response for my post. Is there anybody who can help?
    Thanks,

    Comment


    • #3
      Depends on your input data (what species, do you have data from multiple samples).
      If these things are genuine SVs segregating in the population, then you should see them in many samples, and the two alleles should behave roughly the way you might expect (some people are hom-ref, some are hom-alt, and some are het).
      This approach has been put into action in Genome Strip (I forget which letters they capitalise), and in the Cortex population/segregation filter, and results in both of these methods having low FDR. So if you have data on many samples, or can go and genotype many samples (even 10 or 20 would do), look at the allele-balance. Excess heterozygosity is a signal of artefacts caused by mismapping/missing repeats in the reference genome,

      Comment


      • #4
        Thanks Zam,
        I'm working on humans with around 60 cancer samples. It is very difficult to find the common SVs in all the samples as already mentioned that I'm getting more than 10000 variants for one sample. How can I find excessive heterozygosity?
        Thanks,

        Comment


        • #5
          Will anybody give more details regarding this?
          Thanks,

          Comment


          • #6
            Hi there. I didn't realise you were talking about cancer samples.
            1. the thing that bothers you seems to me not to be the most difficult problem you face. 10,000 variants per sample does not seem like a big deal to me, especially if you have 60 samples and you say you are looking for something shared by them all. Just look to see which of the 600,000 are in them all. Have you genotyped all your samples at all these called sites?
            2. the fact that you are essentially sampling a pool from a population of cells which presumably have different genomes makes the problem much harder. Do you expect to have both normal and multiple tumour genomes in there?

            By excess heterozygosity,I meant, take one of the specific variants and look to see how many of your samples have both alleles of that variant. But anyway, the test I proposed was really applicable for germline variants in a population of humans, I wasn't thinking of cancer. To be honest I think there are people reading this better qualified than me to help.

            Good luck!

            Comment


            • #7
              Thanks,
              The problem I'm facing is that when I check individual SVs from the aligned BAM file using IGV, I could see that most of the SVs are false positives. That means I have to check all the 60,000 variants one by one, it will take a long long time. Is there any way to ignore the false positives? I have both normal and cancer samples (60 pairs).

              Thanks
              Thanks,

              Comment


              • #8
                Is that 10000 variants in tumour but not normal per sample then?

                Comment


                • #9
                  Because BreakDancer and GASV gives only breakpoints for each SV and these breakpoints mostly cannot match (around 95%) between the cancer and normal because a slight difference in breakpoints (say 1-50 basepairs) means that the variants is same but how can we identify that?
                  Thanks,

                  Comment


                  • #10
                    Originally posted by tahamasoodi View Post
                    Because BreakDancer and GASV gives only breakpoints for each SV and these breakpoints mostly cannot match (around 95%) between the cancer and normal because a slight difference in breakpoints (say 1-50 basepairs) means that the variants is same but how can we identify that?
                    Convert them to bed files and use intersectBed from BedTools to identify regions with any overlap?

                    Comment


                    • #11
                      Are there any more suggestions?
                      Thanks,

                      Comment


                      • #12
                        1,Breakdancer[PE]+tigra_sv[local assembling]+cross_match[alignment] may filter out some false positives.
                        2,There is another software CREST[split-reads], you may get positive breakpoints for cancer research[somatic SV breakpoints]. Also you should use pair-end reads to ensure the results. However, the SV type from CREST is not exactly some time. CREST also can detect sv breakpoints for one sample, but I think you want to get somatic SVs.

                        Comment


                        • #13
                          My workflow usually goes like this:

                          1) convert calls to bedpe format (see the docs for BEDtools for examples)
                          2) use bedtools pairToPair to subtract germline SVs from the breakpoints from the cancer sample
                          3) Filter the remaining somatic SV candidates using BEDtools by removing any in which:

                          a) one end overlaps a simple or low complexity repeat
                          b) one end is in a segmental duplication
                          c) both ends match (with some slop) a breakpoint categorized in the normal population by the 1000genomes project (or your favorite set of previously validated SVs)

                          Then rank the remaining set by score/number of supporting read pairs and use a cutoff that gets them down to a reasonable number.

                          Finally try a local assembly method (TIGRA, etc) or a method that can refine calls using split read mappings (DELLY, etc) to validate in silico. Sometimes I will also sometimes use a more sensitive aligner (like MEGABLAST) to find alternative concordant mappings for the supporting read pairs for each candidate SV.

                          Comment


                          • #14
                            Originally posted by Bukowski View Post
                            Convert them to bed files and use intersectBed from BedTools to identify regions with any overlap?
                            Yes, I would also use something like this to rank the SVs by number of supporting samples and use that criteria in cwhelan's workflow above (after step c).

                            Comment


                            • #15
                              I would like to use bedtools to complete step a as noted in swhelan's post above. In bedtools manual the recommeded step is:
                              6.2.3 Retain only paired-end BAM alignments where neither end overlaps simple
                              sequence repeats.
                              $ pairToBed -abam reads.bam -b SSRs.bed -type neither > reads.noSSRs.bam

                              I have 'somatic' SVs in bedpe format from step 2. Can this be done with a bedpe format or do i need to get that information into bam format? Thanks.

                              Comment

                              Latest Articles

                              Collapse

                              • seqadmin
                                Essential Discoveries and Tools in Epitranscriptomics
                                by seqadmin


                                The field of epigenetics has traditionally concentrated more on DNA and how changes like methylation and phosphorylation of histones impact gene expression and regulation. However, our increased understanding of RNA modifications and their importance in cellular processes has led to a rise in epitranscriptomics research. “Epitranscriptomics brings together the concepts of epigenetics and gene expression,” explained Adrien Leger, PhD, Principal Research Scientist on Modified Bases...
                                Today, 07:01 AM
                              • seqadmin
                                Current Approaches to Protein Sequencing
                                by seqadmin


                                Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
                                04-04-2024, 04:25 PM

                              ad_right_rmr

                              Collapse

                              News

                              Collapse

                              Topics Statistics Last Post
                              Started by seqadmin, 04-11-2024, 12:08 PM
                              0 responses
                              37 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 04-10-2024, 10:19 PM
                              0 responses
                              41 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 04-10-2024, 09:21 AM
                              0 responses
                              35 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 04-04-2024, 09:00 AM
                              0 responses
                              54 views
                              0 likes
                              Last Post seqadmin  
                              Working...
                              X