Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • I wrote:
    "It's not at all clear to me how sam2pindel knows whether I am giving it alignments from mate-pair instead of alignments from paired-end."

    For anyone else who has a similar problem and happens upon this thread, it turns out sam2pindel has a 6th argument that isn't shown in the user manual, which is how you tell it this difference.

    I now notice that even though the URL says "current" the text says Pindel 0.2.4. I'm using 0.2.5a3. Web docs lag the source.

    T.Hattum

    Comment


    • Hi,

      As per AlisonF, I need to extract the reference sequencing depth (RD) at the position of the detected structural variant for subsequent computation.

      However, after running pindel on my novoalign generated .bam files and converting the pindel output into the vcf format using pindel2vcf, I only see the "GT:AD" fields (example below) and not the "GT:RD:AD" field as shown by the other users.


      #CHROM POS ID REF ALT QUAL FILTER INFO FORMAT T101a T101b
      chr1 5919343 . ATC A . PASS END=5919345;HOMLEN=5;HOMSEQ=TCTCT;SVLEN=-2;SVTYPE=DEL GT:AD 0/0:0,0 0/0:4,4
      chr1 16975201 . CT C . PASS END=16975202;HOMLEN=0;SVLEN=-1;SVTYPE=DEL GT:AD 0/0:57,0 0/0:53,4
      chr1 17084536 . TGGAACA T . PASS END=17084542;HOMLEN=2;HOMSEQ=GG;SVLEN=-6;SVTYPE=DEL GT:AD 0/0:55,0 0/0:71,4


      For my analysis, I am using "Pindel version 0.2.5a3, Oct 24 2013." with default settings and the -i option with the config file and "pindel2vcf 0.5.8" to convert the pindel file into the vcf format.

      Is anyone able to assist me on this?


      Thank you.

      Comment


      • Hi Kai Ye,
        I'm attempting to use Pindel (version 0.2.5a3) to annotate Internal Tandem Duplications in the gene FLT3. Others (PMID: 23159595) have had great success doing this with Pindel. I've run into a problem where I'm unable to limit the search to any region using the -c option. I'm particularly interested in the region chr13:28608000-28608600. Every time I execute Pindel with the -c option set to anything other than "ALL", I get a segmentation fault:

        "
        Initializing parameters...
        Pindel version 0.2.5a3, Oct 24 2013.
        Loading reference genome ...
        Loading reference genome done.
        Initializing parameters done.
        SearchRegion::SearchRegion
        Segmentation fault (core dumped)
        "

        However, when I execute with -c ALL, the program runs fine (and finds the desired ITD). I've tried setting -r, -l, and -k to false and I've tried setting -w to 1 to limit the amount of memory used. I've reproduced the command I'm using below (with absolute paths removed for clarity):

        pindel -f hg19.fa -i pindel_input.txt -c 13 -r false -l false -k false -t true -w 1 -o pindel_output

        I could just run the program on all chromosomes but would prefer to run it just on this defined region. Any idea where this seg fault is coming from? Unfortunately to core file isnt actually getting dumped so I cant reproduce that for you.
        Thanks in advance!
        -Ryan

        Comment


        • Originally posted by mcorces View Post
          Hi Kai Ye,
          I'm attempting to use Pindel (version 0.2.5a3) to annotate Internal Tandem Duplications in the gene FLT3. Others (PMID: 23159595) have had great success doing this with Pindel. I've run into a problem where I'm unable to limit the search to any region using the -c option. I'm particularly interested in the region chr13:28608000-28608600. Every time I execute Pindel with the -c option set to anything other than "ALL", I get a segmentation fault:

          "
          Initializing parameters...
          Pindel version 0.2.5a3, Oct 24 2013.
          Loading reference genome ...
          Loading reference genome done.
          Initializing parameters done.
          SearchRegion::SearchRegion
          Segmentation fault (core dumped)
          "

          However, when I execute with -c ALL, the program runs fine (and finds the desired ITD). I've tried setting -r, -l, and -k to false and I've tried setting -w to 1 to limit the amount of memory used. I've reproduced the command I'm using below (with absolute paths removed for clarity):

          pindel -f hg19.fa -i pindel_input.txt -c 13 -r false -l false -k false -t true -w 1 -o pindel_output

          I could just run the program on all chromosomes but would prefer to run it just on this defined region. Any idea where this seg fault is coming from? Unfortunately to core file isnt actually getting dumped so I cant reproduce that for you.
          Thanks in advance!
          -Ryan
          You have -c 13 and -c chr13. Please make sure that use the chromosome names as in your reference file. If you wish to check a set of genomic regions, you might use the new -j bed option. You would be able to specify chr start end per line and put as many genomic regions as you want. You could combine -c with -j so that Pindel compute regions also constrained by -c.

          Let me know whether this works.

          Kai

          Comment


          • Originally posted by kartong View Post
            Hi,

            As per AlisonF, I need to extract the reference sequencing depth (RD) at the position of the detected structural variant for subsequent computation.

            However, after running pindel on my novoalign generated .bam files and converting the pindel output into the vcf format using pindel2vcf, I only see the "GT:AD" fields (example below) and not the "GT:RD:AD" field as shown by the other users.


            #CHROM POS ID REF ALT QUAL FILTER INFO FORMAT T101a T101b
            chr1 5919343 . ATC A . PASS END=5919345;HOMLEN=5;HOMSEQ=TCTCT;SVLEN=-2;SVTYPE=DEL GT:AD 0/0:0,0 0/0:4,4
            chr1 16975201 . CT C . PASS END=16975202;HOMLEN=0;SVLEN=-1;SVTYPE=DEL GT:AD 0/0:57,0 0/0:53,4
            chr1 17084536 . TGGAACA T . PASS END=17084542;HOMLEN=2;HOMSEQ=GG;SVLEN=-6;SVTYPE=DEL GT:AD 0/0:55,0 0/0:71,4


            For my analysis, I am using "Pindel version 0.2.5a3, Oct 24 2013." with default settings and the -i option with the config file and "pindel2vcf 0.5.8" to convert the pindel file into the vcf format.

            Is anyone able to assist me on this?


            Thank you.
            We also put RD as part of AD so that AD has two int.

            Comment


            • Hi Kai Ye,
              You're exactly right. Sorry for such a silly error!
              Thanks!
              -ryan

              Comment


              • Hi KaiYe,

                we're using Pindel to analyse exome data for quite a while now. We are starting to analyse more whole genome datasets now, and running pindel takes quite a while for these datasets (several days using a single thread), especially after switching from v0.2.4t to v0.2.5. So I'm wondering what is the recommended way to analyse large datasets? Currently we are using pindel with standard parameters (except -A which we set to 20) directly on BAM files from BWA and are just increasing the number of threads to 8-10. Is there something we can do better?

                Best,
                Thomas

                Comment


                • Originally posted by t.wieland View Post
                  Hi KaiYe,

                  we're using Pindel to analyse exome data for quite a while now. We are starting to analyse more whole genome datasets now, and running pindel takes quite a while for these datasets (several days using a single thread), especially after switching from v0.2.4t to v0.2.5. So I'm wondering what is the recommended way to analyse large datasets? Currently we are using pindel with standard parameters (except -A which we set to 20) directly on BAM files from BWA and are just increasing the number of threads to 8-10. Is there something we can do better?

                  Best,
                  Thomas
                  For exome data, you could use -x 2 or even -x 1 to speed up. Which version are you using? Recently I changed the code and remove one memory leak.

                  You'd better run jobs per chr and maximum 4 threads as you will not cut runtime further with more threads but it is better to increase the number of jobs.

                  Kai

                  Comment


                  • Hi Kai,

                    thanks for your reply!

                    Currently we're using v.2_5a3
                    Ok, I also thought about splitting by chromosome, but I wasn't sure if the detection of interchromosomal events still works?

                    Thomas

                    Comment


                    • Originally posted by t.wieland View Post
                      Hi Kai,

                      thanks for your reply!

                      Currently we're using v.2_5a3
                      Ok, I also thought about splitting by chromosome, but I wasn't sure if the detection of interchromosomal events still works?

                      Thomas
                      interchromosomal events will not be affected if split per chr.

                      Comment


                      • pindel variants counted multiple times

                        Originally posted by hshain View Post
                        Thanks for the quick reply. Let's discuss the duplication first. I now see in the output where the "unique" reads are reported for a given event. In the example below, there are 6 reads which are duplicates of each other and Pindel recognizes and reports this:

                        ####################################################################################################
                        4725 D 1 NT 0 "" ChrID chr3 BP 3016666 3016668 BP_range 3016666 3016668 Supports 6 1 + 6 1 - 0 0 S1 7 SUM_MS 314 1 NumSupSamples 1 1 28_1_GTGTTA 6 1 0 0
                        ATTGGATGCATAATAAAATTAAAACATTTTTTGTTTCTGGCATGGCCAATATTGCTATTTGTCTTATAGAAACCTCTTCTCATTACTAAATTATATATTCTgTATAGTGGGCCCCCCTTTCTAATTAATAATTAATATTGTCTTCCAGGCATTTTAGTTACCAAGTGGTAAAGGAAGCTTCTGTGATTTCAACTTCAAGTTA
                        CTTATAGAAACCTCTTCTCATTACTAAATTATATATTCT TATAGTGGGCCCCCCTTTCTAATTAATAATTAATATTGTCTTCCAGGCATTTTAGTTACCAA + 3016209 37 28_1_GTGTTA @DCDF8JN1:204:C0V4FACXX:4:2215:15528:50861/1
                        CTTATAGAAACCTCTTCTCATTACTAAATTATATATTCT TATAGTGGGCCCCCCTTTCTAATTAATAATTAATATTGTCTTCCAGGCATTTTAGTTACCAA + 3016209 37 28_1_GTGTTA @DCDF8JN1:204:C0V4FACXX:4:1312:8307:12733/1
                        CTTATAGAAACCTCTTCTCATTACTAAATTATATATTCT TATAGTGGGCCCCCCTTTCTAATTAATAATTAATATTGTCTTCCAGGCATTTTAGTTACCAA + 3016209 60 28_1_GTGTTA @DCDF8JN1:204:C0V4FACXX:4:1209:17429:19881/1
                        CTTATAGAAACCTCTTCTCATTGCTAAATTATATATTCT TATAGTGGGCCCCCCTTTCTAATTAATAATTAATATTGTCTTCCAGGCATTTTAGTTACCAA + 3016209 60 28_1_GTGTTA @DCDF8JN1:204:C0V4FACXX:4:1201:7760:69122/1
                        CTTATAGAAACCTCTTCTCATTACTAAATTATATATTCT TATAGTGGGCCCCCCTTTCTAATTAATAATTAATATTGTCTTCCAGGCATTTTAGTTACCAA + 3016209 60 28_1_GTGTTA @DCDF8JN1:204:C0V4FACXX:4:1109:9254:43555/1
                        CTTATAGAAACCTCTTCTCATTACTAAATTATATATTCT TATAGTGGGCCCCCCTTTCTAATTAATAATTAATATTGTCTTCCAGGCATTTTAGTTACCAA + 3016209 60 28_1_GTGTTA @DCDF8JN1:204:C0V4FACXX:4:1103:14378:72999/1
                        ####################################################################################################


                        However, Pindel does not seem to properly recognize duplicates when the paired end reads run into each other as in this example:

                        ####################################################################################################
                        4727 D 4 NT 0 "" ChrID chr3 BP 3910869 3910874 BP_range 3910869 3910877 Supports 4 4 + 2 2 - 2 2 S1 9 SUM_MS 240 1 NumSupSamples 1 1 28_1_GTGTTA 2 2 2 2
                        GCAGAAATAAAAAGAAAACATCAAATGCGGCTCTTCCATGACCTGCTAGGATCTGCTTCTACCAAATCATGGATATAGAAATAGGCCCAGCTGCACACCACtcttTCTAATTATCCTGTTCTTCCAATTCCTCTTTTATGCATTTTTTTTGCCCACTCTCTCTCGAAACACAGTAGCTCTGGGAGTTGAAAATTAAGTTTTA
                        AGAAATAGGCCCAGCTGCACACCAC TCTAATTATCCTGTTCTTCCAATTCCTCTTTTATGCATTTTTTTTGCCCACTCTCTCTCGAAACACAGTAGCTCTG + 3910628 60 28_1_GTGTTA @DCDF8JN1:204:C0V4FACXX:4:2214:15997:96580/1
                        CTCTTCTATGACCTGCTAGGATCTGCTTCTACCAAATCATGGATATAGAAATAGGCCCAGCTGCACACCAC TCTAATTATCCTGTTCTTCCAATTCCTCTT - 3911111 60 28_1_GTGTTA @DCDF8JN1:204:C0V4FACXX:4:2214:15997:96580/2
                        AGAAATAGGCCCAGCTGCACACCAC TCTAATTATCCTGTTCTTCCAATTCCTCTTTTATGCATTTTTTTTGCCCACTCTCTCTCGAAACACAGTAGCTCTG + 3910628 60 28_1_GTGTTA @DCDF8JN1:204:C0V4FACXX:4:1302:10814:73798/1
                        CTCTTCTATGACCTGCTAGGATCTGCTTCTACCAAATCATGGATATAGAAATAGGCCCAGCTGCACACCAC TCTAATTATCCTGTTCTTCCAATTCCTCTT - 3911111 60 28_1_GTGTTA @DCDF8JN1:204:C0V4FACXX:4:1302:10814:73798/2
                        ####################################################################################################

                        You can see by the read coordinates that the X and Y position are the same for the forward and reverse reads -- suggesting that they are in the same pair. It is apparent that this is one unique paired-end read, the forward and reverse reads overlap, there is a deletion in the overlapping portion, and the read was duplicated. This is one event that is counted 4 times. I also confirmed that Picard mark dups correctly flagged all 4 reads as duplicates.

                        When I rank order candidates by the reported number of unique reads, these types of events are enriched at the top of my list.


                        Hi Hshain, I have this same issue using Pindel 0.2. Did you ever figure out why this is happening or how to fix/avoid this problem?

                        Comment


                        • Pindel marks a read as duplicated if both read pairs are identical. For the first event, both read sequences and the their mates' mapping positions are the same. Thus they are all marked as duplicates. For the second case, the mates' positions differ. Of course, this could be modified depending what we consider as true duplicates.

                          kai

                          Comment


                          • Hi Kai

                            I am a little confused about "-u" option (maximum_allowed_mismatch_rate)

                            In 0.2.5a8 version, help says:
                            "Only reads with more than this fraction of mismatches than the reference genome will be considered as harboring potential SVs. (default 0.02) "

                            BUT, on the website (http://gmt.genome.wustl.edu/packages...er-manual.html), it says
                            "only reads with fewer mismatches with the reference genome than this fraction will be considered (default 0.1)"

                            1- Which version should I trust ?
                            2- If I have well understood, this parameter (and the ones below) allow me to play with sensitivity/specificity depending on the kind of result I want (SV discovery or validation)
                            but I am not sure about when do I want to modify one option over the others in the list below :

                            -u (maximum_allowed_mismatch_rate)
                            -H (min_distance_to_the_end)
                            -n (minimum number of edit distance between reads and ref genome)
                            -d (min_num_matched_bases)
                            -a (additional_mismatch)
                            -m (min_perfect_match_around_BP)


                            Thanks!
                            Yann

                            PS: (still on the website) The link to dispersed duplication 's documentation seems dead.

                            Comment


                            • Hi KaiYe,
                              Thanks so much for your attention to this. We are using the latest version 0.2.5b8, 20151210. But we are really confused about some variants with large insertion, for example the record in the xx_SI file:
                              (Line1)287 I.98 NT.98."AAAAAGATTTGACTTGTGTAAACGAACCCATTTTCAAGAACTCTACCATGGTTTTATATGGAGACACAGGTGATAAACAAGCAACCCAAGTGTCAATT" ChrID.13 BP.32911324 32911325 BP_range.32911324 32911334 Supports.40 40 +.40 40 -.0 0 S1.41 SUM_MS.3840 1 NumSupSamples.1 1 tumor.0.0.40.40.0.0
                              (Line2)AAA..................................................................................................................................................................................................AAAGATTTGGTTTATGTTCTTGCAGAGGAGAACAAAAATAGTGTAAAGCAGCATATAAAAATGACTCTAGGTCAAGATTTAAAATCGGACATCTCCTTGAATATAGATAAAATACCAGAAAAAAATAATGATTACATGAACAAATGGGCAG
                              (Line3)AAACAGACTTGACTTGTGTAAACGAACCCATTTTCAAGAACTCTACCATGGTTTTATATGGAGACACAGGTGATAAACAAGCAACCCAAGTGTCAATTATAAAGATTTGGTTTATGTTCTTGCAG............................ + 32911224 96 tumor @M03097:19:000000000-G05CH:1:2104:21078:24842
                              But it seems impossiple exist the insertion 'ACAGACTTGACTTGTGTAAACGAACCCATTTTCAAGAACTCTACCATGGTTTTATATGGAGACACAGGTGATAAACAAGCAACCCAAGTGTCAATTAT' based on the blat result for this read '@M03097:19:000000000-G05CH:1:2104:21078:24842':

                              Blat result:
                              Genomic chr13 :
                              gactctgaag aacttttctc agacaatgag aataattttg tcttccaagt 32911173
                              agctaatgaa aggaataatc ttgctttagg aaatactaag gaacttcatg 32911223
                              AAACAGACTT GACTTGTGTA AACGAACCCA TTTTCAAGAA CTCTACCATG 32911273
                              GTTTTATATG GAGACACAGG TGATAAACAA GCAACCCAAG TGTCAATTAa 32911323
                              aAAAGATTTG GTTTATGTTC TTGCAGagga gaacaaaaat agtgtaaagc 32911373
                              agcatataaa aatgactcta ggtcaagatt taaaatcgga catctccttg 32911423
                              aatatagata aaataccaga aaaaaa

                              We try to send the temp files (bam and the input file for pindel) to you, but the email address like [email protected] and [email protected] are already invalid.
                              Thanks in advance for your reply
                              my email is [email protected]
                              Yongyong Ren
                              Last edited by Yongyong.Ren; 07-17-2016, 07:29 PM.

                              Comment

                              Latest Articles

                              Collapse

                              • seqadmin
                                Techniques and Challenges in Conservation Genomics
                                by seqadmin



                                The field of conservation genomics centers on applying genomics technologies in support of conservation efforts and the preservation of biodiversity. This article features interviews with two researchers who showcase their innovative work and highlight the current state and future of conservation genomics.

                                Avian Conservation
                                Matthew DeSaix, a recent doctoral graduate from Kristen Ruegg’s lab at The University of Colorado, shared that most of his research...
                                03-08-2024, 10:41 AM
                              • seqadmin
                                The Impact of AI in Genomic Medicine
                                by seqadmin



                                Artificial intelligence (AI) has evolved from a futuristic vision to a mainstream technology, highlighted by the introduction of tools like OpenAI's ChatGPT and Google's Gemini. In recent years, AI has become increasingly integrated into the field of genomics. This integration has enabled new scientific discoveries while simultaneously raising important ethical questions1. Interviews with two researchers at the center of this intersection provide insightful perspectives into...
                                02-26-2024, 02:07 PM

                              ad_right_rmr

                              Collapse

                              News

                              Collapse

                              Topics Statistics Last Post
                              Started by seqadmin, 03-14-2024, 06:13 AM
                              0 responses
                              33 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 03-08-2024, 08:03 AM
                              0 responses
                              72 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 03-07-2024, 08:13 AM
                              0 responses
                              81 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 03-06-2024, 09:51 AM
                              0 responses
                              68 views
                              0 likes
                              Last Post seqadmin  
                              Working...
                              X