Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • processing pindel output files

    I have a massive pindel deletion file that has too many rows to open in excel..
    Anyone have any ideas on how I can analyze this? I probably need to convert this to a database?

    It would be nice if pindel could output a list of deletions as a nice csv file...

    EDIT: I was able to delete all lines from the output file not starting with a digit (using a regex) and that gives me file I can start to format as a csv. Any better ideas or methods out there?

  • #2
    It comes with an executable to format output to a VCF file. This should trim it down a good deal, depending on how many variants you have versus supporting reads.

    VCF format is accepted by certain annotation programs as well, which is very nice.

    Comment


    • #3
      Originally posted by odoyle81 View Post
      I have a massive pindel deletion file that has too many rows to open in excel..
      Anyone have any ideas on how I can analyze this? I probably need to convert this to a database?

      It would be nice if pindel could output a list of deletions as a nice csv file...

      EDIT: I was able to delete all lines from the output file not starting with a digit (using a regex) and that gives me file I can start to format as a csv. Any better ideas or methods out there?
      Indeed you may convert the result to VCF. You may also grep the head line containing the variant information and then print selected fields:
      grep ChrID Pindel_output.txt | awk '{print $.....}'

      Comment


      • #4
        Originally posted by bwubb View Post
        It comes with an executable to format output to a VCF file. This should trim it down a good deal, depending on how many variants you have versus supporting reads.

        VCF format is accepted by certain annotation programs as well, which is very nice.
        Could you provide more documentation on this? I don't see this in the pindel source. I see vcfcreator.cpp but I don't know how to implement this.

        Thanks!

        Comment


        • #5
          Originally posted by odoyle81 View Post
          Could you provide more documentation on this? I don't see this in the pindel source. I see vcfcreator.cpp but I don't know how to implement this.

          Thanks!
          After you type
          ./INSTALL <path to samtools folder>
          you will have binary programs, pindel, pindel2vcf...

          If you type
          ./pindel2vcf
          you will see documentation...

          Comment


          • #6
            awesome thanks - this works great!

            Comment


            • #7
              Problem running pindel2vcf

              Hello,

              I am having an issue running pindel2vcf, just for a particular reference genome. Another set of Pindel files using a different reference genome converted fine without a problem. I successfully ran Pindel and have what looks to be proper Pindel output files. The reference was indexed with samtools faidx. When I run pindel2vcf it looks like it can't find the scaffold sequences in the fasta. Are certain characters not allowed in the reference fasta, or something wrong with the ChrID naming for this particular fasta? I have them named S00, S01, so on, with the ChrIDs in the Pindel output files matching those in the reference so that doesn't seem to be the issue. Any help would be greatly appreciated. Thanks.

              pindel2vcf -p EXAMPLE_D -r ./EXAMPLE_reference.fasta -R example -d 20130728 -v EXAMPLE_deletions.vcf
              Samples:
              1. EXAMPLE
              Chromosomes in which SVs have been found:
              1. S00
              2. S01
              3. S02
              4. S04
              5. S05
              6. S06
              7. S07
              8. S08
              9. S09
              10. S10
              11. S11
              12. S12
              13. S13
              14. S14
              15. S15
              16. S16
              17. S17
              18. S18
              19. S19
              20. S20
              21. S21
              22. S22
              23. S23
              24. S26
              25. S28
              26. S29
              27. S36
              28. S37
              29. S39
              Scanning chromosome: S00
              Scanning chromosome: S01
              Scanning chromosome: S02
              Scanning chromosome: S03
              Scanning chromosome: S04
              Scanning chromosome: S05
              Scanning chromosome: S06
              Scanning chromosome: S07
              Scanning chromosome: S08
              Scanning chromosome: S09
              Scanning chromosome: S10
              Scanning chromosome: S11
              Scanning chromosome: S12
              Scanning chromosome: S13
              Scanning chromosome: S14
              Scanning chromosome: S15
              Scanning chromosome: S16
              Scanning chromosome: S17
              Scanning chromosome: S18
              Scanning chromosome: S19
              Scanning chromosome: S20
              Scanning chromosome: S21
              Scanning chromosome: S22
              Scanning chromosome: S23
              Scanning chromosome: S24
              Scanning chromosome: S25
              Scanning chromosome: S26
              Scanning chromosome: S27
              Scanning chromosome: S28
              Scanning chromosome: S29
              Scanning chromosome: S30
              Scanning chromosome: S31
              Scanning chromosome: S32
              Scanning chromosome: S33
              Scanning chromosome: S34
              Scanning chromosome: S35
              Scanning chromosome: S36
              Scanning chromosome: S37
              Scanning chromosome: S38
              Scanning chromosome: S39
              Exiting reference scanning.
              , skipping it.hromosome S00
              from memory.mosome S00
              , skipping it.hromosome S01
              from memory.mosome S01
              , skipping it.hromosome S02
              from memory.mosome S02
              , skipping it.hromosome S03
              from memory.mosome S03
              , skipping it.hromosome S04
              from memory.mosome S04
              , skipping it.hromosome S05
              from memory.mosome S05
              , skipping it.hromosome S06
              from memory.mosome S06
              , skipping it.hromosome S07
              from memory.mosome S07
              , skipping it.hromosome S08
              from memory.mosome S08
              , skipping it.hromosome S09
              from memory.mosome S09
              , skipping it.hromosome S10
              from memory.mosome S10
              , skipping it.hromosome S11
              from memory.mosome S11
              , skipping it.hromosome S12
              from memory.mosome S12
              , skipping it.hromosome S13
              from memory.mosome S13
              , skipping it.hromosome S14
              from memory.mosome S14
              , skipping it.hromosome S15
              from memory.mosome S15
              , skipping it.hromosome S16
              from memory.mosome S16
              , skipping it.hromosome S17
              from memory.mosome S17
              , skipping it.hromosome S18
              from memory.mosome S18
              , skipping it.hromosome S19
              from memory.mosome S19
              , skipping it.hromosome S20
              from memory.mosome S20
              , skipping it.hromosome S21
              from memory.mosome S21
              , skipping it.hromosome S22
              from memory.mosome S22
              , skipping it.hromosome S23
              from memory.mosome S23
              , skipping it.hromosome S24
              from memory.mosome S24
              , skipping it.hromosome S25
              from memory.mosome S25
              , skipping it.hromosome S26
              from memory.mosome S26
              , skipping it.hromosome S27
              from memory.mosome S27
              , skipping it.hromosome S28
              from memory.mosome S28
              , skipping it.hromosome S29
              from memory.mosome S29
              , skipping it.hromosome S30
              from memory.mosome S30
              , skipping it.hromosome S31
              from memory.mosome S31
              , skipping it.hromosome S32
              from memory.mosome S32
              , skipping it.hromosome S33
              from memory.mosome S33
              , skipping it.hromosome S34
              from memory.mosome S34
              , skipping it.hromosome S35
              from memory.mosome S35
              , skipping it.hromosome S36
              from memory.mosome S36
              , skipping it.hromosome S37
              from memory.mosome S37
              , skipping it.hromosome S38
              from memory.mosome S38
              , skipping it.hromosome S39
              from memory.mosome S39

              Comment


              • #8
                first time have this issue. can you provide a subset of your output and your reference file somewhere like ftp?

                Comment


                • #9
                  Originally posted by KaiYe View Post
                  first time have this issue. can you provide a subset of your output and your reference file somewhere like ftp?
                  Hello,
                  I would like to discuss about the preprocessing of the input files and the running of Pindel program.
                  At the begining, I should present the basic of this work: Five unrelated patient's DNA were sequenced using an illumina kit on the MiSeq. This kit covers 12 Mb of genomic content.
                  In order to detect the breakpoints of large deletions, medium sized insertions, inversions, tandem duplications and other structural variants from next-gen sequence data, Pindel was chosen to refine and complete the analysis procedure.
                  To success this step, I have encountered some problems:
                  1- The preprocessing of the input files:
                  The input for Pindel consists of the reference genome sequence and the Bam files resulting from our high throughput sequencing manipulation. Here, my question is as follows: I should download all the human reference genome? Or simply, I write this command './pindel -f hs_ref_GRCh37.fa -p my_input_name_files.txt -c ALL -o my_output-name_files' and the software can run it?
                  And, in my case, searching for indels and SVs should be limited to the genomic regions covered by the Trusight One kit? Can I generate a false results when we map paired-end reads to the entire human reference genome ?
                  2- Insert size:
                  My question is the following : What are the tools used to obtain the insert size metrics for the each samples?
                  3- Running Pindel on five bam files:
                  I have five bam files generated from the sequencing of five unrelated affected patient's DNA. What do you recommend: I run pindel with bam files one by one or I run all the files at the same time ? And what's the diffrence(s) between the output files in each case?
                  4- The computational infrastructure recommended for the execution of Pindel (memory size, Hard disk).
                  I look forward to your response.

                  Comment


                  • #10
                    Hello Myriem,

                    this is Eric-Wubbo Lameijer from Kai Ye's (Pindel's) lab.

                    To answer your questions:

                    1) you need the reference genome/fasta file that has been used to generate the BAM file, and give the name of that file (and the path to it) as the -f parameter. If another bioinformatician has created the BAM file, they should be able to provide you with the correct fasta file. If you can't get that fasta file, you need to do some extra work; some people in the forum may know where you can download a 'proper' reference genome, I myself have not found a ready-made reference genome yet and had to use ftp://ftp.ncbi.nlm.nih.giv/genomes/H...romosomes/seq/ and of those the hs_ref_ files. Gunzip, merge, possibly change the chromosome names (after the >) to chr1, chr2 etc., and use samtools to index the reference file. There is also a file on the UCSC website – you can check hgdownload.cse.ucsc.edu/downloads.html . But easiest (and best) is if you can use the fasta file that has been used for creating the BAM files.

                    1b) Yes, Pindel can generate (more) false positives if the whole genome reference is used, as it could be that a region outside the scanned area provides a more exact match. The ways I would personally handle this are first to limit the size of indels to seek (-x option with 1 or 2), and basically be wary of all indels that have very low coverage/support – though what counts as low support will depend on your dataset. You can use an option in pindel2vcf (the -e option) if there seem too many indel calls with a very low support. What support to take as border depends on the coverage of your original data set, calls with a total support of less than something like 20-25% of the median coverage tend to be relatively unreliable in my experience.

                    2) Insert size metrics: at the moment, Pindel assumes that the user knows the insert size of the library he/she used/ordered. If you don't know: according to some discussions on biostars (https://www.biostars.org/p/14339/ and https://www.biostars.org/p/94246/ ) some BAM/SAM files have this information, otherwise you need to copy/use some script to deduce it.

                    3) Running on the patients separately or as one group: in general, I would recommend running Pindel on the full set of samples in one go; this increases Pindel's sensitivity somewhat, and makes downstream processing easier. And if you see in (in all unrelated patients) an indel at a certain position with low allele frequency (say 10-20%), then you can be reasonably certain that this is a false call caused my measurement errors or problems with genomic repetitiveness or such. So in general, try to run Pindel on the entire set in one go. As for the differences: running samples together increases the sensitivity of Pindel (chance that it finds a relatively difficult-to-find indel), though it decreases the specificity (larger chance to find a 'fake' indel). So it is a tradeoff, but generally I think it more useful to throw away bad indels later than not to find real indels in the first place.

                    4) One does not need special hardware for Pindel; basically, if a computer runs Linux (OSX can also work, but getting Pindel to work on OSX can be a bit trickier) it can run Pindel; even on a normal system (say PC with 2 GB of memory) Pindel should not run out of memory and should be finished in a time between 10 minutes and a day, for your exome I'd estimate an hour at most. If there is a problem with lack of memory, please consult the FAQ file in the Pindel main directory, that should generally work. If that does not work, please contact us directly on our contact e-mail addresses or by raising an issue on GitHub. But basically, I would not expect any problems with extreme running times or out-of-memory errors.

                    Best regards,

                    Eric-Wubbo

                    Comment

                    Latest Articles

                    Collapse

                    • seqadmin
                      Current Approaches to Protein Sequencing
                      by seqadmin


                      Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
                      04-04-2024, 04:25 PM
                    • seqadmin
                      Strategies for Sequencing Challenging Samples
                      by seqadmin


                      Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                      03-22-2024, 06:39 AM

                    ad_right_rmr

                    Collapse

                    News

                    Collapse

                    Topics Statistics Last Post
                    Started by seqadmin, 04-11-2024, 12:08 PM
                    0 responses
                    18 views
                    0 likes
                    Last Post seqadmin  
                    Started by seqadmin, 04-10-2024, 10:19 PM
                    0 responses
                    22 views
                    0 likes
                    Last Post seqadmin  
                    Started by seqadmin, 04-10-2024, 09:21 AM
                    0 responses
                    17 views
                    0 likes
                    Last Post seqadmin  
                    Started by seqadmin, 04-04-2024, 09:00 AM
                    0 responses
                    49 views
                    0 likes
                    Last Post seqadmin  
                    Working...
                    X