Seqanswers Leaderboard Ad

**bwubb** · 03-22-2012, 06:41 AM

It comes with an executable to format output to a VCF file. This should trim it down a good deal, depending on how many variants you have versus supporting reads.

VCF format is accepted by certain annotation programs as well, which is very nice.

**KaiYe** · 03-22-2012, 08:04 AM

Originally posted by odoyle81 View Post

I have a massive pindel deletion file that has too many rows to open in excel..
Anyone have any ideas on how I can analyze this? I probably need to convert this to a database?

It would be nice if pindel could output a list of deletions as a nice csv file...

EDIT: I was able to delete all lines from the output file not starting with a digit (using a regex) and that gives me file I can start to format as a csv. Any better ideas or methods out there?

Indeed you may convert the result to VCF. You may also grep the head line containing the variant information and then print selected fields:
grep ChrID Pindel_output.txt | awk '{print $.....}'

**odoyle81** · 03-27-2012, 04:57 PM

Originally posted by bwubb View Post

It comes with an executable to format output to a VCF file. This should trim it down a good deal, depending on how many variants you have versus supporting reads.

VCF format is accepted by certain annotation programs as well, which is very nice.

Could you provide more documentation on this? I don't see this in the pindel source. I see vcfcreator.cpp but I don't know how to implement this.

Thanks!

**KaiYe** · 03-28-2012, 12:11 AM

Originally posted by odoyle81 View Post

Could you provide more documentation on this? I don't see this in the pindel source. I see vcfcreator.cpp but I don't know how to implement this.

Thanks!

After you type
./INSTALL <path to samtools folder>
you will have binary programs, pindel, pindel2vcf...

If you type
./pindel2vcf
you will see documentation...

**odoyle81** · 04-03-2012, 10:18 AM

awesome thanks - this works great!

**tgoldman** · 07-28-2013, 01:30 PM

Problem running pindel2vcf

Hello,

I am having an issue running pindel2vcf, just for a particular reference genome. Another set of Pindel files using a different reference genome converted fine without a problem. I successfully ran Pindel and have what looks to be proper Pindel output files. The reference was indexed with samtools faidx. When I run pindel2vcf it looks like it can't find the scaffold sequences in the fasta. Are certain characters not allowed in the reference fasta, or something wrong with the ChrID naming for this particular fasta? I have them named S00, S01, so on, with the ChrIDs in the Pindel output files matching those in the reference so that doesn't seem to be the issue. Any help would be greatly appreciated. Thanks.

pindel2vcf -p EXAMPLE_D -r ./EXAMPLE_reference.fasta -R example -d 20130728 -v EXAMPLE_deletions.vcf
Samples:
1. EXAMPLE
Chromosomes in which SVs have been found:
1. S00
2. S01
3. S02
4. S04
5. S05
6. S06
7. S07
8. S08
9. S09
10. S10
11. S11
12. S12
13. S13
14. S14
15. S15
16. S16
17. S17
18. S18
19. S19
20. S20
21. S21
22. S22
23. S23
24. S26
25. S28
26. S29
27. S36
28. S37
29. S39
Scanning chromosome: S00
Scanning chromosome: S01
Scanning chromosome: S02
Scanning chromosome: S03
Scanning chromosome: S04
Scanning chromosome: S05
Scanning chromosome: S06
Scanning chromosome: S07
Scanning chromosome: S08
Scanning chromosome: S09
Scanning chromosome: S10
Scanning chromosome: S11
Scanning chromosome: S12
Scanning chromosome: S13
Scanning chromosome: S14
Scanning chromosome: S15
Scanning chromosome: S16
Scanning chromosome: S17
Scanning chromosome: S18
Scanning chromosome: S19
Scanning chromosome: S20
Scanning chromosome: S21
Scanning chromosome: S22
Scanning chromosome: S23
Scanning chromosome: S24
Scanning chromosome: S25
Scanning chromosome: S26
Scanning chromosome: S27
Scanning chromosome: S28
Scanning chromosome: S29
Scanning chromosome: S30
Scanning chromosome: S31
Scanning chromosome: S32
Scanning chromosome: S33
Scanning chromosome: S34
Scanning chromosome: S35
Scanning chromosome: S36
Scanning chromosome: S37
Scanning chromosome: S38
Scanning chromosome: S39
Exiting reference scanning.
, skipping it.hromosome S00
from memory.mosome S00
, skipping it.hromosome S01
from memory.mosome S01
, skipping it.hromosome S02
from memory.mosome S02
, skipping it.hromosome S03
from memory.mosome S03
, skipping it.hromosome S04
from memory.mosome S04
, skipping it.hromosome S05
from memory.mosome S05
, skipping it.hromosome S06
from memory.mosome S06
, skipping it.hromosome S07
from memory.mosome S07
, skipping it.hromosome S08
from memory.mosome S08
, skipping it.hromosome S09
from memory.mosome S09
, skipping it.hromosome S10
from memory.mosome S10
, skipping it.hromosome S11
from memory.mosome S11
, skipping it.hromosome S12
from memory.mosome S12
, skipping it.hromosome S13
from memory.mosome S13
, skipping it.hromosome S14
from memory.mosome S14
, skipping it.hromosome S15
from memory.mosome S15
, skipping it.hromosome S16
from memory.mosome S16
, skipping it.hromosome S17
from memory.mosome S17
, skipping it.hromosome S18
from memory.mosome S18
, skipping it.hromosome S19
from memory.mosome S19
, skipping it.hromosome S20
from memory.mosome S20
, skipping it.hromosome S21
from memory.mosome S21
, skipping it.hromosome S22
from memory.mosome S22
, skipping it.hromosome S23
from memory.mosome S23
, skipping it.hromosome S24
from memory.mosome S24
, skipping it.hromosome S25
from memory.mosome S25
, skipping it.hromosome S26
from memory.mosome S26
, skipping it.hromosome S27
from memory.mosome S27
, skipping it.hromosome S28
from memory.mosome S28
, skipping it.hromosome S29
from memory.mosome S29
, skipping it.hromosome S30
from memory.mosome S30
, skipping it.hromosome S31
from memory.mosome S31
, skipping it.hromosome S32
from memory.mosome S32
, skipping it.hromosome S33
from memory.mosome S33
, skipping it.hromosome S34
from memory.mosome S34
, skipping it.hromosome S35
from memory.mosome S35
, skipping it.hromosome S36
from memory.mosome S36
, skipping it.hromosome S37
from memory.mosome S37
, skipping it.hromosome S38
from memory.mosome S38
, skipping it.hromosome S39
from memory.mosome S39

**KaiYe** · 07-29-2013, 08:24 AM

first time have this issue. can you provide a subset of your output and your reference file somewhere like ftp?

**Myriem** · 01-14-2017, 10:51 AM

Originally posted by KaiYe View Post

first time have this issue. can you provide a subset of your output and your reference file somewhere like ftp?

Hello,
I would like to discuss about the preprocessing of the input files and the running of Pindel program.
At the begining, I should present the basic of this work: Five unrelated patient's DNA were sequenced using an illumina kit on the MiSeq. This kit covers 12 Mb of genomic content.
In order to detect the breakpoints of large deletions, medium sized insertions, inversions, tandem duplications and other structural variants from next-gen sequence data, Pindel was chosen to refine and complete the analysis procedure.
To success this step, I have encountered some problems:
1- The preprocessing of the input files:
The input for Pindel consists of the reference genome sequence and the Bam files resulting from our high throughput sequencing manipulation. Here, my question is as follows: I should download all the human reference genome? Or simply, I write this command './pindel -f hs_ref_GRCh37.fa -p my_input_name_files.txt -c ALL -o my_output-name_files' and the software can run it?
And, in my case, searching for indels and SVs should be limited to the genomic regions covered by the Trusight One kit? Can I generate a false results when we map paired-end reads to the entire human reference genome ?
2- Insert size:
My question is the following : What are the tools used to obtain the insert size metrics for the each samples?
3- Running Pindel on five bam files:
I have five bam files generated from the sequencing of five unrelated affected patient's DNA. What do you recommend: I run pindel with bam files one by one or I run all the files at the same time ? And what's the diffrence(s) between the output files in each case?
4- The computational infrastructure recommended for the execution of Pindel (memory size, Hard disk).
I look forward to your response.

**EWLameijer_XJTU** · 01-17-2017, 11:10 PM

Hello Myriem,

this is Eric-Wubbo Lameijer from Kai Ye's (Pindel's) lab.

To answer your questions:

1) you need the reference genome/fasta file that has been used to generate the BAM file, and give the name of that file (and the path to it) as the -f parameter. If another bioinformatician has created the BAM file, they should be able to provide you with the correct fasta file. If you can't get that fasta file, you need to do some extra work; some people in the forum may know where you can download a 'proper' reference genome, I myself have not found a ready-made reference genome yet and had to use ftp://ftp.ncbi.nlm.nih.giv/genomes/H...romosomes/seq/ and of those the hs_ref_ files. Gunzip, merge, possibly change the chromosome names (after the >) to chr1, chr2 etc., and use samtools to index the reference file. There is also a file on the UCSC website – you can check hgdownload.cse.ucsc.edu/downloads.html . But easiest (and best) is if you can use the fasta file that has been used for creating the BAM files.

1b) Yes, Pindel can generate (more) false positives if the whole genome reference is used, as it could be that a region outside the scanned area provides a more exact match. The ways I would personally handle this are first to limit the size of indels to seek (-x option with 1 or 2), and basically be wary of all indels that have very low coverage/support – though what counts as low support will depend on your dataset. You can use an option in pindel2vcf (the -e option) if there seem too many indel calls with a very low support. What support to take as border depends on the coverage of your original data set, calls with a total support of less than something like 20-25% of the median coverage tend to be relatively unreliable in my experience.

2) Insert size metrics: at the moment, Pindel assumes that the user knows the insert size of the library he/she used/ordered. If you don't know: according to some discussions on biostars (https://www.biostars.org/p/14339/ and https://www.biostars.org/p/94246/ ) some BAM/SAM files have this information, otherwise you need to copy/use some script to deduce it.

3) Running on the patients separately or as one group: in general, I would recommend running Pindel on the full set of samples in one go; this increases Pindel's sensitivity somewhat, and makes downstream processing easier. And if you see in (in all unrelated patients) an indel at a certain position with low allele frequency (say 10-20%), then you can be reasonably certain that this is a false call caused my measurement errors or problems with genomic repetitiveness or such. So in general, try to run Pindel on the entire set in one go. As for the differences: running samples together increases the sensitivity of Pindel (chance that it finds a relatively difficult-to-find indel), though it decreases the specificity (larger chance to find a 'fake' indel). So it is a tradeoff, but generally I think it more useful to throw away bad indels later than not to find real indels in the first place.

4) One does not need special hardware for Pindel; basically, if a computer runs Linux (OSX can also work, but getting Pindel to work on OSX can be a bit trickier) it can run Pindel; even on a normal system (say PC with 2 GB of memory) Pindel should not run out of memory and should be finished in a time between 10 minutes and a day, for your exome I'd estimate an hour at most. If there is a problem with lack of memory, please consult the FAQ file in the Pindel main directory, that should generally work. If that does not work, please contact us directly on our contact e-mail addresses or by raising an issue on GitHub. But basically, I would not expect any problems with extreme running times or out-of-memory errors.

Best regards,

Eric-Wubbo

Topics	Statistics	Last Post
Cancer Metastasis: A Deep Dive into Cellular Plasticity by seqadmin Started by seqadmin, 04-11-2024, 12:08 PM	0 responses 18 views 0 likes	Last Post by seqadmin 04-11-2024, 12:08 PM
Proteogenomic Profiles Offer New Clues in Prostate Cancer by seqadmin Started by seqadmin, 04-10-2024, 10:19 PM	0 responses 22 views 0 likes	Last Post by seqadmin 04-10-2024, 10:19 PM
Novel Diagnostic Assay Enhances Ovarian Cancer Detection by seqadmin Started by seqadmin, 04-10-2024, 09:21 AM	0 responses 17 views 0 likes	Last Post by seqadmin 04-10-2024, 09:21 AM
Evolutionary Dynamics of Centromeres: A Comparative Genomic Analysis by seqadmin Started by seqadmin, 04-04-2024, 09:00 AM	0 responses 49 views 0 likes	Last Post by seqadmin 04-04-2024, 09:00 AM

Seqanswers Leaderboard Ad

Announcement

processing pindel output files

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News