SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics



Similar Threads
Thread Thread Starter Forum Replies Last Post
Structural Variants jkozubek Bioinformatics 4 08-22-2016 07:38 AM
Calling structural variants (CNVs) with single-end reads agwe Genomic Resequencing 5 01-18-2016 07:30 AM
Strangely high proportion of variants are INDELs PeteH Bioinformatics 2 05-09-2012 12:13 AM
Calling structural variants from capture data Heisman Bioinformatics 3 04-16-2012 07:01 AM
cufflinks 1.2.0 version got me significantly different results than the old version slowsmile Bioinformatics 9 02-01-2012 01:26 AM

Reply
 
Thread Tools
Old 09-24-2014, 04:38 AM   #141
t.wieland
Junior Member
 
Location: London

Join Date: Mar 2010
Posts: 8
Default

Hi KaiYe,

we're using Pindel to analyse exome data for quite a while now. We are starting to analyse more whole genome datasets now, and running pindel takes quite a while for these datasets (several days using a single thread), especially after switching from v0.2.4t to v0.2.5. So I'm wondering what is the recommended way to analyse large datasets? Currently we are using pindel with standard parameters (except -A which we set to 20) directly on BAM files from BWA and are just increasing the number of threads to 8-10. Is there something we can do better?

Best,
Thomas
t.wieland is offline   Reply With Quote
Old 09-24-2014, 06:33 AM   #142
KaiYe
Senior Member
 
Location: amsterdam

Join Date: Jun 2009
Posts: 133
Default

Quote:
Originally Posted by t.wieland View Post
Hi KaiYe,

we're using Pindel to analyse exome data for quite a while now. We are starting to analyse more whole genome datasets now, and running pindel takes quite a while for these datasets (several days using a single thread), especially after switching from v0.2.4t to v0.2.5. So I'm wondering what is the recommended way to analyse large datasets? Currently we are using pindel with standard parameters (except -A which we set to 20) directly on BAM files from BWA and are just increasing the number of threads to 8-10. Is there something we can do better?

Best,
Thomas
For exome data, you could use -x 2 or even -x 1 to speed up. Which version are you using? Recently I changed the code and remove one memory leak.

You'd better run jobs per chr and maximum 4 threads as you will not cut runtime further with more threads but it is better to increase the number of jobs.

Kai
KaiYe is offline   Reply With Quote
Old 09-24-2014, 06:38 AM   #143
t.wieland
Junior Member
 
Location: London

Join Date: Mar 2010
Posts: 8
Default

Hi Kai,

thanks for your reply!

Currently we're using v.2_5a3
Ok, I also thought about splitting by chromosome, but I wasn't sure if the detection of interchromosomal events still works?

Thomas
t.wieland is offline   Reply With Quote
Old 09-24-2014, 07:43 AM   #144
KaiYe
Senior Member
 
Location: amsterdam

Join Date: Jun 2009
Posts: 133
Default

Quote:
Originally Posted by t.wieland View Post
Hi Kai,

thanks for your reply!

Currently we're using v.2_5a3
Ok, I also thought about splitting by chromosome, but I wasn't sure if the detection of interchromosomal events still works?

Thomas
interchromosomal events will not be affected if split per chr.
KaiYe is offline   Reply With Quote
Old 05-26-2015, 06:19 AM   #145
BGould
Member
 
Location: New York

Join Date: May 2010
Posts: 14
Default pindel variants counted multiple times

Quote:
Originally Posted by hshain View Post
Thanks for the quick reply. Let's discuss the duplication first. I now see in the output where the "unique" reads are reported for a given event. In the example below, there are 6 reads which are duplicates of each other and Pindel recognizes and reports this:

####################################################################################################
4725 D 1 NT 0 "" ChrID chr3 BP 3016666 3016668 BP_range 3016666 3016668 Supports 6 1 + 6 1 - 0 0 S1 7 SUM_MS 314 1 NumSupSamples 1 1 28_1_GTGTTA 6 1 0 0
ATTGGATGCATAATAAAATTAAAACATTTTTTGTTTCTGGCATGGCCAATATTGCTATTTGTCTTATAGAAACCTCTTCTCATTACTAAATTATATATTCTgTATAGTGGGCCCCCCTTTCTAATTAATAATTAATATTGTCTTCCAGGCATTTTAGTTACCAAGTGGTAAAGGAAGCTTCTGTGATTTCAACTTCAAGTTA
CTTATAGAAACCTCTTCTCATTACTAAATTATATATTCT TATAGTGGGCCCCCCTTTCTAATTAATAATTAATATTGTCTTCCAGGCATTTTAGTTACCAA + 3016209 37 28_1_GTGTTA @DCDF8JN1:204:C0V4FACXX:4:2215:15528:50861/1
CTTATAGAAACCTCTTCTCATTACTAAATTATATATTCT TATAGTGGGCCCCCCTTTCTAATTAATAATTAATATTGTCTTCCAGGCATTTTAGTTACCAA + 3016209 37 28_1_GTGTTA @DCDF8JN1:204:C0V4FACXX:4:1312:8307:12733/1
CTTATAGAAACCTCTTCTCATTACTAAATTATATATTCT TATAGTGGGCCCCCCTTTCTAATTAATAATTAATATTGTCTTCCAGGCATTTTAGTTACCAA + 3016209 60 28_1_GTGTTA @DCDF8JN1:204:C0V4FACXX:4:1209:17429:19881/1
CTTATAGAAACCTCTTCTCATTGCTAAATTATATATTCT TATAGTGGGCCCCCCTTTCTAATTAATAATTAATATTGTCTTCCAGGCATTTTAGTTACCAA + 3016209 60 28_1_GTGTTA @DCDF8JN1:204:C0V4FACXX:4:1201:7760:69122/1
CTTATAGAAACCTCTTCTCATTACTAAATTATATATTCT TATAGTGGGCCCCCCTTTCTAATTAATAATTAATATTGTCTTCCAGGCATTTTAGTTACCAA + 3016209 60 28_1_GTGTTA @DCDF8JN1:204:C0V4FACXX:4:1109:9254:43555/1
CTTATAGAAACCTCTTCTCATTACTAAATTATATATTCT TATAGTGGGCCCCCCTTTCTAATTAATAATTAATATTGTCTTCCAGGCATTTTAGTTACCAA + 3016209 60 28_1_GTGTTA @DCDF8JN1:204:C0V4FACXX:4:1103:14378:72999/1
####################################################################################################


However, Pindel does not seem to properly recognize duplicates when the paired end reads run into each other as in this example:

####################################################################################################
4727 D 4 NT 0 "" ChrID chr3 BP 3910869 3910874 BP_range 3910869 3910877 Supports 4 4 + 2 2 - 2 2 S1 9 SUM_MS 240 1 NumSupSamples 1 1 28_1_GTGTTA 2 2 2 2
GCAGAAATAAAAAGAAAACATCAAATGCGGCTCTTCCATGACCTGCTAGGATCTGCTTCTACCAAATCATGGATATAGAAATAGGCCCAGCTGCACACCACtcttTCTAATTATCCTGTTCTTCCAATTCCTCTTTTATGCATTTTTTTTGCCCACTCTCTCTCGAAACACAGTAGCTCTGGGAGTTGAAAATTAAGTTTTA
AGAAATAGGCCCAGCTGCACACCAC TCTAATTATCCTGTTCTTCCAATTCCTCTTTTATGCATTTTTTTTGCCCACTCTCTCTCGAAACACAGTAGCTCTG + 3910628 60 28_1_GTGTTA @DCDF8JN1:204:C0V4FACXX:4:2214:15997:96580/1
CTCTTCTATGACCTGCTAGGATCTGCTTCTACCAAATCATGGATATAGAAATAGGCCCAGCTGCACACCAC TCTAATTATCCTGTTCTTCCAATTCCTCTT - 3911111 60 28_1_GTGTTA @DCDF8JN1:204:C0V4FACXX:4:2214:15997:96580/2
AGAAATAGGCCCAGCTGCACACCAC TCTAATTATCCTGTTCTTCCAATTCCTCTTTTATGCATTTTTTTTGCCCACTCTCTCTCGAAACACAGTAGCTCTG + 3910628 60 28_1_GTGTTA @DCDF8JN1:204:C0V4FACXX:4:1302:10814:73798/1
CTCTTCTATGACCTGCTAGGATCTGCTTCTACCAAATCATGGATATAGAAATAGGCCCAGCTGCACACCAC TCTAATTATCCTGTTCTTCCAATTCCTCTT - 3911111 60 28_1_GTGTTA @DCDF8JN1:204:C0V4FACXX:4:1302:10814:73798/2
####################################################################################################

You can see by the read coordinates that the X and Y position are the same for the forward and reverse reads -- suggesting that they are in the same pair. It is apparent that this is one unique paired-end read, the forward and reverse reads overlap, there is a deletion in the overlapping portion, and the read was duplicated. This is one event that is counted 4 times. I also confirmed that Picard mark dups correctly flagged all 4 reads as duplicates.

When I rank order candidates by the reported number of unique reads, these types of events are enriched at the top of my list.


Hi Hshain, I have this same issue using Pindel 0.2. Did you ever figure out why this is happening or how to fix/avoid this problem?
BGould is offline   Reply With Quote
Old 05-26-2015, 06:48 AM   #146
KaiYe
Senior Member
 
Location: amsterdam

Join Date: Jun 2009
Posts: 133
Default

Pindel marks a read as duplicated if both read pairs are identical. For the first event, both read sequences and the their mates' mapping positions are the same. Thus they are all marked as duplicates. For the second case, the mates' positions differ. Of course, this could be modified depending what we consider as true duplicates.

kai
KaiYe is offline   Reply With Quote
Old 06-01-2015, 06:06 AM   #147
Yann G.
Junior Member
 
Location: Europe

Join Date: Jun 2015
Posts: 1
Default

Hi Kai

I am a little confused about "-u" option (maximum_allowed_mismatch_rate)

In 0.2.5a8 version, help says:
"Only reads with more than this fraction of mismatches than the reference genome will be considered as harboring potential SVs. (default 0.02) "

BUT, on the website (http://gmt.genome.wustl.edu/packages...er-manual.html), it says
"only reads with fewer mismatches with the reference genome than this fraction will be considered (default 0.1)"

1- Which version should I trust ?
2- If I have well understood, this parameter (and the ones below) allow me to play with sensitivity/specificity depending on the kind of result I want (SV discovery or validation)
but I am not sure about when do I want to modify one option over the others in the list below :

-u (maximum_allowed_mismatch_rate)
-H (min_distance_to_the_end)
-n (minimum number of edit distance between reads and ref genome)
-d (min_num_matched_bases)
-a (additional_mismatch)
-m (min_perfect_match_around_BP)


Thanks!
Yann

PS: (still on the website) The link to dispersed duplication 's documentation seems dead.
Yann G. is offline   Reply With Quote
Old 07-17-2016, 07:14 PM   #148
Yongyong.Ren
Junior Member
 
Location: China

Join Date: Jul 2016
Posts: 1
Default

Hi KaiYe,
Thanks so much for your attention to this. We are using the latest version 0.2.5b8, 20151210. But we are really confused about some variants with large insertion, for example the record in the xx_SI file:
(Line1)287 I.98 NT.98."AAAAAGATTTGACTTGTGTAAACGAACCCATTTTCAAGAACTCTACCATGGTTTTATATGGAGACACAGGTGATAAACAAGCAACCCAAGTGTCAATT" ChrID.13 BP.32911324 32911325 BP_range.32911324 32911334 Supports.40 40 +.40 40 -.0 0 S1.41 SUM_MS.3840 1 NumSupSamples.1 1 tumor.0.0.40.40.0.0
(Line2)AAA..................................................................................................................................................................................................AAAGATTTGGTTTATGTTCTTGCAGAGGAGAACAAAAATAGTGTAAAGCAGCATATAAAAATGACTCTAGGTCAAGATTTAAAATCGGACATCTCCTTGAATATAGATAAAATACCAGAAAAAAATAATGATTACATGAACAAATGGGCAG
(Line3)AAACAGACTTGACTTGTGTAAACGAACCCATTTTCAAGAACTCTACCATGGTTTTATATGGAGACACAGGTGATAAACAAGCAACCCAAGTGTCAATTATAAAGATTTGGTTTATGTTCTTGCAG............................ + 32911224 96 tumor @M03097:19:000000000-G05CH:1:2104:21078:24842
But it seems impossiple exist the insertion 'ACAGACTTGACTTGTGTAAACGAACCCATTTTCAAGAACTCTACCATGGTTTTATATGGAGACACAGGTGATAAACAAGCAACCCAAGTGTCAATTAT' based on the blat result for this read '@M03097:19:000000000-G05CH:1:2104:21078:24842':

Blat result:
Genomic chr13 :
gactctgaag aacttttctc agacaatgag aataattttg tcttccaagt 32911173
agctaatgaa aggaataatc ttgctttagg aaatactaag gaacttcatg 32911223
AAACAGACTT GACTTGTGTA AACGAACCCA TTTTCAAGAA CTCTACCATG 32911273
GTTTTATATG GAGACACAGG TGATAAACAA GCAACCCAAG TGTCAATTAa 32911323
aAAAGATTTG GTTTATGTTC TTGCAGagga gaacaaaaat agtgtaaagc 32911373
agcatataaa aatgactcta ggtcaagatt taaaatcgga catctccttg 32911423
aatatagata aaataccaga aaaaaa

We try to send the temp files (bam and the input file for pindel) to you, but the email address like kye@genome.wustl.edu and k.ye@lumc.nl are already invalid.
Thanks in advance for your reply
my email is yongyong.ren2014@gmail.com
Yongyong Ren

Last edited by Yongyong.Ren; 07-17-2016 at 07:29 PM.
Yongyong.Ren is offline   Reply With Quote
Reply

Tags
pindel

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 06:04 AM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2019, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO