SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics



Similar Threads
Thread Thread Starter Forum Replies Last Post
Somatic SNP wanguan2000 Bioinformatics 2 03-13-2012 07:59 PM
Characterization and improvement of RNA-Seq precision NGSfan Literature Watch 3 09-15-2011 03:00 AM
Concerned about precision... dawe Bioinformatics 3 09-24-2009 07:08 AM
PubMed: High-precision, whole-genome sequencing of laboratory strains facilitates gen Newsbot! Literature Watch 0 08-02-2008 05:11 AM

Reply
 
Thread Tools
Old 01-24-2012, 01:13 PM   #1
Jane M
Senior Member
 
Location: Paris

Join Date: Aug 2011
Posts: 239
Question more precision about VarScan somatic

Hello,
I have used Varscan somatic today for the first time. I have provided the "normal" and the "tumoral" bam files with a reference, hg19.fasta. I'm looking for certain information, that I haven't found yet.

1) First, I would like to know if there exists a documentation for this tool, as a publication or a pdf which describe each option?

2) Then, I am very interested in understanding clearly all the p-values.
In the inputs, we can specify 2 p-values:
  1. p-value: P-value threshold to call a heterozygote
    So, I guess the null hypothesis is: H0 "the individual is homozygous at this site". We test H0 at the risk 0.1 (default value)%.
    If the p-value is less than 0.1% then, H0 is rejected: the individual is heterozygous.

    Am I right? I found a default value of 0.9% when reading the explanation given when using the software. Are they equivalent?

  2. somatic-p-value: P-value threshold to call a somatic site
    I guess the null hypothesis is: H0 "the variant is not somatic". We test H0 at the risk 0.0001 (default value)%.
    If the p-value is less than 0.0001% then, H0 is rejected: the variant is somatic.

In the outputs, there are also 2 p-values:
  1. somatic-p-value: Significance of tumor read count vs. normal read count
  2. variant-p-value: Significance of variant read count vs. baseline error rate

Somatic-p-value is most likely the computed p-value of the test: H0 "the variant is not somatic", for wich we gave a threhold in the inputs.

I don't understand very much the variant-p-value. When checking the output file, I understood that this is the result of the statistical test: H0 "the variant is not germline", but I'm not sure...

I wonder about the way to compute those p-values. In the website, I read that an exact Fisher test is computed. But for the tumor read count vs. normal read count, we have only one count per sample, so do they compare 1 value in each condition? How is it possible to use this test?

3) Third, I would like more precision about the inputs "purity" and strand filter". I have kept the default values because I dont know what they are exaclty.

4) Finally, I have the impression that the problem of false discovery rate is not taken into account... Is there a way to compute the adjuted p-values to control it?

Thank you for the help you can give me for any of these points
Jane

Last edited by Jane M; 01-24-2012 at 01:15 PM.
Jane M is offline   Reply With Quote
Old 01-25-2012, 12:51 AM   #2
Jane M
Senior Member
 
Location: Paris

Join Date: Aug 2011
Posts: 239
Default

I will try to clarify a bit my questions.

About 2), for the somatic-p-value, I guess now that the null hypothesis is: H0 "the proportion of variants between the normal and tumoral samples is equal". We test H0 at the risk 0.0001 (default value)%.
Of course, we are looking for the positions which do not fit this hypothesis. And I understand why the exact fisher test is possible here.

I still do not see what is the null hypothesis for the p-value... H0: "the proportion of variants between the normal and tumoral samples is equal"?
And if the test is rejected, then the individual is heterozygous at this position?

In the website, there are:
Quote:
--p-value P-value threshold to call a heterozygote [1.0e-01]
--somatic-p-value P-value threshold to call a somatic site [1.0e-04]
And the description of the command in the software gives:
Quote:
--p-value - P-value threshold to call a heterozygote [0.99]
--somatic-p-value - P-value threshold to call a somatic site [0.05]
So which default values should be taken for p-value?

3)
Quote:
Can someone explains me what means purity here?
--normal-purity - Estimated purity (non-tumor content) of normal sample [1.00]
--tumor-purity - Estimated purity (tumor content) of tumor sample [1.00]
For strand filter:
Quote:
--strand-filter - If set to 1, removes variants with >90% strand bias
I understand that if the difference in the number of reads per strand is over 90%, then this variant is removed. Am I right? Is it a good threshod? I have no idea... I would say 95% when the region is well covered, but when the sequencing depth is short, it's tricky... For example in the case where there are 10 reads for the reference and 1 read for the variant. Should we remove it...

I hope everyone has already studied these issues and could provide me some help. I must admit that my analysis is a bit suspended...
Jane M is offline   Reply With Quote
Old 01-25-2012, 08:19 AM   #3
david.tamborero
Member
 
Location: spain

Join Date: Feb 2011
Posts: 60
Default

Hi, I was playing with Varscan and I would want to give you you some feedback in case it helps:

Quote:
2) Then, I am very interested in understanding clearly all the p-values.
Quote:
for the somatic-p-value, I guess now that the null hypothesis is: H0 "the proportion of variants between the normal and tumoral samples is equal". We test H0 at the risk 0.0001 (default value)%.
Of course, we are looking for the positions which do not fit this hypothesis. And I understand why the exact fisher test is possible here.
Yes, I would say that about the Ho. On the other hand, the cut-off value for definining the somatic status (column 13 in the output) should be editable by the user by the --somatic-p-value argument (i did not check it). The default p value should be the one from which the somatic status is declared as somatic in the output. However, in my results varscan has defined as somatic several positions with a p larger than 5% (up to a p=0.48!), and it has no sense to me...

Quote:
I don't understand very much the variant-p-value. When checking the output file, I understood that this is the result of the statistical test: H0 "the variant is not germline", but I'm not sure...
I guess so, since all the variants declared as somatic has a variant p-value of 1.0.

Quote:
I wonder about the way to compute those p-values. In the website, I read that an exact Fisher test is computed. But for the tumor read count vs. normal read count, we have only one count per sample, so do they compare 1 value in each condition? How is it possible to use this test?
Note that the reference genome has a single value for the base, but the germline variant of the normal sample can has some reads supporting the variant but also some others supporting the reference. Anyway, I'm not statistician, but Fisher should be able to lead with a contingency table with a 0 in some of its positions.

Quote:
Can someone explains me what means purity here?
--normal-purity - Estimated purity (non-tumor content) of normal sample [1.00]
--tumor-purity - Estimated purity (tumor content) of tumor sample [1.00]
I would say that there are samples that can be significantly more pure than others, for instance an hematological tumor can reach almost 100% of purity, whereas solid tumors has lower purity figures. Contamination between tumor/non tumor cells (and normal/non normal) in each sample should be somehow taken into account. How it is done by Varscan? Mistery for me...

Quote:
Finally, I have the impression that the problem of false discovery rate is not taken into account... Is there a way to compute the adjuted p-values to control it?
I also agree with you. I would say that when choosing a cut-off value for considering the p as significant, it should be corrected by multiple testing. In my case, in which I look for somatic mutations, I take all the variants declared as somatic by Varscan as the number to be used for correcting by Bonferroni. It's not very elegant, but it is the easiest and in my case (exomes of leukemia) I do not deal with a large number of variants so it's not that stringent. If it is not your case, you should take inhto account other multiple test correction techniques, as the FDR.

Really hope it helps, I am not very happy with some aspects of the Varscan performance/doc (which in overall is a great tool).
david.tamborero is offline   Reply With Quote
Old 01-25-2012, 12:27 PM   #4
Jane M
Senior Member
 
Location: Paris

Join Date: Aug 2011
Posts: 239
Default

Thanks a lot for your answer David!
It helps me a lot to have feedbacks on your experience with Varscan and about the test interpretations.

Like you, I'm looking for somatic mutations between fibroblast and leukemia in exomes.

I would like to add one question on my list
Is it possible to avoid the germline variants in the output of Varscan? In my case, they represent the huge majority of the variants. The run could be faster...

Quote:
Originally Posted by david.tamborero View Post
Yes, I would say that about the Ho. On the other hand, the cut-off value for definining the somatic status (column 13 in the output) should be editable by the user by the --somatic-p-value argument (i did not check it). The default p value should be the one from which the somatic status is declared as somatic in the output. However, in my results varscan has defined as somatic several positions with a p larger than 5% (up to a p=0.48!), and it has no sense to me...
I think that the p-values of the test H0 "the proportion of variants between the normal and tumoral samples is equal" is given in the column "somatic_p_value" (Significance of tumor read count vs. normal read count). For this, we can choose a threshold: --somatic_p_value
The output "variant_p_value" (Significance of variant read count vs. baseline error rate) is probably for the germline status. But we cannot choose a threshold for that. Anyway, this tool is for somatic detection so nevermind.
We can also choose the "--p-value" (P-value threshold to call a heterozygote [0.99]), but we do not see its value in the output. Unless it's the one given in the "somatic-p-value" for LOH

Can you tell me which thresholds do you use for these 2 p-values and in particular for --p-value. For the latter (heterozygote call), is it rather 0.01 or 0.99? I have found contradictory information for that



Quote:
Note that the reference genome has a single value for the base, but the germline variant of the normal sample can has some reads supporting the variant but also some others supporting the reference. Anyway, I'm not statistician, but Fisher should be able to lead with a contingency table with a 0 in some of its positions.
I checked out the exact fisher test and it's ok now. It's indeed a contingency table so we have the data needed for this table in VarScan's output.


Quote:
I would say that there are samples that can be significantly more pure than others, for instance an hematological tumor can reach almost 100% of purity, whereas solid tumors has lower purity figures. Contamination between tumor/non tumor cells (and normal/non normal) in each sample should be somehow taken into account. How it is done by Varscan? Mistery for me...
I got some information from people in my (wet) lab. It seems that the purety is related to the absorbency and it's a ratio. The threshold of the samples that they usually have is around 1.8. We read also on the internet that it's usually between (at minimum I think) 1.5, 2. That's why I chose 1.5 for both purety thresholds for running VarScan, but I'm not convinced

I wonder if it is the ratio between the number of "real" tumoral cells and the number of "other" cells, which contaminate the tumoral samples.
In the case of tumoral sample, I can see approximately the meaning of purety. But it's less clear in the case of normal sample. Could it be, that among the fibroblasts, we have a proportion of something else?

Also, since we are working on the same kind of data, can I ask you which thresholds do you use?

Quote:
I also agree with you. I would say that when choosing a cut-off value for considering the p as significant, it should be corrected by multiple testing. In my case, in which I look for somatic mutations, I take all the variants declared as somatic by Varscan as the number to be used for correcting by Bonferroni. It's not very elegant, but it is the easiest and in my case (exomes of leukemia) I do not deal with a large number of variants so it's not that stringent. If it is not your case, you should take inhto account other multiple test correction techniques, as the FDR.
I will try to adjust the p-values after a reliable analysis (analysis with interesting inputs) with VarScan. If I find something interesting, I will come back to this topic to explain what I did.


Quote:
Really hope it helps, I am not very happy with some aspects of the Varscan performance/doc (which in overall is a great tool).
Yes, thanks a lot
Jane M is offline   Reply With Quote
Old 01-30-2012, 12:51 PM   #5
Jane M
Senior Member
 
Location: Paris

Join Date: Aug 2011
Posts: 239
Default

Hello everybody,

I come back to this topic because I made some progress: I figured out almost every input, except the --p-value parameter. I don't see the interest to have 2 tests here.
For each potential mutation, we test if the proportion of variants in the normal sample equals if the proportion of variants in the tumoral sample. Ok

For the germline variant, the test will be accepted. What is the interest of the second test? With the variant-p-value, "this other test" can be accepted... What is the conclusion of the second test?
Jane M is offline   Reply With Quote
Old 01-31-2012, 02:37 PM   #6
dkoboldt
Member
 
Location: St. Louis

Join Date: Mar 2009
Posts: 62
Default

Hello Jane and David,

Thank you for posting this discussion; I'm the developer of VarScan and hope to be able to offer you some clarity. First and foremost, you can find the usage for the tool here:
http://varscan.sourceforge.net/using-varscan.html

An overview of how somatic mutation calling is performed can be found here:
http://varscan.sourceforge.net/somatic-calling.html

Finally, a manuscript describing the algorithm was accepted by Genome Research and should be out in the next couple of weeks.

Regarding the P-values

There are two p-value thresholds used by VarScan somatic:
1.) The "--p-value" is the FET p-value threshold required to call a variant in a single sample (normal or tumor) based upon the read counts. The null hypothesis is that the sample is homozygous-reference; under this hypothesis, all reads should support the reference base. By default, the --p-value threshold is set to 0.98, which tells VarScan to skip the Fisher's Exact Test when running in somatic mode, and instead use the heuristic thresholds (--min-coverage, --min-var-freq) to determine whether or not a sample is variant at a given position. This saves computation time, because a more informative p-value is computed by comparing the two samples.

2.) The "--somatic-p-value" is the FET p-value threshold required for a difference in read counts between tumor and normal to be deemed significant. Here, the null hypothesis is that the genotype of the tumor matches that of the normal. The "observed" read counts are the number of reference-supporting reads and variant-supporting reads in the tumor, and the "expected" are the number of reference-supporting reads and variant-supporting reads in the normal. If the proportion of variant-supporting reads differs between tumor and normal (e.g. LOH or somatic mutation), this p-value will get very small.


Regarding Purity

The --normal-purity and --tumor-purity input parameters are perhaps not well explained in the documentation. First of all, the default value for each of these is 1, and if you change them, it should be to a value between 0 and 1. These are designed to adjust somatic mutation calling thresholds for two situations:
1.) Tumor contamination of the normal (--normal-purity < 1), which happens in AML and some other liquid tumors. In other words, this is the proportion of cells in your "normal" sample that are infiltrating tumor cells, not normal tissue. When this is set to <1, we increase the minimum variant allele frequency for the normal sample to "ignore" variant-containing reads in the normal, which would otherwise suggest that it's a germline variant.

2.) Low tumor content (--tumor-purity < 1). This is for solid tumors, when you know that the purity of the tumor sample is low due to stromal cells, infiltrating immune cells, etc. When this is set to <1, we reduce the minimum variant allele frequency for the tumor sample to allow detection of mutations only present in a fraction of the cells in the "tumor" sample.

David, if you have suggestions or feedback on VarScan performance or documentation, please shoot me an e-mail to dkoboldt (at) genome (dot) wustl (dot) edu.
dkoboldt is offline   Reply With Quote
Old 02-01-2012, 06:47 AM   #7
Jane M
Senior Member
 
Location: Paris

Join Date: Aug 2011
Posts: 239
Default

Thanks a lot for your explanation The tests are a bit clearer now for me, it will be easier to understand what is done !

I have 3 more questions:
1) I am using VarScan in somatic mode with these options:

Quote:
java -Xmx10g -jar VarScan.v2.2.8.jar somatic /data/patient2/sorted_vicjefibros.mpileup /data/patient2/sorted_vicje565.mpileup --output-snp /data/patient2/output_varsan_vicje.snp --output-indel /data/patient2/output_varsan_vicje.indel --min-coverage 10 --min-coverage-normal 10 --min-coverage-tumor 10 --min-var-freq 0.3 --min-freq-for-hom 0.75 --normal-purity 1 --tumor-purity 1 --min-avg-qual 25 --min-strands2 2 --min-reads2 3 --p-value 0.01 --somatic-p-value 0.01 --strand-filter 1
And then I use SomaticFilter:
Quote:
java -Xmx20g -jar VarScan.v2.2.8.jar somaticFilter /data/patient2/output_varsan_vicje.snp --min-strands2 2 --min-avg-qual 25 --min-var-freq 0.3 --p-value 0.01 --min-strands2 2 --min-reads2 3 --indel-file /data/patient2/output_varsan_vicje.indel --output-file /data/patient2/output_somaticFilter_varsan_vicje.snp
Nervertheless, I have still results with tumor_reads2_plus=0 or tumor_reads2_minus=0. Is there something wrong with this option?

2) As output, I get for example this:
Code:
161526258 positions in tumor
160697950 positions shared in normal
81328216 had sufficient coverage for comparison
81249700 were called Reference
4 were mixed SNP-indel calls and filtered
2888 were removed by the strand filter
39667 were called Germline
12065 were called LOH
17183 were called Somatic
6709 were called Unknown
0 were called Variant
Overall, I have 75624 variants, but when I read the file in R, I have only around 65000 variants. Do you have an explanation for these missing values? I thought I lost them when opening it with some line-limitated softwares, but I ran Varscan two times and I got the same problem.

3)Finally, why do we get all p-values, even the ones over the p-values thresholds, when setting the --p-value and --somatic-p-value?

Thank you for your help and for your work on VarScan !
Jane
Jane M is offline   Reply With Quote
Old 02-02-2012, 08:46 AM   #8
david.tamborero
Member
 
Location: spain

Join Date: Feb 2011
Posts: 60
Default

Thanks for your answer, dkoboldt.

I've re-sent you a mail explaining what it seems a bug in Varscan performance during the count of ref/variant bases depending on base quality (ie. the --min-avg-qual argument).

There is also an incomplete discussion about it in http://seqanswers.com/forums/showthread.php?t=16599.

Thanks again for the answer and for all the work in Varscan!

cheers,
david
david.tamborero is offline   Reply With Quote
Old 02-10-2012, 05:21 AM   #9
Jane M
Senior Member
 
Location: Paris

Join Date: Aug 2011
Posts: 239
Default

Hello everybody!

I have 2 more questions:

1) Why are the default parameters
Code:
min-coverage-normal - Minimum coverage in normal to call somatic [8] 
min-coverage-tumor - Minimum coverage in tumor to call somatic [6]
Why are we less stringent about the minimum number of reads in tumor than about the minimum number of reads in normal?

2) I chose to use varscan somatic rather varscan "alone" because it's faster. I can analyze my two samples (normal and tumoral) at once and have a "direct" comparison.
Are there advantages more technical rather than easiness ? Like no error propagation?

Thanks,
Jane
Jane M is offline   Reply With Quote
Old 02-24-2012, 01:25 AM   #10
Jane M
Senior Member
 
Location: Paris

Join Date: Aug 2011
Posts: 239
Default

I would like to come back to this topic, especially to my last 2 questions.
I read the publication about VarScan 2 and I have a beginning of answer. It seems that my 2 questions are related.

At the very beginning of the article, there is "Simply subtracting variants in the matched normal from variants in the tumor is poorly suited for the analysis of exome sequence data, because it fails to account for regions that were undersampled in the normal".

I do not understand clearly this argument but it seems to me to be the reason why we are asking for more reads in the normal than in the tumoral sample.

Can someone clarify the reason for me please? I understand that if the normal sample has not been "read a lot", then the variant won't be seen by a comparison normal/reference, whereas it would be seen in a normal/tumoral comparison...

To my point of view, the advantage of doing a direct comparison is the following:
When doing a normal/reference comparison (A1), we introduce errors. Let us call it E1.
When doing a tumoral/reference comparison (A2), we introduce errors. Let us call it E2.
When doing the difference A2-A1, we introduce again errors, which is as big as the errors E1 and E2 or a combination of both. Let us call it E3.
Finally, we have 3 "degrees of errors".

When doing the direct comparison, we test the difference between the proportion of variants in normal and tumoral samples, at a certain level of significance. We introduce an error ERR1. Here we only have 1 "degree of error". So this process "produces" less error because it introduces less occasions to do such errors.

Can someone tell me if I am understanding right?
Thanks for your help
Jane M is offline   Reply With Quote
Old 03-16-2012, 10:29 AM   #11
dkoboldt
Member
 
Location: St. Louis

Join Date: Mar 2009
Posts: 62
Default Answers to VarScan questions

Jane M,

Thank you for posting your detailed questions, and I apologize for my delay in replying. Here is my summary of the outstanding issues with some answers:

1.) somaticFilter: Overall, I have 75624 variants, but when I read the file in R, I have only around 65000 variants. Do you have an explanation for these missing values? Why are we less stringent about the minimum number of reads in tumor than about the minimum number of reads in normal?
The counters in the somaticFilter tool are just rough estimates. Some things are counted twice, e.g. a variant that was initially called Somatic but then removed by the filter. The ultimate, accurate counts should be obtained from the resulting files.

2.) I chose to use varscan somatic rather varscan "alone" because it's faster. I can analyze my two samples (normal and tumoral) at once and have a "direct" comparison. Are there advantages more technical rather than easiness ? Like no error propagation?
Yes, there are a couple of advantages to running VarScan in the somatic mode as opposed to single-sample. First, it increases your power to detect germline variants and somatic mutations, especially at positions of low coverage. Second, it achieves a more accurate result by directly comparing read counts between tumor and normal rather than between each sample and the null hypothesis (same coverage, no variant). Finally, with SAMtools mpileup of normal and tumor BAMs input to VarScan (with the --mpileup parameter), it should run faster than running each sample individually

3.) At the very beginning of the article, there is "Simply subtracting variants in the matched normal from variants in the tumor is poorly suited for the analysis of exome sequence data, because it fails to account for regions that were undersampled in the normal". I do not understand clearly this argument but it seems to me to be the reason why we are asking for more reads in the normal than in the tumoral sample. Can someone clarify the reason for me please? I understand that if the normal sample has not been "read a lot", then the variant won't be seen by a comparison normal/reference, whereas it would be seen in a normal/tumoral comparison...
Yes, the reason that we recommend slightly higher coverage for the normal sample is a simple one: 99% of the variants in a tumor genome are inherited (not acquired). These are likely to be mistaken for somatic mutations if the normal coverage was low enough that the variant allele was not observed. In contrast, if the tumor coverage were low, we might mistake it as LOH but we'd be unlikely to call it somatic because the variant would be observed in the normal.

Please let me know if you have any additional questions!

~Dan Koboldt
dkoboldt is offline   Reply With Quote
Old 04-17-2012, 12:40 AM   #12
Jane M
Senior Member
 
Location: Paris

Join Date: Aug 2011
Posts: 239
Default

Thank you Dan for your help and all your explanations.
I made big advances with the analysis of my data. I was finalizing my results, when I noticed something weird in the .indel file output for each variant.
Here is an exmaple:

Quote:
chrom position ref var normal_reads1 normal_reads2 normal_var_freq normal_gt tumor_reads1 tumor_reads2 tumor_var_freq tumor_gt somatic_status variant_p_value somatic_p_value tumor_reads1_plus tumor_reads1_minus tumor_reads2_plus tumor_reads2_minus
chr1 7861153 C +T 35 164 82,41% +T/+T 31 129 80,62% +T/+T Germline 6.46318091323176E-137 0.7167904206848322 144 12 118 11
As you can notice, tumor_reads1 is different from tumor_reads1_plus + tumor_reads1_minus

One can see that nevertheless, tumor_reads2 is equal to tumor_reads2_plus + tumor_reads2_minus. I don't understand why I see this difference and why they concern only the reads supporting the reference and not those supporting the variant.
I assume that I have the same problem in the reference sample but I cannot see it since I do not have the details of the strand.

I would like to know more specifically how do you count the reads supporting indels. Do you consider only the reads of the first nucleotide involved in the indel?

This problem can be important if the "right" number of reads is provided by tumor_reads1_plus + tumor_reads1_minus and not by tumor_reads1, because it could change the status (somatic, germiline, loh)

Has everyone noticed this? How can I solved this issue?
Thanks,
Jane
Jane M is offline   Reply With Quote
Old 04-19-2012, 08:04 AM   #13
Jane M
Senior Member
 
Location: Paris

Join Date: Aug 2011
Posts: 239
Default

I would like to ask one more question about the parameters min-reads2=2 and min-strands2=2.

When I specified min-strands2=2 and min-strands2 (with --min-avg-qual 20), I get for example:
Quote:
chr2 152107899 T C nonsynonymous SNV 97 56 36,6% Y 173 0 0% T LOH 1 5.7297e-22 107 66 0 0
In IGV:
Quote:
99 (reference in normal sample) 56 (variant in normal sample) 183 (reference in tumoral sample) 1 (variant in tumoral sample)
The only variant in the tumoral sample has a phred of 19.
So, when I use --min-avg-qual 20, I do not detect this read and this position is variant.
When using --min-avg-qual 19, I do detect it and this position does not appear anymore in my output file.

So my question concerns the 2 parameters min-reads et min-strands: If the number of reads is different from 0, then should I have min-reads and min-strands over the specified threshold in all cases?
In the case that I show, the only variant read is most likely an error and because I do not have 2 reads and or 2 strands, the position is not anymore considered as variant...

That's a pity because I have to take min-reads=1 or 0 and min-strands=0 or 1 in order to avoid losing true variants (like the one I am loosing), but I will probably increase the number of false variants.

What do you think?
Can someone give me more information about both parameters min-reads et min-strands?

Thank you in advance
Jane M is offline   Reply With Quote
Old 04-30-2012, 06:55 AM   #14
Jane M
Senior Member
 
Location: Paris

Join Date: Aug 2011
Posts: 239
Default

Any suggestion? Any feedback?
Jane M is offline   Reply With Quote
Old 05-17-2012, 11:28 AM   #15
dkoboldt
Member
 
Location: St. Louis

Join Date: Mar 2009
Posts: 62
Default

Jane,

Thank you for following up; I apologize for my delay in replying.

Indel Representation in SAMtools Mpileup
To understand how VarScan counts reads for indel variants, it's important to know how SAMtools represents these variants in its pileup output. SAMtools pileup output contains one line per genomic position. At the first base of an indel variant, insertions are represented as +NNN and deletions as -NNN (where NNN represents the bases that are inserted or deleted) and reference bases are represented as "." or ",".

VarScan Read Counting
At a specific position, VarScan computes the number of reads supporting the reference base ("." or ",") as well the number supporting each variant allele observed. In Somatic mode, the most frequently observed variant allele is what's evaluated between normal and tumor.

Read Count Columns in VarScan Output
The normal_reads1, normal_reads2, tumor_reads1, and tumor_reads2 columns should be considered the most accurate counts of reference and variant supporting reads. These are computed independently of the reads1plus/reads1minus/reads2plus/reads2minus tabulations. The latter were never really intended for use by end users. While there does seem to be an issue in your example, it should not affect the somatic mutation calling because strand-specific counts are not used for the mutation calling. Could you give me the normal and tumor pileup for the position in your example?

Min-Reads2 and Min-Strands2
The min-reads2 parameter specifies the minimum number of variant-supporting reads required for VarScan to call a variant. This is a useful parameter for improving the specificity of mutation calling; internally, we require at least 4 supporting reads (in most cases) to call a variant.

The min-strands2 parameter is a more blunt parameter that I generally recommend against using. By setting this to 2, you tell VarScan only to report variants observed on both strands. In theory, this might remove some false positives but will certainly remove some true variants that are under-represented on one strand.

I recommend that you don't use this parameter, and instead use the false positive filter described in the VarScan 2 publication (it requires an accessory script available on VarScan's SourceForge page, and a bam-readcount utility available on GitHub).
dkoboldt is offline   Reply With Quote
Old 05-22-2012, 12:13 AM   #16
Jane M
Senior Member
 
Location: Paris

Join Date: Aug 2011
Posts: 239
Default

Dan, thank you a lot for your detailed answer.
1) I would like to come back to the Min-Reads2 and Min-Strands2 parameters.

Quote:
Originally Posted by dkoboldt View Post
Min-Reads2 and Min-Strands2
The min-reads2 parameter specifies the minimum number of variant-supporting reads required for VarScan to call a variant. This is a useful parameter for improving the specificity of mutation calling; internally, we require at least 4 supporting reads (in most cases) to call a variant.
I understand why they are used for, but I'm wondering what is happening in some cases, for example:
  1. 50 (reference in normal sample) 50 (variant in normal sample) 100 (reference in tumor sample) 0 (variant in tumoral sample)
  2. 0 (reference in normal sample) 100 (variant in normal sample) 100 (reference in tumor sample) 0 (variant in tumoral sample)

Since there is no variant in the tumor samples, if we use Min-Reads2>0, the position won't be considered as "LOH", whereas it is, am I right?


2) Then, I would like to know if VarScan could be use to compare 2 tumor samples. For several patients, I have a normal sample, an untreated sample and a treated sample.
I want to study the effect of the treatment.
I am comparing normal/untreated (1), normal/treated (2) and finally (1) vs (2). But it would be much faster to directly compare untreated/treated. I am wondering if there is something which prevents to do so in VarScan. What do you think ?


3) Finally, for the issue concerning the number of reads in .indel file, here are 2 postions:

Varscan output:
Code:
chrom	position	ref	var	normal_reads1	normal_reads2	normal_var_freq	normal_gt	tumor_reads1	tumor_reads2	tumor_var_freq	tumor_gt	somatic_status	variant_p_value	somatic_p_value	tumor_reads1_plus	tumor_reads1_minus	tumor_reads2_plus	tumor_reads2_minus																																		
chr1	20098	C	-AG	10	0	0%	C	6	2	25%	*/-AG	Somatic	1.0	0.1830065359477151	7	1	1	1																				
chr1	3418647	G	-C	16	0	0%	G	7	5	41,67%	*/-C	Somatic	1.0	0.008058608058608	4	6	4	1
Mpileup output:
Code:
$ grep -w 20098 fibros.mpileup
chr1	20098	c	10	.$..,,,.,.,	EJJIH?EEDD

$ grep -w 20098 tumor.mpileup
chr1	20098	c	8	....-2AG.,-2ag..	JJIEJEHF
Thank you for your help,
Jane
Jane M is offline   Reply With Quote
Old 06-11-2012, 03:18 PM   #17
cdry7ue
Member
 
Location: Austin

Join Date: Feb 2011
Posts: 20
Default

Dan,
I was wondering how I could get each position specific hypothesis outputted from
pileup2snp, irrespective of quality,%variant etc. So absolutely no filtering of any kind.

I tried
--variant 1
--validation 1
--variant 0

etc but I still get filtered output only.

Also is there a way for the p-value to not come out as 0.98 by default. That problem is
solved by a ghetto trick of setting --p-value 0.9779 but is there an official way.

Thanks

-Ashish
cdry7ue is offline   Reply With Quote
Old 06-18-2012, 09:52 AM   #18
shyam_la
Member
 
Location: California

Join Date: Mar 2012
Posts: 97
Default

Quote:
Originally Posted by Jane M View Post
Dan, thank you a lot for your detailed answer.
1) I would like to come back to the Min-Reads2 and Min-Strands2 parameters.



I understand why they are used for, but I'm wondering what is happening in some cases, for example:
  1. 50 (reference in normal sample) 50 (variant in normal sample) 100 (reference in tumor sample) 0 (variant in tumoral sample)
  2. 0 (reference in normal sample) 100 (variant in normal sample) 100 (reference in tumor sample) 0 (variant in tumoral sample)

Since there is no variant in the tumor samples, if we use Min-Reads2>0, the position won't be considered as "LOH", whereas it is, am I right?


2) Then, I would like to know if VarScan could be use to compare 2 tumor samples. For several patients, I have a normal sample, an untreated sample and a treated sample.
I want to study the effect of the treatment.
I am comparing normal/untreated (1), normal/treated (2) and finally (1) vs (2). But it would be much faster to directly compare untreated/treated. I am wondering if there is something which prevents to do so in VarScan. What do you think ?


3) Finally, for the issue concerning the number of reads in .indel file, here are 2 postions:

Varscan output:
Code:
chrom	position	ref	var	normal_reads1	normal_reads2	normal_var_freq	normal_gt	tumor_reads1	tumor_reads2	tumor_var_freq	tumor_gt	somatic_status	variant_p_value	somatic_p_value	tumor_reads1_plus	tumor_reads1_minus	tumor_reads2_plus	tumor_reads2_minus																																		
chr1	20098	C	-AG	10	0	0%	C	6	2	25%	*/-AG	Somatic	1.0	0.1830065359477151	7	1	1	1																				
chr1	3418647	G	-C	16	0	0%	G	7	5	41,67%	*/-C	Somatic	1.0	0.008058608058608	4	6	4	1
Mpileup output:
Code:
$ grep -w 20098 fibros.mpileup
chr1	20098	c	10	.$..,,,.,.,	EJJIH?EEDD

$ grep -w 20098 tumor.mpileup
chr1	20098	c	8	....-2AG.,-2ag..	JJIEJEHF
Thank you for your help,
Jane
Hi Jane,

I am new to bioinformatics. Infact, Im a biologist.
Since you appear to have dissected varscan over months, what is your opinion of it now? Do you think it fares well? I have used MuTect (beta) for my mutation calls on tumor-normal paired exomes and it seems to work well..
Have you used it?
Which one do you think is more reliable, from your experience?

I am confused about the varscan output too - particularly the somatic p-value column, which goes all the way upto "1" from 10 orders down, for sites that have been called somatic.
Even after removing all the sites with somatic p > .05 manually from the output - varscan2 still outputs 25% more sites than MuTect. I don't know which one to choose, if it possible to say one is better than the other at all..

Thanks.

Last edited by shyam_la; 06-18-2012 at 09:58 AM.
shyam_la is offline   Reply With Quote
Old 06-18-2012, 10:39 AM   #19
Jane M
Senior Member
 
Location: Paris

Join Date: Aug 2011
Posts: 239
Default

Hello,

I haven't used MuTect, so I cannot compare. I would like to hear your experience with it. Could you give me the range of the number of mutations that you have detected wit MuTect? Did you check it with classical sequencing?

I am using VarScan for a few months, but only VarScan2. I have detected somatic mutations and LOH, but we haven't verified them by sequencing yet.
I have added some filters to reduce my list of variants, as the biologists wanted. To increase reliability, we can tune the different thresholds (and it takes time) but it would be great to make a cautious choice based on a solid argument. I am rather satisfied with the results now. Overall, the tool is good, besides some potential issues that I reported on this website.

Instead of removing all the sites with somatic p > .05, I could advise you to take a look at the adjusted p-values. You will remove much more variants and it's more logical to use that. Here again, the threshold will be "an arbitrary" choice...

When a site is "somatic", the somatic p-value shouldn't be 1, it's weird... By the way, did you read the post of dkoboldt regarding the p-values?
Jane M is offline   Reply With Quote
Old 06-18-2012, 01:37 PM   #20
shyam_la
Member
 
Location: California

Join Date: Mar 2012
Posts: 97
Default

Hi,

Well, I started on this project only 3 weeks back and learnt everything from scratch. Right now I have only 1 sample; more raw reads will be coming in soon. The mutation rate was ~100 / mbasepair on this sample which even though high, is consistent with this patients medical history.

I trust MuTect with its calls now, because the SNPs it gave me were over 95% C > T and G > A which is expected for this type of tumor and also there were stop codons in genes that were expected to have them in this type of tumor. But no not verified by classical sequencing yet.. Will do that when there are many more samples..

The only reason I was unhappy was I had to downgrade to older ref databases (dbsnp132 and the fasta file available on the mutect ftp), to get it to work in the first place, and so was searching for alternatives. I tested SomaticSniper yesterday and VarScan2 today.. I think both give too many false positives and require too much tweaking (for a biologist)..
But I managed to get MuTect to work with GRCh37.67 and dbSNP135 today (I just like to use the most updated stuff :P ). So my search ends here.. I am sticking with MuTect.
In my opinion, you should give it a shot - its much easier to use than VarScan, runs faster (-Xmx8g for both) and definitely makes great calls. But it doesn't provide info about LoH, if that is important to you..

Thanks.

PS: MuTect doesn't run with java 7.. Just discovered it today. You will have to use JRE 6..

Last edited by shyam_la; 06-18-2012 at 01:39 PM.
shyam_la is offline   Reply With Quote
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 09:15 AM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2019, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO