SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics



Similar Threads
Thread Thread Starter Forum Replies Last Post
Copy number variation from whole genome data tahamasoodi Bioinformatics 4 05-05-2014 01:35 AM
Copy number variation and synteny mapping for bacterial genomes Bgansw Bioinformatics 0 11-11-2012 07:18 PM
Copy number variation..on chromosome level...or ploidy with sequencing antu82 Illumina/Solexa 6 09-21-2012 07:19 PM
Copy number variation: read depth algorithms and BAF shuteo Bioinformatics 0 07-27-2012 08:19 AM
A question regarding copy number variation JackieBadger Bioinformatics 0 07-16-2012 04:11 AM

Reply
 
Thread Tools
Old 03-13-2014, 09:04 AM   #1
lethalfang
Member
 
Location: San Francisco, CA

Join Date: Aug 2011
Posts: 91
Default Copy number variation in cancer WGS

Hi guys,

I am looking at some whole genome sequencing (WGS) for tumor-normal pairs, and I want to find somatic copy number alteration in the tumors.
What tools do you guys recommend for these?

I have read about a few, e.g., BIC-Seq, OncoSNP-SEQ, CREST, etc., but have no experience. Any recommendations?

Thanks in advance.
lethalfang is offline   Reply With Quote
Old 03-14-2014, 05:50 AM   #2
TiborNagy
Senior Member
 
Location: Budapest

Join Date: Mar 2010
Posts: 329
Default

Control-FreeC is a good choice.
And an interesting article:
https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3875755/
TiborNagy is offline   Reply With Quote
Old 03-14-2014, 12:39 PM   #3
lethalfang
Member
 
Location: San Francisco, CA

Join Date: Aug 2011
Posts: 91
Default

Thanks. Yeah I've read over that article just recently, and I like to see what are people's experiences with different software.

Control-FREEC seems pretty good. What's your experience with it? I also want to try out BIC-Seq, OncoSNP-SEQ, and CNAnorm. I want to see how they perform. Does anyone have any experience with those?
lethalfang is offline   Reply With Quote
Old 03-17-2014, 06:11 AM   #4
TiborNagy
Senior Member
 
Location: Budapest

Join Date: Mar 2010
Posts: 329
Default

Control-FreeC is easy to use if you are working with human or mouse model. If you working with non model-organisms it is a bit tricky, because you need to generate the mappability files. Read carefully the documentation because there is a lots of settings. But the we were happy with the results.
TiborNagy is offline   Reply With Quote
Old 03-17-2014, 09:33 AM   #5
lethalfang
Member
 
Location: San Francisco, CA

Join Date: Aug 2011
Posts: 91
Default

Quote:
Originally Posted by TiborNagy View Post
Control-FreeC is easy to use if you are working with human or mouse model. If you working with non model-organisms it is a bit tricky, because you need to generate the mappability files. Read carefully the documentation because there is a lots of settings. But the we were happy with the results.
I'm working with some tumor-normal WGS pairs. Finished some runs, trying to make the plot........ which is taking some time.

I ran Control-FREEC with a pair of simulated tumor-normal data from this Syapse thing: https://www.synapse.org/#!Synapse:syn312572/wiki/60702

It took 3 hours to complete. The *_ratio.txt file is 113M. It took 3 minutes to produce the graph.
For the real tumor-normal pair (data file is slightly larger), it took 14 hours to complete. The The *_ratio.txt file is 169M. It took about an hour to finish plotting.

Why that discrepancy in time?

Last edited by lethalfang; 03-17-2014 at 09:43 AM.
lethalfang is offline   Reply With Quote
Old 03-17-2014, 10:48 AM   #6
lethalfang
Member
 
Location: San Francisco, CA

Join Date: Aug 2011
Posts: 91
Default

Just checking, it's a bad idea to split Copy Number Analysis jobs into chromosome by chromosome, right? It would be bad for the sample power.
lethalfang is offline   Reply With Quote
Old 03-17-2014, 12:27 PM   #7
lethalfang
Member
 
Location: San Francisco, CA

Join Date: Aug 2011
Posts: 91
Default

I'm wondering for tumor-normal paired WGS, what are the commended settings for some of the following:

coefficientOfVariation:
The test file had 0.062. The manual has an "example" of 0.05. What's a good parameter from people's experience?

forceGCcontentNormalization:
Does GC content normalization improve the results for tumor-normal pairs? If so, is 1 or 2 better?
1: normalize GC content first, and then calculate sample/control ratio.
2: calculate sample/control ratio first, and then normalize GC content.

intercept:
1 - with GC content
0 - with a control dataset.
What if I have both?

minCNAlength:
What's a good setting based on people's experience?

Any other setting that people find particularly better for tumor-normal pairs?

Thanks.
lethalfang is offline   Reply With Quote
Old 03-18-2014, 06:28 AM   #8
TiborNagy
Senior Member
 
Location: Budapest

Join Date: Mar 2010
Posts: 329
Default

Quote:
Originally Posted by lethalfang View Post
Why that discrepancy in time?
It can be anything: slow hard disk, old computer, etc.

Quote:
Just checking, it's a bad idea to split Copy Number Analysis jobs into chromosome by chromosome, right?
I am not an expert, but I think it is not a bad idea.

Last edited by TiborNagy; 03-18-2014 at 06:29 AM. Reason: spell correction
TiborNagy is offline   Reply With Quote
Old 04-09-2014, 02:20 PM   #9
ffavero
Member
 
Location: London

Join Date: Apr 2014
Posts: 12
Default

Quote:
Originally Posted by lethalfang View Post
Hi guys,

I am looking at some whole genome sequencing (WGS) for tumor-normal pairs, and I want to find somatic copy number alteration in the tumors.
What tools do you guys recommend for these?

I have read about a few, e.g., BIC-Seq, OncoSNP-SEQ, CREST, etc., but have no experience. Any recommendations?

Thanks in advance.
Hi, you could try sequenza, is available on CRAN.

Is composed by a python script that generate a suitable file for the R-package, or alternatively VarScan2 output could be used.

It was developed for exome sequencing, but with whole-genome it works even better.

It's available from the institute page or from CRAN, the python script is bundled with the R package, as the documentation and example data.

As usual, higher depth, and higher tumor content are a goo thing, but I managed to analyse tumor sample with relatively low depth (10x) as well samples around ~20% of tumor content with satisfying results.
ffavero is offline   Reply With Quote
Old 04-09-2014, 03:51 PM   #10
lethalfang
Member
 
Location: San Francisco, CA

Join Date: Aug 2011
Posts: 91
Default

Quote:
Originally Posted by ffavero View Post
Hi, you could try sequenza, is available on CRAN.

Is composed by a python script that generate a suitable file for the R-package, or alternatively VarScan2 output could be used.

It was developed for exome sequencing, but with whole-genome it works even better.

It's available from the institute page or from CRAN, the python script is bundled with the R package, as the documentation and example data.

As usual, higher depth, and higher tumor content are a goo thing, but I managed to analyse tumor sample with relatively low depth (10x) as well samples around ~20% of tumor content with satisfying results.
Cool.
Is it possible to see the pre-print of your submitted paper, or do you have to wait until it's published?
lethalfang is offline   Reply With Quote
Old 04-10-2014, 01:01 AM   #11
ffavero
Member
 
Location: London

Join Date: Apr 2014
Posts: 12
Default

:-).
Well, I could but I'd like to see what the reviewers/editors have to say first.

We just describe the algorithm and compare the results runing on exome with respective SNP array from TCGA, as well compare the results of other similar algirithms.
Sequenza was, when not perfecly the same, was pretty close to the SNP array prediction...
ffavero is offline   Reply With Quote
Old 04-10-2014, 07:16 AM   #12
lethalfang
Member
 
Location: San Francisco, CA

Join Date: Aug 2011
Posts: 91
Default

Quote:
Originally Posted by ffavero View Post
:-).
Well, I could but I'd like to see what the reviewers/editors have to say first.

We just describe the algorithm and compare the results runing on exome with respective SNP array from TCGA, as well compare the results of other similar algirithms.
Sequenza was, when not perfecly the same, was pretty close to the SNP array prediction...
That sounds like OncoSNP-SEQ.
lethalfang is offline   Reply With Quote
Old 04-10-2014, 09:03 AM   #13
ffavero
Member
 
Location: London

Join Date: Apr 2014
Posts: 12
Default

Quote:
Originally Posted by lethalfang View Post
That sounds like OncoSNP-SEQ.
Well oncoSNP-seq have a different inference implementation, plus it uses dbSNP to set heteozygous positions, while we use information from the germline.

It's written in MATLAB while we have implemented it in R, which should make it easier to use I suppose.

Anyway I've started the sofware from scratch , to have something working properly with exome, whitout borrow any concept from oncoSNP, they are ptetty different from each other, I would say.

I've tried to use oncoSNP-seq on exome, but it doesn't works well, as warned in the manual.
Although sequenza works pretty well with wgs.
If you want to perform some testing it it should be difficult to try them both and benchmark the difference.
ffavero is offline   Reply With Quote
Old 04-10-2014, 09:07 AM   #14
lethalfang
Member
 
Location: San Francisco, CA

Join Date: Aug 2011
Posts: 91
Default

Quote:
Originally Posted by ffavero View Post
Well oncoSNP-seq have a different inference implementation, plus it uses dbSNP to set heteozygous positions, while we use information from the germline.

It's written in MATLAB while we have implemented it in R, which should make it easier to use I suppose.

Anyway I've started the sofware from scratch , to have something working properly with exome, whitout borrow any concept from oncoSNP, they are ptetty different from each other, I would say.

I've tried to use oncoSNP-seq on exome, but it doesn't works well, as warned in the manual.
Although sequenza works pretty well with wgs.
If you want to perform some testing it it should be difficult to try them both and benchmark the difference.
Yeah installing MATLAB runtime was a pain.
It's much better to have a software without having to install a proprietary 3rd-party language.
lethalfang is offline   Reply With Quote
Old 05-02-2014, 12:13 AM   #15
lethalfang
Member
 
Location: San Francisco, CA

Join Date: Aug 2011
Posts: 91
Default

Was trying

Quote:


sequenza-utils.py pileup2seqz \
-gc ~/references/hg19.gc50Base.txt.gz \
-n ~/pileups_benchmark/wgs.normal.pileup.gz \
-t ~/pileups_benchmark/wgs.tumor.pileup.gz \
| bgzip > ~/out.seqz.gz

And I got an empty output file, with only the header. No error message.
The two pileups have chromosomes labeled as 1, 2, 3, ..., X, Y, M.

Any idea why?

Thanks.
lethalfang is offline   Reply With Quote
Old 05-02-2014, 05:26 AM   #16
ffavero
Member
 
Location: London

Join Date: Apr 2014
Posts: 12
Default

Hi lethalfang,
the chromosomes in the 2 pileups and in the GC file are in the same order, right?
Also note that the pileup have to be generated with the fasta reference (-f argument), otherwise there might be problems.

you could try to diminish the '-n' parameter to allow consensus position with less depth to be taken into account.
The default is 20, so to be included you need to have at least 10 reads in the normal and 10 in the tumor at a given position (or any other configuration where the sum is 20). This might might be too high for low pass WGS.

If you have a chance to paste part of the content of your 3 files (eg in pastebin or similar) I could have a look and see if there is something clearly wrong.

EDIT: additionally you could have a look here https://bitbucket.org/ffavero/sequen...Sequenza_Utils, for tips on how to use sequenza-utils.

Last edited by ffavero; 05-02-2014 at 07:22 AM.
ffavero is offline   Reply With Quote
Old 05-02-2014, 07:47 AM   #17
lethalfang
Member
 
Location: San Francisco, CA

Join Date: Aug 2011
Posts: 91
Default

Yes, they're in the same order. They were generated using the following command (the pileups were first created for VarScan2).
The reference used here is the Broad Institute's version B37, i.e., the chromosomes written as 1, 2, 3, ..., X, Y, MT.

Quote:
samtools mpileup -B -q 1 -f b37.fa WGS.sorted.aligned.bam > WGS.pileup
That pileup was then bgzipped into .pileup.gz

The GC content file was generated using the hg19 version, i.e., chromosomes written as chr1, chr2, chr3, ..., chrX, chrY, chrM.

I tried to create the GC content file with the B37.fa, but that b37 fasta file has a smaller number (maybe no more than 1000) of characters that are not G, C, T, A, or N. The script failed when it tried to count M, R, etc. due to dictionary key error. In any case, the GC content file shouldn't be the cause of an empty seqz file.

First 5 lines of the normal pileup.gz
Quote:
1 10000 N 2 ^$A^]A <<
1 10001 T 17 CC^0.^-.^0.^*.^-.^+.^>.^2.^>.^>.^+.^>.^,.^&.^0. CC@CBDEDED@??E<<C
1 10002 A 33 C.................^2.^>.^-.^*.^>.^>.^2.^-.^/.^-.^0.^".^-.^$.^2. DDACBB@DBCB@@=DCCCB::A@C?:A9?:<=D
1 10003 A 43 .................................^2.^*.^>.^&.^+.^$.^$.^:.^8.^*. CEDBEDCE=EEDB6EEEEDCCCBD3DC?@CEDE>A::==AC=<
1 10004 C 58 ...........................................^".^>.^5.^8.^5.^>.^*.^$.^*.^'.^5.^*.^-.^$.^>. BBBABBBC?AB@BACBBCBBBBC?ABB>;BCBB?ABBAB@BBC?:A@:AA<C=D?B

First 5 lines of the gc content file:
Quote:
variableStep chrom=chrM span=50
1 50.0
51 56.0
101 52.0
151 36.0
201 22.0
251 44.0
301 62.0
351 40.0
401 42.0

All the pileups and gc-content files are bgzipped (into gz). Hope that isn't the problem.


Never mind, it may be due to the gc-content file with wrong chromosome orders. Let me fix that and try again.

Last edited by lethalfang; 05-02-2014 at 01:09 PM.
lethalfang is offline   Reply With Quote
Old 05-02-2014, 10:33 AM   #18
lethalfang
Member
 
Location: San Francisco, CA

Join Date: Aug 2011
Posts: 91
Default

Update:

Simply re-ordering the chromosomes wasn't enough. I converted chrom=chr1 into chrom=1, etc, and now it's working.

Because the pileup files were generated using the b37 format (i.e., chromosomes were named 1, 2, 3, ..., X, Y, MT), and the gc-content file was generated using the hg19 format (i.e., chromosomes were named chr1, chr2, chr3, ..., chrX, chrY, chrM), the chromosome names did not match.

Two things:
1) The b37's fasta file has characters like M and R in the sequence. There aren't many of those, you can simply consider those as "N" in the script. Due to that, the python script failed trying to generate a gc-content file from b37 fasta file.

2) I guess you can modify the python script, so it doesn't matter which chromosome formats were used.

Last edited by lethalfang; 05-02-2014 at 01:09 PM.
lethalfang is offline   Reply With Quote
Old 05-02-2014, 10:42 AM   #19
lethalfang
Member
 
Location: San Francisco, CA

Join Date: Aug 2011
Posts: 91
Default

By the way, in the user guide (http://cran.r-project.org/web/packag...s/sequenza.pdf), you had "-r" to flag normal.pileup and "-s" to flag tumor.pileup.
lethalfang is offline   Reply With Quote
Old 05-04-2014, 06:18 AM   #20
ffavero
Member
 
Location: London

Join Date: Apr 2014
Posts: 12
Default

Ops, you are right!
That's because from version 1.* to 2.0 I've changed all the arguments from reference/sample (-r/-s) to, more on-topic withe cancer research, normal/tumor (-n/-t). I was carefull to change it everywere, but clearly not there.

I haven't test it with b37 fasta, I will add a way to handle M and R.
Thanks for taking this to my attention! Both your points were really relevant.
ffavero is offline   Reply With Quote
Reply

Tags
cancer, cna, cnv, copy number, somatic

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 10:45 AM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2019, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO