Seqanswers Leaderboard Ad

**ffavero** · 05-02-2014, 05:26 AM

Hi lethalfang,
the chromosomes in the 2 pileups and in the GC file are in the same order, right?
Also note that the pileup have to be generated with the fasta reference (-f argument), otherwise there might be problems.

you could try to diminish the '-n' parameter to allow consensus position with less depth to be taken into account.
The default is 20, so to be included you need to have at least 10 reads in the normal and 10 in the tumor at a given position (or any other configuration where the sum is 20). This might might be too high for low pass WGS.

If you have a chance to paste part of the content of your 3 files (eg in pastebin or similar) I could have a look and see if there is something clearly wrong.

EDIT: additionally you could have a look here https://bitbucket.org/ffavero/sequen...Sequenza_Utils, for tips on how to use sequenza-utils.

**lethalfang** · 05-02-2014, 07:47 AM

Yes, they're in the same order. They were generated using the following command (the pileups were first created for VarScan2).
The reference used here is the Broad Institute's version B37, i.e., the chromosomes written as 1, 2, 3, ..., X, Y, MT.

samtools mpileup -B -q 1 -f b37.fa WGS.sorted.aligned.bam > WGS.pileup

That pileup was then bgzipped into .pileup.gz

The GC content file was generated using the hg19 version, i.e., chromosomes written as chr1, chr2, chr3, ..., chrX, chrY, chrM.

I tried to create the GC content file with the B37.fa, but that b37 fasta file has a smaller number (maybe no more than 1000) of characters that are not G, C, T, A, or N. The script failed when it tried to count M, R, etc. due to dictionary key error. In any case, the GC content file shouldn't be the cause of an empty seqz file.

First 5 lines of the normal pileup.gz

1 10000 N 2 ^$A^]A <<
1 10001 T 17 CC^0.^-.^0.^*.^-.^+.^>.^2.^>.^>.^+.^>.^,.^&.^0. CC@CBDEDED@??E<<C
1 10002 A 33 C.................^2.^>.^-.^*.^>.^>.^2.^-.^/.^-.^0.^".^-.^$.^2. DDACBB@DBCB@@=DCCCB::A@C?:A9?:<=D
1 10003 A 43 .................................^2.^*.^>.^&.^+.^$.^$.^:.^8.^*. CEDBEDCE=EEDB6EEEEDCCCBD3DC?@CEDE>A::==AC=<
1 10004 C 58 ...........................................^".^>.^5.^8.^5.^>.^*.^$.^*.^'.^5.^*.^-.^$.^>. BBBABBBC?AB@BACBBCBBBBC?ABB>;BCBB?ABBAB@BBC

?:A@:AA<C=D?B

First 5 lines of the gc content file:

variableStep chrom=chrM span=50
1 50.0
51 56.0
101 52.0
151 36.0
201 22.0
251 44.0
301 62.0
351 40.0
401 42.0

All the pileups and gc-content files are bgzipped (into gz). Hope that isn't the problem.

Never mind, it may be due to the gc-content file with wrong chromosome orders. Let me fix that and try again.

**lethalfang** · 05-02-2014, 10:33 AM

Update:

Simply re-ordering the chromosomes wasn't enough. I converted chrom=chr1 into chrom=1, etc, and now it's working.

Because the pileup files were generated using the b37 format (i.e., chromosomes were named 1, 2, 3, ..., X, Y, MT), and the gc-content file was generated using the hg19 format (i.e., chromosomes were named chr1, chr2, chr3, ..., chrX, chrY, chrM), the chromosome names did not match.

Two things:
1) The b37's fasta file has characters like M and R in the sequence. There aren't many of those, you can simply consider those as "N" in the script. Due to that, the python script failed trying to generate a gc-content file from b37 fasta file.

2) I guess you can modify the python script, so it doesn't matter which chromosome formats were used.

**lethalfang** · 05-02-2014, 10:42 AM

By the way, in the user guide (http://cran.r-project.org/web/packag...s/sequenza.pdf), you had "-r" to flag normal.pileup and "-s" to flag tumor.pileup.

**ffavero** · 05-04-2014, 06:18 AM

Ops, you are right!
That's because from version 1.* to 2.0 I've changed all the arguments from reference/sample (-r/-s) to, more on-topic withe cancer research, normal/tumor (-n/-t). I was carefull to change it everywere, but clearly not there.

I haven't test it with b37 fasta, I will add a way to handle M and R.
Thanks for taking this to my attention! Both your points were really relevant.

**lethalfang** · 05-04-2014, 11:20 AM

Originally posted by ffavero View Post

Ops, you are right!
That's because from version 1.* to 2.0 I've changed all the arguments from reference/sample (-r/-s) to, more on-topic withe cancer research, normal/tumor (-n/-t). I was carefull to change it everywere, but clearly not there.

I haven't test it with b37 fasta, I will add a way to handle M and R.
Thanks for taking this to my attention! Both your points were really relevant.

The b37 fasta has all the letters in the IUPAC code, (i.e., R, Y, S, W, K, M, B, D, H, V):

IUPAC codes

http://www.bioinformatics.org/sms2/iupac.html

Some are capitalized and some are lower cases.

**lethalfang** · 05-04-2014, 04:37 PM

I had trouble getting sequenza.fit command to work with the chromosome 22 of HCC1143, though it works with the example.seqz.txt.gz included in the package.

Here are the results in the R shell:

Code:

> test22 <- sequenza.extract('22.seqz.gz')
Processing 22: 1293 variant calls; 36826 heterozygous positions; 33837423 homozygous positions.


> names(test22)
[1] "BAF"         "ratio"       "mutations"   "segments"    "chromosomes"
[6] "gc"          "avg.depth"  


> CP.example22 <- sequenza.fit( test22 )
  |                                                                      |   0%
Error in result$L - max.lik : non-numeric argument to binary operator
In addition: There were 50 or more warnings (use warnings() to see the first 50)


> warnings()
Warning messages:
1: In mclapply(X, wrapper, mc.cores = mc.cores, ...) :
  all scheduled cores encountered errors in user code
2: In mean.default(X[[1L]], ...) :
  argument is not numeric or logical: returning NA
3: In mean.default(X[[2L]], ...) :
  argument is not numeric or logical: returning NA
4: In mean.default(X[[3L]], ...) :
  argument is not numeric or logical: returning NA
5: In mean.default(X[[4L]], ...) :
  argument is not numeric or logical: returning NA

First 10 lines of 22.seqz.gz:

Code:


$ gunzip -c 22.seqz.gz | head
chromosome	position	base.ref	depth.normal	depth.tumor	depth.ratio	Af	Bf	zygosity.normal	GC.percent	good.reads	AB.normal	AB.tumor	tumor.strand
22	16050047	A	9	13	1.444	1.0	0	hom	50.0	13	A	.	0
22	16050048	A	11	13	1.182	1.0	0	hom	50.0	13	A	.	0
22	16050049	G	12	13	1.083	1.0	0	hom	50.0	13	G	.	0
22	16050050	T	13	13	1.0	1.0	0	hom	50.0	12	T	.	0
22	16050051	C	13	14	1.077	1.0	0	hom	58.0	13	C	.	0
22	16050052	A	13	15	1.154	1.0	0	hom	58.0	15	A	.	0
22	16050053	C	13	16	1.231	1.0	0	hom	58.0	16	C	.	0
22	16050054	T	14	16	1.143	1.0	0	hom	58.0	16	T	.	0
22	16050055	T	15	17	1.133	1.0	0	hom	58.0	17	T	.	0

If you'd like to see my file yourself, it's at http://caprica.thruhere.net:8080/22.seqz.gz

**lethalfang** · 05-04-2014, 11:46 PM

Also tried it with chromosome 1, which is much larger than chromosome 22:

Code:

> test1 <- sequenza.extract('1.seqz.gz')
Processing 1: 5572 variant calls; 185400 heterozygous positions; 220663060 homozygous positions.


> CP.sample1 <- sequenza.fit(test1)
  |                                                                      |   0%Error in mcfork() : 
  unable to fork, possible reason: Cannot allocate memory

The computer has 96 GB of memory. After test1 <- sequenza.extract('1.seqz.gz'), there were about 20GB memory left.

**ffavero** · 05-05-2014, 01:33 AM

Have you tried binning the resulting seqz.gz file with seqz-binning?

The original goal was targeting exome sequencing, which generates smaller data then WGS. With whole genome you need to binning every 100~200nt in order to use it without huge memory requirement (R needs to load the file in memory if you want adequate speed). I tested this also on a 8Gb desktop.
Also while segmenting (on sequenza.extract function) you need to use a kmin bigger then 100nt (minimal number of probe to call a segment, argument from the copynumber package), otherwise you will generate a lot of not really relevant segments, and running sequenza.fit with a huge number of segments can be really slow.

I see now that I have to mention some suggested workflow for WGS in the manual (for me that I made the software it's difficult to pinpoint what it's not obvious and NEEDS to be documented).

I have a script that looks like:

Code:

#!/bin/bash

SCRATCH=$1
OUT=$2
BIN_SZ=$3
lll=""
## Loop over all chrs.
for i in $(cat $HOME/references/chromosomes/characters.txt)
   do
      lll=${lll}" "$SCRATCH\_$i.seqz.gz
   done
zcat $lll | gawk '{if (NR==1 || $1!="chromosome") {print $0}}' | \
   ~/src/pypy-favero01-2.1/bin/pypy \
   ~/local/bin/sequenza-utils.py seqz-binning -w $BIN_SZ -a - | \
   gzip > $OUT

that performs the binning of a desired size and merge the various chromosomes (stored all in the same directory).

you could run it like this

Code:

bash ./merge_bin.sh /path/of/sample/sample /path/output/sample.seqz.gz 200

where the various chromosomes are in /path/of/sample/sample_chr1.seqz.gz, ..., /path/of/sample/sample_chr22.seqz.gz ..., for example. And the "characters.txt" file mentioned in the script it's just a list of name like chr1, chr2 and so on - on chromosome for line -.

The normalization and analysis have to be performed genome wide, so analyzing chromosome by chromosome might cause incorrect estimation.

So, I suggest you to run a similar script to have a file comprising all the chromosomes binned by ~100 or 200nt, and then try again the sequenza R package (this time with a smaller file will be a lot faster/memory efficient), and also increase the kmin with a value around 500, then decrease it if you see the sensibility of the segmentation is to low (or play with the gamma kmin argument, see ?copynumber manual)

Thanks again for the valuable feedback!

EDIT: I've tried yout 22.seqz.gz file on my laptop, binning every 100nt (it reduces the size to 35Mb) and using kmin of 500. It works - although the cellularity/ploidy estimation is wrong, since no gain or losses are visible on this chromosome -
With cell-line it is somehow easier since the cellularity is always 100%, so any changes should be well visible. However the mere estimation might be a little more tricky, I've tested sequenza with TCGA benchmark 4, and it was working well, hope you will find the same.

**lethalfang** · 05-05-2014, 12:21 PM

Originally posted by ffavero View Post

Have you tried binning the resulting seqz.gz file with seqz-binning?

The original goal was targeting exome sequencing, which generates smaller data then WGS. With whole genome you need to binning every 100~200nt in order to use it without huge memory requirement (R needs to load the file in memory if you want adequate speed). I tested this also on a 8Gb desktop.
Also while segmenting (on sequenza.extract function) you need to use a kmin bigger then 100nt (minimal number of probe to call a segment, argument from the copynumber package), otherwise you will generate a lot of not really relevant segments, and running sequenza.fit with a huge number of segments can be really slow.

I see now that I have to mention some suggested workflow for WGS in the manual (for me that I made the software it's difficult to pinpoint what it's not obvious and NEEDS to be documented).

I have a script that looks like:

Code:

#!/bin/bash

SCRATCH=$1
OUT=$2
BIN_SZ=$3
lll=""
## Loop over all chrs.
for i in $(cat $HOME/references/chromosomes/characters.txt)
   do
      lll=${lll}" "$SCRATCH\_$i.seqz.gz
   done
zcat $lll | gawk '{if (NR==1 || $1!="chromosome") {print $0}}' | \
   ~/src/pypy-favero01-2.1/bin/pypy \
   ~/local/bin/sequenza-utils.py seqz-binning -w $BIN_SZ -a - | \
   gzip > $OUT

that performs the binning of a desired size and merge the various chromosomes (stored all in the same directory).

you could run it like this

Code:

bash ./merge_bin.sh /path/of/sample/sample /path/output/sample.seqz.gz 200

where the various chromosomes are in /path/of/sample/sample_chr1.seqz.gz, ..., /path/of/sample/sample_chr22.seqz.gz ..., for example. And the "characters.txt" file mentioned in the script it's just a list of name like chr1, chr2 and so on - on chromosome for line -.

The normalization and analysis have to be performed genome wide, so analyzing chromosome by chromosome might cause incorrect estimation.

So, I suggest you to run a similar script to have a file comprising all the chromosomes binned by ~100 or 200nt, and then try again the sequenza R package (this time with a smaller file will be a lot faster/memory efficient), and also increase the kmin with a value around 500, then decrease it if you see the sensibility of the segmentation is to low (or play with the gamma kmin argument, see ?copynumber manual)

Thanks again for the valuable feedback!

EDIT: I've tried yout 22.seqz.gz file on my laptop, binning every 100nt (it reduces the size to 35Mb) and using kmin of 500. It works - although the cellularity/ploidy estimation is wrong, since no gain or losses are visible on this chromosome -
With cell-line it is somehow easier since the cellularity is always 100%, so any changes should be well visible. However the mere estimation might be a little more tricky, I've tested sequenza with TCGA benchmark 4, and it was working well, hope you will find the same.

My chromosome 22 came from this cell line:

Page not found | Department of Pathology

http://www.path.cam.ac.uk/~pawefish/BreastCellLineDescriptions/HCC1143.html

The bam files were downloaded from cghub/TCGA public data set.

Chromosome 21 has a lot more variations, if you want to try that:

http://caprica.thruhere.net:8080/21.seqz.gz

**ffavero** · 05-05-2014, 10:29 PM

Originally posted by lethalfang View Post

My chromosome 22 came from this cell line:

Page not found | Department of Pathology

http://www.path.cam.ac.uk/~pawefish/BreastCellLineDescriptions/HCC1143.html

The bam files were downloaded from cghub/TCGA public data set.

Chromosome 21 has a lot more variations, if you want to try that:
http://caprica.thruhere.net:8080/21.seqz.gz

I have already processed HCC1143, it results triploids and 100% of cellularity.
The aberration detected with sequenza seems to be matching the SKY profile (although is difficult to detect the color sparse around in translocations)

This is chromosome 7, for example:

Top panel shows the mutation (not filtered by any means, each dotted line is the expected frequency of the mutated, if CN = 4, we would have 4 lines and so on...);
Middle panel shows B-allelle frequency (the dotted lines are on top of the expected frequency of the possible number of B alleles);
Bottom panel shows total copy number information.

This plot was obtained some time ago, using only the heterozygous positions and variant alleles on homozygous - so discarding non variant position instead of binning them - Therefore it loses accuracy on the depth ratio. ( few thousands position used for the depth ratio instead of 150 millions).
As a matter of facts I'm working on the same cell lines now, for an add-on required by reviewers

, so I'm running the data with the current sequenza version. Tell me if something is still wrong/too cumbersome.

**pratima** · 05-06-2014, 12:08 AM

hiii!!!!! hv no idea just felt like posting

**lethalfang** · 05-06-2014, 10:08 AM

Okay I'll give it a try.

**lethalfang** · 05-06-2014, 12:08 PM

I binned the chromosome 22, but still get this error:

Code:

> library('sequenza')

> data.file <- "binned_22.seqz.gz"
> seqz.data <- read.seqz(data.file, chr.name = "22")
> test <- sequenza.extract(data.file)
Processing 22: 1293 variant calls; 36633 heterozygous positions; 2311953 homozygous positions.

> cp22.example=sequenza.fit(test)
  |                                                                      |   0%
Error in result$L - max.lik : non-numeric argument to binary operator
In addition: There were 50 or more warnings (use warnings() to see the first 50)

The file is at: http://caprica.thruhere.net:8080/binned_22.seqz.gz

It was created with the following command:
sequenza-utils.py seqz-binning -w 200 -a 22.seqz | /home/ltfang/apps/tabix/bgzip > binned_22.seqz.gz

Chromosome 1 ran fine. However, some commands did not work:

Code:

> sequenza.results( sequenza.extract=test, sequenza.fit=cp1.example, sample.id="Test1", out.dir="." )
Error in sequenza.results(sequenza.extract = test, sequenza.fit = cp1.example,  : 
  unused argument (sequenza.fit = cp1.example)
> names(cp1.example)



> chromosome.view( mut.tab=test$mutations[[3]], baf.windows=test$BAF[[3]], ratio.windows=test$ratio[[3]], min.N.ratio=1, segments=seg.tab[seg.tab$chromosome==test$chromosomes[3],], main=test$chromosomes[3], cellularity=cellularity, ploidy=ploidy, avg.depth.ratio=avg.depth.ratio)
Error in seq.default(from = 1, to = CNt, by = 1) : 
  'to' cannot be NA, NaN or infinite
In addition: Warning message:
In max(segments$CNt, na.rm = TRUE) :
  no non-missing arguments to max; returning -Inf

The ploidy and cellularity may be off:

Code:

> cellularity
[1] 0.76

> ploidy
[1] 2.6

**ffavero** · 05-07-2014, 06:56 AM

Hi lethalfang,
try with kmin=400 (or similar, bigger the 100) in the sequenza.extract function.

Also you should merge all the files, since the algorithm need to find a median ploidy across all the genome.

also the

Code:

sequenza.results( sequenza.extract=test, sequenza.fit=cp1.example...

Now it's

Code:

sequenza.results( sequenza.extract=test, cp.table=cp1.example

I had to change argument name to be consistent with all the other function in the package using the cp.table object.

I spotted a bug in the binning function this morning, so it will not work as expected... I'm working on it, I'll have a fix in the next few days.

In the meantime you could use a script like this to merge the seqz.gz files:

Code:

#!/bin/bash

SCRATCH=$1
OUT=$2
lll=""
## Loop over all chrs.
for i in $(cat $HOME/references/chromosomes/numerics.txt)
   do
      lll=${lll}" "$SCRATCH\_$i.seqz.gz
   done
gzip -dc $lll | gawk '{if (NR==1 || $1!="chromosome") {print $0}}' | \
 gawk '{if (NR==1 || $7<= 0.965 || $9 == "het") {print $0}}' | gzip > $OUT

The bug consist on removing part of heterozygous positions, I'm debugging the code to spot the flaw.
Thanks for the patience

Topics	Statistics	Last Post
Expanding the Horizons of Cellular Research with the Single Cell Atlas by seqadmin Started by seqadmin, Yesterday, 11:49 AM	0 responses 13 views 0 likes	Last Post by seqadmin Yesterday, 11:49 AM
Genetic Variants and Diabetes Risk in Childhood Cancer Survivors by seqadmin Started by seqadmin, 04-24-2024, 08:47 AM	0 responses 16 views 0 likes	Last Post by seqadmin 04-24-2024, 08:47 AM
Cancer Metastasis: A Deep Dive into Cellular Plasticity by seqadmin Started by seqadmin, 04-11-2024, 12:08 PM	0 responses 61 views 0 likes	Last Post by seqadmin 04-11-2024, 12:08 PM
Proteogenomic Profiles Offer New Clues in Prostate Cancer by seqadmin Started by seqadmin, 04-10-2024, 10:19 PM	0 responses 60 views 0 likes	Last Post by seqadmin 04-10-2024, 10:19 PM

Seqanswers Leaderboard Ad

Announcement

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News