Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • #16
    Hi lethalfang,
    the chromosomes in the 2 pileups and in the GC file are in the same order, right?
    Also note that the pileup have to be generated with the fasta reference (-f argument), otherwise there might be problems.

    you could try to diminish the '-n' parameter to allow consensus position with less depth to be taken into account.
    The default is 20, so to be included you need to have at least 10 reads in the normal and 10 in the tumor at a given position (or any other configuration where the sum is 20). This might might be too high for low pass WGS.

    If you have a chance to paste part of the content of your 3 files (eg in pastebin or similar) I could have a look and see if there is something clearly wrong.

    EDIT: additionally you could have a look here https://bitbucket.org/ffavero/sequen...Sequenza_Utils, for tips on how to use sequenza-utils.
    Last edited by ffavero; 05-02-2014, 07:22 AM.

    Comment


    • #17
      Yes, they're in the same order. They were generated using the following command (the pileups were first created for VarScan2).
      The reference used here is the Broad Institute's version B37, i.e., the chromosomes written as 1, 2, 3, ..., X, Y, MT.

      samtools mpileup -B -q 1 -f b37.fa WGS.sorted.aligned.bam > WGS.pileup
      That pileup was then bgzipped into .pileup.gz

      The GC content file was generated using the hg19 version, i.e., chromosomes written as chr1, chr2, chr3, ..., chrX, chrY, chrM.

      I tried to create the GC content file with the B37.fa, but that b37 fasta file has a smaller number (maybe no more than 1000) of characters that are not G, C, T, A, or N. The script failed when it tried to count M, R, etc. due to dictionary key error. In any case, the GC content file shouldn't be the cause of an empty seqz file.

      First 5 lines of the normal pileup.gz
      1 10000 N 2 ^$A^]A <<
      1 10001 T 17 CC^0.^-.^0.^*.^-.^+.^>.^2.^>.^>.^+.^>.^,.^&.^0. CC@CBDEDED@??E<<C
      1 10002 A 33 C.................^2.^>.^-.^*.^>.^>.^2.^-.^/.^-.^0.^".^-.^$.^2. DDACBB@DBCB@@=DCCCB::A@C?:A9?:<=D
      1 10003 A 43 .................................^2.^*.^>.^&.^+.^$.^$.^:.^8.^*. CEDBEDCE=EEDB6EEEEDCCCBD3DC?@CEDE>A::==AC=<
      1 10004 C 58 ...........................................^".^>.^5.^8.^5.^>.^*.^$.^*.^'.^5.^*.^-.^$.^>. BBBABBBC?AB@BACBBCBBBBC?ABB>;BCBB?ABBAB@BBC?:A@:AA<C=D?B

      First 5 lines of the gc content file:
      variableStep chrom=chrM span=50
      1 50.0
      51 56.0
      101 52.0
      151 36.0
      201 22.0
      251 44.0
      301 62.0
      351 40.0
      401 42.0

      All the pileups and gc-content files are bgzipped (into gz). Hope that isn't the problem.


      Never mind, it may be due to the gc-content file with wrong chromosome orders. Let me fix that and try again.
      Last edited by lethalfang; 05-02-2014, 01:09 PM.

      Comment


      • #18
        Update:

        Simply re-ordering the chromosomes wasn't enough. I converted chrom=chr1 into chrom=1, etc, and now it's working.

        Because the pileup files were generated using the b37 format (i.e., chromosomes were named 1, 2, 3, ..., X, Y, MT), and the gc-content file was generated using the hg19 format (i.e., chromosomes were named chr1, chr2, chr3, ..., chrX, chrY, chrM), the chromosome names did not match.

        Two things:
        1) The b37's fasta file has characters like M and R in the sequence. There aren't many of those, you can simply consider those as "N" in the script. Due to that, the python script failed trying to generate a gc-content file from b37 fasta file.

        2) I guess you can modify the python script, so it doesn't matter which chromosome formats were used.
        Last edited by lethalfang; 05-02-2014, 01:09 PM.

        Comment


        • #19
          By the way, in the user guide (http://cran.r-project.org/web/packag...s/sequenza.pdf), you had "-r" to flag normal.pileup and "-s" to flag tumor.pileup.

          Comment


          • #20
            Ops, you are right!
            That's because from version 1.* to 2.0 I've changed all the arguments from reference/sample (-r/-s) to, more on-topic withe cancer research, normal/tumor (-n/-t). I was carefull to change it everywere, but clearly not there.

            I haven't test it with b37 fasta, I will add a way to handle M and R.
            Thanks for taking this to my attention! Both your points were really relevant.

            Comment


            • #21
              Originally posted by ffavero View Post
              Ops, you are right!
              That's because from version 1.* to 2.0 I've changed all the arguments from reference/sample (-r/-s) to, more on-topic withe cancer research, normal/tumor (-n/-t). I was carefull to change it everywere, but clearly not there.

              I haven't test it with b37 fasta, I will add a way to handle M and R.
              Thanks for taking this to my attention! Both your points were really relevant.
              The b37 fasta has all the letters in the IUPAC code, (i.e., R, Y, S, W, K, M, B, D, H, V):

              Some are capitalized and some are lower cases.
              Last edited by lethalfang; 05-04-2014, 12:36 PM.

              Comment


              • #22
                I had trouble getting sequenza.fit command to work with the chromosome 22 of HCC1143, though it works with the example.seqz.txt.gz included in the package.


                Here are the results in the R shell:
                Code:
                > test22 <- sequenza.extract('22.seqz.gz')
                Processing 22: 1293 variant calls; 36826 heterozygous positions; 33837423 homozygous positions.
                
                
                > names(test22)
                [1] "BAF"         "ratio"       "mutations"   "segments"    "chromosomes"
                [6] "gc"          "avg.depth"  
                
                
                > CP.example22 <- sequenza.fit( test22 )
                  |                                                                      |   0%
                Error in result$L - max.lik : non-numeric argument to binary operator
                In addition: There were 50 or more warnings (use warnings() to see the first 50)
                
                
                > warnings()
                Warning messages:
                1: In mclapply(X, wrapper, mc.cores = mc.cores, ...) :
                  all scheduled cores encountered errors in user code
                2: In mean.default(X[[1L]], ...) :
                  argument is not numeric or logical: returning NA
                3: In mean.default(X[[2L]], ...) :
                  argument is not numeric or logical: returning NA
                4: In mean.default(X[[3L]], ...) :
                  argument is not numeric or logical: returning NA
                5: In mean.default(X[[4L]], ...) :
                  argument is not numeric or logical: returning NA


                First 10 lines of 22.seqz.gz:
                Code:
                
                $ gunzip -c 22.seqz.gz | head
                chromosome	position	base.ref	depth.normal	depth.tumor	depth.ratio	Af	Bf	zygosity.normal	GC.percent	good.reads	AB.normal	AB.tumor	tumor.strand
                22	16050047	A	9	13	1.444	1.0	0	hom	50.0	13	A	.	0
                22	16050048	A	11	13	1.182	1.0	0	hom	50.0	13	A	.	0
                22	16050049	G	12	13	1.083	1.0	0	hom	50.0	13	G	.	0
                22	16050050	T	13	13	1.0	1.0	0	hom	50.0	12	T	.	0
                22	16050051	C	13	14	1.077	1.0	0	hom	58.0	13	C	.	0
                22	16050052	A	13	15	1.154	1.0	0	hom	58.0	15	A	.	0
                22	16050053	C	13	16	1.231	1.0	0	hom	58.0	16	C	.	0
                22	16050054	T	14	16	1.143	1.0	0	hom	58.0	16	T	.	0
                22	16050055	T	15	17	1.133	1.0	0	hom	58.0	17	T	.	0

                If you'd like to see my file yourself, it's at http://caprica.thruhere.net:8080/22.seqz.gz
                Last edited by lethalfang; 05-04-2014, 04:40 PM.

                Comment


                • #23
                  Also tried it with chromosome 1, which is much larger than chromosome 22:
                  Code:
                  > test1 <- sequenza.extract('1.seqz.gz')
                  Processing 1: 5572 variant calls; 185400 heterozygous positions; 220663060 homozygous positions.
                  
                  
                  > CP.sample1 <- sequenza.fit(test1)
                    |                                                                      |   0%Error in mcfork() : 
                    unable to fork, possible reason: Cannot allocate memory
                  The computer has 96 GB of memory. After test1 <- sequenza.extract('1.seqz.gz'), there were about 20GB memory left.

                  Comment


                  • #24
                    Have you tried binning the resulting seqz.gz file with seqz-binning?

                    The original goal was targeting exome sequencing, which generates smaller data then WGS. With whole genome you need to binning every 100~200nt in order to use it without huge memory requirement (R needs to load the file in memory if you want adequate speed). I tested this also on a 8Gb desktop.
                    Also while segmenting (on sequenza.extract function) you need to use a kmin bigger then 100nt (minimal number of probe to call a segment, argument from the copynumber package), otherwise you will generate a lot of not really relevant segments, and running sequenza.fit with a huge number of segments can be really slow.

                    I see now that I have to mention some suggested workflow for WGS in the manual (for me that I made the software it's difficult to pinpoint what it's not obvious and NEEDS to be documented).

                    I have a script that looks like:

                    Code:
                    #!/bin/bash
                    
                    SCRATCH=$1
                    OUT=$2
                    BIN_SZ=$3
                    lll=""
                    ## Loop over all chrs.
                    for i in $(cat $HOME/references/chromosomes/characters.txt)
                       do
                          lll=${lll}" "$SCRATCH\_$i.seqz.gz
                       done
                    zcat $lll | gawk '{if (NR==1 || $1!="chromosome") {print $0}}' | \
                       ~/src/pypy-favero01-2.1/bin/pypy \
                       ~/local/bin/sequenza-utils.py seqz-binning -w $BIN_SZ -a - | \
                       gzip > $OUT
                    that performs the binning of a desired size and merge the various chromosomes (stored all in the same directory).

                    you could run it like this

                    Code:
                    bash ./merge_bin.sh /path/of/sample/sample /path/output/sample.seqz.gz 200
                    where the various chromosomes are in /path/of/sample/sample_chr1.seqz.gz, ..., /path/of/sample/sample_chr22.seqz.gz ..., for example. And the "characters.txt" file mentioned in the script it's just a list of name like chr1, chr2 and so on - on chromosome for line -.

                    The normalization and analysis have to be performed genome wide, so analyzing chromosome by chromosome might cause incorrect estimation.

                    So, I suggest you to run a similar script to have a file comprising all the chromosomes binned by ~100 or 200nt, and then try again the sequenza R package (this time with a smaller file will be a lot faster/memory efficient), and also increase the kmin with a value around 500, then decrease it if you see the sensibility of the segmentation is to low (or play with the gamma kmin argument, see ?copynumber manual)

                    Thanks again for the valuable feedback!

                    EDIT: I've tried yout 22.seqz.gz file on my laptop, binning every 100nt (it reduces the size to 35Mb) and using kmin of 500. It works - although the cellularity/ploidy estimation is wrong, since no gain or losses are visible on this chromosome -
                    With cell-line it is somehow easier since the cellularity is always 100%, so any changes should be well visible. However the mere estimation might be a little more tricky, I've tested sequenza with TCGA benchmark 4, and it was working well, hope you will find the same.
                    Last edited by ffavero; 05-05-2014, 02:51 AM.

                    Comment


                    • #25
                      Originally posted by ffavero View Post
                      Have you tried binning the resulting seqz.gz file with seqz-binning?

                      The original goal was targeting exome sequencing, which generates smaller data then WGS. With whole genome you need to binning every 100~200nt in order to use it without huge memory requirement (R needs to load the file in memory if you want adequate speed). I tested this also on a 8Gb desktop.
                      Also while segmenting (on sequenza.extract function) you need to use a kmin bigger then 100nt (minimal number of probe to call a segment, argument from the copynumber package), otherwise you will generate a lot of not really relevant segments, and running sequenza.fit with a huge number of segments can be really slow.

                      I see now that I have to mention some suggested workflow for WGS in the manual (for me that I made the software it's difficult to pinpoint what it's not obvious and NEEDS to be documented).

                      I have a script that looks like:

                      Code:
                      #!/bin/bash
                      
                      SCRATCH=$1
                      OUT=$2
                      BIN_SZ=$3
                      lll=""
                      ## Loop over all chrs.
                      for i in $(cat $HOME/references/chromosomes/characters.txt)
                         do
                            lll=${lll}" "$SCRATCH\_$i.seqz.gz
                         done
                      zcat $lll | gawk '{if (NR==1 || $1!="chromosome") {print $0}}' | \
                         ~/src/pypy-favero01-2.1/bin/pypy \
                         ~/local/bin/sequenza-utils.py seqz-binning -w $BIN_SZ -a - | \
                         gzip > $OUT
                      that performs the binning of a desired size and merge the various chromosomes (stored all in the same directory).

                      you could run it like this

                      Code:
                      bash ./merge_bin.sh /path/of/sample/sample /path/output/sample.seqz.gz 200
                      where the various chromosomes are in /path/of/sample/sample_chr1.seqz.gz, ..., /path/of/sample/sample_chr22.seqz.gz ..., for example. And the "characters.txt" file mentioned in the script it's just a list of name like chr1, chr2 and so on - on chromosome for line -.

                      The normalization and analysis have to be performed genome wide, so analyzing chromosome by chromosome might cause incorrect estimation.

                      So, I suggest you to run a similar script to have a file comprising all the chromosomes binned by ~100 or 200nt, and then try again the sequenza R package (this time with a smaller file will be a lot faster/memory efficient), and also increase the kmin with a value around 500, then decrease it if you see the sensibility of the segmentation is to low (or play with the gamma kmin argument, see ?copynumber manual)

                      Thanks again for the valuable feedback!

                      EDIT: I've tried yout 22.seqz.gz file on my laptop, binning every 100nt (it reduces the size to 35Mb) and using kmin of 500. It works - although the cellularity/ploidy estimation is wrong, since no gain or losses are visible on this chromosome -
                      With cell-line it is somehow easier since the cellularity is always 100%, so any changes should be well visible. However the mere estimation might be a little more tricky, I've tested sequenza with TCGA benchmark 4, and it was working well, hope you will find the same.
                      My chromosome 22 came from this cell line:

                      The bam files were downloaded from cghub/TCGA public data set.

                      Chromosome 21 has a lot more variations, if you want to try that:
                      Last edited by lethalfang; 05-05-2014, 12:23 PM.

                      Comment


                      • #26
                        Originally posted by lethalfang View Post
                        My chromosome 22 came from this cell line:

                        The bam files were downloaded from cghub/TCGA public data set.

                        Chromosome 21 has a lot more variations, if you want to try that:
                        http://caprica.thruhere.net:8080/21.seqz.gz
                        I have already processed HCC1143, it results triploids and 100% of cellularity.
                        The aberration detected with sequenza seems to be matching the SKY profile (although is difficult to detect the color sparse around in translocations)

                        This is chromosome 7, for example:



                        Top panel shows the mutation (not filtered by any means, each dotted line is the expected frequency of the mutated, if CN = 4, we would have 4 lines and so on...);
                        Middle panel shows B-allelle frequency (the dotted lines are on top of the expected frequency of the possible number of B alleles);
                        Bottom panel shows total copy number information.

                        This plot was obtained some time ago, using only the heterozygous positions and variant alleles on homozygous - so discarding non variant position instead of binning them - Therefore it loses accuracy on the depth ratio. ( few thousands position used for the depth ratio instead of 150 millions).
                        As a matter of facts I'm working on the same cell lines now, for an add-on required by reviewers , so I'm running the data with the current sequenza version. Tell me if something is still wrong/too cumbersome.
                        Last edited by ffavero; 05-05-2014, 10:56 PM.

                        Comment


                        • #27
                          hiii!!!!! hv no idea just felt like posting

                          Comment


                          • #28
                            Okay I'll give it a try.

                            Comment


                            • #29
                              I binned the chromosome 22, but still get this error:

                              Code:
                              > library('sequenza')
                              
                              > data.file <- "binned_22.seqz.gz"
                              > seqz.data <- read.seqz(data.file, chr.name = "22")
                              > test <- sequenza.extract(data.file)
                              Processing 22: 1293 variant calls; 36633 heterozygous positions; 2311953 homozygous positions.
                              
                              > cp22.example=sequenza.fit(test)
                                |                                                                      |   0%
                              Error in result$L - max.lik : non-numeric argument to binary operator
                              In addition: There were 50 or more warnings (use warnings() to see the first 50)

                              The file is at: http://caprica.thruhere.net:8080/binned_22.seqz.gz

                              It was created with the following command:
                              sequenza-utils.py seqz-binning -w 200 -a 22.seqz | /home/ltfang/apps/tabix/bgzip > binned_22.seqz.gz



                              Chromosome 1 ran fine. However, some commands did not work:

                              Code:
                              > sequenza.results( sequenza.extract=test, sequenza.fit=cp1.example, sample.id="Test1", out.dir="." )
                              Error in sequenza.results(sequenza.extract = test, sequenza.fit = cp1.example,  : 
                                unused argument (sequenza.fit = cp1.example)
                              > names(cp1.example)
                              
                              
                              
                              > chromosome.view( mut.tab=test$mutations[[3]], baf.windows=test$BAF[[3]], ratio.windows=test$ratio[[3]], min.N.ratio=1, segments=seg.tab[seg.tab$chromosome==test$chromosomes[3],], main=test$chromosomes[3], cellularity=cellularity, ploidy=ploidy, avg.depth.ratio=avg.depth.ratio)
                              Error in seq.default(from = 1, to = CNt, by = 1) : 
                                'to' cannot be NA, NaN or infinite
                              In addition: Warning message:
                              In max(segments$CNt, na.rm = TRUE) :
                                no non-missing arguments to max; returning -Inf


                              The ploidy and cellularity may be off:

                              Code:
                              > cellularity
                              [1] 0.76
                              
                              > ploidy
                              [1] 2.6
                              Last edited by lethalfang; 05-06-2014, 12:56 PM.

                              Comment


                              • #30
                                Hi lethalfang,
                                try with kmin=400 (or similar, bigger the 100) in the sequenza.extract function.

                                Also you should merge all the files, since the algorithm need to find a median ploidy across all the genome.

                                also the

                                Code:
                                sequenza.results( sequenza.extract=test, sequenza.fit=cp1.example...
                                Now it's

                                Code:
                                sequenza.results( sequenza.extract=test, cp.table=cp1.example
                                I had to change argument name to be consistent with all the other function in the package using the cp.table object.

                                I spotted a bug in the binning function this morning, so it will not work as expected... I'm working on it, I'll have a fix in the next few days.

                                In the meantime you could use a script like this to merge the seqz.gz files:
                                Code:
                                #!/bin/bash
                                
                                SCRATCH=$1
                                OUT=$2
                                lll=""
                                ## Loop over all chrs.
                                for i in $(cat $HOME/references/chromosomes/numerics.txt)
                                   do
                                      lll=${lll}" "$SCRATCH\_$i.seqz.gz
                                   done
                                gzip -dc $lll | gawk '{if (NR==1 || $1!="chromosome") {print $0}}' | \
                                 gawk '{if (NR==1 || $7<= 0.965 || $9 == "het") {print $0}}' | gzip > $OUT
                                The bug consist on removing part of heterozygous positions, I'm debugging the code to spot the flaw.
                                Thanks for the patience
                                Last edited by ffavero; 05-07-2014, 10:48 PM.

                                Comment

                                Latest Articles

                                Collapse

                                • seqadmin
                                  Essential Discoveries and Tools in Epitranscriptomics
                                  by seqadmin




                                  The field of epigenetics has traditionally concentrated more on DNA and how changes like methylation and phosphorylation of histones impact gene expression and regulation. However, our increased understanding of RNA modifications and their importance in cellular processes has led to a rise in epitranscriptomics research. “Epitranscriptomics brings together the concepts of epigenetics and gene expression,” explained Adrien Leger, PhD, Principal Research Scientist...
                                  04-22-2024, 07:01 AM
                                • seqadmin
                                  Current Approaches to Protein Sequencing
                                  by seqadmin


                                  Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
                                  04-04-2024, 04:25 PM

                                ad_right_rmr

                                Collapse

                                News

                                Collapse

                                Topics Statistics Last Post
                                Started by seqadmin, Yesterday, 11:49 AM
                                0 responses
                                13 views
                                0 likes
                                Last Post seqadmin  
                                Started by seqadmin, 04-24-2024, 08:47 AM
                                0 responses
                                16 views
                                0 likes
                                Last Post seqadmin  
                                Started by seqadmin, 04-11-2024, 12:08 PM
                                0 responses
                                61 views
                                0 likes
                                Last Post seqadmin  
                                Started by seqadmin, 04-10-2024, 10:19 PM
                                0 responses
                                60 views
                                0 likes
                                Last Post seqadmin  
                                Working...
                                X