Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • MethylCoder: software for bisulfite treated reads

    hi, i have been working on a pipeline that takes from bisulfite treated reads and returns useful methylation summary and output as simply as possible. i'm posting it here to get feedback. the best summary is to read the page here: http://github.com/brentp/methylcode

    it's available for download

    directly from the git repository as: git clone git://github.com/brentp/methylcode.git

    and via tarball: http://github.com/brentp/methylcode/tarball/master


    you'll need:
    * numpy from here: http://sourceforge.net/projects/numpy/files/
    * cython from here: http://pypi.python.org/packages/sour...-0.12.1.tar.gz
    * pyfasta from here: http://pypi.python.org/pypi/pyfasta/
    * bowtie from here: http://bowtie-bio.sourceforge.net/index.shtml

    MethylCoder uses the well-known method of converting all C's to T's in both the reads and the reference in order to map the bisulfite treated reads. Bowtie is used to do the alignments. It requires a FASTQ file for input, but if you have raw reads, you can convert them to FASTQ and use 'I' or whatever for the quality values and adjust the bowtie params and it will work fine.

    We have been using it in the lab for quite a while and I have tested it against published analyses and other software and it matches very closely (but uses less memory and less CPU time), but use at your own risk.

    Currently, it does not handle paired end reads. If someone needs this and provides me with a set of paired-end BS-treated reads, I will likely implement.
    I would appreciate any feedback in terms of usability or features.

    this work is supported the fischer lab (http://epmb.berkeley.edu/facPage/dispFP.php?I=8) at uc berkeley but any problems are my fault. please contact me directly with any questions or problems.

  • #2
    Please put an entry in the software Wiki for this tool! Otherwise, I'll have to do it :-)

    Comment


    • #3
      done. thanks.

      Comment


      • #4
        hi, i've updated MethylCoder with the following:

        + supports paired end reads
        + can use either bowtie or gsnap for the aligner
        + can take either fasta or fastq files as input
        + prints a nice, per-chromsome summary along with the per-base text and binary format and the SAM format.
        + better documented analysis scripts for finding differentially methylated regions between 2 runs of the pipeline. (fisher's exact test)
        + full tracking of the command used to generate each output file.
        + growing test suite.


        please let me know if any questions, comments, or feature requests ( [email protected] )
        code is available at github as before:

        directly from the git repository as: git clone git://github.com/brentp/methylcode.git
        and via tarball: http://github.com/brentp/methylcode/tarball/master

        Comment


        • #5
          MethylCoder has been published as a bioinformatics applications note:

          MethylCoder: Software Pipeline for Bisulfite-Treated Sequences
          Brent Pedersen; Tzung-Fu Hsieh; Christian Ibarra; Robert L. Fischer
          Bioinformatics 2011; doi: 10.1093/bioinformatics/btr394

          PDF Link

          Let me know of any questions.

          Comment


          • #6
            Hi Brent,

            what are the differences in the alignment between basespace and colorspace data i.e. how do you solve the problem that one can't apply the in-silico conversion of C's to T's in reads for colorspace?

            Comment


            • #7
              Originally posted by bisol View Post
              Hi Brent,

              what are the differences in the alignment between basespace and colorspace data i.e. how do you solve the problem that one can't apply the in-silico conversion of C's to T's in reads for colorspace?
              Hi Bisol,
              I basically side-step the problem. I recommend that you do the following:
              1) quality trim your reads
              2) map with methylcoder (+bowtie) allowing 0 (you can also try 1) mismatches.
              3) map the unmapped reads with solid's SOCS tool: http://solidsoftwaretools.com/gf/project/socs/

              MethylCoder does a naive translation of C=>T by converting to base-space, then converting, then converting back to base-space. So it doesn't solve the problem, just tries to provide a solution to quickly map reads with no errors. I welcome suggestions for improvement in that regard.

              -Brent

              Comment


              • #8
                Comparison with BisMark?

                Brent,

                Nice software and publication. Have you tried comparing MethylCode and BisMark on the H1 ES cell line MethylC-seq dataset from Lister et al (2009)?

                Thanks,
                Derek

                Comment


                • #9
                  Originally posted by dychiang View Post
                  Have you tried comparing MethylCode and BisMark on the H1 ES cell line MethylC-seq dataset from Lister et al (2009)?
                  Hi Derek, there is a comparison to other BS-Seq software here:

                  It uses some Arabidopsis thaliana data and shows time, (approximate) memory use, and reads mapped.

                  Felix Kreuger, one of the authors of BisMark suggested some changes to BisMark parameters that I could use to improve its performance, but I have not yet updated the benchmark with those changes.

                  Comment


                  • #10
                    As both MethylCoder and Bismark employ a very similar strategy, I would imagine that the results are very similar. By the way my last name is spelled Krueger :P.

                    Comment


                    • #11
                      Originally posted by fkrueger View Post
                      By the way my last name is spelled Krueger :P.
                      As someone who repeatedly has their last name misspelled, I sincerely apologize.

                      And yes, the results between MethylCoder (with bowtie) and BisMark are quite similar.

                      Comment


                      • #12
                        Originally posted by fkrueger View Post
                        As both MethylCoder and Bismark employ a very similar strategy, I would imagine that the results are very similar. By the way my last name is spelled Krueger :P.
                        Brent and Felix -- thanks very much for your helpful replies. The epigenetics sequencing community needs some good benchmarks, such as RGASP, to test the plethora of algorithms being developed.

                        Will either or both of you be attending the HiTSeq SIG at ISMB next week? I would be delighted to meet up with you.

                        Comment


                        • #13
                          Hi dychiang,
                          I won't be attending the HiTSeq SIG but I'll be at ISMB from Sunday til Wednesday. I'm happy to meet up with you, either drop me an email ([email protected]) or find me at my poster (Poster U59: Analysing allele-specific NGS datasets using ASAP)

                          Comment


                          • #14
                            Hi Derek, I wont be at that conference, but feel free to send me an email.

                            Comment


                            • #15
                              So essentially MethylCoder can map only perfect color space reads which align
                              perfectly against the genome.

                              As it has been pointed out in a previous thread you started
                              (http://seqanswers.com/forums/showthread.php?t=7979), it is generally problematic to
                              do a naive translation of color to base space, then converting C=>T and translating
                              back to color space, as a single measurement error in the color space read will be
                              translated into a false nucleotide sequence. Depending on where in the read this
                              measurement error occurs, the sequence either can't be mapped anymore (which means a
                              low mapping efficiency) or it will map to a wrong position in the genome and thus
                              result in false methylation calls.

                              You are now suggesting that all unmapped reads should instead be aligned with
                              SOCS-B, which - even though it is a good tool - is incredibly slow for complex
                              genomes and many reads.

                              Therefore, isn't it a quite bold statement to state in the paper that "MethylCoder
                              is a novel tool that allows ... mapping in both color and nucleotide-space -
                              something that no other BS-Seq software allows"?

                              Comment

                              Latest Articles

                              Collapse

                              • seqadmin
                                Essential Discoveries and Tools in Epitranscriptomics
                                by seqadmin




                                The field of epigenetics has traditionally concentrated more on DNA and how changes like methylation and phosphorylation of histones impact gene expression and regulation. However, our increased understanding of RNA modifications and their importance in cellular processes has led to a rise in epitranscriptomics research. “Epitranscriptomics brings together the concepts of epigenetics and gene expression,” explained Adrien Leger, PhD, Principal Research Scientist...
                                04-22-2024, 07:01 AM
                              • seqadmin
                                Current Approaches to Protein Sequencing
                                by seqadmin


                                Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
                                04-04-2024, 04:25 PM

                              ad_right_rmr

                              Collapse

                              News

                              Collapse

                              Topics Statistics Last Post
                              Started by seqadmin, Yesterday, 11:49 AM
                              0 responses
                              15 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 04-24-2024, 08:47 AM
                              0 responses
                              16 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 04-11-2024, 12:08 PM
                              0 responses
                              61 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 04-10-2024, 10:19 PM
                              0 responses
                              60 views
                              0 likes
                              Last Post seqadmin  
                              Working...
                              X