Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Programs for GC content and CpG Islands

    Hi everyone,

    I am interested in determining G+C rich regions in a whole genome sequence as well as identifying possible CpG Islands.

    Can anyone recommend their favourite resources for either of these tasks?

    So far, for G+C content, I have tried Picard's CollectGCBiasMetrics (doesn't give me the right info) and GATK's GCContentByInterval walker (gives me a persistent error message) and I am just in the process of trying to run GCProfile.

    If anyone has used the GCContentByInterval walker could you perhaps give me an example of your code so that I might be able to compare and see where mine is going wrong.

    For CpG Islands I have found 'CpGIslands' but have not yet tried it.

    I am new to programming so any help would be much appreciated.

    Many thanks
    Helen

  • #2
    If you are interested in identifying CpG islands I can recommend reading Wu et al. Biostatistics (2010) (http://www.ncbi.nlm.nih.gov/pubmed/20212320). The paper argues that some common definitions of CpG islands are too restrictive (such as the definition used by the UCSC genome browser). The authors develop a hidden Markov model to define CpG islands for arbitrary genomes.

    The paper is accompanied by software that implements their method and tables of pre-computed CpG islands using their software for many popular genomes (see http://rafalab.jhsph.edu/CGI/index.html).
    Pete

    Comment


    • #3
      Pete,

      Great, I think this will be very useful indeed!
      I had been trying to find an existing set of CpG Islands for Bos taurus as well.
      Many thanks!

      Comment


      • #4
        Hi Helen

        I used "makeCGI" for Sus scrofa and get .rda file in the result folder. I want to know that if you used this software for Bos taurus and how you extract the result from .rda file.
        thank you in advance

        Jamal

        Comment


        • #5
          The GATK command worked for me (did you make the picard ".dict" file for your reference fasta file?):

          % java -Xmx2g -Djava.io.tmpdir=/path/to/tmp -jar /path/to/GenomeAnalysisTK-1.1-23-g8072bd9/GenomeAnalysisTK.jar -T GCContentByInterval -R /path/to/human_g1k_v37.fasta -L 1:1-100000 -o chr1_1_100000_gc.txt

          ...

          % cat chr1_1_100000_gc.txt
          1:1-100000 0.38207

          Chris

          Comment


          • #6
            Hi chris

            I didn't make the picard file for my genome. please tell me how can I do that.
            and plaese tell me more about GATK.

            thanks alot

            Jamal

            Comment


            • #7
              There is a link here about making the picard dict file for GATK:



              Download the latest picard from here into a new directory (for me $HOME/src on a Linux machine) and unzip it:



              Something like this works for me:

              java -jar /home/cjp64/src/picard-tools-1.53/CreateSequenceDictionary.jar R=/data/refs/archive/hg19/bowtie/hg19.fasta O=/data/refs/archive/hg19/bowtie/hg19.dict

              GATK help starts here (it's on many pages though and is more for doing SNP calls):



              Chris

              Comment


              • #8
                Hi all,

                Did anyone try "makeCGI" recently?

                I am having some problems with this package.

                First, It finds a lot of troubles reading chromosome/scaffold headers from the the fasta files and crash. I reduced the headers just to chromosome/scaffold (deleting the rest of the stuff) name and it seemed to work but then crashed with a new warning message:

                Warning message:
                In rm(pattern = "Ngc") : object 'Ngc' not found

                Apparently, It doesn't like too much to find "Ns" along the sequence.

                IT creates the result file but apparently it is empty.

                Any suggestions? I am really new with all these stuff so any advice will be very welcome

                Thanks in advance

                jamal, Maybe is a bit late, but I have found this to convert RDA to CSV I though it might be useful for other people

                Comment


                • #9
                  makeCGIbject 'Ngc' not found

                  Hi
                  I've tried this program recently, but I met the same problem like you.

                  Warning message:
                  In rm(pattern = "Ngc") : object 'Ngc' not found

                  I want to know if you find any solutions for this program.
                  Thank you in advance.

                  Originally posted by oria34 View Post
                  Hi all,

                  Did anyone try "makeCGI" recently?

                  I am having some problems with this package.

                  First, It finds a lot of troubles reading chromosome/scaffold headers from the the fasta files and crash. I reduced the headers just to chromosome/scaffold (deleting the rest of the stuff) name and it seemed to work but then crashed with a new warning message:

                  Warning message:
                  In rm(pattern = "Ngc") : object 'Ngc' not found

                  Apparently, It doesn't like too much to find "Ns" along the sequence.

                  IT creates the result file but apparently it is empty.

                  Any suggestions? I am really new with all these stuff so any advice will be very welcome

                  Thanks in advance

                  jamal, Maybe is a bit late, but I have found this to convert RDA to CSV I though it might be useful for other people

                  Comment

                  Latest Articles

                  Collapse

                  • seqadmin
                    Essential Discoveries and Tools in Epitranscriptomics
                    by seqadmin




                    The field of epigenetics has traditionally concentrated more on DNA and how changes like methylation and phosphorylation of histones impact gene expression and regulation. However, our increased understanding of RNA modifications and their importance in cellular processes has led to a rise in epitranscriptomics research. “Epitranscriptomics brings together the concepts of epigenetics and gene expression,” explained Adrien Leger, PhD, Principal Research Scientist...
                    04-22-2024, 07:01 AM
                  • seqadmin
                    Current Approaches to Protein Sequencing
                    by seqadmin


                    Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
                    04-04-2024, 04:25 PM

                  ad_right_rmr

                  Collapse

                  News

                  Collapse

                  Topics Statistics Last Post
                  Started by seqadmin, Today, 08:47 AM
                  0 responses
                  12 views
                  0 likes
                  Last Post seqadmin  
                  Started by seqadmin, 04-11-2024, 12:08 PM
                  0 responses
                  60 views
                  0 likes
                  Last Post seqadmin  
                  Started by seqadmin, 04-10-2024, 10:19 PM
                  0 responses
                  59 views
                  0 likes
                  Last Post seqadmin  
                  Started by seqadmin, 04-10-2024, 09:21 AM
                  0 responses
                  54 views
                  0 likes
                  Last Post seqadmin  
                  Working...
                  X