Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Introducing CGAT: computational genomics analysis toolkit

    We have just published CGAT: computational genomics analysis toolkit in Bioinformatics.

    The toolkit grew out of the code base that we have developed over the years working on many genomics and NGS projects. It currently contains over 50 tools covering tasks such as converting, filtering, comparison, conversion, summrization and annotation of genomic intervals, genesets and sequences.

    The tools follow a common command-line interface, and can be integrated into workflow tools such as galaxy.

    Full documentation, including examples, recipes and tool reference is here.

    Installation of the released version on a linux system should be as easy as:
    Code:
    pip install cgat
    For more detailed instructions and troubleshooting see here.

    The project is being actively developed and we welcome community input. The actively developed code is hosted on the github repository at:
    https://github.com/CGATOxford/cgat
    Last edited by sudders; 01-09-2014, 06:02 AM. Reason: Fixed documentation link

  • #2
    FYI: The documentation link is not working. http://www.cgat.org/~andreas/documen.../cgat/cgat.htm

    Comment


    • #3
      Sorry, should now be fixed.

      Comment


      • #4
        Hi thanks for making this tool
        could you please give the R script for making the average plot and the heatmap for bam2geneprofile


        Tommy

        Comment


        • #5
          Hi Tommy,

          Thanks for your interest.

          Simple average profiles are produced automatically by the script (using matplotlib code rather than R).

          You can use R to produce more complex plots (like showing more than one profile on the same plot). A simple example is given on the page for the recipe:

          What is the binding profile of NFKB across gene models?

          If you wanted to plot more than one line on the plot (for example the data for the input) you could do some like the following (assuming that you'd already run bam2geneprofile for each sample)

          Code:
          > profile_chip <- read.csv("nfkb_profile.geneprofile.matrix.tsv.gz", header = T, stringsAsFactors = F, sep = "\t")
          
          > profile_input <- read.csv("input_profile.geneprofile.matrix.tsv.gz", header = T, stringsAsFactors = F, sep="\t")
          
          > plot(profile_chip$bin, profile_chip$counts, cex = 0, xaxt = "none")
          
          > lines(profile_chip$bin, profile_chip$counts, col = "blue")
          
          > lines(profile_input$bin, profile_input$counts, col = "red")
          
          > abline(v = c(1000, 2000), lty = 2)
          
          > mtext("upstream", adj = 0.1)
          
          > mtext("exons", adj = 0.5)
          
          > mtext("downstream", adj = 0.9)
          The data for the heatmaps is produced by bam2peakshape rather than bam2geneprofile.

          bam2peakshape doesn't produce currenlty the plots itself, but the R code to do so isn't difficult. It is also on the recipe linked above, at the bottom of the pages. I reproduce it below.

          Assuming you've run bam2peakshape with a test and control bam and output to the pattern peakshap.%s you can do the following in R

          Code:
          > library( gplots )
          
          > library( RColorBrewer )
          
          > # read the H3K4me3 matrix into R
          > me3 <- read.csv( "peakshape.matrix_peak_height.gz", header=TRUE, sep="\t", row.names=1 )
          
          > # convert to matrix
          > me3.matrix <- as.matrix( me3 )
          
          > # A proportion of NFkB intervals have no discernable H3K4me3 or H3K4me1 coverage. These are removed before plotting.
          > me3.matrix <- me3.matrix[ c( 4000, 14906 ), ]
          
          > # the remainder are plotted
          > cols <- brewer.pal( 9, "Blues" )
          
          > heatmap.2( me3.matrix, col=cols, Rowv=F, Colv=F, labRow="", key=FALSE, labCol="", trace="none", dendrogram="none", breaks=seq(0, 1000, 101) )
          
          > # A second plot can be produced for the H3K4me1 data
          > me1 <- read.csv( "peakshape.control_peak_height.gz", header=T, sep="\t", row.names=1 )
          
          > me1.matrix <- as.matrix( me3 )
          
          > me1.matrix <- me1.matrix[ c( 4000, 14906 ), ]
          
          > cols <- brewer.pal( 9, "Greens" )
          
          > heatmap.2( me1.matrix, col=cols, Rowv=F, Colv=F, labRow="", key=FALSE, labCol="", trace="none", dendrogram="none", breaks=seq(0, 100, 11))
          I hope this helps. Do let me know if I can help further.

          Ian
          ---

          Comment


          • #6
            Hi Ian,

            Thank you so much for your kind reply. meta gene plot is becoming very routine in the NGS papers. Your tool is very helpful.

            I was using Homer + R for heatmap and HTSeq for meta-gene profile.

            Thank you again.

            Tommy




            Originally posted by sudders View Post
            Hi Tommy,

            Thanks for your interest.

            Simple average profiles are produced automatically by the script (using matplotlib code rather than R).

            You can use R to produce more complex plots (like showing more than one profile on the same plot). A simple example is given on the page for the recipe:

            What is the binding profile of NFKB across gene models?

            If you wanted to plot more than one line on the plot (for example the data for the input) you could do some like the following (assuming that you'd already run bam2geneprofile for each sample)

            Code:
            > profile_chip <- read.csv("nfkb_profile.geneprofile.matrix.tsv.gz", header = T, stringsAsFactors = F, sep = "\t")
            
            > profile_input <- read.csv("input_profile.geneprofile.matrix.tsv.gz", header = T, stringsAsFactors = F, sep="\t")
            
            > plot(profile_chip$bin, profile_chip$counts, cex = 0, xaxt = "none")
            
            > lines(profile_chip$bin, profile_chip$counts, col = "blue")
            
            > lines(profile_input$bin, profile_input$counts, col = "red")
            
            > abline(v = c(1000, 2000), lty = 2)
            
            > mtext("upstream", adj = 0.1)
            
            > mtext("exons", adj = 0.5)
            
            > mtext("downstream", adj = 0.9)
            The data for the heatmaps is produced by bam2peakshape rather than bam2geneprofile.

            bam2peakshape doesn't produce currenlty the plots itself, but the R code to do so isn't difficult. It is also on the recipe linked above, at the bottom of the pages. I reproduce it below.

            Assuming you've run bam2peakshape with a test and control bam and output to the pattern peakshap.%s you can do the following in R

            Code:
            > library( gplots )
            
            > library( RColorBrewer )
            
            > # read the H3K4me3 matrix into R
            > me3 <- read.csv( "peakshape.matrix_peak_height.gz", header=TRUE, sep="\t", row.names=1 )
            
            > # convert to matrix
            > me3.matrix <- as.matrix( me3 )
            
            > # A proportion of NFkB intervals have no discernable H3K4me3 or H3K4me1 coverage. These are removed before plotting.
            > me3.matrix <- me3.matrix[ c( 4000, 14906 ), ]
            
            > # the remainder are plotted
            > cols <- brewer.pal( 9, "Blues" )
            
            > heatmap.2( me3.matrix, col=cols, Rowv=F, Colv=F, labRow="", key=FALSE, labCol="", trace="none", dendrogram="none", breaks=seq(0, 1000, 101) )
            
            > # A second plot can be produced for the H3K4me1 data
            > me1 <- read.csv( "peakshape.control_peak_height.gz", header=T, sep="\t", row.names=1 )
            
            > me1.matrix <- as.matrix( me3 )
            
            > me1.matrix <- me1.matrix[ c( 4000, 14906 ), ]
            
            > cols <- brewer.pal( 9, "Greens" )
            
            > heatmap.2( me1.matrix, col=cols, Rowv=F, Colv=F, labRow="", key=FALSE, labCol="", trace="none", dendrogram="none", breaks=seq(0, 100, 11))
            I hope this helps. Do let me know if I can help further.

            Ian
            ---

            Comment


            • #7
              Hi Ian,

              I do have another question. for bam2geneprofile, one needs to provide a gtf file.

              for bam2peakshape, one needs to provide a bed file (generated from MACS)

              If I want to generate an average plot with the ChIP-seq data ( similar to the meta-gene plot, but I am plotting the average on the peak intervals rather than the gene-model), I can still use bam2peakshape, get the matrix (the matrix is used for the heatmap), and calculate the colMeans for each bin, and then plot a line graph.

              Or I can use the bam2geneprofile, but I need to convert the MACS bed file to gtf file first.

              Am I correct?

              Thanks!
              Tommy




              Originally posted by sudders View Post
              Hi Tommy,

              Thanks for your interest.

              Simple average profiles are produced automatically by the script (using matplotlib code rather than R).

              You can use R to produce more complex plots (like showing more than one profile on the same plot). A simple example is given on the page for the recipe:

              What is the binding profile of NFKB across gene models?

              If you wanted to plot more than one line on the plot (for example the data for the input) you could do some like the following (assuming that you'd already run bam2geneprofile for each sample)

              Code:
              > profile_chip <- read.csv("nfkb_profile.geneprofile.matrix.tsv.gz", header = T, stringsAsFactors = F, sep = "\t")
              
              > profile_input <- read.csv("input_profile.geneprofile.matrix.tsv.gz", header = T, stringsAsFactors = F, sep="\t")
              
              > plot(profile_chip$bin, profile_chip$counts, cex = 0, xaxt = "none")
              
              > lines(profile_chip$bin, profile_chip$counts, col = "blue")
              
              > lines(profile_input$bin, profile_input$counts, col = "red")
              
              > abline(v = c(1000, 2000), lty = 2)
              
              > mtext("upstream", adj = 0.1)
              
              > mtext("exons", adj = 0.5)
              
              > mtext("downstream", adj = 0.9)
              The data for the heatmaps is produced by bam2peakshape rather than bam2geneprofile.

              bam2peakshape doesn't produce currenlty the plots itself, but the R code to do so isn't difficult. It is also on the recipe linked above, at the bottom of the pages. I reproduce it below.

              Assuming you've run bam2peakshape with a test and control bam and output to the pattern peakshap.%s you can do the following in R

              Code:
              > library( gplots )
              
              > library( RColorBrewer )
              
              > # read the H3K4me3 matrix into R
              > me3 <- read.csv( "peakshape.matrix_peak_height.gz", header=TRUE, sep="\t", row.names=1 )
              
              > # convert to matrix
              > me3.matrix <- as.matrix( me3 )
              
              > # A proportion of NFkB intervals have no discernable H3K4me3 or H3K4me1 coverage. These are removed before plotting.
              > me3.matrix <- me3.matrix[ c( 4000, 14906 ), ]
              
              > # the remainder are plotted
              > cols <- brewer.pal( 9, "Blues" )
              
              > heatmap.2( me3.matrix, col=cols, Rowv=F, Colv=F, labRow="", key=FALSE, labCol="", trace="none", dendrogram="none", breaks=seq(0, 1000, 101) )
              
              > # A second plot can be produced for the H3K4me1 data
              > me1 <- read.csv( "peakshape.control_peak_height.gz", header=T, sep="\t", row.names=1 )
              
              > me1.matrix <- as.matrix( me3 )
              
              > me1.matrix <- me1.matrix[ c( 4000, 14906 ), ]
              
              > cols <- brewer.pal( 9, "Greens" )
              
              > heatmap.2( me1.matrix, col=cols, Rowv=F, Colv=F, labRow="", key=FALSE, labCol="", trace="none", dendrogram="none", breaks=seq(0, 100, 11))
              I hope this helps. Do let me know if I can help further.

              Ian
              ---

              Comment


              • #8
                Hi Tommy,

                Sorry for the slow reply, i've been away. In future, if you use the CGAT user group (https://groups.google.com/forum/?fro...gat-user-group), your message will go to more people, so someone will reply to you even if i'm not around.

                As to your question:
                If I want to generate an average plot with the ChIP-seq data ( similar to the meta-gene plot, but I am plotting the average on the peak intervals rather than the gene-model), I can still use bam2peakshape, get the matrix (the matrix is used for the heatmap), and calculate the colMeans for each bin, and then plot a line graph.

                Or I can use the bam2geneprofile, but I need to convert the MACS bed file to gtf file first.
                I would recommend using the second of these two methods. The tool bed2gff will do the conversion for you.

                Code:
                  zcat my_bed_file.bed.gz 
                | cgat bed2gff -a 
                | cgat bam2geneprofile --bamfile=my_bam_file.bam --gtffile=- --method=intervalprofile --reporter=transcript
                Along with what ever normalisation and output options you want. The - in --gtfile tells bam2geneprofile to use stdin for the interval file, and the -a on bed2gff tells it to output gtf.

                Let me know if you have any further problems.

                Ian
                ---

                Comment


                • #9
                  Hi Ian,

                  Thank you very much!

                  Tommy
                  Originally posted by sudders View Post
                  Hi Tommy,

                  Sorry for the slow reply, i've been away. In future, if you use the CGAT user group (https://groups.google.com/forum/?fro...gat-user-group), your message will go to more people, so someone will reply to you even if i'm not around.

                  As to your question:


                  I would recommend using the second of these two methods. The tool bed2gff will do the conversion for you.

                  Code:
                    zcat my_bed_file.bed.gz 
                  | cgat bed2gff -a 
                  | cgat bam2geneprofile --bamfile=my_bam_file.bam --gtffile=- --method=intervalprofile --reporter=transcript
                  Along with what ever normalisation and output options you want. The - in --gtfile tells bam2geneprofile to use stdin for the interval file, and the -a on bed2gff tells it to output gtf.

                  Let me know if you have any further problems.

                  Ian
                  ---

                  Comment

                  Latest Articles

                  Collapse

                  • seqadmin
                    Current Approaches to Protein Sequencing
                    by seqadmin


                    Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
                    04-04-2024, 04:25 PM
                  • seqadmin
                    Strategies for Sequencing Challenging Samples
                    by seqadmin


                    Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                    03-22-2024, 06:39 AM

                  ad_right_rmr

                  Collapse

                  News

                  Collapse

                  Topics Statistics Last Post
                  Started by seqadmin, 04-11-2024, 12:08 PM
                  0 responses
                  25 views
                  0 likes
                  Last Post seqadmin  
                  Started by seqadmin, 04-10-2024, 10:19 PM
                  0 responses
                  28 views
                  0 likes
                  Last Post seqadmin  
                  Started by seqadmin, 04-10-2024, 09:21 AM
                  0 responses
                  24 views
                  0 likes
                  Last Post seqadmin  
                  Started by seqadmin, 04-04-2024, 09:00 AM
                  0 responses
                  52 views
                  0 likes
                  Last Post seqadmin  
                  Working...
                  X