Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Pearson correlation between replicates

    Hi all,

    I'm analyzing some ChIP-seq data, and I'd like to see how different my replicate samples are (pairwise comparison only is fine). To be clear, I want to compare the read density across the whole genome, not just the peaks. I know the process of comparing peaks has been covered in other threads, but I couldn't find a description of how to generate a simple scatterplot of read density at all genomic positions (after normalization), with the corresponding pearson correlation coefficient. It seems to be a common form of analysis, and it looks like people generally use ~200 bp windows. It also seems important to eliminate regions with no reads in either sample, to avoid artificially increasing the correlation. Any suggestions on how to tackle this would be much appreciated!

    I'm dealing with FASTQ files, already mapped to hg19.

  • #2
    Sounds like you'll be needing R and one of the packages from Bioconductor that can handle SAM/BAM files (unless you prefer to use the C or whatever API).

    Comment


    • #3
      Thanks, that's definitely a good start, although a little more detail would be helpful. I have used R in the past, and I've looked around Bioconductor's website, but I'm still not sure how to go forward.

      Does anyone know if Galaxy's pre-loaded SAMtools offer a way to do this?

      Comment


      • #4
        thats interesting. Why would you need that correlation? why not just use a statistical model, like the ones in useq or macs, and when you input the replicates as case and control, obtain not many significantly different regions.
        --
        bioinfosm

        Comment


        • #5
          Hi bioinfosm,

          I'm fairly new to this, so correct me if I'm wrong, but although the two approaches are superficially similar (i.e. they both are a measure of the similarity between two samples), the Pearson correlation coefficient (PCC) is a much more robust measurement because it doesn't rely on an arbitrary significance cutoff, as well as being more quantitative, allowing for simpler comparisons. I'd like to assess the reproducibility of my samples between experiments, conditions, etc, and I believe it's easier to do that with the PCC. From what I've read, it's a fairly standard approach.

          Comment


          • #6
            I think it's very easy to go wrong with simple Pearson correlations on genome-wide ChIP-seq tag count profiles. I would advise to go for something like this recent method: http://genomebiology.com/2012/13/3/R16/abstract

            Note, I haven't tried it myself yet

            Comment


            • #7
              If you want to do genome wide correlation in R, you will need 2 bioconductor packages: RSamtools and GenomicRanges, and also the lengths for the chromosomes from the corresponding genome.

              The pipeline should go something like this:

              readBam -> convert bam to GRanges -> extend the reads (resize function) -> make the coverage (coverage function). Now you need to smooth the coverage a bit because, otherwise the resulting vector will be to big. This can be done easily using the first function in the Rle tips and tricks manual.
              Repeat that for each sample, put all of the data in a data frame, and use the cor function on the resulting object.

              I would advise you to go chromosome by chromosome (if you have an indexed bam file), and aggregate the data only before doing the correlation analysis.

              If you want to do a comparison of peak regions, a very cool method was published in Bioinformatics recently: An effective statistical evaluation of ChIPseq dataset similarity.

              Cheers!

              Comment


              • #8
                Thanks very much, tir_al, I'll give that a try!

                Comment


                • #9
                  Hi all, I'm also doing analysis of some nextgen data. right now,I'm handling fastq files of three experimental replicates and want to calculate their correlation(i.i correlation of the three replicates).
                  Can someone please guide on how to go about this in R using or without using Bioconductor.

                  I'm from a math background and know how to calculate pearson correlation coefficient between two set of numbers,but how to do that for fastq data??


                  thanks in advance.

                  Comment


                  • #10
                    The short answer is that you don't.

                    Firstly, map the reads to the appropriate genome. Then perform whatever type of quantification you need (for ChIP-seq, this would be peak calling; for RNA-seq this would be counting reads per gene/transcript/whatever; etc.) and then calculate the correlation from the resulting metrics.

                    Comment

                    Latest Articles

                    Collapse

                    • seqadmin
                      Techniques and Challenges in Conservation Genomics
                      by seqadmin



                      The field of conservation genomics centers on applying genomics technologies in support of conservation efforts and the preservation of biodiversity. This article features interviews with two researchers who showcase their innovative work and highlight the current state and future of conservation genomics.

                      Avian Conservation
                      Matthew DeSaix, a recent doctoral graduate from Kristen Ruegg’s lab at The University of Colorado, shared that most of his research...
                      03-08-2024, 10:41 AM
                    • seqadmin
                      The Impact of AI in Genomic Medicine
                      by seqadmin



                      Artificial intelligence (AI) has evolved from a futuristic vision to a mainstream technology, highlighted by the introduction of tools like OpenAI's ChatGPT and Google's Gemini. In recent years, AI has become increasingly integrated into the field of genomics. This integration has enabled new scientific discoveries while simultaneously raising important ethical questions1. Interviews with two researchers at the center of this intersection provide insightful perspectives into...
                      02-26-2024, 02:07 PM

                    ad_right_rmr

                    Collapse

                    News

                    Collapse

                    Topics Statistics Last Post
                    Started by seqadmin, 03-14-2024, 06:13 AM
                    0 responses
                    34 views
                    0 likes
                    Last Post seqadmin  
                    Started by seqadmin, 03-08-2024, 08:03 AM
                    0 responses
                    72 views
                    0 likes
                    Last Post seqadmin  
                    Started by seqadmin, 03-07-2024, 08:13 AM
                    0 responses
                    81 views
                    0 likes
                    Last Post seqadmin  
                    Started by seqadmin, 03-06-2024, 09:51 AM
                    0 responses
                    68 views
                    0 likes
                    Last Post seqadmin  
                    Working...
                    X