Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Pearson correlation between replicates

    Hi all,

    I'm analyzing some ChIP-seq data, and I'd like to see how different my replicate samples are (pairwise comparison only is fine). To be clear, I want to compare the read density across the whole genome, not just the peaks. I know the process of comparing peaks has been covered in other threads, but I couldn't find a description of how to generate a simple scatterplot of read density at all genomic positions (after normalization), with the corresponding pearson correlation coefficient. It seems to be a common form of analysis, and it looks like people generally use ~200 bp windows. It also seems important to eliminate regions with no reads in either sample, to avoid artificially increasing the correlation. Any suggestions on how to tackle this would be much appreciated!

    I'm dealing with FASTQ files, already mapped to hg19.

  • #2
    Sounds like you'll be needing R and one of the packages from Bioconductor that can handle SAM/BAM files (unless you prefer to use the C or whatever API).

    Comment


    • #3
      Thanks, that's definitely a good start, although a little more detail would be helpful. I have used R in the past, and I've looked around Bioconductor's website, but I'm still not sure how to go forward.

      Does anyone know if Galaxy's pre-loaded SAMtools offer a way to do this?

      Comment


      • #4
        thats interesting. Why would you need that correlation? why not just use a statistical model, like the ones in useq or macs, and when you input the replicates as case and control, obtain not many significantly different regions.
        --
        bioinfosm

        Comment


        • #5
          Hi bioinfosm,

          I'm fairly new to this, so correct me if I'm wrong, but although the two approaches are superficially similar (i.e. they both are a measure of the similarity between two samples), the Pearson correlation coefficient (PCC) is a much more robust measurement because it doesn't rely on an arbitrary significance cutoff, as well as being more quantitative, allowing for simpler comparisons. I'd like to assess the reproducibility of my samples between experiments, conditions, etc, and I believe it's easier to do that with the PCC. From what I've read, it's a fairly standard approach.

          Comment


          • #6
            I think it's very easy to go wrong with simple Pearson correlations on genome-wide ChIP-seq tag count profiles. I would advise to go for something like this recent method: http://genomebiology.com/2012/13/3/R16/abstract

            Note, I haven't tried it myself yet

            Comment


            • #7
              If you want to do genome wide correlation in R, you will need 2 bioconductor packages: RSamtools and GenomicRanges, and also the lengths for the chromosomes from the corresponding genome.

              The pipeline should go something like this:

              readBam -> convert bam to GRanges -> extend the reads (resize function) -> make the coverage (coverage function). Now you need to smooth the coverage a bit because, otherwise the resulting vector will be to big. This can be done easily using the first function in the Rle tips and tricks manual.
              Repeat that for each sample, put all of the data in a data frame, and use the cor function on the resulting object.

              I would advise you to go chromosome by chromosome (if you have an indexed bam file), and aggregate the data only before doing the correlation analysis.

              If you want to do a comparison of peak regions, a very cool method was published in Bioinformatics recently: An effective statistical evaluation of ChIPseq dataset similarity.

              Cheers!

              Comment


              • #8
                Thanks very much, tir_al, I'll give that a try!

                Comment


                • #9
                  Hi all, I'm also doing analysis of some nextgen data. right now,I'm handling fastq files of three experimental replicates and want to calculate their correlation(i.i correlation of the three replicates).
                  Can someone please guide on how to go about this in R using or without using Bioconductor.

                  I'm from a math background and know how to calculate pearson correlation coefficient between two set of numbers,but how to do that for fastq data??


                  thanks in advance.

                  Comment


                  • #10
                    The short answer is that you don't.

                    Firstly, map the reads to the appropriate genome. Then perform whatever type of quantification you need (for ChIP-seq, this would be peak calling; for RNA-seq this would be counting reads per gene/transcript/whatever; etc.) and then calculate the correlation from the resulting metrics.

                    Comment

                    Latest Articles

                    Collapse

                    • seqadmin
                      Advancing Precision Medicine for Rare Diseases in Children
                      by seqadmin




                      Many organizations study rare diseases, but few have a mission as impactful as Rady Children’s Institute for Genomic Medicine (RCIGM). “We are all about changing outcomes for children,” explained Dr. Stephen Kingsmore, President and CEO of the group. The institute’s initial goal was to provide rapid diagnoses for critically ill children and shorten their diagnostic odyssey, a term used to describe the long and arduous process it takes patients to obtain an accurate...
                      12-16-2024, 07:57 AM
                    • seqadmin
                      Recent Advances in Sequencing Technologies
                      by seqadmin



                      Innovations in next-generation sequencing technologies and techniques are driving more precise and comprehensive exploration of complex biological systems. Current advancements include improved accessibility for long-read sequencing and significant progress in single-cell and 3D genomics. This article explores some of the most impactful developments in the field over the past year.

                      Long-Read Sequencing
                      Long-read sequencing has seen remarkable advancements,...
                      12-02-2024, 01:49 PM

                    ad_right_rmr

                    Collapse

                    News

                    Collapse

                    Topics Statistics Last Post
                    Started by seqadmin, 12-17-2024, 10:28 AM
                    0 responses
                    26 views
                    0 likes
                    Last Post seqadmin  
                    Started by seqadmin, 12-13-2024, 08:24 AM
                    0 responses
                    43 views
                    0 likes
                    Last Post seqadmin  
                    Started by seqadmin, 12-12-2024, 07:41 AM
                    0 responses
                    29 views
                    0 likes
                    Last Post seqadmin  
                    Started by seqadmin, 12-11-2024, 07:45 AM
                    0 responses
                    42 views
                    0 likes
                    Last Post seqadmin  
                    Working...
                    X