Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • How to handle duplicates data?

    I wonder how do you handle duplicates data (e.g. ChIP-seq) that were performed on the two biological replicates.

    Do you map them individually first and get their mapping location in the genome, transfer them to some format like bed files and then merge the bed files?

    Or do you merge the two fastq files first and then map the one fastq file?

    Thanks!

  • #2
    You could consider removing those duplicates using SAMTOOLS (rmdup).

    Comment


    • #3
      Originally posted by kushald View Post
      You could consider removing those duplicates using SAMTOOLS (rmdup).
      I don't think he/she's referring to PCR duplicates, rather biological replicates.

      gene_x, could you explain your experiment in a bit more detail? More specifically, describe how many samples you've sequenced, how many lanes of data you have for each sample, and how many libraries were generated for each sample (if there is more than one lane of data for each sample), and then please briefly describe what you're hoping to do (find peaks in both samples, find peaks present in one set of samples vs. another, etc).

      Comment


      • #4
        Originally posted by Heisman View Post
        I don't think he/she's referring to PCR duplicates, rather biological replicates.

        gene_x, could you explain your experiment in a bit more detail? More specifically, describe how many samples you've sequenced, how many lanes of data you have for each sample, and how many libraries were generated for each sample (if there is more than one lane of data for each sample), and then please briefly describe what you're hoping to do (find peaks in both samples, find peaks present in one set of samples vs. another, etc).
        I just realized my latest post is roughly asking the same question..

        to Heisman, I wanted to reanalyze/replicate some ENCODE ChIP-seq data and I couldn't be sure how they did it based on their description.. basically to find peaks in a duplicated sequencing data.

        Comment


        • #5
          I see the similarity between this and your other post I just responded to:

          if the same sample was sequenced multiple times with the same exact library then the only difference in the data would be due to differences in the sequencing itself. In this case you could align each lane of data separately and give them separate RG IDs but the same LB and SM (library and sample) IDs.

          if the same sample was sequenced multiple times with different libraries (ie, you prepped the sample twice), then you can do the above but make sure the LB ID is different in addition to the RG ID.

          If you have completely different samples that are true biological replicates, then you probably don't want to merge the raw or aligned data at all; rather you'll want to probably call peaks on the two samples separately and then compare the results in some capacity (ie, using IDR: https://sites.google.com/site/anshul...e/projects/idr)

          Honestly, I'm not experienced enough with peak calling to give great advice, but the above should be solid.

          Comment


          • #6
            Originally posted by Heisman View Post
            I see the similarity between this and your other post I just responded to:

            if the same sample was sequenced multiple times with the same exact library then the only difference in the data would be due to differences in the sequencing itself. In this case you could align each lane of data separately and give them separate RG IDs but the same LB and SM (library and sample) IDs.

            if the same sample was sequenced multiple times with different libraries (ie, you prepped the sample twice), then you can do the above but make sure the LB ID is different in addition to the RG ID.

            If you have completely different samples that are true biological replicates, then you probably don't want to merge the raw or aligned data at all; rather you'll want to probably call peaks on the two samples separately and then compare the results in some capacity (ie, using IDR: https://sites.google.com/site/anshul...e/projects/idr)

            Honestly, I'm not experienced enough with peak calling to give great advice, but the above should be solid.
            What does RG, LB ID stand for and what are they?

            Comment


            • #7
              Originally posted by gene_x View Post
              What does RG, LB ID stand for and what are they?
              RG = read group
              LB = library
              SM = sample
              PL = platform
              ID = ID (identification, haha)

              So RG:ID is shorthand for read group ID, for example.

              If you're getting into this stuff for the first time and it's not a one-off, I'd glance/read through this: http://samtools.sourceforge.net/SAM1.pdf

              The importance of library is when removing duplicate reads. If you sequence the same sample with different libraries, you don't want to remove reads that appear as duplicates between different libraries (because they are from different biological template strands). If you sequence the same library multiple times, though, then if reads appear as duplicates people do typically want to remove them as they are more likely due to PCR amplification of the same original biological template strand (some exceptions here particularly if you have high coverage).

              Read group can be important independent of library if some of the sequencing runs were of bad quality, and because a lot of the software of the GATK toolset uses/requires RG to be set.

              Comment

              Latest Articles

              Collapse

              • seqadmin
                Advancing Precision Medicine for Rare Diseases in Children
                by seqadmin




                Many organizations study rare diseases, but few have a mission as impactful as Rady Children’s Institute for Genomic Medicine (RCIGM). “We are all about changing outcomes for children,” explained Dr. Stephen Kingsmore, President and CEO of the group. The institute’s initial goal was to provide rapid diagnoses for critically ill children and shorten their diagnostic odyssey, a term used to describe the long and arduous process it takes patients to obtain an accurate...
                12-16-2024, 07:57 AM
              • seqadmin
                Recent Advances in Sequencing Technologies
                by seqadmin



                Innovations in next-generation sequencing technologies and techniques are driving more precise and comprehensive exploration of complex biological systems. Current advancements include improved accessibility for long-read sequencing and significant progress in single-cell and 3D genomics. This article explores some of the most impactful developments in the field over the past year.

                Long-Read Sequencing
                Long-read sequencing has seen remarkable advancements,...
                12-02-2024, 01:49 PM

              ad_right_rmr

              Collapse

              News

              Collapse

              Topics Statistics Last Post
              Started by seqadmin, 12-17-2024, 10:28 AM
              0 responses
              33 views
              0 likes
              Last Post seqadmin  
              Started by seqadmin, 12-13-2024, 08:24 AM
              0 responses
              49 views
              0 likes
              Last Post seqadmin  
              Started by seqadmin, 12-12-2024, 07:41 AM
              0 responses
              34 views
              0 likes
              Last Post seqadmin  
              Started by seqadmin, 12-11-2024, 07:45 AM
              0 responses
              46 views
              0 likes
              Last Post seqadmin  
              Working...
              X