Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Is my chip-seq data garbage?

    I received some chip-seq data that had a very high amount of sequence duplication (over 90% of the reads). The experiment was looking at H3K4me3. I aligned with bowtie2 and ran rmdup and ended up with only about 1 million unique reads mapped. Most of the peaks that MACS is calling have only 5 reads in them. I'm wondering if the data is complete garbage or if I can get something legitimate out of these peaks?

  • #2
    it seems so, check MACS model file, if the watson & crick distance is small, this means it's useless. you may also want check with fastq, this high duplication could probably due to adapters.

    Comment


    • #3
      Originally posted by xubeisi View Post
      ... check MACS model file, if the watson & crick distance is small, this means it's useless...
      How small are we talking about?

      Comment


      • #4
        Originally posted by Tobikenobi View Post
        How small are we talking about?
        ~100 should be fine, to me, samples less than 50 are trash

        Comment


        • #5
          Have you actually looked at your data (both before and after duplication)?

          Simply looking at the pattern of mapped reads will very quickly tell you if you're wasting your time trying spending more effort on your analysis.

          Comment


          • #6
            Sorry to hijack this thread...

            Originally posted by xubeisi View Post
            ~100 should be fine, to me, samples less than 50 are trash
            Depending on what number I enter as mfold in MACS (>10), I can get anything from d=51 to d=118. Does that tell me anything, and is it desirable to go for the highest d possible?
            Thank you very much!

            Comment


            • #7
              Originally posted by simonandrews View Post
              Have you actually looked at your data (both before and after duplication)?

              Simply looking at the pattern of mapped reads will very quickly tell you if you're wasting your time trying spending more effort on your analysis.
              Could you please specify what you mean by `before and after duplication`?

              Also, what would I expect to see in case of high duplication levels (I am looking at ~75% duplication according to fastqc myself)?

              Comment


              • #8
                Originally posted by Tobikenobi View Post
                Could you please specify what you mean by `before and after duplication`?

                Also, what would I expect to see in case of high duplication levels (I am looking at ~75% duplication according to fastqc myself)?
                High duplication can come from a few different sources. It could be that you've got very good enriched data and have saturated the coverage of the enriched regions, in which case it would be bad to deduplicate. It could be that you have a very small number of regions which have enormous coverage or you could have more general low level duplication of your whole library. A 75% duplication level could come from most sequences appearing exactly twice in your data, or it could be a small number of sites with huge coverage.

                If you look at the mapped data before you've done any deduplication you will be able to see whether there is a problem. If you see nicely enriched peaks with even coverage over them then the data might well be OK as it is. If you can see obviously biased coverage with more isolated towers of reads where you have duplication then you would need to deduplicate to stand any chance of getting sensible results out of your data.

                Don't think that you should always deduplicate your data. There are defininte down sides to doing so - for high coverage regions you can end up compressing the dynamic range of your data and reducing the amount of information you have to work with. It can help in some cases, but when we're analysing data we would only deduplicate if we could see that there was a problem with the data which deduplication would help to fix.

                Comment


                • #9
                  Originally posted by simonandrews View Post
                  High duplication can come from a few different sources. It could be that you've got very good enriched data and have saturated the coverage of the enriched regions, in which case it would be bad to deduplicate. It could be that you have a very small number of regions which have enormous coverage or you could have more general low level duplication of your whole library. A 75% duplication level could come from most sequences appearing exactly twice in your data, or it could be a small number of sites with huge coverage.

                  If you look at the mapped data before you've done any deduplication you will be able to see whether there is a problem. If you see nicely enriched peaks with even coverage over them then the data might well be OK as it is. If you can see obviously biased coverage with more isolated towers of reads where you have duplication then you would need to deduplicate to stand any chance of getting sensible results out of your data.

                  Don't think that you should always deduplicate your data. There are defininte down sides to doing so - for high coverage regions you can end up compressing the dynamic range of your data and reducing the amount of information you have to work with. It can help in some cases, but when we're analysing data we would only deduplicate if we could see that there was a problem with the data which deduplication would help to fix.
                  Thank you very much for your help!
                  I actually looked at the data before and after filterig for duplicates and have attached a picture of my four samples before (top four tracks) and after de-duplication (lower four tracks). It seems that your second suggestion of isolated towers seems to be the case, as I saw similar things across all chromosomes.
                  I then went on to try peak calling on my original files (only clipped the adapters and trimmed a little of the 3` end), for what I randomly selected and omitted lines in the input to get equal numbers of tags. Then MACS gives me the following output in the peaks.xls file:

                  # This file is generated by MACS
                  # ARGUMENTS LIST:
                  # name = E_2_mfold_20
                  # format = SAM
                  # ChIP-seq file = /galaxy/main_pool/pool7/files/005/979/dataset_5979847.dat
                  # control file = /galaxy/main_pool/pool7/files/005/965/dataset_5965128.dat
                  # effective genome size = 1.87e+09
                  # tag size = 50
                  # band width = 300
                  # model fold = 20
                  # pvalue cutoff = 1.00e-05
                  # Ranges for calculating regional lambda are : peak_region,1000,5000,10000
                  # unique tags in treatment: 2868667
                  # total tags in treatment: 22927127
                  # unique tags in control: 8014554
                  # total tags in control: 22927127

                  # d = 51

                  Especially in the treatment, the unique tags are very low compared to the control. This makes FDR unreliable.

                  Is it adviseable to de-duplicate the data and try peak calling then?
                  Also, as I have two replicates, would be reasonable to combine the two replicates to obtain more unique reads, and then try the peak calling again?

                  Again, thank you very much for your input!
                  Attached Files

                  Comment


                  • #10
                    It might be worth noting that MACS does an internal deduplication of your data whilst peak calling. It works out the likely duplication level in your data and then removes any tags which are duplicated above that level when calling peaks. It may not remove as much data as doing a complete strict deduplication, but it does look at this information.

                    I had a look at the image you posted but at that resolution it's hard to see what's going on. It's not unusual to see a few huge outliers in the data (which can skew the scale on the y-axis), it's more what happens at a more local level which is important, especially looking at the actual pattern of mapped reads rather than quantitated values.

                    Comment


                    • #11
                      So if I understand correctly, it may not be necessary at all to deduplicate the data before using MACS, as it will attempt this on its own.
                      Moreover, if I would deduplicate myself, I would omit true duplicates that arise from sequencing depth. So deduplicating would really only make sense if I really wanted the accurate FDR from MACS, which I can only get if I adjust the unique tag number beforehand?

                      Comment

                      Latest Articles

                      Collapse

                      • seqadmin
                        Strategies for Sequencing Challenging Samples
                        by seqadmin


                        Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                        03-22-2024, 06:39 AM
                      • seqadmin
                        Techniques and Challenges in Conservation Genomics
                        by seqadmin



                        The field of conservation genomics centers on applying genomics technologies in support of conservation efforts and the preservation of biodiversity. This article features interviews with two researchers who showcase their innovative work and highlight the current state and future of conservation genomics.

                        Avian Conservation
                        Matthew DeSaix, a recent doctoral graduate from Kristen Ruegg’s lab at The University of Colorado, shared that most of his research...
                        03-08-2024, 10:41 AM

                      ad_right_rmr

                      Collapse

                      News

                      Collapse

                      Topics Statistics Last Post
                      Started by seqadmin, Yesterday, 06:37 PM
                      0 responses
                      8 views
                      0 likes
                      Last Post seqadmin  
                      Started by seqadmin, Yesterday, 06:07 PM
                      0 responses
                      8 views
                      0 likes
                      Last Post seqadmin  
                      Started by seqadmin, 03-22-2024, 10:03 AM
                      0 responses
                      49 views
                      0 likes
                      Last Post seqadmin  
                      Started by seqadmin, 03-21-2024, 07:32 AM
                      0 responses
                      67 views
                      0 likes
                      Last Post seqadmin  
                      Working...
                      X