Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Hi,

    I've been using Seqmonk to process and visualize my RRBS data. The data I've imported has replicates for treatment and control conditions, and I want to export a list of all the probes (CpGs) and their q-values to plot the distribution of q-values. I've tried filtering my data using the Replicate Set Stats and setting the p-value cutoff to 1 but this generates a list of the probes with p-values < 1.

    Is there anyway a full list of probes and q-values?

    Thanks

    Comment


    • Originally posted by edere View Post
      Hi,
      I've been using Seqmonk to process and visualize my RRBS data. The data I've imported has replicates for treatment and control conditions, and I want to export a list of all the probes (CpGs) and their q-values to plot the distribution of q-values. I've tried filtering my data using the Replicate Set Stats and setting the p-value cutoff to 1 but this generates a list of the probes with p-values < 1.
      In all of the statistical tests in SeqMonk you should be able to set the cutoff to 1 to generate a full list of hits regardless of significance. I just tried this in a few tests and whilst it worked OK in most of them, it failed for the replicate set stats in a few cases. Looking at the cases where it failed it was probes where all of the values in all cases were exactly the same, so there was no absolute difference and no variance. This is probably a corner case which produces an infinite value which doesn't get caught and I'll look into that.

      Since you're trying to run this on RRBS data then you're much more likely to hit this condition since I guess it will be fairly common to have repeated methylation values (especially 0 or 100%). Quantitative tests such as the replicate set stats aren't really suitable for this kind of data unless you have huge numbers of replicates (central limit theorem and all that). We'd normally use the contingency based tests (Chi-Square in SeqMonk) to do a count based significance assessment, along with a subsequent filter for a sensible level of absolute difference.

      Comment


      • I'm not exactly sure how to run the Chi-squared test in Seqmonk. When I add the two data stores that I want to compare, in this case they are my replicate sets for my control and treatment conditions, they get populated into the "Pairs" box, but I can't run the filter.

        Clearly I haven't done something correctly. How do I set up the Chi-squared test to compare the my treatment replicates against the control replicates for each probe (cytosine)?


        Originally posted by simonandrews View Post
        In all of the statistical tests in SeqMonk you should be able to set the cutoff to 1 to generate a full list of hits regardless of significance. I just tried this in a few tests and whilst it worked OK in most of them, it failed for the replicate set stats in a few cases. Looking at the cases where it failed it was probes where all of the values in all cases were exactly the same, so there was no absolute difference and no variance. This is probably a corner case which produces an infinite value which doesn't get caught and I'll look into that.

        Since you're trying to run this on RRBS data then you're much more likely to hit this condition since I guess it will be fairly common to have repeated methylation values (especially 0 or 100%). Quantitative tests such as the replicate set stats aren't really suitable for this kind of data unless you have huge numbers of replicates (central limit theorem and all that). We'd normally use the contingency based tests (Chi-Square in SeqMonk) to do a count based significance assessment, along with a subsequent filter for a sensible level of absolute difference.

        Comment


        • ChIA-PET Hi-C maps

          Hi Simon,

          I have been using Seqmonk succesfully for analysis of ChIP-seq data and now I have ChIA-PET data to analyse. I have the reads mapped in a BAM file, I have seperate BAM files for the different linker combinations generated by ChIA-PET. I manage to visualise them in a Hi-C heatmap in Seqmonk but got a bit worried here because there is some statistics applied before the heatmap is generated. I think the statistics used for Hi-C experiments vs statistics for ChIA-PET experiments must be different as ChIA-PET is a ChIP (and therefore enriched for loci to begin with) derived method. I have been looking for the exact statistics but I cannot find them anywhere, only a brief mentioning in one of the tutorial video's. Can you tell me where I can find what statistics are applied? Is there any option to visualize the data in a heatmap without applying the statistics?
          I would also like to substract interactions found for one linker combination from interactions found for another linker combination. Is this possible in Seqmonk?
          Thank you very much for your help!

          Comment


          • Sorry to take so long to reply to you - I'm a bit behind on my mails right now!

            The statistical model for HiC in SeqMonk is based on the expectation that random interactions should be distributed in line with the level of fragment end coverage within each region, so that regions which have a large number of fragment ends in them should also be more likely to form part of any random interactions. This works well for methods where you expect fairly even fragment end coverage (such as normal HiC) but is less good where you have uneven distributions (any kind of intentional enrichment). In these cases the method still works, but if your enrichment is very strong then you may find that you get highly significant results from small numbers of interactions with generally depleted regions, which may not be biologically sensible.

            If you want to view the heatmap without applying any filtering then you should be able to set the cutoffs to not filter (p<1 and diff >0). I might be mis-remembering but I may have changed some of this code recently to allow this kind of construction (I think the filters may have been restricted to showing only interactions with Obs/Exp > 1 whatever you selected). You could try the development snapshot at http://www.bioinformatics.babraham.a...28.0_devel.zip which has some fixes and other HiC improvements which might be useful.

            Comment


            • Hi Simon,

              I had a look at the v28 development of Seqmonk and indeed I was able to visualize the heatmap without any filtering (even obs/exp = 0) which is great and useful for my application!
              As I mentioned in my previous post I would like to substract interactions found for one dataset (one kind of linker combination) from interactions found for another dataset (another kind of linker combination).
              Am I right that the interactions that seqmonk finds are only 'present' in the HiC heatmap and not in e.g. the genome view? Because in the genome view with the probes showing enrichment I can substract one dataset from the other but this is then not visualised in the HiC heatmap (or the report from the HiC heatmap). Is there any other way of getting a report for the interactions found besides the report from the heatmap?
              Please let me know if it is possible to substract interactions for one dataset from another (and how)!

              Thank you very much in advance!

              Comment


              • Originally posted by Aspadia View Post
                Hi Simon,

                I had a look at the v28 development of Seqmonk and indeed I was able to visualize the heatmap without any filtering (even obs/exp = 0) which is great and useful for my application!
                As I mentioned in my previous post I would like to substract interactions found for one dataset (one kind of linker combination) from interactions found for another dataset (another kind of linker combination).
                Am I right that the interactions that seqmonk finds are only 'present' in the HiC heatmap and not in e.g. the genome view? Because in the genome view with the probes showing enrichment I can substract one dataset from the other but this is then not visualised in the HiC heatmap (or the report from the HiC heatmap). Is there any other way of getting a report for the interactions found besides the report from the heatmap?
                Please let me know if it is possible to substract interactions for one dataset from another (and how)!

                Thank you very much in advance!
                You're right that the normal subtraction of probe lists won't be of any use for subtracting HiC interaction lists. At the moment there isn't a way to do this within SeqMonk. I've been planning a proper fix for this for a while - the idea would be that the calculation of interaction lists would be a separate process which would introduce another folder in the data view of interaction sets. These could then be put into views such as the heatmap view (meaning you don't have to re-calculate each time) and would also then be able to be combined and filtered in different ways. Unfortunately this is a lot of work as it messes quite deeply with the core structure of seqmonk and I just haven't had time to work on it.

                For the time being therefore any subtraction would have to be done by exporting the lists from the heatmap view and then using an external script to match up the ids to find the common and different points.

                Comment


                • I’ve been working with a mRRBS dataset and have a bit of time using both methylKit and Seqmonk. I’ve used Bismark to align my bisulfite sequence reads, sorted the outputted SAM files by genomic location and then loaded the data into R using the read.bismark function in methylKit and looking only at the CpG context.

                  Using the same SAM files, I used Bismark’s methylation extractor to generate the CpG context OT and OB methylation calls, and then loaded those files into Seqmonk.

                  When I randomly look at the methylRaw values can compare the locations of methylated or unmethylated cytosines and compare them with data loaded into Seqmonk, I’m finding that there are methylRaw values have data for locations that do not correspond with the data loaded into Seqmonk.

                  Has anyone else experienced this and/or have a reason why methylKit and Seqmonk don’t have the same results?


                  Thanks,
                  Ed

                  Comment


                  • Originally posted by edere View Post
                    When I randomly look at the methylRaw values can compare the locations of methylated or unmethylated cytosines and compare them with data loaded into Seqmonk, I’m finding that there are methylRaw values have data for locations that do not correspond with the data loaded into Seqmonk.
                    That's interesting - the seqmonk data should be pretty unambiguous since it will show all of the raw calls put out by the methylation extractor. Just to be sure have you done a simple read count in the seqmonk data track and confirmed that the counts for those regions really are 0? I know that internally we fixed a bug where some reads didn't show up, but I'm pretty sure that only affected the development version which isn't released.

                    There are plenty of options for how to turn a set of calls into methylation percentages (and there will be some new options in the next seqmonk release) but there shouldn't be any ambiguity over the presence/absence of reads.

                    Another sanity check would be to take the location of one of the disputed regions and then use something like samtools to extract the reads for that region from the BAM files produced by bismark. That would let you see if there should have been CpG calls in the region and if there were then we could track down at which subsequent point they disappeared and why.

                    Comment


                    • Originally posted by simonandrews View Post
                      That's interesting - the seqmonk data should be pretty unambiguous since it will show all of the raw calls put out by the methylation extractor. Just to be sure have you done a simple read count in the seqmonk data track and confirmed that the counts for those regions really are 0? I know that internally we fixed a bug where some reads didn't show up, but I'm pretty sure that only affected the development version which isn't released.

                      There are plenty of options for how to turn a set of calls into methylation percentages (and there will be some new options in the next seqmonk release) but there shouldn't be any ambiguity over the presence/absence of reads.

                      Another sanity check would be to take the location of one of the disputed regions and then use something like samtools to extract the reads for that region from the BAM files produced by bismark. That would let you see if there should have been CpG calls in the region and if there were then we could track down at which subsequent point they disappeared and why.
                      Hi Simon,

                      I managed to track down the problem. It was something that I did that I forgot to take into account when comparing the analysis methods. I forgot that I used the --ignore argument when running the methylation extraction.

                      When I ran the methylation extraction again without the --ignore option and compared the calls in Seqmonk and methylkit, everything matched.

                      Thanks for your suggestions though.

                      Ed

                      Comment


                      • Importing Bismark txt file into Seqmonk

                        I had a quick question about importing Bismark methylation extractor output into SeqMonk (as in Under File -> Import Data -> Bismark). Is this feature not included in version 27 anymore? I am sure we have used this feature in our previous publications with earlier versions. Is there a workaround?

                        thanks
                        --Kartik

                        Comment


                        • Originally posted by kshankar View Post
                          I had a quick question about importing Bismark methylation extractor output into SeqMonk (as in Under File -> Import Data -> Bismark). Is this feature not included in version 27 anymore? I am sure we have used this feature in our previous publications with earlier versions. Is there a workaround?

                          thanks
                          --Kartik
                          The File -> Import Data -> Bismark import only ever worked with the vanilla Bismark output (pre-SAM/BAM). Files from the methylation extractor may be imported via the Generic Text Import filter. Once there, select col 3 as chromosome, col 4 as both start and end positions, and col 2 as strand (= methylation state).

                          Comment


                          • Hi all, are there any options that forces SeqMonk to use multiple processor cores?

                            I was running a intensity difference filter over a methylation call datasets from bismark. I was comparing two different individuals with a genome size of approx 1Gb. The entire comparison took two weeks and it gave me zero hits, haha. Just want to trial and error on this software so I was wondering any ways to force the analysis to use up more resources so that it is faster?

                            Also, is this the correct way of doing differentially methylated region analysis on SeqMonk?

                            All guidance and criticisms are highly welcomed! Newbie here just transitioning between wet and dry lab.

                            BTW, my computing cluster is allocated 8 cores of [email protected]. Total memory was 31.4Gb.

                            Comment


                            • Originally posted by kentawan View Post
                              Hi all, are there any options that forces SeqMonk to use multiple processor cores?

                              I was running a intensity difference filter over a methylation call datasets from bismark. I was comparing two different individuals with a genome size of approx 1Gb. The entire comparison took two weeks and it gave me zero hits, haha. Just want to trial and error on this software so I was wondering any ways to force the analysis to use up more resources so that it is faster?

                              Also, is this the correct way of doing differentially methylated region analysis on SeqMonk?
                              Wow! I'm kind of impressed you waited that long for it to finish, but unfortunately, as you found, it wasn't really an appropriate test to use anyway, and if it took that long to run I guess you were using a huge number of probes, so even if you had found things which were somewhat significant they would have been extremely unlikely to survive multiple testing correction for that many tests.

                              The intensity difference filter is only useful in a case where:
                              1. You have a measure where the size of the measure and the noise are correlated - generally anything which is a direct manipulation of a sequence count
                              2. You expect that the majority of the data will not be changing such that it makes sense to look for outliers after creating a model over the rest of the data.

                              ..and unfortunately neither of these really applies to methlyation data.

                              There are a few different ways to look at methlyation data depending on what you're looking for, but a typical recipe we'd use would be something like this:
                              1. Decide where you're going to measure. It almost never makes sense to analyse individual cytosines as the coverage for each will generally be poor and differences will never be significant. If you're targeting something specific like gene bodies, promoters or CpG islands then put probes over those, if not then tile probes of an appropriate size (our default is 3-5kb if we have reasonably good coverage - go larger if your coverage isn't great). Use whichever probe generator is appropriate to make probes over these regions.
                              2. Quantitate your data using the bisulphite quantitation pipeline using the probes you made in the step above. I'd suggest setting the minimum coverage depth and the measures per feature both to 1 to start with (this isn't the default at the moment, but it will be in the next release)
                              3. Decide on the minimum absolute methylation change you're prepared to care about. In well measured regions you might find that a change of 1% or less will be significant, but you want to consider the biological context and whether this is likely to mean anything. We'd probably normally start looking at a minimum 5-10% change.
                              4. Run a contingency based test (Filtering > Statistical Test > Chi Square > For/Rev) and select the two datasets you want to compare. Set the minimum difference to the value you selected in the last step. This will give you an initial hit list.
                              5. Select the hit list you got from the contingency test and then filter this using your current quantitation using the differences filter (Filter > Value difference > Individual probes). Select both of your datasets in both option lists and set the minimum difference to be the cutoff you selected before.


                              You should now have a list of probes which show a methylation change which is both statistically and biologically significant. If you did tiled probes then the next step would be to try to relate the positions you found to biological features to try to understand why they were selected.

                              There are many other ways to go about this, but hopefully this will at least get you started.

                              Comment


                              • Originally posted by simonandrews View Post
                                Wow! I'm kind of impressed you waited that long for it to finish, but unfortunately, as you found, it wasn't really an appropriate test to use anyway, and if it took that long to run I guess you were using a huge number of probes, so even if you had found things which were somewhat significant they would have been extremely unlikely to survive multiple testing correction for that many tests.

                                The intensity difference filter is only useful in a case where:
                                1. You have a measure where the size of the measure and the noise are correlated - generally anything which is a direct manipulation of a sequence count
                                2. You expect that the majority of the data will not be changing such that it makes sense to look for outliers after creating a model over the rest of the data.

                                ..and unfortunately neither of these really applies to methlyation data.

                                There are a few different ways to look at methlyation data depending on what you're looking for, but a typical recipe we'd use would be something like this:
                                1. Decide where you're going to measure. It almost never makes sense to analyse individual cytosines as the coverage for each will generally be poor and differences will never be significant. If you're targeting something specific like gene bodies, promoters or CpG islands then put probes over those, if not then tile probes of an appropriate size (our default is 3-5kb if we have reasonably good coverage - go larger if your coverage isn't great). Use whichever probe generator is appropriate to make probes over these regions.
                                2. Quantitate your data using the bisulphite quantitation pipeline using the probes you made in the step above. I'd suggest setting the minimum coverage depth and the measures per feature both to 1 to start with (this isn't the default at the moment, but it will be in the next release)
                                3. Decide on the minimum absolute methylation change you're prepared to care about. In well measured regions you might find that a change of 1% or less will be significant, but you want to consider the biological context and whether this is likely to mean anything. We'd probably normally start looking at a minimum 5-10% change.
                                4. Run a contingency based test (Filtering > Statistical Test > Chi Square > For/Rev) and select the two datasets you want to compare. Set the minimum difference to the value you selected in the last step. This will give you an initial hit list.
                                5. Select the hit list you got from the contingency test and then filter this using your current quantitation using the differences filter (Filter > Value difference > Individual probes). Select both of your datasets in both option lists and set the minimum difference to be the cutoff you selected before.


                                You should now have a list of probes which show a methylation change which is both statistically and biologically significant. If you did tiled probes then the next step would be to try to relate the positions you found to biological features to try to understand why they were selected.

                                There are many other ways to go about this, but hopefully this will at least get you started.
                                Thank you very much Simon for the headstart guide. I made significant progress with my data analysis over the week.

                                Just one question though, what does the output value of the DNA bisulphite quantitation pipeline actually means? Does higher value means higher methylation counts on a specific probes or does it mean higher percentage of methylation on that basepair out of all calls (methylated + non-methylated)? My input files are actually CpG methylation call files from bismark. From my understanding, the output files of bismark CpG calles are all 1bp long, where the + strand reads are methylated C while - strand reads are unmethylated c, am I right?

                                Thanks and regards, hope to hear from you soon.

                                Comment

                                Latest Articles

                                Collapse

                                • seqadmin
                                  Strategies for Sequencing Challenging Samples
                                  by seqadmin


                                  Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                                  03-22-2024, 06:39 AM
                                • seqadmin
                                  Techniques and Challenges in Conservation Genomics
                                  by seqadmin



                                  The field of conservation genomics centers on applying genomics technologies in support of conservation efforts and the preservation of biodiversity. This article features interviews with two researchers who showcase their innovative work and highlight the current state and future of conservation genomics.

                                  Avian Conservation
                                  Matthew DeSaix, a recent doctoral graduate from Kristen Ruegg’s lab at The University of Colorado, shared that most of his research...
                                  03-08-2024, 10:41 AM

                                ad_right_rmr

                                Collapse

                                News

                                Collapse

                                Topics Statistics Last Post
                                Started by seqadmin, Yesterday, 06:37 PM
                                0 responses
                                10 views
                                0 likes
                                Last Post seqadmin  
                                Started by seqadmin, Yesterday, 06:07 PM
                                0 responses
                                9 views
                                0 likes
                                Last Post seqadmin  
                                Started by seqadmin, 03-22-2024, 10:03 AM
                                0 responses
                                49 views
                                0 likes
                                Last Post seqadmin  
                                Started by seqadmin, 03-21-2024, 07:32 AM
                                0 responses
                                67 views
                                0 likes
                                Last Post seqadmin  
                                Working...
                                X