SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics



Similar Threads
Thread Thread Starter Forum Replies Last Post
ChIP-Seq: Enabling Data Analysis on High-Throughput Data in Large Data Depository Usi Newsbot! Literature Watch 1 04-18-2018 10:50 PM
Cufflinks - Nature Biotech data sets adrian Bioinformatics 1 04-16-2011 05:40 PM
public data sets muchomaas Bioinformatics 2 06-08-2010 02:48 AM
sff_extract: combining data from 454 Flx and Titanium data sets agroster Bioinformatics 7 01-14-2010 11:19 AM
SeqMonk - Flexible analysis of mapped reads simonandrews Bioinformatics 7 07-24-2009 05:12 AM

Reply
 
Thread Tools
Old 04-22-2014, 01:58 AM   #261
Aspadia
Junior Member
 
Location: Europe

Join Date: Aug 2013
Posts: 4
Default

Hi Simon,

I had a look at the v28 development of Seqmonk and indeed I was able to visualize the heatmap without any filtering (even obs/exp = 0) which is great and useful for my application!
As I mentioned in my previous post I would like to substract interactions found for one dataset (one kind of linker combination) from interactions found for another dataset (another kind of linker combination).
Am I right that the interactions that seqmonk finds are only 'present' in the HiC heatmap and not in e.g. the genome view? Because in the genome view with the probes showing enrichment I can substract one dataset from the other but this is then not visualised in the HiC heatmap (or the report from the HiC heatmap). Is there any other way of getting a report for the interactions found besides the report from the heatmap?
Please let me know if it is possible to substract interactions for one dataset from another (and how)!

Thank you very much in advance!
Aspadia is offline   Reply With Quote
Old 04-28-2014, 01:19 AM   #262
simonandrews
Simon Andrews
 
Location: Babraham Inst, Cambridge, UK

Join Date: May 2009
Posts: 871
Default

Quote:
Originally Posted by Aspadia View Post
Hi Simon,

I had a look at the v28 development of Seqmonk and indeed I was able to visualize the heatmap without any filtering (even obs/exp = 0) which is great and useful for my application!
As I mentioned in my previous post I would like to substract interactions found for one dataset (one kind of linker combination) from interactions found for another dataset (another kind of linker combination).
Am I right that the interactions that seqmonk finds are only 'present' in the HiC heatmap and not in e.g. the genome view? Because in the genome view with the probes showing enrichment I can substract one dataset from the other but this is then not visualised in the HiC heatmap (or the report from the HiC heatmap). Is there any other way of getting a report for the interactions found besides the report from the heatmap?
Please let me know if it is possible to substract interactions for one dataset from another (and how)!

Thank you very much in advance!
You're right that the normal subtraction of probe lists won't be of any use for subtracting HiC interaction lists. At the moment there isn't a way to do this within SeqMonk. I've been planning a proper fix for this for a while - the idea would be that the calculation of interaction lists would be a separate process which would introduce another folder in the data view of interaction sets. These could then be put into views such as the heatmap view (meaning you don't have to re-calculate each time) and would also then be able to be combined and filtered in different ways. Unfortunately this is a lot of work as it messes quite deeply with the core structure of seqmonk and I just haven't had time to work on it.

For the time being therefore any subtraction would have to be done by exporting the lists from the heatmap view and then using an external script to match up the ids to find the common and different points.
simonandrews is offline   Reply With Quote
Old 05-12-2014, 08:53 AM   #263
edere
Junior Member
 
Location: US

Join Date: Feb 2014
Posts: 4
Default

I’ve been working with a mRRBS dataset and have a bit of time using both methylKit and Seqmonk. I’ve used Bismark to align my bisulfite sequence reads, sorted the outputted SAM files by genomic location and then loaded the data into R using the read.bismark function in methylKit and looking only at the CpG context.

Using the same SAM files, I used Bismark’s methylation extractor to generate the CpG context OT and OB methylation calls, and then loaded those files into Seqmonk.

When I randomly look at the methylRaw values can compare the locations of methylated or unmethylated cytosines and compare them with data loaded into Seqmonk, I’m finding that there are methylRaw values have data for locations that do not correspond with the data loaded into Seqmonk.

Has anyone else experienced this and/or have a reason why methylKit and Seqmonk don’t have the same results?


Thanks,
Ed
edere is offline   Reply With Quote
Old 05-12-2014, 08:59 AM   #264
simonandrews
Simon Andrews
 
Location: Babraham Inst, Cambridge, UK

Join Date: May 2009
Posts: 871
Default

Quote:
Originally Posted by edere View Post
When I randomly look at the methylRaw values can compare the locations of methylated or unmethylated cytosines and compare them with data loaded into Seqmonk, Iím finding that there are methylRaw values have data for locations that do not correspond with the data loaded into Seqmonk.
That's interesting - the seqmonk data should be pretty unambiguous since it will show all of the raw calls put out by the methylation extractor. Just to be sure have you done a simple read count in the seqmonk data track and confirmed that the counts for those regions really are 0? I know that internally we fixed a bug where some reads didn't show up, but I'm pretty sure that only affected the development version which isn't released.

There are plenty of options for how to turn a set of calls into methylation percentages (and there will be some new options in the next seqmonk release) but there shouldn't be any ambiguity over the presence/absence of reads.

Another sanity check would be to take the location of one of the disputed regions and then use something like samtools to extract the reads for that region from the BAM files produced by bismark. That would let you see if there should have been CpG calls in the region and if there were then we could track down at which subsequent point they disappeared and why.
simonandrews is offline   Reply With Quote
Old 05-14-2014, 01:13 PM   #265
edere
Junior Member
 
Location: US

Join Date: Feb 2014
Posts: 4
Default

Quote:
Originally Posted by simonandrews View Post
That's interesting - the seqmonk data should be pretty unambiguous since it will show all of the raw calls put out by the methylation extractor. Just to be sure have you done a simple read count in the seqmonk data track and confirmed that the counts for those regions really are 0? I know that internally we fixed a bug where some reads didn't show up, but I'm pretty sure that only affected the development version which isn't released.

There are plenty of options for how to turn a set of calls into methylation percentages (and there will be some new options in the next seqmonk release) but there shouldn't be any ambiguity over the presence/absence of reads.

Another sanity check would be to take the location of one of the disputed regions and then use something like samtools to extract the reads for that region from the BAM files produced by bismark. That would let you see if there should have been CpG calls in the region and if there were then we could track down at which subsequent point they disappeared and why.
Hi Simon,

I managed to track down the problem. It was something that I did that I forgot to take into account when comparing the analysis methods. I forgot that I used the --ignore argument when running the methylation extraction.

When I ran the methylation extraction again without the --ignore option and compared the calls in Seqmonk and methylkit, everything matched.

Thanks for your suggestions though.

Ed
edere is offline   Reply With Quote
Old 06-13-2014, 09:40 AM   #266
kshankar
Member
 
Location: Little Rock AR

Join Date: Jul 2010
Posts: 12
Default Importing Bismark txt file into Seqmonk

I had a quick question about importing Bismark methylation extractor output into SeqMonk (as in Under File -> Import Data -> Bismark). Is this feature not included in version 27 anymore? I am sure we have used this feature in our previous publications with earlier versions. Is there a workaround?

thanks
--Kartik
kshankar is offline   Reply With Quote
Old 06-13-2014, 12:56 PM   #267
fkrueger
Senior Member
 
Location: Cambridge, UK

Join Date: Sep 2009
Posts: 620
Default

Quote:
Originally Posted by kshankar View Post
I had a quick question about importing Bismark methylation extractor output into SeqMonk (as in Under File -> Import Data -> Bismark). Is this feature not included in version 27 anymore? I am sure we have used this feature in our previous publications with earlier versions. Is there a workaround?

thanks
--Kartik
The File -> Import Data -> Bismark import only ever worked with the vanilla Bismark output (pre-SAM/BAM). Files from the methylation extractor may be imported via the Generic Text Import filter. Once there, select col 3 as chromosome, col 4 as both start and end positions, and col 2 as strand (= methylation state).
fkrueger is offline   Reply With Quote
Old 10-14-2014, 06:17 PM   #268
kentawan
Member
 
Location: Singapore

Join Date: Apr 2014
Posts: 14
Default

Hi all, are there any options that forces SeqMonk to use multiple processor cores?

I was running a intensity difference filter over a methylation call datasets from bismark. I was comparing two different individuals with a genome size of approx 1Gb. The entire comparison took two weeks and it gave me zero hits, haha. Just want to trial and error on this software so I was wondering any ways to force the analysis to use up more resources so that it is faster?

Also, is this the correct way of doing differentially methylated region analysis on SeqMonk?

All guidance and criticisms are highly welcomed! Newbie here just transitioning between wet and dry lab.

BTW, my computing cluster is allocated 8 cores of E5405@2.00GHz. Total memory was 31.4Gb.
kentawan is offline   Reply With Quote
Old 10-15-2014, 01:53 AM   #269
simonandrews
Simon Andrews
 
Location: Babraham Inst, Cambridge, UK

Join Date: May 2009
Posts: 871
Default

Quote:
Originally Posted by kentawan View Post
Hi all, are there any options that forces SeqMonk to use multiple processor cores?

I was running a intensity difference filter over a methylation call datasets from bismark. I was comparing two different individuals with a genome size of approx 1Gb. The entire comparison took two weeks and it gave me zero hits, haha. Just want to trial and error on this software so I was wondering any ways to force the analysis to use up more resources so that it is faster?

Also, is this the correct way of doing differentially methylated region analysis on SeqMonk?
Wow! I'm kind of impressed you waited that long for it to finish, but unfortunately, as you found, it wasn't really an appropriate test to use anyway, and if it took that long to run I guess you were using a huge number of probes, so even if you had found things which were somewhat significant they would have been extremely unlikely to survive multiple testing correction for that many tests.

The intensity difference filter is only useful in a case where:
  1. You have a measure where the size of the measure and the noise are correlated - generally anything which is a direct manipulation of a sequence count
  2. You expect that the majority of the data will not be changing such that it makes sense to look for outliers after creating a model over the rest of the data.
..and unfortunately neither of these really applies to methlyation data.

There are a few different ways to look at methlyation data depending on what you're looking for, but a typical recipe we'd use would be something like this:
  1. Decide where you're going to measure. It almost never makes sense to analyse individual cytosines as the coverage for each will generally be poor and differences will never be significant. If you're targeting something specific like gene bodies, promoters or CpG islands then put probes over those, if not then tile probes of an appropriate size (our default is 3-5kb if we have reasonably good coverage - go larger if your coverage isn't great). Use whichever probe generator is appropriate to make probes over these regions.
  2. Quantitate your data using the bisulphite quantitation pipeline using the probes you made in the step above. I'd suggest setting the minimum coverage depth and the measures per feature both to 1 to start with (this isn't the default at the moment, but it will be in the next release)
  3. Decide on the minimum absolute methylation change you're prepared to care about. In well measured regions you might find that a change of 1% or less will be significant, but you want to consider the biological context and whether this is likely to mean anything. We'd probably normally start looking at a minimum 5-10% change.
  4. Run a contingency based test (Filtering > Statistical Test > Chi Square > For/Rev) and select the two datasets you want to compare. Set the minimum difference to the value you selected in the last step. This will give you an initial hit list.
  5. Select the hit list you got from the contingency test and then filter this using your current quantitation using the differences filter (Filter > Value difference > Individual probes). Select both of your datasets in both option lists and set the minimum difference to be the cutoff you selected before.

You should now have a list of probes which show a methylation change which is both statistically and biologically significant. If you did tiled probes then the next step would be to try to relate the positions you found to biological features to try to understand why they were selected.

There are many other ways to go about this, but hopefully this will at least get you started.
simonandrews is offline   Reply With Quote
Old 10-19-2014, 08:07 PM   #270
kentawan
Member
 
Location: Singapore

Join Date: Apr 2014
Posts: 14
Default

Quote:
Originally Posted by simonandrews View Post
Wow! I'm kind of impressed you waited that long for it to finish, but unfortunately, as you found, it wasn't really an appropriate test to use anyway, and if it took that long to run I guess you were using a huge number of probes, so even if you had found things which were somewhat significant they would have been extremely unlikely to survive multiple testing correction for that many tests.

The intensity difference filter is only useful in a case where:
  1. You have a measure where the size of the measure and the noise are correlated - generally anything which is a direct manipulation of a sequence count
  2. You expect that the majority of the data will not be changing such that it makes sense to look for outliers after creating a model over the rest of the data.
..and unfortunately neither of these really applies to methlyation data.

There are a few different ways to look at methlyation data depending on what you're looking for, but a typical recipe we'd use would be something like this:
  1. Decide where you're going to measure. It almost never makes sense to analyse individual cytosines as the coverage for each will generally be poor and differences will never be significant. If you're targeting something specific like gene bodies, promoters or CpG islands then put probes over those, if not then tile probes of an appropriate size (our default is 3-5kb if we have reasonably good coverage - go larger if your coverage isn't great). Use whichever probe generator is appropriate to make probes over these regions.
  2. Quantitate your data using the bisulphite quantitation pipeline using the probes you made in the step above. I'd suggest setting the minimum coverage depth and the measures per feature both to 1 to start with (this isn't the default at the moment, but it will be in the next release)
  3. Decide on the minimum absolute methylation change you're prepared to care about. In well measured regions you might find that a change of 1% or less will be significant, but you want to consider the biological context and whether this is likely to mean anything. We'd probably normally start looking at a minimum 5-10% change.
  4. Run a contingency based test (Filtering > Statistical Test > Chi Square > For/Rev) and select the two datasets you want to compare. Set the minimum difference to the value you selected in the last step. This will give you an initial hit list.
  5. Select the hit list you got from the contingency test and then filter this using your current quantitation using the differences filter (Filter > Value difference > Individual probes). Select both of your datasets in both option lists and set the minimum difference to be the cutoff you selected before.

You should now have a list of probes which show a methylation change which is both statistically and biologically significant. If you did tiled probes then the next step would be to try to relate the positions you found to biological features to try to understand why they were selected.

There are many other ways to go about this, but hopefully this will at least get you started.
Thank you very much Simon for the headstart guide. I made significant progress with my data analysis over the week.

Just one question though, what does the output value of the DNA bisulphite quantitation pipeline actually means? Does higher value means higher methylation counts on a specific probes or does it mean higher percentage of methylation on that basepair out of all calls (methylated + non-methylated)? My input files are actually CpG methylation call files from bismark. From my understanding, the output files of bismark CpG calles are all 1bp long, where the + strand reads are methylated C while - strand reads are unmethylated c, am I right?

Thanks and regards, hope to hear from you soon.
kentawan is offline   Reply With Quote
Old 10-20-2014, 12:59 AM   #271
simonandrews
Simon Andrews
 
Location: Babraham Inst, Cambridge, UK

Join Date: May 2009
Posts: 871
Default

Quote:
Originally Posted by kentawan View Post
Thank you very much Simon for the headstart guide. I made significant progress with my data analysis over the week.

Just one question though, what does the output value of the DNA bisulphite quantitation pipeline actually means? Does higher value means higher methylation counts on a specific probes or does it mean higher percentage of methylation on that basepair out of all calls (methylated + non-methylated)? My input files are actually CpG methylation call files from bismark. From my understanding, the output files of bismark CpG calles are all 1bp long, where the + strand reads are methylated C while - strand reads are unmethylated c, am I right?

Thanks and regards, hope to hear from you soon.
The output of the bisulphite pipeline is percentage methylation. The only differences from the pipeline and a simple percent for/all is that the pipeline allows you to remove Cs with very low coverage, and then gives each C equal weight when calculating the overall percentage for the region.

You're correct that the output of the methylation extractor in bismark are 1bp long methylation calls, where the strand quoted in the file actually represents the methylation state and not the strand of origin (which can be determined by looking at the file name, OT and OB strands are reported separately, but we usually recombine them for analysis).

Simon.
simonandrews is offline   Reply With Quote
Old 10-23-2014, 08:21 AM   #272
simonandrews
Simon Andrews
 
Location: Babraham Inst, Cambridge, UK

Join Date: May 2009
Posts: 871
Default

I've just released a new version of seqmonk (v0.28). This is now available from the project site.

The new release makes a fairly large change in the way bisulphite data is handled by removing the old flag values, and allowing for individual probes to be listed as not having a value.

We've also added some new QC plots for RNA-Seq and Small RNA data which will be useful in the early stage assessment of new data. There are also a few new options for how data is displayed in the chromosome view.

We've added some new options and modules to both the probe generators and the quantitation methods, and there is a new "proportion of library" statistical filter which will be of use to some people.

A full list of the changes can be found in the release notes which along with the software itself is available from the project web site.

Please have a play with the new version and let me know if you hit any problems.
simonandrews is offline   Reply With Quote
Old 11-24-2014, 10:06 PM   #273
kentawan
Member
 
Location: Singapore

Join Date: Apr 2014
Posts: 14
Default

Hi Simon, I ran the RNAseq Quantification Pipeline with these descriptions. Log value is on. "Transcript features over mRNA. Quantitated with RNA-Seq pipeline quantitation counting reads over exons. Log transformed. Assuming a Non-strand specific library."

What does the heatmap coloured peaks represents, especially the value of the peaks? Does it mean a relative abundance calculation of RNA expressed?

Thanks in advance and I hope to hear from you soon.

regards,

Ziyi
kentawan is offline   Reply With Quote
Old 11-25-2014, 01:06 AM   #274
simonandrews
Simon Andrews
 
Location: Babraham Inst, Cambridge, UK

Join Date: May 2009
Posts: 871
Default

Quote:
Originally Posted by kentawan View Post
Hi Simon, I ran the RNAseq Quantification Pipeline with these descriptions. Log value is on. "Transcript features over mRNA. Quantitated with RNA-Seq pipeline quantitation counting reads over exons. Log transformed. Assuming a Non-strand specific library."

What does the heatmap coloured peaks represents, especially the value of the peaks? Does it mean a relative abundance calculation of RNA expressed?

Thanks in advance and I hope to hear from you soon.
Yes, in effect it's a relative abundance measure, it's actually a log transformed normalised read count. With the default settings you're getting values which are comparable for the same gene across different sample, but aren't comparable for different genes within the same sample, which is what people normally want.
simonandrews is offline   Reply With Quote
Old 12-09-2014, 06:49 AM   #275
m.olsson
Junior Member
 
Location: Gothenburg, Sweden

Join Date: Jun 2013
Posts: 1
Default

Hi Simon,
I'm using Seqmonk v0.27.0 to visualize the methylation level in individual CpGs in three individual data sets. The genomic distance relative to the reference genome, and thus also between the samples, appear to be offset by a couple of bases (possibly due to indels?). This, of course, makes Seqmonk "call" CpGs where there should not be any, when quantifying. Can I avoid this in some way..? (See attached image)

Thanks,
Martina
Attached Images
File Type: png forum question.PNG (19.7 KB, 7 views)
m.olsson is offline   Reply With Quote
Old 12-12-2014, 12:55 PM   #276
ctstackh
Junior Member
 
Location: Alabama

Join Date: Dec 2013
Posts: 7
Default

Simon or anyone else,

I am trying to create and quantitate probes between two features which I have already defined (Gene End and CDS End ie the 3' UTR). How can I do this in Seqmonk?

Best,
Christian
ctstackh is offline   Reply With Quote
Old 12-13-2014, 07:58 AM   #277
simonandrews
Simon Andrews
 
Location: Babraham Inst, Cambridge, UK

Join Date: May 2009
Posts: 871
Default

Quote:
Originally Posted by ctstackh View Post
Simon or anyone else,

I am trying to create and quantitate probes between two features which I have already defined (Gene End and CDS End ie the 3' UTR). How can I do this in Seqmonk?

Best,
Christian
Hi Christian,

I had a think about this, but couldn't come up with a way to easily do this directly in SeqMonk. If it was me doing this I'd probably export the two tracks you already have, work out the differences outside the program (you could probably do it easily enough in Excel), and then import that back through the generic text annotation import.

If you're stuck with this then feel free to send me the coordinates of the tracks you have and I'll make up a file you can re-import directly.

Simon.
simonandrews is offline   Reply With Quote
Old 12-17-2014, 03:07 AM   #278
simonandrews
Simon Andrews
 
Location: Babraham Inst, Cambridge, UK

Join Date: May 2009
Posts: 871
Default

We have just released seqmonk v0.29.0 onto the project web site.

This release adds a bunch of new features. Many of these are improvements to the chromosome view, specifically targeted at studies with large numbers of samples (we're getting a lot of single cell datasets in these days). You can now display your quantitated data in some new ways, and can also look at the variability in the quantitations shown when you're displaying a replicate set.

We've also added some new probe generation and quantitation options which have proved to be useful for projects we've worked on recently.

A related release which prompted some of the new features is that we have also put up the documentation for our methylation analysis course. We'll be running this fairly regularly starting in the new year, but all of the material for the course is available for anyone who wants to look. The course isn't solely focussed on seqmonk for the visualisation and analysis, but this does make up the majority of the practicals, so anyone wanting to use seqmonk to look at methylation data might want to take a look at the material we've put up.
simonandrews is offline   Reply With Quote
Old 12-17-2014, 03:12 AM   #279
simonandrews
Simon Andrews
 
Location: Babraham Inst, Cambridge, UK

Join Date: May 2009
Posts: 871
Default

Quote:
Originally Posted by m.olsson View Post
Hi Simon,
I'm using Seqmonk v0.27.0 to visualize the methylation level in individual CpGs in three individual data sets. The genomic distance relative to the reference genome, and thus also between the samples, appear to be offset by a couple of bases (possibly due to indels?). This, of course, makes Seqmonk "call" CpGs where there should not be any, when quantifying. Can I avoid this in some way..? (See attached image)

Thanks,
Martina
Sorry Martina - I missed this post.

Where did you get the data for the data which you're importing into seqmonk? Normally for bs-seq applications the data you import would be individual methylation calls, so the positions seqmonk would read would be the actual bases of the calls, so there wouldn't be any chance to get the positioning wrong. We would normally use the output of the bismark methylation extractor as input into seqmonk, and we've never seen an issue like this.

Have you looked in the genome to see whether the positions you're seeing are CpGs or not? The other possibility would simply be that the different samples are seeing differnet subsets of the genome. If you're really getting non-C positions imported then you'd need to go back to the imported file and check whether the positions there were wrong. If you're really seeing seqmonk change positions from the imported data to the project file then that would obviously be a bug and we can look at it, but that would seem to be unlikely with data as simple as this.

If you can reduce this problem to a small dataset which illustrates the error then I'd be happy to take a look at it for you.
simonandrews is offline   Reply With Quote
Old 12-17-2014, 09:10 AM   #280
ctstackh
Junior Member
 
Location: Alabama

Join Date: Dec 2013
Posts: 7
Default Thanks!

Quote:
Originally Posted by simonandrews View Post
Hi Christian,

I had a think about this, but couldn't come up with a way to easily do this directly in SeqMonk. If it was me doing this I'd probably export the two tracks you already have, work out the differences outside the program (you could probably do it easily enough in Excel), and then import that back through the generic text annotation import.

If you're stuck with this then feel free to send me the coordinates of the tracks you have and I'll make up a file you can re-import directly.

Simon.
Thank you for the suggestion! I was able to export the necessary tracks, edit, and save them as bed files. I then used the interval and subtract functions in Galaxy to isolate the regions of interest. Next step will be to import into Seqmonk for quantitation.

Just thought I would post my solution in case anyone was interested or has a similar problem to solve.

Best,
Christian
ctstackh is offline   Reply With Quote
Reply

Tags
analysis, desktop, seqmonk, visualization

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 11:40 AM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2019, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO