Seqanswers Leaderboard Ad

**rmorey** · 07-13-2012, 03:54 PM

Out of Memory Issues

I have 8 gb of ram allocated to seqmonk and although it takes awhile I am able to import all of my methylation extractor data into seqmonk and generate probes using the contig probe generator with a depth of one. However, when I try to quantitate the probes I get an out of memory error. I have three samples that I would like to analyze together however I have also tried to quantitate using just one sample and I still ran out of memory. Any suggestions?

**fkrueger** · 07-14-2012, 12:10 AM

Originally posted by rmorey View Post

I have 8 gb of ram allocated to seqmonk and although it takes awhile I am able to import all of my methylation extractor data into seqmonk and generate probes using the contig probe generator with a depth of one. However, when I try to quantitate the probes I get an out of memory error. I have three samples that I would like to analyze together however I have also tried to quantitate using just one sample and I still ran out of memory. Any suggestions?

Designing probes over every single C is a very memory intensive way of going about the analysis. I don't know the details but all CpGs (top and bottom strand) for a human genome may well be around 50 million probes to handle (and this number would be a lot higher for non-CG context), so one could easily imagine that 8GB is not enough for this many calculations.

What we normally use to compare different samples is the Bisulfite Quantitation pipeline over features. This lets you select certain genomic features you are interested in (basically any type of annotation or even the current probe set), e.g. CGIs, promoters, gene bodies etc., and calculates a percentage methylation for each feature. While doing so you may select a certain coverage threshold for a position to be considered, or a certain number of observations per feature. Tto get an overall idea about your different datasets you should be able to desing 1kb tiles over the entire genome, then quantify using appropriate coverage and observations per 1kb tile, e.g. 5 methylation counts per C and 10 Cs per tile, and then you can start looking at differences using appropriate filtering options.

**JacquesWvdH** · 08-14-2012, 06:07 AM

Hi Simon,
Thanks for making SeqMonk available. I would like to use it for bismark output. I am using the test data that came with bismark.
There is an item in the import data menu for bismark files, but no data tracks show, just errors.
I can import the bam-file that is there, but that does not show the methylation if C's that I was hoping to be able to view with SeqMonk.
I have seen the instruction videos on youtube (great and helpful), but nothing about methylation data.
Looking forward to youw answer,
Jacques

**fkrueger** · 08-14-2012, 08:28 AM

Originally posted by JacquesWvdH View Post

Hi Simon,
Thanks for making SeqMonk available. I would like to use it for bismark output. I am using the test data that came with bismark.
There is an item in the import data menu for bismark files, but no data tracks show, just errors.
I can import the bam-file that is there, but that does not show the methylation if C's that I was hoping to be able to view with SeqMonk.
I have seen the instruction videos on youtube (great and helpful), but nothing about methylation data.
Looking forward to youw answer,
Jacques

Dear Jacques,

Seqmonk is currently only able to import methylation information directly if Bismark was run in '--vanilla' mode. SAM files need to be run through the Bismark methylation extractor and can then be imported into SeqMonk using the Generic Text Import option.

Enabling methylation import directly from SAM/BAM files has been on Simon's things to do list for a while, but unfortunately there are quite a few other things, too

**honey** · 10-15-2012, 04:47 PM

Custom genome in seqmonk

I have a bacterial genome and is well documented at NCBI meaning has Fasta file as well annotation. However the particular organism is not there in seqmonk. Can I use custom annotation for my analysis?

Thanks

**mathew** · 10-20-2012, 06:47 PM

transcript expression in Seq monk

Is it possible to calculate transcript differential expression with Seqmonk? I see exon utr and other selection features for probe but not transcript? Thanks for your help.

**simonandrews** · 10-22-2012, 02:44 AM

Originally posted by mathew View Post

Is it possible to calculate transcript differential expression with Seqmonk? I see exon utr and other selection features for probe but not transcript? Thanks for your help.

Yes - I've been meaning to put up a new instruction video for this on our YouTube site, but there are now reasonably good tools to do differential expression analysis in SeqMonk. Most of the details for this can be found in our Advanced Course Manual, which actually focuses on the analysis of RNA-Seq data, but the general pipeline is:

Select a set of transcripts you want to analyse and make these into an annotation track (or use the whole of the mRNA feature track)
Use the RPKM quantitation pipeline to quantitate your data using default parameters. I need to change the name of this since it doesn't actually do RPKM calculations by default (more like LRPM)
Check the normalisation of your samples and use the available tools (percentile distribution normalisation, match distributions normalisation) to correct any significant differences if required
Make replicate sets from any groups of biological replicates you have (optional if you don't have them)
Use the intensity difference filter to find a set of differentially expressed transcripts
If you have biological replicates use the replicate set stats filter to remove variable transcripts from the initial candidate list
Use the deduplication filter to select the most significantly changing transcript for each gene (optional)
If you have more than two conditions use the hierarchical clustering plot to separate your hits into related clusters
Report your list of hits and proceed to other downstream analyses.

We've been using this methodology for a while now and it seems to be pretty robust for us.

**krespim** · 10-25-2012, 01:39 AM

Originally posted by simonandrews View Post

If you're interested in looking at alternative splicing then if you haven't seen this already then a really neat option is to import just the spliced introns into your project. If you have a spliced mapped SAM/BAM file (eg from TopHat), then if you import this and select "Split Spliced Reads" and "Import Introns rather than exons" then you'll see just the splices which you've observed. You can quantitatively analyse these by using the Read Position Probe Generator followed by the Exact Overlap Count Quantitation. We've found this way of looking at the data to be really helpful in deciding if there really is a change in the splicing pattern between samples.

This might be a silly question, but does this means that one could look at and identify retained introns using SeqMonk?

**simonandrews** · 10-25-2012, 01:45 AM

Originally posted by krespim View Post

This might be a silly question, but does this means that one could look at and identify retained introns using SeqMonk?

Yes, you could specifically quantitate over introns (even excluding other overlapping exons) and quantitate reads in these to see if you can see significant levels of coverage. There are no pipelines set up for this type of analysis but you should be able to build something suitable from the generic sets of tools which are available.

**sschmidt** · 11-16-2012, 07:18 AM

Hi Simon,
I'm currently using SeqMonk to analyze a set of RNA-seq libraries with read depths ranging from 12-100 million. I've used the RPKM pipeline and the cumulative distribution plot lines are close, but still parallel to each other. How different should these lines be before doing further normalization, and which normalization would work the best for datasets with this difference in total reads? Thanks!

**simonandrews** · 11-16-2012, 08:34 AM

After the RPKM pipeline (which will have it's name changed in the next release given that by default it doesn't do RPKM calculations!), you should look at the cumulative distribution plot and see how your samples look. If they're well overlaid then you can stop there. If they're different then you'd normally run the percentile normalisation quantitation method to get them to match up.

Within the percentile normalisation you have the choice to match your datasets by either adding or multiplying by a factor to get the data to match. If your samples have a small number of highly differential regions then you can end up with profiles which fall parallel to each other. In this case you'd use the 'add' option to make them line up. If your samples have differing degrees of duplication in them then you can get profiles where the degree of separation is proportional to the read count (the lines get further apart through the plot). In this case you'd use the 'multiply' option.

If you're not sure you can always try both and see which works best.

Hope this helps

Simon.

**shadow19c** · 11-20-2012, 08:18 AM

Hello,
so I'm working with bisulfite data, so after methylation extraction after mapping of Bismark, I have 5 files, CHG,CG,CHH, bedgraph and genomewidereport.
I tried to upload data to Seqmonk by using the option text(generic).

I create for each chromosome a file because the genomewidereport file is really huge....

I want to know how to visulize the peak (the number of methylation per position) and after I tried the probe window generator to see by window of 20 bp but I did not see any difference between windows...

SO if someone has an idea how to analyse or visualize these kinds of data...

Thanks

Mohamed

**simonandrews** · 11-21-2012, 12:58 AM

Firstly I'd have a think about exactly which contexts you care about in your data. CHH and CHG are normally huge datasets, and if you're only really interested in CpG then just working with that should make your life much easier. On a reasonable PC you should be able to load a whole CpG dataset fairly easily. You'll need a reasonable amount of RAM to do a whole CHH/CHG dataset.

Once you've loaded the data the way to calculate methylation levels is to use one of the quantitation pipelines (Data > Quantitation Pipelines), specifically the 'Bisulphite Methylation over feature' pipeline. This gives you several options to calculate an overall methlyation value for each one of a class of features (CpG islands, promoters, exons, genes, whatever..). Once you have the methlyation values you can start to compare these between datasets - exactly how you do this will depend on the datasets you have and what it is you want to look for.

Hopefully this is enough to get you started.

**fkrueger** · 11-21-2012, 01:17 AM

Originally posted by shadow19c View Post

Hello,
so I'm working with bisulfite data, so after methylation extraction after mapping of Bismark, I have 5 files, CHG,CG,CHH, bedgraph and genomewidereport.
I tried to upload data to Seqmonk by using the option text(generic).

I create for each chromosome a file because the genomewidereport file is really huge....

I want to know how to visulize the peak (the number of methylation per position) and after I tried the probe window generator to see by window of 20 bp but I did not see any difference between windows...

SO if someone has an idea how to analyse or visualize these kinds of data...

Thanks

Mohamed

Further to what Simon wrote, you can't import the genome wide cytosine report as this basically has only one entry per cytosine. This means that the entire genome has a coverage of only a single read which is probably why you don't see any differences. SeqMonk regards every line of an input file as an extra read, which is why you have to import the output of the methylation extractor (CpG_*, CHG_* or CHH_*) and not the cytosine report.

**Neuromancer** · 11-26-2012, 06:26 AM

SE vs PE counting

Hey all,

how does Seqmonk count paired-end reads? Is each pair only counted once (or once per gene?) or is each read counted individually? In any case: is there any way how to switch between these two modes?

And if you allow a second question: Paired-end reads are shown as "complement" on mouse-over, but is there also a way to tell which reads belong together, maybe similar to how duplicate reads are visualized (they turn green on mouse-over)?

The reason I ask, is because I have mixed samples form SE and PE sequencing and want to compare raw count numbers (i.e. I use Seqmonk for preparing countTables for DESeq).

Thanks,

Roman

Topics	Statistics	Last Post
Cancer Metastasis: A Deep Dive into Cellular Plasticity by seqadmin Started by seqadmin, 04-11-2024, 12:08 PM	0 responses 18 views 0 likes	Last Post by seqadmin 04-11-2024, 12:08 PM
Proteogenomic Profiles Offer New Clues in Prostate Cancer by seqadmin Started by seqadmin, 04-10-2024, 10:19 PM	0 responses 22 views 0 likes	Last Post by seqadmin 04-10-2024, 10:19 PM
Novel Diagnostic Assay Enhances Ovarian Cancer Detection by seqadmin Started by seqadmin, 04-10-2024, 09:21 AM	0 responses 16 views 0 likes	Last Post by seqadmin 04-10-2024, 09:21 AM
Evolutionary Dynamics of Centromeres: A Comparative Genomic Analysis by seqadmin Started by seqadmin, 04-04-2024, 09:00 AM	0 responses 47 views 0 likes	Last Post by seqadmin 04-04-2024, 09:00 AM

Seqanswers Leaderboard Ad

Announcement

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News