SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics



Similar Threads
Thread Thread Starter Forum Replies Last Post
ChIP-Seq: Enabling Data Analysis on High-Throughput Data in Large Data Depository Usi Newsbot! Literature Watch 1 04-18-2018 10:50 PM
Cufflinks - Nature Biotech data sets adrian Bioinformatics 1 04-16-2011 05:40 PM
public data sets muchomaas Bioinformatics 2 06-08-2010 02:48 AM
sff_extract: combining data from 454 Flx and Titanium data sets agroster Bioinformatics 7 01-14-2010 11:19 AM
SeqMonk - Flexible analysis of mapped reads simonandrews Bioinformatics 7 07-24-2009 05:12 AM

Reply
 
Thread Tools
Old 03-11-2011, 10:51 AM   #21
ttnguyen
Member
 
Location: Ireland

Join Date: Mar 2010
Posts: 41
Default

Is it true that Ensembl have some versions of GRCh37 and the latest now is GRCh37.61?
I've just checked the difference between GRCh37.61 and hg19 and found that there are some differences in chromosome Y and MT.

I am thinking if I can create a 'genome' from GTF + chromosome length, so I can use different sources of genes annotation?
ttnguyen is offline   Reply With Quote
Old 03-11-2011, 11:17 AM   #22
simonandrews
Simon Andrews
 
Location: Babraham Inst, Cambridge, UK

Join Date: May 2009
Posts: 870
Default

Quote:
Originally Posted by ttnguyen View Post
Is it true that Ensembl have some versions of GRCh37 and the latest now is GRCh37.61?
I've just checked the difference between GRCh37.61 and hg19 and found that there are some differences in chromosome Y and MT.

I am thinking if I can create a 'genome' from GTF + chromosome length, so I can use different sources of genes annotation?
There are two things here, the genome assembly and the annotation set. UCSC don't really distinguish these - they just refer to hg18 or mm9 which are a combination of assembly and annotation. I presume they update their annotation sets through the life of an assembly but don't specifically advertise this in their nomenclature.

Ensembl specifically separate the two things, so their current human version is based on the GRCh37 assembly but with an annotation set from ensembl release 61. The release numbers relate to ensembl releases which occur across all of their genomes, so sometimes a particular genome will get updated annotation, but often it won't. This means that every GRCh37 genome will be the same underlying sequence and there's no need to remap data between these releases.

As for the differences you saw between hg19 and GRCh37 - for the Y chromosome Ensembl mask out the pseudo-autosomal region which would otherwise produce a huge stretch of exact identity with ChrX. The coordinates which remain should match between the two assemblies and arguably you should be mapping against the masked version.

I'm not sure about the mitochondrial sequence - that may not be part of the official genome assembly in the first place so might differ slightly.
simonandrews is offline   Reply With Quote
Old 03-15-2011, 10:57 AM   #23
psabelli
Junior Member
 
Location: Arizona

Join Date: Feb 2011
Posts: 2
Default Thanks for the info

Quote:
Originally Posted by simonandrews View Post
I'm actually in the process of working with just this kind of data! SeqMonk wasn't really designed with this in mind, but you can make a pseudo genome out of shorter contigs where you concatonate them into groups of a few thousand. It's not ideal but if you want to have a go then I'm happy to share the code I've written for my job.

If it doesn't start at all then it's normally one of two things;
  1. Java isn't installed properly, or the java binary isn't in your path. Open a command prompt and type 'java -version' if you get an error saying this isn't a recognised command then this is the problem
  2. You don't have enough RAM in your machine to run the default configuration. SeqMonk ships with a configuration which assumes you have 2GB RAM. If you have less than that you can still run the program for smaller datasets but you'll need to change the memory settings.

If it's neither of these things then try starting seqmonk from a command prompt (move to the seqmonk directory and just run the bat file directly from the command line). It will still fail to launch but should leave a useful error in the window which if you post it I can see what's going wrong.

Thanks for your help Simon - I appreciate it. As far as the first issue (using cDNA reference sequences) is concerned, I might try building a pseudogenome as you suggested to do the alignment of the tags, and in that case I'll contact you again. However, right now mapping is a bit of a bottleneck for us, and if we are going to do a brand new mapping we might opt for using the genome instead.
On to the second issue, SeqMonk did not start in Windows, I solved it by lowering the memory requirements to 1GB, as you indicated. (By the way, it seemed that by default my version of SeqMonk was set up at 1.5GB of memory.) Although I have 4GB installed, the available memory is much less and I might need to free some for SeqMonk to run properly. Thanks. Paolo
psabelli is offline   Reply With Quote
Old 03-15-2011, 11:36 AM   #24
simonandrews
Simon Andrews
 
Location: Babraham Inst, Cambridge, UK

Join Date: May 2009
Posts: 870
Default

Quote:
Originally Posted by psabelli View Post
On to the second issue, SeqMonk did not start in Windows, I solved it by lowering the memory requirements to 1GB, as you indicated. (By the way, it seemed that by default my version of SeqMonk was set up at 1.5GB of memory.) Although I have 4GB installed, the available memory is much less and I might need to free some for SeqMonk to run properly. Thanks. Paolo
The 1.5GB in the config file is right for 2GB total usage. The config file specifies the amount of memory the program can use, but there is an overhead for the java virtual machine which runs the program - we reckon that 1.5GB of program memory ends up using around 2GB.

Glad you've managed to get everything running though.
simonandrews is offline   Reply With Quote
Old 05-20-2011, 08:13 AM   #25
simonandrews
Simon Andrews
 
Location: Babraham Inst, Cambridge, UK

Join Date: May 2009
Posts: 870
Default

SeqMonk v0.15.0 has just been released.

This release adds some new tools which are useful for the analysis of differential splicing. These are:
  1. An option to import introns from spliced SAM/BAM files
  2. A probe generator to put probes over every different read position in a dataset
  3. A quantitation method to count exact overlaps between probes and reads

Using this combination of tools you can get a count of the number of times a particular splice junction was used in a dataset, and can then use the existing tools to compare these counts between different datasets.

In addition other changes in this release are:
  • A change to the way empty probes are handled in log transformed quantitations
  • A new probe generator which can deduplicate and merge overlapping probes in an existing probe set
  • An option to import features in GFFv3 or GTF format
  • An option to create probe trends where each probe gets the same weight in the final trend plot
  • An option to zoom in in histogram plots

You can get the new release from the project website at:

http://www.bioinformatics.bbsrc.ac.uk/projects/seqmonk/

[If you don't see the option to download v0.15.0 press shift+refresh in your browser to force our cache to give you the latest version]
simonandrews is offline   Reply With Quote
Old 06-17-2011, 05:35 AM   #26
simonandrews
Simon Andrews
 
Location: Babraham Inst, Cambridge, UK

Join Date: May 2009
Posts: 870
Default

SeqMonk v0.16.0 has just been released onto our website.

This adds a wrapper script for linux users which makes it easy to launch the program and saves the bother of having to construct your own launch command.

Improvements have also been made to the probe trend plot and the percentile normalisation quantitation.

We've also (finally!) put in place a work round for the problem where paired end read data from tophat files could not be imported due to a missing field in their BAM/SAM files.

You can get SeqMonk from:

http://www.bioinformatics.bbsrc.ac.uk/projects/seqmonk/
simonandrews is offline   Reply With Quote
Old 08-28-2011, 08:12 AM   #27
Neuromancer
Member
 
Location: Goettingen, Germany

Join Date: Aug 2011
Posts: 28
Default Seqmonk probe trend plots

I think Seqmonk is really great to visualize and analyze mapped read data, all in one package! (btw: are there any recent papers that used/cited Seqmonk for their analysis?)

I'm currently trying to make sense out of ChIP-seq data of a particular histone modification. Actually I have two different conditions with three replicates each that I want to compare in quantitative way.

So here's my question to the community:
Is it feasible to make a quantitative statement between the two groups in a sense like: "is there more (or less) of this histone modification in one or the other condition in a particular genomic region (like genes, promoters, etc) ?
For example, using Seqmonk, I created probes over each gene (incl. their promoter) and looked at the probe trend plot. At the moment I'm a bit stuck at which of the probe trend plot types/options t use for this kind of question. I reasoned that the cumulative type is more suitable 'cause it has an option to correct for total read count (which is indispensable for a quantitative statement, I guess?). I tried to look at the Seqmonk help for these options, but have to admit that it confused me a little... (esp. the sentence "A relative distribution plot will weight each probe equally in the final profile, whereas the cumulative count plot will weight the probes according to the number of bases of read falling into each probe. The cumulative count plot is more susceptible to high read count outliers skewing your result, but will give you results in real read depths")
What is your advice?
Thanks very much in advance!
Neuromancer is offline   Reply With Quote
Old 09-12-2011, 03:23 AM   #28
Neuromancer
Member
 
Location: Goettingen, Germany

Join Date: Aug 2011
Posts: 28
Default

I just came across another problem: Is it possible to export imported data from Seqmonk back to BED or any other format (i.e. the format that it was imported from?)

Thx,

Nrmncr
Neuromancer is offline   Reply With Quote
Old 09-12-2011, 03:27 AM   #29
simonandrews
Simon Andrews
 
Location: Babraham Inst, Cambridge, UK

Join Date: May 2009
Posts: 870
Default

Quote:
Originally Posted by Neuromancer View Post
I just came across another problem: Is it possible to export imported data from Seqmonk back to BED or any other format (i.e. the format that it was imported from?)
You can export quantitated data from SeqMonk as BedGraph files and you can take tabulated data in the various reports which it offers.

There's no current option to export out raw data. SeqMonk only stores the sets of mapped positions (not the sequence or qualities), so it would only be possible to export out a limited set of data if we were to add that ability.

I suppose the question would be why you wanted to export the raw data back out of SeqMonk rather than just use the files which you imported from in the first place?
simonandrews is offline   Reply With Quote
Old 09-13-2011, 08:43 AM   #30
Neuromancer
Member
 
Location: Goettingen, Germany

Join Date: Aug 2011
Posts: 28
Default

Thanks, that is more or less what I needed. Thanks!

Any suggestions about my former question about the probe trend plot the quantitation?
Neuromancer is offline   Reply With Quote
Old 09-14-2011, 12:52 AM   #31
simonandrews
Simon Andrews
 
Location: Babraham Inst, Cambridge, UK

Join Date: May 2009
Posts: 870
Default

If you're looking to compare modification enrichment in genomic features then there are a couple of ways to do this.

You could put probes over your feature of interest and then do an enrichment quantitation and compare either the means or the distributions between your two samples. This would tell you if one sample was more enriched than another on average. The problem with this approach is that you may well see overall differences in enrichment which come from technical effects (how well the ChIP worked) rather than biological. These effects should be global though, so you could, for example, compare enrichment in promoters vs exons.

Alternatively you could make a simpler comparison by simply counting the number of promoters which showed enrichment and then comparing values between your samples. In many cases a simple quantitation of corrected read counts will show a nice bivalent distribution where you can easily set a threshold to separate the enriched from non-enriched populations. You could then apply this to your two samples and compare the number of promoters which pass the filter. This might not work well if there isn't a clear distinction between enriched and non-enriched in your sample though.

The probe trend plot probably isn't best suited to this kind of analysis. Its strength is in showing the pattern of enrichment to see if that changes, rather than judging the strength of enrichment which is normally better handled by the conventional quantitation tools. If you do want to use the trend plot to do this then you will need to use the cumulative distribution plot, but beware that (as the docs you quoted state), this is susceptible to bias from extreme outliers since it just sums the counts across all probes and makes no distinction between them in the final plot.
simonandrews is offline   Reply With Quote
Old 09-14-2011, 01:53 AM   #32
Neuromancer
Member
 
Location: Goettingen, Germany

Join Date: Aug 2011
Posts: 28
Default

Thanks for this comprehensive answer!!
I'll try that and let you know, how/what has worked.

Many Thanks!
Neuromancer is offline   Reply With Quote
Old 09-22-2011, 03:12 AM   #33
simonandrews
Simon Andrews
 
Location: Babraham Inst, Cambridge, UK

Join Date: May 2009
Posts: 870
Default

I've just released SeqMonk v0.17.0 onto our project's web site. This is the biggest release we've made for an awfully long time and has lots of improvements and new toys to play with. The biggest changes are:
  • Support for HiC data sets, and a new HiC heatmap view to visualise them
  • New program launchers (now with a proper native windows exe) which will automatically configure optimal memory settings.
  • Support for gzipped data in all import filters
  • A new MA plot view
  • Support for very large annotation sets (millions of features)
  • A z-score transformation option in the quantitation tools
  • An option to match distributions exactly in the quantitaiton options
  • A new statistical filter for pairwise comparison of data stores without the requirement for replicates.

..plus many other smaller improvements and general tidying up. I'll hopefully be adding some more videos to our site in the near future to help illustrate the usage of some of the new tools.
simonandrews is offline   Reply With Quote
Old 09-26-2011, 01:35 AM   #34
Neuromancer
Member
 
Location: Goettingen, Germany

Join Date: Aug 2011
Posts: 28
Default

Dear Simon,

When I want to start seqmonk v0.17.0 on my iMac, it simply does not start. When I looked in the console I saw the following error message:

9/26/11 10:27:00 AM [0x0-0x1b01b].SeqMonk[971] Could't parse physical memory from the output of top at /Users/Shared/NGS/Programs/SeqMonk/SeqMonk.app/Contents/MacOS/seqmonk line 72.

However on my MacBook the v0.17.0 works fine...!
The iMac is a managed workstation (16GB RAM), so I'm not using it with limited read/write access, could that be a problem? Based on the error, I guess it has to do with configuring memory settings by the new automatic launcher...?

edit:
When I launched the seqmonk binary that is mentioned in the error message it says the following:

/Users/Shared/NGS/Programs/SeqMonk/SeqMonk.app/Contents/MacOS/seqmonk ; exit;
$ /Users/Shared/NGS/Programs/SeqMonk/SeqMonk.app/Contents/MacOS/seqmonk ; exit;
Memory ceiling is 8192
Could't parse physical memory from the output of top at /Users/Shared/NGS/Programs/SeqMonk/SeqMonk.app/Contents/MacOS/seqmonk line 72.

Last edited by Neuromancer; 09-26-2011 at 01:39 AM.
Neuromancer is offline   Reply With Quote
Old 09-26-2011, 01:43 AM   #35
simonandrews
Simon Andrews
 
Location: Babraham Inst, Cambridge, UK

Join Date: May 2009
Posts: 870
Default

Sorry to hear this failed. Can you please try running the following command in a terminal and let me know what output you get:

top -l 1 -n 0

I thought top was always available on a mac, which may not be true, or it might be that the formatting is substantially different on some systems.
simonandrews is offline   Reply With Quote
Old 09-26-2011, 01:58 AM   #36
Neuromancer
Member
 
Location: Goettingen, Germany

Join Date: Aug 2011
Posts: 28
Default

Quote:
Originally Posted by simonandrews View Post
Sorry to hear this failed. Can you please try running the following command in a terminal and let me know what output you get:

top -l 1 -n 0

I thought top was always available on a mac, which may not be true, or it might be that the formatting is substantially different on some systems.
bash-3.2$ top -l 1 -n 0
Processes: 54 total, 2 running, 52 sleeping, 260 threads
2011/09/26 10:57:39
Load Avg: 0.11, 0.07, 0.06
CPU usage: 0.0% user, 25.0% sys, 75.0% idle
SharedLibs: 4944K resident, 12M data, 0B linkedit.
MemRegions: 6256 total, 543M resident, 12M private, 291M shared.
PhysMem: 599M wired, 664M active, 822M inactive, 2085M used, 14G free.
VM: 126G vsize, 1041M framework vsize, 46601(0) pageins, 0(0) pageouts.
Networks: packets: 19761/13M in, 11216/1895K out.
Disks: 40572/1282M read, 25716/882M written.



edit:
runs on SnowLeopard, if that is of any help!
Neuromancer is offline   Reply With Quote
Old 09-26-2011, 02:03 AM   #37
simonandrews
Simon Andrews
 
Location: Babraham Inst, Cambridge, UK

Join Date: May 2009
Posts: 870
Default

Ah OK. When you have that much memory some of the values are reported in Gb rather than Mb so the parser fails to recognise the memory settings.

It should be an easy fix. I'll put out an updated version which fixes this.

In the mean time I think you can work round it by running:

/Users/Shared/NGS/Programs/SeqMonk/SeqMonk.app/Contents/MacOS/seqmonk -m 8000

..which should bypass the automatic memory calibration.
simonandrews is offline   Reply With Quote
Old 09-26-2011, 02:11 AM   #38
Neuromancer
Member
 
Location: Goettingen, Germany

Join Date: Aug 2011
Posts: 28
Default

Great! That works! Thanks a lot!
Neuromancer is offline   Reply With Quote
Old 09-27-2011, 03:11 AM   #39
simonandrews
Simon Andrews
 
Location: Babraham Inst, Cambridge, UK

Join Date: May 2009
Posts: 870
Default

I've just put out an update to SeqMonk (v0.17.1) which fixes the OSX launcher bug on systems with large amounts of RAM. It also fixes a crash in the HiC plot when using more than 45k probes and adds some more controls to the HiC plot view.
simonandrews is offline   Reply With Quote
Old 10-14-2011, 09:50 AM   #40
kshankar
Member
 
Location: Little Rock AR

Join Date: Jul 2010
Posts: 12
Default

I am trying to import a large file with (~ 450 -500 million Illumina single 36 bp reads) into SeqMonk. We have 48 GB of memory on the machine and have assigned 8 GB for Seqmonk. However, after ~ 330 million reads, we inevitably find 99% of memory being used up and the software slowing down considerably. Is there any way to increase the memory any more, perhaps in the latest Java environment. We are using JRE b1.6.0_24 and the latest SeqMonk (v0.17.1). BTW, the software is immensely useful. great work Simon.
kshankar is offline   Reply With Quote
Reply

Tags
analysis, desktop, seqmonk, visualization

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 08:45 PM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2018, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO