SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics



Similar Threads
Thread Thread Starter Forum Replies Last Post
ChIP-Seq: Enabling Data Analysis on High-Throughput Data in Large Data Depository Usi Newsbot! Literature Watch 1 04-18-2018 10:50 PM
Cufflinks - Nature Biotech data sets adrian Bioinformatics 1 04-16-2011 05:40 PM
public data sets muchomaas Bioinformatics 2 06-08-2010 02:48 AM
sff_extract: combining data from 454 Flx and Titanium data sets agroster Bioinformatics 7 01-14-2010 11:19 AM
SeqMonk - Flexible analysis of mapped reads simonandrews Bioinformatics 7 07-24-2009 05:12 AM

Reply
 
Thread Tools
Old 10-14-2011, 10:18 AM   #41
fkrueger
Senior Member
 
Location: Cambridge, UK

Join Date: Sep 2009
Posts: 611
Default

The biggest problem of very large datasets is the initial data import since all reads have to be held in memory temporarily until the all reads mapping to displayed chromosomes can be cached onto the disk. Once a file has been cached I don't think that 450M reads would be a considerable problem to deal with (BS-Seq data us much larger than that). So the easiest option would probably be to split the file up into 2-4 smaller chunks, and then import the files individually. Once imported, you can then create a data group in Seqmonk and 'merge' the fileparts into a single dataset (group) again.

The trouble with Java (according to Simon) is that if you allow it to use stupidly high amounts of RAM then it will spend ages trying to clear up the garbage collection etc. while trying to free memory, thereby effectively making everything slower the more memory you give it to play with (I got 16GB of memory on my machine and Simon wouldn't 'allow' me to use more than 8GB either). Splitting files up should definitely work though.
fkrueger is offline   Reply With Quote
Old 10-14-2011, 12:02 PM   #42
simonandrews
Simon Andrews
 
Location: Babraham Inst, Cambridge, UK

Join Date: May 2009
Posts: 870
Default

Quote:
Originally Posted by kshankar View Post
I am trying to import a large file with (~ 450 -500 million Illumina single 36 bp reads) into SeqMonk. We have 48 GB of memory on the machine and have assigned 8 GB for Seqmonk. However, after ~ 330 million reads, we inevitably find 99% of memory being used up and the software slowing down considerably. Is there any way to increase the memory any more, perhaps in the latest Java environment. We are using JRE b1.6.0_24 and the latest SeqMonk (v0.17.1). BTW, the software is immensely useful. great work Simon.
If you have a dataset with that many reads then I'm guessing that you've merged together several runs into a single file. Instead of doing this outside the program the way to do this is to import the files individually and then merge them together within SeqMonk by creating a Data Group. This will be hugely more memory efficient than trying to import everything from one file.

Basically the reason for this is that SeqMonk has an efficient caching mechanism which reduces the amount of data which needs to be held in memory. During normal operation only one chromosome's worth of data is in memory. Whilst loading in data however the program needs to temporarily store all of the data for one dataset in memory so it can sort it and write out the cache files. If all of your data comes in one dataset then it will all end up in memory whilst being loaded. If the data comes in smaller chunks then these can be cached separately which will reduce the overhead. As you've found, with 8GB RAM you'll start getting problems over about 250 million sequences in one data set, but if you split your file into 10 datasets of 50 million sequences each and then imported these you could handle this on a ~2GB machine.
simonandrews is offline   Reply With Quote
Old 10-14-2011, 12:03 PM   #43
simonandrews
Simon Andrews
 
Location: Babraham Inst, Cambridge, UK

Join Date: May 2009
Posts: 870
Default

I really should read to the end of a thread before replying. I should have known Felix would have got there before me :-)
simonandrews is offline   Reply With Quote
Old 11-22-2011, 02:43 AM   #44
simonandrews
Simon Andrews
 
Location: Babraham Inst, Cambridge, UK

Join Date: May 2009
Posts: 870
Default

I've just put SeqMonk v0.18.0 up onto the project web site. This release greatly improves the tools for HiC analysis which were a little clunky in their initial incarnation. It also adds a specific RNA-Seq analysis pipeline which allows for simple analysis of RNA-Seq data at the level of transcripts rather than exons.

I've also made changes so that people on multi-CPU machines should see a noticeable decrease in data loading time, as well as making numerous other improvements throughout the program.
simonandrews is offline   Reply With Quote
Old 11-23-2011, 01:24 PM   #45
kshankar
Member
 
Location: Little Rock AR

Join Date: Jul 2010
Posts: 12
Default

Is there any way for SeqMonk to show the % methylation calls in the .txt file (coming out of BisMark's Methylation_extractor). The calls can be seen in IGV but not in SeqMonk. Any way to input this information?
kshankar is offline   Reply With Quote
Old 11-23-2011, 01:33 PM   #46
fkrueger
Senior Member
 
Location: Cambridge, UK

Join Date: Sep 2009
Posts: 611
Default

The methylation information can be imported into SeqMonk whereby '+' reads are methylated and '-' reads are non-methylated cytosines. Use the position value for both start and end of the cytosine methylation calls. You can then perform a probe generation over individual C positions (e.g. read position probe generation) and do a relative quantitation of 'FORWARD' reads 'as percentage of' 'ALL READS'. You could also look at other genomic features such as CGIs, promoters etc.

If you are not necessarily interested in strand specific methylation you can also import *bismark.txt files directly into SeqMonk using the Bismark import filter where you can select the context you are interested in. Hope this helps.
fkrueger is offline   Reply With Quote
Old 01-03-2012, 05:45 AM   #47
simonandrews
Simon Andrews
 
Location: Babraham Inst, Cambridge, UK

Join Date: May 2009
Posts: 870
Default SeqMonk v0.19.0 released

As a somewhat belated christmas present I've just put up the release of SeqMonk v0.19.0 onto our project web page. In this release we've made some fairly major changes to the core data model which mean that we get a significant increase in loading and saving speeds (load times are around half what they were before), along with a big decrease in the running memory footprint (down by around 4X) as well as a nice speed increase in many of the analysis functions.

Along with this we've improved some of the plots (aligned probes and probe trend), and have put some new display options into the main chromosome view (greater raw read density, fixed colours for individual datasets).

We've been running this build internally for a while and have seen large increases in the amount of data we've been able to handle - along with a pleasant reduction in the amount of time we spend watching little red bars slowly crawl across the screen.

The updated version is available from our project page. As always, if you don't see the new version try pressing shift+refresh in your browser to bypass the annoying BBSRC proxy server.

If you have any problems, either add a note to this thread, or report them in our bugzilla system.
simonandrews is offline   Reply With Quote
Old 01-17-2012, 07:20 AM   #48
beajorrin
Junior Member
 
Location: Madrid

Join Date: Jan 2012
Posts: 6
Default

I'm trying to visualizer my data with SeqMonk. My data is Illumina pair-based sequences, I work fist with Galaxy, and do Bowtie there. So now i have SAM and BAM files. I could import my reference genome, changing those thing in the AC and product, locus_tag. I try first with the BAM file, but when I import this data and the SeqMonk reads it it told me "Couldn't extract a valid name from <name>".
So I go to the reference genome that i used in galaxy (the same that i used in seqMonk) and change in the fasta file the AC/ID, that I use in SeqMonk reference genome. And the answer is the same.

I don't try yet the SAM, but I think that the problem is the reference genome used in galaxy.

Thanks
beajorrin is offline   Reply With Quote
Old 01-17-2012, 07:38 AM   #49
simonandrews
Simon Andrews
 
Location: Babraham Inst, Cambridge, UK

Join Date: May 2009
Posts: 870
Default

If your reference genome is chromosome based, but the identifiers are not chromosome names but accession numbers or something similar then you need to define some custom chromosome name mappings so SeqMonk can figure out which accession refers to which chromosome. Once you have the mappings set up then the import should work.
simonandrews is offline   Reply With Quote
Old 01-18-2012, 06:37 PM   #50
goofy
Junior Member
 
Location: Melbourne

Join Date: Jan 2012
Posts: 1
Default Total Read Count Differece

Hi,

I'm trying to quantitate the percentage distribution of TF enrichment of my control and treated samples but I got a massive Total Read Count between the samples, just wondering what it means. I've designed probes based on promoter, introns, exons region, they're ok but I want to normalize that against the total read count. My other ChUp-Seq's total read counts are relatively similar between control and treated samples but just this TF ChIP-Seq has a massive difference. Anyone know what this means????
goofy is offline   Reply With Quote
Old 01-19-2012, 12:28 AM   #51
beajorrin
Junior Member
 
Location: Madrid

Join Date: Jan 2012
Posts: 6
Default

Quote:
Originally Posted by simonandrews View Post
If your reference genome is chromosome based, but the identifiers are not chromosome names but accession numbers or something similar then you need to define some custom chromosome name mappings so SeqMonk can figure out which accession refers to which chromosome. Once you have the mappings set up then the import should work.
Thanks, It works!
beajorrin is offline   Reply With Quote
Old 01-20-2012, 12:51 AM   #52
simonandrews
Simon Andrews
 
Location: Babraham Inst, Cambridge, UK

Join Date: May 2009
Posts: 870
Default

Quote:
Originally Posted by goofy View Post
Hi,

I'm trying to quantitate the percentage distribution of TF enrichment of my control and treated samples but I got a massive Total Read Count between the samples, just wondering what it means. I've designed probes based on promoter, introns, exons region, they're ok but I want to normalize that against the total read count. My other ChUp-Seq's total read counts are relatively similar between control and treated samples but just this TF ChIP-Seq has a massive difference. Anyone know what this means????
Total read count isn't always a great thing to normalise to. In some cases (particularly in ChIP samples) you can get a huge number of sequences mapping to a small number of loci. Often these will be mis-mappings, maybe even of regions which aren't in the assembly (telomeric or centromeric repeats for example). We've seen cases where 40% of reads in a ChIP (a MeDIP actually) came from this kind of sequence and mapped to just 12 locations. This kind of bias can hugely throw off your normalisation.

Within SeqMonk you can use the cumulative distribution plot to look at how well your samples are normalised. If your total count has thrown off the normalisation then you'll probably see lines running parallel to each other. In this case you can then use the percentile normalisation quantitation method to correct your normalisation to a specific point in your distribution where the distributions look to be equivalent, and this should remove any odd biases in the total counts.

I'm actually going to be releasing our Advaanced SeqMonk course documentation in the next couple of weeks, and there will be a whole section on sorting out data normalisation which will go through these kinds of issues in much more detail.
simonandrews is offline   Reply With Quote
Old 01-24-2012, 01:59 AM   #53
simonandrews
Simon Andrews
 
Location: Babraham Inst, Cambridge, UK

Join Date: May 2009
Posts: 870
Default

I've just release SeqMonk v0.20.0 onto our repositories. This address a potentially nasty bug in v0.19 which may have truncated some filtered probe lists in any projects saved with that version.

The bug would affect you if your probe set contained multiple probes at exactly the same genomic position. In practice this only really happens if you make feature based probes and don't select the option to remove exact duplicates. If you made probe sets like this in v0.19.0 you should recalculate any filtered lists you have made with that version. Most of these won't actually have been affected, but since we can't spot a truncated list automatically it's better to be safe than sorry.

The gory details of the bug can be found on our bugzilla server.

Other changes in this release are:
  • We fixed a bug in the Intensity Difference Filter which was adding the same hit multiple times. All reported hits were real hits, but some may have been duplicated.
  • We fixed a display bug for deduplicated HiC data when it was first imported. Saving and reloading the project would fix the problem.
  • We added a new quantitaiton pipeline to allow you to easily make 'wiggle' type plots.

The new version is now available from our project page and all users of the previous version are strongly advised to upgrade immediately.
simonandrews is offline   Reply With Quote
Old 01-30-2012, 11:36 AM   #54
mediator
Member
 
Location: New England

Join Date: Nov 2010
Posts: 27
Default

Hi Simon,
I am using the Seqmonk to analyze my RNA Seq data right now. It's very straightforward and intuitive. Just have a question, after I used the quantitation pipeline to perform RPKM calculation on my data, how do I save the RPKM for all the probes in a export file? Thank you!


Quote:
Originally Posted by simonandrews View Post
I've just release SeqMonk v0.20.0 onto our repositories. This address a potentially nasty bug in v0.19 which may have truncated some filtered probe lists in any projects saved with that version.

The bug would affect you if your probe set contained multiple probes at exactly the same genomic position. In practice this only really happens if you make feature based probes and don't select the option to remove exact duplicates. If you made probe sets like this in v0.19.0 you should recalculate any filtered lists you have made with that version. Most of these won't actually have been affected, but since we can't spot a truncated list automatically it's better to be safe than sorry.

The gory details of the bug can be found on our bugzilla server.

Other changes in this release are:
  • We fixed a bug in the Intensity Difference Filter which was adding the same hit multiple times. All reported hits were real hits, but some may have been duplicated.
  • We fixed a display bug for deduplicated HiC data when it was first imported. Saving and reloading the project would fix the problem.
  • We added a new quantitaiton pipeline to allow you to easily make 'wiggle' type plots.

The new version is now available from our project page and all users of the previous version are strongly advised to upgrade immediately.
mediator is offline   Reply With Quote
Old 01-31-2012, 12:29 AM   #55
simonandrews
Simon Andrews
 
Location: Babraham Inst, Cambridge, UK

Join Date: May 2009
Posts: 870
Default

Quote:
Originally Posted by mediator View Post
Hi Simon,
I am using the Seqmonk to analyze my RNA Seq data right now. It's very straightforward and intuitive. Just have a question, after I used the quantitation pipeline to perform RPKM calculation on my data, how do I save the RPKM for all the probes in a export file? Thank you!
Simply create an annotated probe report (Reports > Create Annotated Probe Report). You don't actually need to add any additional annotation as the probes themselves will be named after the transcript to which they relate.
simonandrews is offline   Reply With Quote
Old 02-02-2012, 07:31 AM   #56
beajorrin
Junior Member
 
Location: Madrid

Join Date: Jan 2012
Posts: 6
Default

I'm really think that SeqMonk is very useful, but i have a problem. I'm working with Illumina pair-end reads, I've trimmed my reads by quality, I've mapped it with Bowtie and finally I've transformed it from sam to bam. I`ve visualized it with Seqmonk, and I've observed that my reads are assembled (and Bowtie don`t assemble, just map). It dosenīt happen if don`t trim my data. What could be the problem?
Thanks
beajorrin is offline   Reply With Quote
Old 02-02-2012, 07:46 AM   #57
simonandrews
Simon Andrews
 
Location: Babraham Inst, Cambridge, UK

Join Date: May 2009
Posts: 870
Default

Quote:
Originally Posted by beajorrin View Post
I'm really think that SeqMonk is very useful, but i have a problem. I'm working with Illumina pair-end reads, I've trimmed my reads by quality, I've mapped it with Bowtie and finally I've transformed it from sam to bam.
OK, I'm with you so far (but for the record you could have left out the last step since SeqMonk would have read the SAM files directly - and doesn't care whether they're sorted or not).

Quote:
Originally Posted by beajorrin View Post
I`ve visualized it with Seqmonk, and I've observed that my reads are assembled (and Bowtie don`t assemble, just map). It dosenīt happen if don`t trim my data. What could be the problem?
I'm not sure what you mean here when you say your reads are assembled. SeqMonk will pack your mapped reads together so you can see as many as possible on the screen, but this isn't an assembly - it's just showing the positions of the reads in the existing genome assembly you mapped against with bowtie. You should have got this whether your data was trimmed or not (except that your untrimmed data might have been more spread out since the mapping efficiency might have been much lower). Could you describe (or post small pictures of) exactly what you're seeing which concerns you?
simonandrews is offline   Reply With Quote
Old 02-02-2012, 08:26 AM   #58
beajorrin
Junior Member
 
Location: Madrid

Join Date: Jan 2012
Posts: 6
Default

Quote:
Originally Posted by simonandrews View Post
OK, I'm with you so far (but for the record you could have left out the last step since SeqMonk would have read the SAM files directly - and doesn't care whether they're sorted or not).



I'm not sure what you mean here when you say your reads are assembled. SeqMonk will pack your mapped reads together so you can see as many as possible on the screen, but this isn't an assembly - it's just showing the positions of the reads in the existing genome assembly you mapped against with bowtie. You should have got this whether your data was trimmed or not (except that your untrimmed data might have been more spread out since the mapping efficiency might have been much lower). Could you describe (or post small pictures of) exactly what you're seeing which concerns you?
Hi!
First, thanks for your quickly answer.

What i see is different read length if my data. I have reads, in my original data, of at least 100pb, but when I viualized it whit Seqmonk I have read of 9000 pb or more. It could be because the maximum insert size for valid paired-end alignments? I've set it in 10000. Could Seqmonk join this reads that are far away one from other? or is how i map the reads?
thanks

(I upload an image)
Attached Images
File Type: png Captura de pantalla 2012-02-02 a las 17.20.00.png (16.2 KB, 26 views)
beajorrin is offline   Reply With Quote
Old 02-02-2012, 08:35 AM   #59
simonandrews
Simon Andrews
 
Location: Babraham Inst, Cambridge, UK

Join Date: May 2009
Posts: 870
Default

Quote:
Originally Posted by beajorrin View Post
Hi!
First, thanks for your quickly answer.

What i see is different read length if my data. I have reads, in my original data, of at least 100pb, but when I viualized it whit Seqmonk I have read of 9000 pb or more. It could be because the maximum insert size for valid paired-end alignments? I've set it in 10000. Could Seqmonk join this reads that are far away one from other? or is how i map the reads?
thanks

(I upload an image)
Ah, OK. When you import paired end data SeqMonk displays the inferred insert from the paired set of reads. If you have two reads from the same transcript which mapped 100,000bases apart then you'll see a read which is 100,000bases long. Because of this SeqMonk sets a limit on how far apart paired end reads can be. The default is 1kb which is about the limit for insert sizes on the Illumina platform. Unless you're working on a platform which can actually work with much longer insert sizes then you probably don't want to increase this.

Looking at the screenshot you posted you seem to have a big discrepancy between the number of reads mapped before and after trimming your data. This leads me to suspect that something may have gone wrong with your mapping of the trimmed data. When you trim your data you do need to ensure that you keep the sequences in your two fastq files exactly paired - ie if you trim one sequence down to no bases, then you still need to leave it in the file - or remove it completely from both fastq files so that bowtie always sees correctly paired sequences when it does the paired end mapping. My initial guess would be that your fastq files have ended up with different numbers of reads in them causing your data to be mispaired - which will lead to this odd kind of pairing.
simonandrews is offline   Reply With Quote
Old 02-06-2012, 12:33 AM   #60
beajorrin
Junior Member
 
Location: Madrid

Join Date: Jan 2012
Posts: 6
Default

Quote:
Originally Posted by simonandrews View Post
Ah, OK. When you import paired end data SeqMonk displays the inferred insert from the paired set of reads. If you have two reads from the same transcript which mapped 100,000bases apart then you'll see a read which is 100,000bases long. Because of this SeqMonk sets a limit on how far apart paired end reads can be. The default is 1kb which is about the limit for insert sizes on the Illumina platform. Unless you're working on a platform which can actually work with much longer insert sizes then you probably don't want to increase this.

Looking at the screenshot you posted you seem to have a big discrepancy between the number of reads mapped before and after trimming your data. This leads me to suspect that something may have gone wrong with your mapping of the trimmed data. When you trim your data you do need to ensure that you keep the sequences in your two fastq files exactly paired - ie if you trim one sequence down to no bases, then you still need to leave it in the file - or remove it completely from both fastq files so that bowtie always sees correctly paired sequences when it does the paired end mapping. My initial guess would be that your fastq files have ended up with different numbers of reads in them causing your data to be mispaired - which will lead to this odd kind of pairing.
OK! In fact I have and inter size of 500bp, so I have to change it. I have to check the trim fastq to reduce the mispaired.
Thanks
beajorrin is offline   Reply With Quote
Reply

Tags
analysis, desktop, seqmonk, visualization

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 08:45 PM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2018, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO