SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics



Similar Threads
Thread Thread Starter Forum Replies Last Post
ChIP-Seq: Enabling Data Analysis on High-Throughput Data in Large Data Depository Usi Newsbot! Literature Watch 1 04-18-2018 10:50 PM
Cufflinks - Nature Biotech data sets adrian Bioinformatics 1 04-16-2011 05:40 PM
public data sets muchomaas Bioinformatics 2 06-08-2010 02:48 AM
sff_extract: combining data from 454 Flx and Titanium data sets agroster Bioinformatics 7 01-14-2010 11:19 AM
SeqMonk - Flexible analysis of mapped reads simonandrews Bioinformatics 7 07-24-2009 05:12 AM

Reply
 
Thread Tools
Old 01-27-2013, 03:50 AM   #141
mjp
Member
 
Location: USA

Join Date: Mar 2011
Posts: 25
Default DNase I footprints with SeqMonk

Has anybody been successful in finding DNase I footprints using SeqMonk?

Can we use SeqMonk to do this sort of analysis?
mjp is offline   Reply With Quote
Old 01-27-2013, 03:58 AM   #142
simonandrews
Simon Andrews
 
Location: Babraham Inst, Cambridge, UK

Join Date: May 2009
Posts: 871
Default

We've never tried as far as I'm aware. Since it's (as I understand it) just enrichment data you should be able to use the same sort of methods as for ChIP-Seq. If there's something more specific which applies to this kind of data then if you can provide me with a pointer I can look at adding it.

I'd be interested to hear how you get on.
simonandrews is offline   Reply With Quote
Old 01-27-2013, 05:05 AM   #143
mjp
Member
 
Location: USA

Join Date: Mar 2011
Posts: 25
Default DNase I footprints with SeqMonk

To not to cite to many fragments from this short paper I will let you read it if you are interested.

The biggest difference between ChIP and DNase, in my opinion, is that while ChIP looks at the underlying peaks of usually well known proteins, DNase looks at the enrichment of much wider regions. There are also other differences between these two methodologies (see the paper).
Overall, with this kind of experiment people would be actually looking for the depleted (from sequencing tags) regions which were protected by bound proteins.

I guess that could be done in SeqMonk by creating continuous probes across the genome and then looking at the depleted regions instead of the enriched. Having said that, I would probably have some problems setting some kind of cut-off level between enrichment and depletion.

At the moment I'm trying to evaluate the options from the paper I mentioned but it would be nice to have SeqMonk to do that as well, as I already have done some simple analysis in it. If I will succeed to do that with SeqMonk I'll let you know.
mjp is offline   Reply With Quote
Old 02-06-2013, 04:39 AM   #144
mjp
Member
 
Location: USA

Join Date: Mar 2011
Posts: 25
Default DNase I footprints with SeqMonk

So I managed to come up with similar results with SeqMonk as with other software. The one I have in mind here is F-seq, which is fairly commonly used for analysis of DNase data.
In SeqMonk I have used contig probe generator to do this. Actually, SeqMonk returned more reasonable enriched regions that F-seq but many of them overlapped.

Furthermore I managed to relate my data in SeqMonk to the protein binding sites (PBS) I'm interested in.

What I would like to achieve now, is to come up with some kind of systematic way of identifying the depleted regions within the enriched regions. These would correspond to the protected sites where my protein was bound to. I have attached an example of such region.

What you see there is DNase Hypersensitive Sites (DNase_HS), underlying protein biding site (PBS) and at the very bottom the sinlge base-paired probes created within DNase_HS. The underlying dip within the probes would ideally correspond to the depleted region (there might be another one to the right of the first one - between 400 and 500 bp in displayed region).

I have tried to use Z-score re-quantitation to see how different the probes are from the mean but that didn't yield anything informative at the moment.

At the moment I can't think of anything I could use in SeqMonk to annotate probes that have significantly lower values than surrounding probes within a window (which would be what I'm essentially looking for).

Is there any systematic way I could identify such short stretches of depletion in SeqMonk?

Thanks in advance for any input.
Attached Images
File Type: png example_of_footprint.png (11.7 KB, 14 views)
mjp is offline   Reply With Quote
Old 02-06-2013, 01:26 PM   #145
sschmidt
Junior Member
 
Location: USA

Join Date: Nov 2012
Posts: 2
Default

I have RNA-seq libraries made from cell lines that express a transgene, and was able to quantitate using the RPKM pipeline for all existing probes. Is it possible to design a probe for the transgene as well, if so how, and can that be done using the RPKM pipeline?
sschmidt is offline   Reply With Quote
Old 02-07-2013, 12:44 AM   #146
simonandrews
Simon Andrews
 
Location: Babraham Inst, Cambridge, UK

Join Date: May 2009
Posts: 871
Default

Quote:
Originally Posted by sschmidt View Post
I have RNA-seq libraries made from cell lines that express a transgene, and was able to quantitate using the RPKM pipeline for all existing probes. Is it possible to design a probe for the transgene as well, if so how, and can that be done using the RPKM pipeline?
So I'm assuming that you're talking about a novel gene inserted into the main genome which you're also measuring in your data. If that's right then you'll need to find some way to represent the transgene in your genome. You could do this by modifying the existing genome sequence and inserting the novel sequence at the correct position, or you could take a shortcut and make a short extra fake chromosome which just contained the transgene sequence. You'd need to do this for the mapping stage as well as the downstream analysis since otherwise the hits to the transgene won't be mapped. The process for adding a new fake chromosome would be the same as for making a custom genome except that you'd just add the new dat file to the chromosomes in an existing assembly rather than making a new one.

Once you've done that then the transgene should show up the same as any other gene and it would just be a case of picking the value for that gene out of the full set of quantitated data.
simonandrews is offline   Reply With Quote
Old 02-07-2013, 12:53 AM   #147
simonandrews
Simon Andrews
 
Location: Babraham Inst, Cambridge, UK

Join Date: May 2009
Posts: 871
Default

Quote:
Originally Posted by mjp View Post
What I would like to achieve now, is to come up with some kind of systematic way of identifying the depleted regions within the enriched regions. These would correspond to the protected sites where my protein was bound to. I have attached an example of such region.

What you see there is DNase Hypersensitive Sites (DNase_HS), underlying protein biding site (PBS) and at the very bottom the sinlge base-paired probes created within DNase_HS. The underlying dip within the probes would ideally correspond to the depleted region (there might be another one to the right of the first one - between 400 and 500 bp in displayed region).

I have tried to use Z-score re-quantitation to see how different the probes are from the mean but that didn't yield anything informative at the moment.

At the moment I can't think of anything I could use in SeqMonk to annotate probes that have significantly lower values than surrounding probes within a window (which would be what I'm essentially looking for).

Is there any systematic way I could identify such short stretches of depletion in SeqMonk?

Thanks in advance for any input.
Looking at the result you have I think what you'd need would be a new quantitation normalisation method which would do a local subtraction of a smoothed value running through the data. This would remove the large scale effects of the enrichment and leave you with a measure of the local difference to the general enrichment level of the area. Once you had this you could then use the windowed replicate filter to find regions which had a value which was consistently different from 0 over whatever window size you chose. This would then find sets of adjacent depleted probes which would hopefully correspond to your binding sites.

Setting this up would need the addition of a new quantitation method, but it would basically be an adaption of the existing smoothing quantitation so it would be really easy to add. If you want to contact me off list (simon.andrews@babraham.ac.uk) I can give you a development snapshot with this in to test, and if you could let me have some example data of this type it would be really helpful in making sure it's working properly.
simonandrews is offline   Reply With Quote
Old 02-08-2013, 02:54 AM   #148
simonandrews
Simon Andrews
 
Location: Babraham Inst, Cambridge, UK

Join Date: May 2009
Posts: 871
Default

I've just added a new smoothing subtraction quantitation which should do what you need to allow a better systematic analysis of the DNase data. Drop me an email and I can give you a test release containing this code so you can try it out before I put it into an official release.
simonandrews is offline   Reply With Quote
Old 02-11-2013, 09:21 AM   #149
simonandrews
Simon Andrews
 
Location: Babraham Inst, Cambridge, UK

Join Date: May 2009
Posts: 871
Default

I've just release SeqMonk v0.24.0. I've included the smoothing subtraction quantitation I described above which should help for DNase type analyses but there's also lots of other new stuff listed below:
  • Added the ability to export all probe reports in GFF format
  • Added a pipeline to detect antisense transcription from directional RNA-Seq libraries.
  • Added a system which can provide immediate feedback to submitted crash reports if they're ones we've seen before and for which we can offer useful feedback.
  • Added a chi-square based contingency test filter which is useful for bisulphite sequencing libraries (and possibly others too).
  • Added an ID field to reports for cases where the name of a feature isn't useful or unique
  • Added a probe length quantitation option
  • Added a probe name filter which allows you to specify a large list of names and selects probes which match any of them
  • Added an option to merge all transcripts in the RNA-Seq pipeline to create a single gene level measure of transcription
  • Changed the active store parser to a visible stores parse to allow the easy re-import of multiple datasets in a single operation
  • Added an option to generate raw counts to the RNA-Seq quantitation pipeline to allow for easy interfacing with tools such as DESeq which require this
  • Added a smoothing subtraction quantitation method which can be used to detect sudden local changes in quantitation
  • Added the ability to select the order of highlighted probe lists in the scatterplot


Some changes have also been made to address problems in previous versions:
  • We fixed a bug which would produce incorrect p-values following multiple testing correction, but only affected p-values which were initially very high (p>~0.3)
  • We fixed an unnecessary level of multiple testing correction in the intensity difference filter which meant that some candidates which could have been reported were not. Typically we see around a 10% increase in the number of candidates in the new correction method over the previous version.
  • We changed the behaviour of the BAM import filter for paired end data which were mapped with a spliced read mapper. We now show the second read of the pair with the same direction as the first read to indicate the direction of the fragment and preserve the direction in strand specific libraries.
  • The "load probes from file" probe generator has been removed. It was never very well supported and its functionality is better performed by importing the data into an annotation track and using the feature probe generator.
  • A couple of timing bugs were fixed which prevented the import of extra annotation on some linux installations.
  • In HiC analysis we have removed some optimisations in the testing which were leading to unrealistically low p-values for some interactions. We now test against the full set of possible interactions, only making an exception to correct for only cis interactions when all trans interactions have been specifically excluded.
simonandrews is offline   Reply With Quote
Old 02-22-2013, 08:50 AM   #150
simonandrews
Simon Andrews
 
Location: Babraham Inst, Cambridge, UK

Join Date: May 2009
Posts: 871
Default

I've just released v0.24.1 to fix a few bugs which were discovered in the last release. The new release is up on our site now.
  • The RNA-Seq quantitation incorrectly corrected for total read count if you used unmerged transcripts (it corrected against total base count instead of total read count). Counts from versions before 0.24.0 are not affected.
  • The MACS probe generator didn't actually read the user options but always used the defaults. This has always been broken but we didn't notice!
  • A crash was fixed in the visible store parser which was introduced in the last release.
  • A crash was fixed if you ran the intensity difference filter with a very small number of probes
  • A occasional drawing bug on the axis labels of the probe trend plot was fixed. I think this one wins a record for being the longest standing bug in the code. It's been there since the trend plot was first introduced in v0.6 and we've finally found and fixed it.
simonandrews is offline   Reply With Quote
Old 02-22-2013, 12:39 PM   #151
kshankar
Member
 
Location: Little Rock AR

Join Date: Jul 2010
Posts: 12
Default Average read coverage

Is there a way to calculate the average read coverage / exon for RNA-seq datasets using SeqMonk. I imagine it would be similar to the Coverage Depth quantitation except that it gives min or max coverage. We are trying to find genes that have a meaningful RPKM value. Based on this paper (McIntyre et al, BMC Genomics, 2011, 12:293), an average of 5 reads per base (for each exon) was needed to have reliable RPKMs. Didnt know if there was a way to compute this for all exons and then filter ones with insufficient quatitation.
kshankar is offline   Reply With Quote
Old 02-24-2013, 06:40 AM   #152
simonandrews
Simon Andrews
 
Location: Babraham Inst, Cambridge, UK

Join Date: May 2009
Posts: 871
Default

Quote:
Originally Posted by kshankar View Post
Is there a way to calculate the average read coverage / exon for RNA-seq datasets using SeqMonk. I imagine it would be similar to the Coverage Depth quantitation except that it gives min or max coverage. We are trying to find genes that have a meaningful RPKM value. Based on this paper (McIntyre et al, BMC Genomics, 2011, 12:293), an average of 5 reads per base (for each exon) was needed to have reliable RPKMs. Didnt know if there was a way to compute this for all exons and then filter ones with insufficient quatitation.
If you put probes over exons (do feature probes over mRNA and split into subfeatures) and then do a base pair quantitation which corrects for length then you'll get a base count per base of probe, which I think is the measure you wanted.

It would also be pretty easy to extend the coverage depth quantitation to allow mean or median value as a valid measure so I'll add that to the next release, but I think you can do what you want with the existing tools.
simonandrews is offline   Reply With Quote
Old 03-03-2013, 07:35 AM   #153
mathew
Member
 
Location: australia

Join Date: Jan 2011
Posts: 81
Default time course RNA-seq with replicate

I was wondering is it possible to analyze (exon level) RNA-seq data with three replicates over a time course in Seqmonk.

Thanks
mathew is offline   Reply With Quote
Old 03-05-2013, 01:53 AM   #154
simonandrews
Simon Andrews
 
Location: Babraham Inst, Cambridge, UK

Join Date: May 2009
Posts: 871
Default

Quote:
Originally Posted by mathew View Post
I was wondering is it possible to analyze (exon level) RNA-seq data with three replicates over a time course in Seqmonk.
Yes, you can certainly do this kind of analysis with SeqMonk. In essence the analysis would be the same as for any complex experiment type. We did a tutorial video which covers some of the options for this at the transcript level, but you could do the same thing starting from exons.

For the initial quantitation you would just put probes over all exons, but probably filtering your set of transcripts to remove odd biotypes to reduce your set of exons somewhat. We normally just use protein coding mRNAs for our analysis now which about halves the number of transcripts we have to consider from the full Ensembl set. Once you have the probes you'd use a read count quantitation correcting for total count and log transforming. You would do the quantitation normalisation in the same way as for whole transcript or gene quantitation.

For the differential analysis I'd suggest doing an intensity difference filter using replicate sets of all of your replicate groups. I'd then do a replicate set stats analysis using the same replicate groups and then focus on exons which were found to be significant by both of these methods. Once you have the full set of changing exons you could use the clustering tools to find sets of exons which behave similarly to make the interpretation of your analysis more simple.

The problem you're always fighting against with this type of analysis is the number of tests you are performing. If you're looking at all exons in the genome you will have a huge number of probes and potentially low observation levels which will make it difficult to achieve a sufficient level of significance to survive multiple testing correction. You can't pre-filter on difference before doing the intensity difference filter since this will break the statistical background model, but if there's any other way to focus on exon which are more likely to be of interest then you could think about that.

The other option you have if you want to look at alternative splicing it to analyse the splice events rather than the exon expression. When you import your data you would choose to import spliced reads and then import introns rather than exons. You could then use the read position probe generator to put a probe over each different splice event, and the exact overlap quantitation method to count the number of occurrences of each splice event in your different samples. Analysing these data will allow you to focus more explicitly on changes in splicing as opposed to the more mixed signals you might get from looking at read counts in overlapping exons.

On my todo is is another tutorial which goes over the options in the program for looking at alternative splicing since this is something we're increasingly doing and which is getting more feasible with longer read lengths and higher coverage depths.
simonandrews is offline   Reply With Quote
Old 03-05-2013, 08:01 PM   #155
renext
Junior Member
 
Location: U.S.

Join Date: Mar 2013
Posts: 3
Default Raw counts difference between v0.24.0 and v0.24.1

Hi Simon Andrews,

I have really enjoyed to use SeqMonk to analyze my mRNA-seq data. It is certainly a great program.

Recently, I noticed that new version of SeqMonk has been released (v0.24.1) and just simply rerun the analysis of my mRNA-seq data using the new version.

I used 'Quantitation Pipelines - RNA Seq quantitation pipelines' and selected following options - mRNA / Non-strand speicif / Merge transcript isoforms / Generate Raw Counts. Although most of probes gave same counting numbers, surprisingly some probes gave different counting numbers compared to the results from old (v0.24.0).

Changes in counting number seems to be consistent between samples, but it slightly changes p-value when I do intensity Difference filtering or Replicate Set filtering, giving a slightly different set of probes.

FYI, I used exactly same options. I assumed that raw counts should be remained same between v0.24.0 and v0.24.1 because v0.24.0 only incorrectly corrected for total read count if I used unmerged transcripts based on your note - Yes, this is the reason I rerun my analysis using v0.24.1 just in case. I am wondering what's going on. Do you have any idea why I got different read counts for some probes (not all)?

Thanks for your help in advance!
renext is offline   Reply With Quote
Old 03-06-2013, 05:11 AM   #156
simonandrews
Simon Andrews
 
Location: Babraham Inst, Cambridge, UK

Join Date: May 2009
Posts: 871
Default

The code which was changed in 0.24.1 only affected the global correction applied, so if you've been using the 'generate raw counts' option then this wouldn't have been applied so you shouldn't have got anything different between the two versions. I've just been back and checked the logs for that code and apart from deleting a commented line, the only change was in that correction code which wouldn't be executed if you were generating raw counts.

If you're seeing consistent differences for only a small number of probes is it possible that your annotation has changed between the analyses? Could you have been using a filtered set of transcripts for your analysis before, or could your annotation for this assembly have been updated?

I should also point out that in your mail it sounded like you were using raw counts for the statistical testing within SeqMonk (you may not be, it wasn't entirely clear). If you're using the tests inside the program you really want to use normalised log transformed counts for the analyses. Raw counts are only intended to be used for external analyses (DESeq for example) which require them.
simonandrews is offline   Reply With Quote
Old 03-06-2013, 07:21 AM   #157
renext
Junior Member
 
Location: U.S.

Join Date: Mar 2013
Posts: 3
Default

Thanks for the reply!

I have played around what is different between my previous analysis and current one more than ver. of program. In fact, there was a difference other than ver. of program.

In previous analysis, when I imported my mapped files, I selected the option 'split spliced reads'. Then when I run the pipeline, I selected 'merge transcript isoform'. In current one, I didn't check 'split spliced reads' when I imported my data. Although I choose same option for the pipeline (merge transcript isoform), it gave different results as I mentioned depending on 'split spliced reads' option in Import data step.

Since I did a pretty much regular mRNA-seq mapping, not a spliced mapping, I guess I shouldn't check 'split spliced reads'. Is it correct?

In addition, you are right. I was not clear before but I actually used normalized log transformed counts for the statistical analyses in SeqMonk. Main reason I also want to get raw counts is for external analyses as you pointed out.

Thank you again for your wonderful help!
renext is offline   Reply With Quote
Old 03-07-2013, 12:51 AM   #158
simonandrews
Simon Andrews
 
Location: Babraham Inst, Cambridge, UK

Join Date: May 2009
Posts: 871
Default

If you've not done a spliced mapping then the 'split spliced reads' option won't do an awful lot to your data. The one change it will force is to split paired end reads into individual reads rather than generating a single read which covers the pair. It will fix the strand attributes to be correct for directional paired end libraries.

For any RNA-Seq data we'd strongly recommend using the split reads option, even if you didn't use a spliced mapper since it will do the right thing with your data. In particular, the RNA-Seq quantitation needs to use individually mapped reads rather than joined read pairs. The reason for this is that although you want to count reads, if you've done a spliced import some reads will have been split up, and would therefore count double if simple read counts were used. The program therefore counts bases of overlap and then works out the read length for the sample and then uses this to convert the base counts back to read counts. If you've used joined read pairs then you'll get huge (and incorrect) base counts, and you'll get a predicted read length which is also huge (and incorrect).
simonandrews is offline   Reply With Quote
Old 03-07-2013, 08:33 AM   #159
renext
Junior Member
 
Location: U.S.

Join Date: Mar 2013
Posts: 3
Default

[QUOTE=simonandrews;98404]For any RNA-Seq data we'd strongly recommend using the split reads option, even if you didn't use a spliced mapper since it will do the right thing with your data.

Thanks! I got it. So I should use 'split spliced reads' option in data import step almost all the time for RNA-seq data. Then I guess my first analysis (using split spliced reads option) is correct one. I may be confused since I read 'Spliced import. If you have done a spliced mapping you will be offered the option to split spliced reads into their component parts' in the SeqMonk manual. I thought, I can choose that option if I have done a spliced mapping.

But certainly, I found that in your youtube tutorial that you selected 'split spliced reads' option for RNA-seq analysis when you import your data!

I really appreciate your clarification. Thanks!
renext is offline   Reply With Quote
Old 03-07-2013, 01:46 PM   #160
griffon42
Member
 
Location: New York

Join Date: Jan 2009
Posts: 23
Default

I've recently installed SeqMonk - it really looks great, and should be very useful for some of the analyses I'm interested in running. Many thanks to the developers!

However, I've run into a rather strange bug. Looking at the menubar, the "File" "Edit" and "View" pulldowns all work just fine. However, all other menus, such as "Data" "Plots" "Filtering" and so on, can be pulled down, but the options are not accessible - they are all completely greyed out.

Nothing seems to change this. I can import large BAM files just fine, visualize reads just fine, and so forth - just cannot access menu items. I've run into the same problem with trying the sample data set from the Seqmonk website.

Searching the web, I haven't encountered any reports of similar behavior - can anyone comment on this? I feel like i must be missing something obvious.

I'm running SeqMonk on Mac OS X 10.8.2, 32GB RAM (10GB available to SeqMonk), with the more recent version of Java.

Thanks for the help.
griffon42 is offline   Reply With Quote
Reply

Tags
analysis, desktop, seqmonk, visualization

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 12:56 AM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2021, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO