SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics



Similar Threads
Thread Thread Starter Forum Replies Last Post
ChIP-Seq: Enabling Data Analysis on High-Throughput Data in Large Data Depository Usi Newsbot! Literature Watch 1 04-18-2018 10:50 PM
Cufflinks - Nature Biotech data sets adrian Bioinformatics 1 04-16-2011 05:40 PM
public data sets muchomaas Bioinformatics 2 06-08-2010 02:48 AM
sff_extract: combining data from 454 Flx and Titanium data sets agroster Bioinformatics 7 01-14-2010 11:19 AM
SeqMonk - Flexible analysis of mapped reads simonandrews Bioinformatics 7 07-24-2009 05:12 AM

Reply
 
Thread Tools
Old 09-10-2013, 07:04 AM   #221
crazyhottommy
Senior Member
 
Location: Gainesville

Join Date: Apr 2012
Posts: 140
Default

Quote:
Originally Posted by simonandrews View Post
There shouldn't really be a consensus as the size you use will depend on the nature of the enrichment you're looking at and the insert size of your library among other factors.

When using multiple probe lists (not sets) in SeqMonk you now draw all of the plots in a single window and the slider adjusts all of them simultaneously so they're directly comparable. I'm never really sure how valuable it is to compare the strength of enrichment in these plots since this can be affected by technical artefacts, but it's a really good way to show differences in the patterning or extent (proportion of probes) of the enrichment.
Do you mean that for two different probe lists, it is hard to compare the enrichment of certain marks?
Let's say, I have two lists of promoter regions ( one list contains the active promoter, the other contains the inactive promoter based on the RNA-seq data).

One may expect H3k4me3 enriches at active promoters, but not the inactive promoters.

DO you mean the aligned probe plot can only look at the pattern, but can not compare the signal strength ( the colour strength in the plot)?


I agree that the Aligned probe plot gives the most information about the data set. The probe trend plot is also very good, but it only gives an average point of view. I saw many papers (only) use box plot to show the tag intensity to compare treatment and control. And it hides a lot of information. Ideally, one should show the trend plot and aligned probe plot at the same time. In this way, readers have an idea whether the mark is enriched and what's the proportion of the probes are enriched with this mark ( TFs, or histone modification).

Thanks!
crazyhottommy is offline   Reply With Quote
Old 09-13-2013, 02:46 AM   #222
Mokinhas
Junior Member
 
Location: Amsterdam

Join Date: Sep 2013
Posts: 4
Default

Hi Simon,

I am really fan of SeqMonk!! It is great!
However I am quite new on this bioinformatic analysis and I have a little question. I am analysisng RNA seq data and I follow the youtube video (very usefull for starters btw) but I do not get in the report what the differential expression means. How can I get a normal fold change? Is that possible?

Thanks in advance.
Mokinhas is offline   Reply With Quote
Old 09-13-2013, 02:54 AM   #223
simonandrews
Simon Andrews
 
Location: Babraham Inst, Cambridge, UK

Join Date: May 2009
Posts: 871
Default

Quote:
Originally Posted by Mokinhas View Post
Hi Simon,

I am really fan of SeqMonk!! It is great!
However I am quite new on this bioinformatic analysis and I have a little question. I am analysisng RNA seq data and I follow the youtube video (very usefull for starters btw) but I do not get in the report what the differential expression means. How can I get a normal fold change? Is that possible?

Thanks in advance.
If you've taken the defaults for the RNA-Seq quantitation then the values recorded for each sample will be log2 RPM (reads per million reads of library). The reports will simply show the quantitated value rather than differences since the quantitation works the same if you have 1, 2 or 100 samples.

If you want to get a fold change from the quantitated values then it's a simple calculation from the log2RPM values. The fold change will just be 2 to the power of the difference in log2RPM, so if you had a value of 3 in one dataset and 5.5 in the other then the difference would be 2.5 and the fold change would be 2^2.5 = 5.7 fold.

If you want to have the differences included in the report then you can do a value differences filter on your data. This will record the difference value against the list so it will show up in the report and you won't have to calculate it afterwards (it will be log2RPM difference though, not fold change).
simonandrews is offline   Reply With Quote
Old 09-18-2013, 12:15 AM   #224
Mokinhas
Junior Member
 
Location: Amsterdam

Join Date: Sep 2013
Posts: 4
Default

Thanks for your quick reply Simon. I understand now
Mokinhas is offline   Reply With Quote
Old 09-26-2013, 07:10 AM   #225
Neuromancer
Member
 
Location: Goettingen, Germany

Join Date: Aug 2011
Posts: 28
Default

Hi Simon,

Just a short question about genome versions:
As far as I know, SeqMonk genomes are derived from ENSEMBL genome releases, right?
So is the current SeqMonk mouse genome (GRCm38) the same as the annotation and coordinates in ENSEMBLE release 73 (i.e. GRCm38p1 + new annotations by ENSEMBL)?

[This current release has 38561 genes (ensemble gene IDs), SeqMonk's probe generator (v0.25.0) generates 32029 genes (feature probes over genes, nothing removed)...]

What's the status of the SeqMonk (mouse) genome then?
Neuromancer is offline   Reply With Quote
Old 10-03-2013, 10:27 AM   #226
mathew
Member
 
Location: australia

Join Date: Jan 2011
Posts: 81
Default Correlation coefficient in cluster

I have question about seqmonk- when we perform clustering on any analyzed data then we can have different subsets of clusters based upon correlation coefficient. How it calculate correlation coefficient? Is it that whatever the data matrix is it will calculate correlation. In that case there should be one correlation value per row. How it is that one single correlation coefficient is used to cut rows. I may be missing something. Any suggestion please
mathew is offline   Reply With Quote
Old 10-04-2013, 12:04 AM   #227
simonandrews
Simon Andrews
 
Location: Babraham Inst, Cambridge, UK

Join Date: May 2009
Posts: 871
Default

Quote:
Originally Posted by mathew View Post
I have question about seqmonk- when we perform clustering on any analyzed data then we can have different subsets of clusters based upon correlation coefficient. How it calculate correlation coefficient? Is it that whatever the data matrix is it will calculate correlation. In that case there should be one correlation value per row. How it is that one single correlation coefficient is used to cut rows. I may be missing something. Any suggestion please
The correlation clustering is an iterative process where you start by making a set of clusters with only one probe in each. In each round the program finds the two most correlated clusters and joins them together. It keeps doing this until all of the clusters are joined together. Since the most strongly correlated clusters are always joined in each round, the level of correlation decreases as the clustering continues. It also means that every cluster join has a specific R value associated with it.

When you adjust the clustering stringency with the slider in SeqMonk what you're actually doing is moving through the cluster tree to find the largest cluster set for which the R value which joined that cluster is at or above the R value that you set. High R values will most likely be found early on in the clustering but will generate only small clusters, smaller or negative R values will be late stage joins of large clusters, so adjusting this threshold allows you to define the stringency of clustering.

Hope this clears things up.
simonandrews is offline   Reply With Quote
Old 10-04-2013, 12:09 AM   #228
simonandrews
Simon Andrews
 
Location: Babraham Inst, Cambridge, UK

Join Date: May 2009
Posts: 871
Default

Quote:
Originally Posted by Neuromancer View Post
Hi Simon,

Just a short question about genome versions:
As far as I know, SeqMonk genomes are derived from ENSEMBL genome releases, right?
So is the current SeqMonk mouse genome (GRCm38) the same as the annotation and coordinates in ENSEMBLE release 73 (i.e. GRCm38p1 + new annotations by ENSEMBL)?

[This current release has 38561 genes (ensemble gene IDs), SeqMonk's probe generator (v0.25.0) generates 32029 genes (feature probes over genes, nothing removed)...]

What's the status of the SeqMonk (mouse) genome then?
In general we only update the genomes for new assemblies and the gene builds we distribute are the initial builds for that assembly. GRCm38 hasn't changed its sequence since the initial Ensembl build so the gene models are still on Ensembl v68. If there is a significant improvement in the gene builds then we can update these and SeqMonk will pick up the updates, but we didn't build in a place to record the specific annotation version when we built the back end (would have been nice in retrospect) so we're generally reluctant to do this.

If you want a newer gene build you can always download the GTF file for any specific build and import that as an additional annotation set. You can prefix all of the features with a specific string so you can tell them apart from the core features.
simonandrews is offline   Reply With Quote
Old 10-24-2013, 01:52 AM   #229
Neuromancer
Member
 
Location: Goettingen, Germany

Join Date: Aug 2011
Posts: 28
Default

Quote:
Originally Posted by simonandrews View Post
If you want a newer gene build you can always download the GTF file for any specific build and import that as an additional annotation set. You can prefix all of the features with a specific string so you can tell them apart from the core features.
Thank you, that was what I had in mind as a solution as well! Thanks for the quick answer.
Neuromancer is offline   Reply With Quote
Old 10-24-2013, 07:46 AM   #230
simonandrews
Simon Andrews
 
Location: Babraham Inst, Cambridge, UK

Join Date: May 2009
Posts: 871
Default

I've just released SeqMonk v0.26.0 onto our project web site. The immediate reason for this is to fix a problem which occurred with the program launcher in the new OSX Mavericks release. We have also, though, included another tool we've been working on which makes it much easier to create and work with custom genomes, so that if you just have a collection of fastq files or a GTF file then it's now much easier to use these with SeqMonk.

Please try out the new release and send your experiences either back to us directly or post them in this forum.
simonandrews is offline   Reply With Quote
Old 10-25-2013, 10:35 AM   #231
rajeshgazara
Junior Member
 
Location: delhi

Join Date: Oct 2013
Posts: 1
Default how to predict gene from transcriptome data by mapping of transcriptome to genome

hi everyone
i want to ask how to predict gene from transcrips by mapping them to genome. please reply if someone know about it. i am new in this field
rajeshgazara is offline   Reply With Quote
Old 10-25-2013, 12:35 PM   #232
simonandrews
Simon Andrews
 
Location: Babraham Inst, Cambridge, UK

Join Date: May 2009
Posts: 871
Default

Quote:
Originally Posted by rajeshgazara View Post
hi everyone
i want to ask how to predict gene from transcrips by mapping them to genome. please reply if someone know about it. i am new in this field
Probably the most commonly used tool for this would be cufflinks. Since you're asking in a SeqMonk thread I should point out that we've done this kind of analysis and then loaded the raw mapped data and the GTF file from cufflinks in to SeqMonk to check the results. We've found that it's been very variable whether the predictions it made matched with what we expected from looking at the data ourselves.
simonandrews is offline   Reply With Quote
Old 11-21-2013, 02:18 PM   #233
tirohia
Member
 
Location: Auckland, NZ

Join Date: Nov 2011
Posts: 46
Default

Hi Simon.

I'm trying to use the tool for constructing genomes via gff and fasta files. I'm trying to get a draft version of the kiwifruit genome (from http://bioinfo.bti.cornell.edu/cgi-b...i/download.cgi) into seqmonk.

I've been loading the gff file - Kiwifruit_pseudomolecule.gff3 - it appears to accept it and then I get nothing when create a new project. Just 25 blank chromosomes. I've tried adding in the various fasta files, with little success. When I load the scaffold file into the genome creation tool along side the others, it'll show me the scaffold track, but that is of little value.

Is there some modification to the gff file I need to make that I'm missing? The seqmonk version I'm using was downloaded a week or so ago, so that should be current.

Any thoughts/pointers would be much appreciated.

Cheers
Ben.
tirohia is offline   Reply With Quote
Old 11-22-2013, 02:07 AM   #234
simonandrews
Simon Andrews
 
Location: Babraham Inst, Cambridge, UK

Join Date: May 2009
Posts: 871
Default

Quote:
Originally Posted by tirohia View Post
Hi Simon.

I'm trying to use the tool for constructing genomes via gff and fasta files. I'm trying to get a draft version of the kiwifruit genome (from http://bioinfo.bti.cornell.edu/cgi-b...i/download.cgi) into seqmonk.

I've been loading the gff file - Kiwifruit_pseudomolecule.gff3 - it appears to accept it and then I get nothing when create a new project. Just 25 blank chromosomes. I've tried adding in the various fasta files, with little success. When I load the scaffold file into the genome creation tool along side the others, it'll show me the scaffold track, but that is of little value.
Hi Ben,

I had a look at this. It's a bug in SeqMonk - it doesn't pick up files with .gff3 extensions when it creates the default annotation set for a new custom genome. If you change the extension to just .gff and rebuild the custom genome it should work.

I'll fix this in the next release. Thanks for spotting and reporting this.

Cheers

Simon.
simonandrews is offline   Reply With Quote
Old 11-24-2013, 02:16 PM   #235
tirohia
Member
 
Location: Auckland, NZ

Join Date: Nov 2011
Posts: 46
Default

Ah. That works. Brilliant.

Ta muchly.

Ben.
tirohia is offline   Reply With Quote
Old 12-05-2013, 08:50 PM   #236
tirohia
Member
 
Location: Auckland, NZ

Join Date: Nov 2011
Posts: 46
Default

Hi Simon.

Possibly a rehash (of sorts of an old question if I may. I'm getting an error when I try and import a bam file into seqmonk - "Couldn't extract a valid name from <name>". It's the same one that was in this post a while back.

I've read the article that you linked to in your response.

So my gff file, where the reference data came from, has entries like this:

Chr6 glean gene 14845357 14856602 . + . ID=Achn215061; status=novel;
Chr6 glean mRNA 14845357 14856602 0.240873 + . ID=Achn215061-TA; Parent=Achn215061; status=novel;
Chr6 glean CDS 14845357 14845647 . + 0 Parent=Achn215061-TA;
Chr6 glean CDS 14848222 14849031 . + 0 Parent=Achn215061-TA;

The corresponding line in the header of the sam file that I have been attempting to import (actually been trying in bam, but this is the sam file it is derived from), has an entry like this:

@SQ SN:Achn215061 LN:4998

Both files have data from all the available chromosomes in them but the sam file only contains accession numbers, not chromosome details. It's all just Achn******. So how does one go about setting up aliases as indicated in the article that you linked to?

I'm assuming at this point, that I'll have to add something to the names of the various genes in the sam file, but the article indicates that there is no regexp attempts on the alias's provided in the files, which would mean that if I added a chromosome name as a prefix/suffix to all the entries in the sam file (Achn215061chr6 maybe), it wouldn't pick them up. I'm not sure where/how I would add the chromosome information in the SAM file.

Am I missing something obvious?

Cheers
Ben.
tirohia is offline   Reply With Quote
Old 12-06-2013, 01:32 AM   #237
simonandrews
Simon Andrews
 
Location: Babraham Inst, Cambridge, UK

Join Date: May 2009
Posts: 871
Default

Quote:
Originally Posted by tirohia View Post
I'm getting an error when I try and import a bam file into seqmonk - "Couldn't extract a valid name from <name>". It's the same one that was in this post a while back.

I've read the article that you linked to in your response.

So my gff file, where the reference data came from, has entries like this:

Chr6 glean gene 14845357 14856602 . + . ID=Achn215061; status=novel;
Chr6 glean mRNA 14845357 14856602 0.240873 + . ID=Achn215061-TA; Parent=Achn215061; status=novel;
Chr6 glean CDS 14845357 14845647 . + 0 Parent=Achn215061-TA;
Chr6 glean CDS 14848222 14849031 . + 0 Parent=Achn215061-TA;

The corresponding line in the header of the sam file that I have been attempting to import (actually been trying in bam, but this is the sam file it is derived from), has an entry like this:

@SQ SN:Achn215061 LN:4998

Both files have data from all the available chromosomes in them but the sam file only contains accession numbers, not chromosome details. It's all just Achn******. So how does one go about setting up aliases as indicated in the article that you linked to?
Setting up chromosome aliases is fairly simple. You simply need to create a file called aliases.txt in the folder containing your seqmonk genome and then add in alias[tab]chromosome name pairs to allow seqmonk to do the lookup when importing.

However, in this case I suspect you might have a different problem. It's difficult to tell from the information you've supplied but I think your data might have been mapped against a transcriptome rather than a genome, so although your genome has assembled chromosomes, the coordinates in your BAM file might be within transcripts rather than being genomic positions. If this is the case then it's not just a case of adding an alias since the positions will be offset in the genome. The aliases file does allow for supplying an offset position as well as an alias, but if you're working in a species which does splicing then even this isn't going to be enough since you will have a different offset for each exon.

It's theoretically possible to translate transcriptome coordinates to genomic coordinates (tophat does this internally for example), but I've never actually tried this and don't know of a simple approach to do this, but if your BAM file is mapped against a transcriptome and you wanted to view the data on a genome then this is what you'd need to do.

If you can give us a bit more information about where this data came from and how the mapping was done we can probably give a more concrete answer.
simonandrews is offline   Reply With Quote
Old 12-06-2013, 05:34 PM   #238
tirohia
Member
 
Location: Auckland, NZ

Join Date: Nov 2011
Posts: 46
Default

So the reference genome that I've loaded into Seqmonk was the one that I was trying to load a week or two ago - from gff file at http://bioinfo.bti.cornell.edu/cgi-b...i/download.cgi. (Your fix worked well, thanks for that).

I've taken the corredesponding file of coding sequences from that site (ftp://bioinfo.bti.cornell.edu/pub/ki...ruit_cds.fa.gz) that correspond to the 39040 genes in the gff file and used that to create an index for bwa.
I then used bwa to map trasncriptome data against those coding sequences, ending up with a sam file. I imagine this is where the problem is. The genes get split up into chromosomes when the reference is loaded into seqmonk, but thre's no chromosome information in the cds file - thus no chromosome information when the mapped reads are put into the sam file.

Though, for reference, when I first started, I mapped my transcriptome data against the kiwifruit pseudomolecule sequence - which are the chromosomes and I was getting the same error though a lot less of them.

Any other info that would be helpful?
tirohia is offline   Reply With Quote
Old 12-06-2013, 05:43 PM   #239
tirohia
Member
 
Location: Auckland, NZ

Join Date: Nov 2011
Posts: 46
Default

Sorry. I may have answered my own question. When I was using the kiwifruit pseudomolecule sequences (i.e. the chromosome sequences) to map my transcriptome data against, I was getting the same error - but I didn't, at the time, find the link about the mapping of the chromsomes.

I'm repeating the mapping with the chromosome sequences, and I'll see if I get results wherein I can figure out how to set up the aliases.

Cheers
Ben.
tirohia is offline   Reply With Quote
Old 12-07-2013, 06:16 AM   #240
simonandrews
Simon Andrews
 
Location: Babraham Inst, Cambridge, UK

Join Date: May 2009
Posts: 871
Default

Hi Ben,

You should be fine if you map against the chromosome sequences. You'll need to use a splice aware mapper such as tophat to do the mapping, but you can also pass in your GTF file to the mapper so that it will effectively map against the transcriptome first, but will give you genomic coordinates.

Let me know if it works out OK, but hopefully this batch of mapped reads will be OK.

Simon.
simonandrews is offline   Reply With Quote
Reply

Tags
analysis, desktop, seqmonk, visualization

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 10:51 PM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2021, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO