Unconfigured Ad

**mbblack** · 05-23-2013, 09:23 AM

Under the circumstances you describe, I would want to start from reads and do my own mapping to my own (and up to date) reference. Assuming you are talking about the same species in each case, I would at least want them mapped to a common reference. I would also want to run them with my own mapping parameters so I could evaluate the quality of mapping in each for myself.

Also unless your reference has 100's of thousands of genes, I'm highly skeptical of that many differentially expressed genes - those numbers are extremely high for a typical mammalian genome at least. It is not uncommon to have several thousand significantly differentially expressed genes (applying both a statistical cutoff and a fold change cutoff), but tens of thousands sounds unusually large.

**priya** · 05-23-2013, 09:40 AM

Originally posted by mbblack View Post

Under the circumstances you describe, I would want to start from reads and do my own mapping to my own (and up to date) reference. Assuming you are talking about the same species in each case, I would at least want them mapped to a common reference. I would also want to run them with my own mapping parameters so I could evaluate the quality of mapping in each for myself.

Also unless your reference has 100's of thousands of genes, I'm highly skeptical of that many differentially expressed genes - those numbers are extremely high for a typical mammalian genome at least. It is not uncommon to have several thousand significantly differentially expressed genes (applying both a statistical cutoff and a fold change cutoff), but tens of thousands sounds unusually large.

Thanks for your reply. Ofcourse, both were sequenced using the same reference genome(mouse). I want to be more clear, they were not stastically significant DE genes, those were genes after Cufflinks analysis(huge list of genes). However further DE analysis has to be done on counts to get Stastically significant DE genes. But I would focus that things further next in my analysis once I am clear whether I can proceed with the dataset I am having now.

However,as u said its better to start with mapping again for both datsets and perform the downstream analysis.

**mbblack** · 05-23-2013, 09:44 AM

Originally posted by priya View Post

Thanks for your reply. Ofcourse, both were sequenced using the same reference genome(mouse). I want to be more clear, they were not stastically significant DE genes, those were genes after Cufflinks analysis(huge list of genes). However further DE analysis has to be done on counts to get Stastically significant DE genes. But I would focus that things further next in my analysis once I am clear whether I can proceed with the dataset I am having now.

However,as u said its better to start with mapping again for both datsets and perform the downstream analysis.

Were they both mapped to the same release of the mouse genome? I would want them mapped to the exact same version and build of the mouse genome. And were they mapped with the same mapping algorithm with the same parameter settings? Again, I would want those identical if I was comparing across the two, or looking to combine them. Even if they were both mapped with Illumina tools, I would want it to be the exact same version of the same software.

**priya** · 05-23-2013, 09:50 AM

Originally posted by mbblack View Post

Were they both mapped to the same release of the mouse genome? I would want them mapped to the exact same version and build of the mouse genome. And were they mapped with the same mapping algorithm with the same parameter settings? Again, I would want those identical if I was comparing across the two, or looking to combine them. Even if they were both mapped with Illumina tools, I would want it to be the exact same version of the same software.

One sequencing was done two years back and the other recently. So I guess its not the same version of reference genome used for mapping. But both were mapping using Tophat and regarding the parameter settings I dont have much information.

If its the same tophat and cufflinks tools used for irrespective of parametric settings. do it make much difference??

**mbblack** · 05-23-2013, 10:07 AM

Well, the genome will have changed over those two years, so that would be a primary reason to re-map all the reads again. And sure, software versions and parameters can have a significant effect on final results. All of these tools are not very old, and have been steadily changing, and improving over the years, so older versions may be very inferior to current versions. Analysis parameters for mapping algorithms can also have a large affect on the final mapped reads - things like gap and overhang penalties, mapQC limits or cutoffs can all make significant differences.

These are all things you need to control and account for if you are going to analyze these data.

**priya** · 05-23-2013, 10:34 AM

Originally posted by mbblack View Post

Well, the genome will have changed over those two years, so that would be a primary reason to re-map all the reads again. And sure, software versions and parameters can have a significant effect on final results. All of these tools are not very old, and have been steadily changing, and improving over the years, so older versions may be very inferior to current versions. Analysis parameters for mapping algorithms can also have a large affect on the final mapped reads - things like gap and overhang penalties, mapQC limits or cutoffs can all make significant differences.

These are all things you need to control and account for if you are going to analyze these data.

Thanks for your ideas. I will definitly look into those things

**swbarnes2** · 05-23-2013, 10:35 AM

Researchers are going to spend how many hours working on the results you you give them? Hundreds? More?

Spend a few hours to do things right from the beginning. Then when people ask you exactly what you are giving them, you know exactly what it is, because you controlled the work from fastqs on.

**mbblack** · 05-23-2013, 10:42 AM

If nothing else, you say one dataset is reporting 38000 DE genes from a genome that only currently has 23,158 protein coding genes in the primary assembly. Are you sure those are actually DE genes? Or did someone or something mess up the mapped reads summary or the cufflinks analysis?

**pmgr** · 06-04-2013, 01:08 AM

Is there any good tools/methods to propose on how to evaluate the mapping quality?

**mbblack** · 06-04-2013, 04:06 AM

There are lots. Some mappers themselves will give you some useful parameters of mapping QC. The Broad institute has a java app called RNA-SeQC that will produce a number of reports and measures of mapping QC. There is also a tool called RSeQC, and I'm sure others as well.

see http://www.broadinstitute.org/cancer/cga/rna-seqc

and https://code.google.com/p/rseqc/

**priya** · 06-24-2013, 02:40 AM

Dear mbblack
Thanks for your previous suggestions. Hope you have more suggestions on my current problem
As my samples were sequenced at two different places and different time points but using the same platform (Illumina). Even the genome versions used were different. Now I have done remapping on my datasets using tophat, cufflinks and got the count and fpkm values using the same genome version.

I was trying to compare the samples from two datasets and figuring the gene expression values prior to go with differential expression analysis(EdgeR)

One of the sample in one dataset acts as replicate to the other sample in second dataset, and in theory the gene expresssion values of both samples should be closely related, but when I look at count values they were far apart and in hierarchical clustering tree also they looks apart not closely related.

I run the tophat with same settings on both datasets individually.. and generated the alignment files. But when we want to compare the two datsets do we need to run the tophat with any extra parameters such that we can make a comparison??

Or is it due to experimental variations as the samples were prepared at different timepoints ? that is the reason the samples were clustered apart in the hiererchical clustering?

Looking for your suggestion..
Thank you

Originally posted by priya View Post

Thanks for your reply. Ofcourse, both were sequenced using the same reference genome(mouse). I want to be more clear, they were not stastically significant DE genes, those were genes after Cufflinks analysis(huge list of genes). However further DE analysis has to be done on counts to get Stastically significant DE genes. But I would focus that things further next in my analysis once I am clear whether I can proceed with the dataset I am having now.

However,as u said its better to start with mapping again for both datsets and perform the downstream analysis.

**dpryan** · 06-24-2013, 02:56 AM

It sounds like you have a significant batch effect, which is pretty common when things are done at different times. This should be apparent if in the clustering, the samples prepared together cluster together. One common way to deal with this is to simply use something like DESeq(2) or edgeR and just put the sequencing batch in as a factor in the generalized linear model.

**priya** · 06-24-2013, 04:35 AM

Originally posted by dpryan View Post

It sounds like you have a significant batch effect, which is pretty common when things are done at different times. This should be apparent if in the clustering, the samples prepared together cluster together. One common way to deal with this is to simply use something like DESeq(2) or edgeR and just put the sequencing batch in as a factor in the generalized linear model.

Thanks for the suggestion. But removing the batch effect will not the affect the actual biological variation??

My samples in the datasets dont contain any technical/biological replicates, in such cases i read some papers where 2X2 contigenecy tables(fishers test) works fine with unreplicated samples. And in EdgeR and DEseq packages, i see in the examples with atleast one replicate situation...

**dpryan** · 06-24-2013, 06:30 AM

If your samples don't have biological replicates then you can't calculate any meaningful statistics to start with. Exactly how many samples do you have from each group sequenced in each batch? If you have a single sample from group A sequenced as one batch and then single samples from both groups A and B sequenced as a second batch then you're in a tight spot. If, on the other hand, you sequenced a single sample each from groups A and B as one batch and then again later as another batch then you might be able to control for the batch effect.

Topics	Statistics	Last Post
Sequencing the Two-Toed Sloth Genome Reveals Jumping Genes Tied to Its Extreme Metabolism by SEQadmin2 Started by SEQadmin2, 06-09-2026, 11:58 AM	0 responses 22 views 0 reactions	Last Post by SEQadmin2 06-09-2026, 11:58 AM
A New Method Makes Hantavirus Genome Analysis Faster and More Accessible by SEQadmin2 Started by SEQadmin2, 06-05-2026, 10:09 AM	0 responses 28 views 0 reactions	Last Post by SEQadmin2 06-05-2026, 10:09 AM
A New Single-Cell Method Maps DNA-Protein Interactions by SEQadmin2 Started by SEQadmin2, 06-04-2026, 08:59 AM	0 responses 39 views 0 reactions	Last Post by SEQadmin2 06-04-2026, 08:59 AM
Long-Read RNA Sequencing Uncovers a Hidden Layer of Immune Cell Regulation by SEQadmin2 Started by SEQadmin2, 06-02-2026, 12:03 PM	0 responses 61 views 0 reactions	Last Post by SEQadmin2 06-02-2026, 12:03 PM

Unconfigured Ad

Analysis of RNA-seq data

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News