Unconfigured Ad

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts
  • priya
    Member
    • Apr 2013
    • 57

    Analysis of RNA-seq data

    Hi ,
    I got the two different RNA-Seq data sequenced from Illumina Hi-seq machines. I got the data as fpkm tables. The data were sequenced at different places and at different time points. I dont have much information on one dataset,as it was done two years back. And other was done recently using Tophat and Cufflinks. I need to combine both the sequencing results and focus on the list of differentially expressed genes and do some bioinformatic approaches for module detection .
    From one dataset , I have list of DE genes as 20,000 and in second dataset were 38000 genes. I am wondering whether is it reasonable to consider the data for further analysis as the library protocol may be different so was differing number of genes in both datasets.
    Is it good idea to start with fasta sequences and do the analysis like mapping,cufflinks from scratch ?

    Any ideas is warmly acceptable
  • mbblack
    Senior Member
    • Aug 2009
    • 245

    #2
    Under the circumstances you describe, I would want to start from reads and do my own mapping to my own (and up to date) reference. Assuming you are talking about the same species in each case, I would at least want them mapped to a common reference. I would also want to run them with my own mapping parameters so I could evaluate the quality of mapping in each for myself.

    Also unless your reference has 100's of thousands of genes, I'm highly skeptical of that many differentially expressed genes - those numbers are extremely high for a typical mammalian genome at least. It is not uncommon to have several thousand significantly differentially expressed genes (applying both a statistical cutoff and a fold change cutoff), but tens of thousands sounds unusually large.
    Michael Black, Ph.D.
    ScitoVation LLC. RTP, N.C.

    Comment

    • priya
      Member
      • Apr 2013
      • 57

      #3
      Originally posted by mbblack View Post
      Under the circumstances you describe, I would want to start from reads and do my own mapping to my own (and up to date) reference. Assuming you are talking about the same species in each case, I would at least want them mapped to a common reference. I would also want to run them with my own mapping parameters so I could evaluate the quality of mapping in each for myself.

      Also unless your reference has 100's of thousands of genes, I'm highly skeptical of that many differentially expressed genes - those numbers are extremely high for a typical mammalian genome at least. It is not uncommon to have several thousand significantly differentially expressed genes (applying both a statistical cutoff and a fold change cutoff), but tens of thousands sounds unusually large.
      Thanks for your reply. Ofcourse, both were sequenced using the same reference genome(mouse). I want to be more clear, they were not stastically significant DE genes, those were genes after Cufflinks analysis(huge list of genes). However further DE analysis has to be done on counts to get Stastically significant DE genes. But I would focus that things further next in my analysis once I am clear whether I can proceed with the dataset I am having now.

      However,as u said its better to start with mapping again for both datsets and perform the downstream analysis.

      Comment

      • mbblack
        Senior Member
        • Aug 2009
        • 245

        #4
        Originally posted by priya View Post
        Thanks for your reply. Ofcourse, both were sequenced using the same reference genome(mouse). I want to be more clear, they were not stastically significant DE genes, those were genes after Cufflinks analysis(huge list of genes). However further DE analysis has to be done on counts to get Stastically significant DE genes. But I would focus that things further next in my analysis once I am clear whether I can proceed with the dataset I am having now.

        However,as u said its better to start with mapping again for both datsets and perform the downstream analysis.
        Were they both mapped to the same release of the mouse genome? I would want them mapped to the exact same version and build of the mouse genome. And were they mapped with the same mapping algorithm with the same parameter settings? Again, I would want those identical if I was comparing across the two, or looking to combine them. Even if they were both mapped with Illumina tools, I would want it to be the exact same version of the same software.
        Michael Black, Ph.D.
        ScitoVation LLC. RTP, N.C.

        Comment

        • priya
          Member
          • Apr 2013
          • 57

          #5
          Originally posted by mbblack View Post
          Were they both mapped to the same release of the mouse genome? I would want them mapped to the exact same version and build of the mouse genome. And were they mapped with the same mapping algorithm with the same parameter settings? Again, I would want those identical if I was comparing across the two, or looking to combine them. Even if they were both mapped with Illumina tools, I would want it to be the exact same version of the same software.
          One sequencing was done two years back and the other recently. So I guess its not the same version of reference genome used for mapping. But both were mapping using Tophat and regarding the parameter settings I dont have much information.

          If its the same tophat and cufflinks tools used for irrespective of parametric settings. do it make much difference??

          Comment

          • mbblack
            Senior Member
            • Aug 2009
            • 245

            #6
            Well, the genome will have changed over those two years, so that would be a primary reason to re-map all the reads again. And sure, software versions and parameters can have a significant effect on final results. All of these tools are not very old, and have been steadily changing, and improving over the years, so older versions may be very inferior to current versions. Analysis parameters for mapping algorithms can also have a large affect on the final mapped reads - things like gap and overhang penalties, mapQC limits or cutoffs can all make significant differences.

            These are all things you need to control and account for if you are going to analyze these data.
            Michael Black, Ph.D.
            ScitoVation LLC. RTP, N.C.

            Comment

            • priya
              Member
              • Apr 2013
              • 57

              #7
              Originally posted by mbblack View Post
              Well, the genome will have changed over those two years, so that would be a primary reason to re-map all the reads again. And sure, software versions and parameters can have a significant effect on final results. All of these tools are not very old, and have been steadily changing, and improving over the years, so older versions may be very inferior to current versions. Analysis parameters for mapping algorithms can also have a large affect on the final mapped reads - things like gap and overhang penalties, mapQC limits or cutoffs can all make significant differences.

              These are all things you need to control and account for if you are going to analyze these data.
              Thanks for your ideas. I will definitly look into those things

              Comment

              • swbarnes2
                Senior Member
                • May 2008
                • 910

                #8
                Researchers are going to spend how many hours working on the results you you give them? Hundreds? More?

                Spend a few hours to do things right from the beginning. Then when people ask you exactly what you are giving them, you know exactly what it is, because you controlled the work from fastqs on.

                Comment

                • mbblack
                  Senior Member
                  • Aug 2009
                  • 245

                  #9
                  If nothing else, you say one dataset is reporting 38000 DE genes from a genome that only currently has 23,158 protein coding genes in the primary assembly. Are you sure those are actually DE genes? Or did someone or something mess up the mapped reads summary or the cufflinks analysis?
                  Michael Black, Ph.D.
                  ScitoVation LLC. RTP, N.C.

                  Comment

                  • pmgr
                    Junior Member
                    • Jun 2012
                    • 5

                    #10
                    Is there any good tools/methods to propose on how to evaluate the mapping quality?

                    Comment

                    • mbblack
                      Senior Member
                      • Aug 2009
                      • 245

                      #11
                      There are lots. Some mappers themselves will give you some useful parameters of mapping QC. The Broad institute has a java app called RNA-SeQC that will produce a number of reports and measures of mapping QC. There is also a tool called RSeQC, and I'm sure others as well.

                      see http://www.broadinstitute.org/cancer/cga/rna-seqc

                      and https://code.google.com/p/rseqc/
                      Michael Black, Ph.D.
                      ScitoVation LLC. RTP, N.C.

                      Comment

                      • priya
                        Member
                        • Apr 2013
                        • 57

                        #12
                        Dear mbblack
                        Thanks for your previous suggestions. Hope you have more suggestions on my current problem
                        As my samples were sequenced at two different places and different time points but using the same platform (Illumina). Even the genome versions used were different. Now I have done remapping on my datasets using tophat, cufflinks and got the count and fpkm values using the same genome version.

                        I was trying to compare the samples from two datasets and figuring the gene expression values prior to go with differential expression analysis(EdgeR)

                        One of the sample in one dataset acts as replicate to the other sample in second dataset, and in theory the gene expresssion values of both samples should be closely related, but when I look at count values they were far apart and in hierarchical clustering tree also they looks apart not closely related.

                        I run the tophat with same settings on both datasets individually.. and generated the alignment files. But when we want to compare the two datsets do we need to run the tophat with any extra parameters such that we can make a comparison??

                        Or is it due to experimental variations as the samples were prepared at different timepoints ? that is the reason the samples were clustered apart in the hiererchical clustering?

                        Looking for your suggestion..
                        Thank you










                        Originally posted by priya View Post
                        Thanks for your reply. Ofcourse, both were sequenced using the same reference genome(mouse). I want to be more clear, they were not stastically significant DE genes, those were genes after Cufflinks analysis(huge list of genes). However further DE analysis has to be done on counts to get Stastically significant DE genes. But I would focus that things further next in my analysis once I am clear whether I can proceed with the dataset I am having now.

                        However,as u said its better to start with mapping again for both datsets and perform the downstream analysis.

                        Comment

                        • dpryan
                          Devon Ryan
                          • Jul 2011
                          • 3478

                          #13
                          It sounds like you have a significant batch effect, which is pretty common when things are done at different times. This should be apparent if in the clustering, the samples prepared together cluster together. One common way to deal with this is to simply use something like DESeq(2) or edgeR and just put the sequencing batch in as a factor in the generalized linear model.

                          Comment

                          • priya
                            Member
                            • Apr 2013
                            • 57

                            #14
                            Originally posted by dpryan View Post
                            It sounds like you have a significant batch effect, which is pretty common when things are done at different times. This should be apparent if in the clustering, the samples prepared together cluster together. One common way to deal with this is to simply use something like DESeq(2) or edgeR and just put the sequencing batch in as a factor in the generalized linear model.
                            Thanks for the suggestion. But removing the batch effect will not the affect the actual biological variation??

                            My samples in the datasets dont contain any technical/biological replicates, in such cases i read some papers where 2X2 contigenecy tables(fishers test) works fine with unreplicated samples. And in EdgeR and DEseq packages, i see in the examples with atleast one replicate situation...
                            Last edited by priya; 06-24-2013, 04:47 AM.

                            Comment

                            • dpryan
                              Devon Ryan
                              • Jul 2011
                              • 3478

                              #15
                              If your samples don't have biological replicates then you can't calculate any meaningful statistics to start with. Exactly how many samples do you have from each group sequenced in each batch? If you have a single sample from group A sequenced as one batch and then single samples from both groups A and B sequenced as a second batch then you're in a tight spot. If, on the other hand, you sequenced a single sample each from groups A and B as one batch and then again later as another batch then you might be able to control for the batch effect.

                              Comment

                              Latest Articles

                              Collapse

                              ad_right_rmr

                              Collapse

                              News

                              Collapse

                              Topics Statistics Last Post
                              Started by SEQadmin2, 06-09-2026, 11:58 AM
                              0 responses
                              22 views
                              0 reactions
                              Last Post SEQadmin2  
                              Started by SEQadmin2, 06-05-2026, 10:09 AM
                              0 responses
                              28 views
                              0 reactions
                              Last Post SEQadmin2  
                              Started by SEQadmin2, 06-04-2026, 08:59 AM
                              0 responses
                              39 views
                              0 reactions
                              Last Post SEQadmin2  
                              Started by SEQadmin2, 06-02-2026, 12:03 PM
                              0 responses
                              61 views
                              0 reactions
                              Last Post SEQadmin2  
                              Working...