Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Tophat + Cufflinks vs. CLC Bio

    Dear all,

    I'm currently analysing RNAseq data (Illumina paired end, read length 101 bases) in order to investigate different cancer stages.

    So far, I ran the Tuxedo pipeline (Tophat + Cufflinks) in two different versions and my colleaque ran CLC Bio in the most recent version in order to compare the results.

    What bothers me is:
    performing the whole Tuxedo pipeline twice and then analysing the results using cummeRbund (as described in the workflow published in NBT), there are NO genes found to be significantly different between two distinct conditions.
    However, CLC Bio nicely finds genes that are differentially expressed and one of them could already be validated in the lab.

    I now switched to GSNAP for the alignments, hoping to get better results, still this cannot be the main reason for these differences.

    So my question is: what am I doing wrong? My best guess is that the Cufflinks steps are somehow not working properly, but since everything runs through without any error messages, I don't really have a starting point for backtracking mistakes.

    Any help is greatly appreciated!

    Best regards


    M&M:

    Alignments were performed with the stable version against Hg18 using the following commands:

    Code:
    tophat -p 8 -o mapping1 bowtie-0.12.7/genomes/hg18/hg18 R1_001.fastq R2_001.fastq
    
    cufflinks --GTF bowtie-0.12.7/genomes/hg18/hg18_annotations.gtf -p 8 -o annotation1 mapping1/accepted_hits.bam
    
    cuffmerge -g bowtie-0.12.7/genomes/hg18/hg18_annotations.gtf -s bowtie-0.12.7/genomes/hg18/hg18.fa -p 8 assemblies.txt
    
    cuffdiff -m 199 -s 38 -o diff_out -b bowtie-0.12.7/genomes/hg18/hg18.fa -p 8 -L G1,G2 -u merged_asm/merged.gtf mapping1/accepted_hits.bam,mapping2/accepted_hits.bam,mapping3/accepted_hits.bam mapping4/accepted_hits.bam,mapping5/accepted_hits.bam,mapping6/accepted_hits.bam

    and against Hg19 with the beta versions of Tophat (2.0.1), bowtie beta 5 and Cufflinks 2.0.1 using the following commands:

    Code:
    tophat --b2-sensitive -p 8 -o mapping1 bowtie2-2.0.0-beta6/hg19/hg19 R1_001.fastq R2_001.fastq
    
    cufflinks --GTF UCSC_annotation_hg19.gtf -p 8 -o annotation1 mapping1/accepted_hits.bam
    
    cuffmerge -g UCSC_annotation_hg19.gtf -s bowtie2-2.0.0-beta6/hg19/hg19.fa -p 8 assemblies.txt
    
    cuffdiff -m 199 -s 38 -o diff_out -b bowtie2-2.0.0-beta6/hg19/hg19.fa -p 9 -L G1,G2 -u merged_asm/merged.gtf mapping1/accepted_hits.bam,mapping2/accepted_hits.bam,mapping3/accepted_hits.bam mapping4/accepted_hits.bam,mapping5/accepted_hits.bam,mapping6/accepted_hits.bam

  • #2
    2nd Code)
    cuffdiff -p 8 ( if you have more than 8 cores it's not a problem)

    did you verified how many are significant by looking at gene_exp.diff
    do you have rpkm values !! or it was zero.
    Krishna

    Comment


    • #3
      In cuffdiff you have too mention space for different conditions. all 1,2,3 and 4,5,6 you mentioned are at the same conditions respectively!!
      Last edited by Krish_143; 07-19-2012, 02:46 AM.
      Krishna

      Comment


      • #4
        Originally posted by Krish_143 View Post
        2nd Code)
        cuffdiff -p 8 ( if you have more than 8 cores it's not a problem)

        did you verified how many are significant by looking at gene_exp.diff
        do you have rpkm values !! or it was zero.
        #1: 12 cores

        On the FPKMs, some are 0, others are not, what I found strange was that there were so many loci having NOTEST or FAILED...

        Here's the head of my gene_exp.diff

        Code:
        test_id	gene_id	gene	locus	sample_1	sample_2	status	value_1	value_2	log2(fold_change)	test_stat	p_value	q_value	significant
        XLOC_000001	XLOC_000001	uc010nxq.1	chr1:11873-29961	G1	G2	NOTEST	0.0190985	0.0252863	0.404898	-0.0177977	0.9858	1	no
        XLOC_000002	XLOC_000002	uc001aal.1	chr1:69090-70008	G1	G2	NOTEST	0	0.00746344	1.79769e+308	1.79769e+308	0.419618	1	no
        XLOC_000003	XLOC_000003	uc001aaq.2	chr1:321083-321115	G1	G2	NOTEST	0	0	0	0	1	1	no
        XLOC_000004	XLOC_000004	uc001aar.2	chr1:321145-321207	G1	G2	NOTEST	0	0	0	0	1	1	no
        XLOC_000005	XLOC_000005	uc001aau.3,uc009vjk.2,uc021oeh.1,uc021oei.1	chr1:322036-328581	G1	G2	OK	5.93524	3.37971	-0.812406	0.256386	0.797653	0.999991	no
        XLOC_000006	XLOC_000006	uc010nxu.2	chr1:367658-368597	G1	G2	NOTEST	0.0907753	0.0181609	-2.32146	0.42867	0.668164	1	no
        XLOC_000007	XLOC_000007	uc001aax.1	chr1:420205-421839	G1	G2	NOTEST	0	0	0	0	1	1	no
        XLOC_000008	XLOC_000008	uc001abb.3	chr1:568843-568913	G1	G2	NOTEST	0	0	0	0	1	1	no
        XLOC_000009	XLOC_000009	uc001abp.1,uc001abq.1,uc001abr.1,uc009vjn.1,uc009vjo.1,uc021oem.1	chr1:763063-789740	G1	G2	OK	19.5893	16.9383	-0.209779	0.133145	0.894079	0.999991	no
        XLOC_000010	XLOC_000010	uc001abu.1	chr1:846814-850328	G1	G2	NOTEST	0.00863467	0.0516752	2.58126	-0.475898	0.634147	1	no
        XLOC_000011	XLOC_000011	uc001abv.1,uc001abw.1,uc001abx.1	chr1:860529-894679	G1	G2	NOTEST	0.35576	0.258042	-0.463299	0.02397	0.980877	1	no
        XLOC_000012	XLOC_000012	uc001aca.2,uc001acb.1,uc010nyb.1	chr1:895966-901099	G1	G2	NOTEST	0.408548	0.294803	-0.470752	0.133242	0.894002	1	no
        XLOC_000013	XLOC_000013	uc001acd.3,uc001ace.3,uc001acf.3	chr1:901876-910484	G1	G2	NOTEST	0.471644	0.213305	-1.14478	0.168669	0.866057	1	no


        Originally posted by Krish_143 View Post
        In cuffdiff you have too mention space for different conditions. all 1,2,3 and 4,5,6 you mentioned are at the same conditions respectively!!
        Yes, they are biological replicates which makes it two distinct groups (1,2,3 and 4,5,6). I just shortened the rest of the commands because it would serve no purpose to post the same commands with different file / folder names. Sorry if this caused any confusion.

        Comment


        • #5
          You can also provide annotation file to tophat using -G option.

          Your gene_exp.diff head looks fine for me.
          I guess it was due to too many 3-3conditons.. may be one occurred warning at any stage of cufflinks then result might be altered.

          You can try this with 2 sets for trial.
          cuffdiff -o CuffDiff_out annotation.gtf top1/assembly_hits.bam top4/assembly_hits.bam

          whether you can find any significant differences or not.

          no need to use cuffmerge untill unless your interested with novel transcripts and novel splice junctions. For only expression leves go with direct step tophat, cufflinks and cuffdiff.
          Krishna

          Comment


          • #6
            Hey guys,

            It has been a while that I posted in this thread, in the meantime I worked on my analysis but I still did not get one step further...

            @Krish_143: Thanks for the recommendation, I redid my analysis now with a subset of only 6 samples (3 vs. 3) that have been aligned via GSNAP. Alignments look good, however my cufflinks output does not change a bit.

            Checking the results for significance via cummeRbund always returns FALSE for any comparison , be it genes, isoforms, CDS or TSS.

            > any(DEgenes[,"significant"]=="yes")
            > [1] FALSE

            I will now run cufflinks v1.3 with the same samples, as I am aware that cufflinks2 is much more conservative in it's calculations (see several threads here in the forum).

            Does anyone have any other suggestions?
            My alternative plan is to use easyRNASeq instead of cufflinks, maybe someone has experiences with that?

            Best regards

            Comment


            • #7
              limma and voom

              you could use the R-package limma and the function voom on raw count data (HTseq-count from the HTseq framework)...

              Comment


              • #8
                Well seing that easyRNASeq actually offers to output the read counts in ready-to-use objects for both DESeq as well as edgeR, I think I will skip HTseq-count.

                However, I have a different question which still concerns cufflinks:

                Does anyone know whether it is possible to obtain cufflinks' FPKM values for each sample? So in the end I would end up with a matrix of x columns (number of samples) and y rows (number of transcripts) similar to a microarray experiment?
                My first idea would be to use all transcript.gtf files after cufflinks and merge them somehow into one file, however, I suppose that again requires some normalization.
                I know this matrix can be obtained by easyRNASeq where it is based on read counts, so how about cufflinks?

                Comment


                • #9
                  Hi rboettcher,

                  To obtain a matrix of x columns (number of samples) and y rows (number of transcripts), check the cuffdiff output directory and locate files ending with ".read_group_tracking".

                  Best regards,
                  Douglas

                  Comment


                  • #10
                    Hi DZhang,

                    thank you for your reply. I got the table of FPKM values, however, the sample names get lost in this file. All I have are the group names + replicate number (0 based), however, I need to provide the sample names themselves to a colleague.

                    Any recommendations?

                    Comment


                    • #11
                      rboettcher, are you referring to the Columns 2 & 3 in the file? You can use simple scripts or even Excel to annotate the sample names (mostly likely in Column 3).

                      Best regards
                      Douglas

                      Comment


                      • #12
                        Yes that is what I am referring to.
                        I already converted the file in a table format similar to a microarray matrix, nevertheless there is no information on which replicate of group 1 is my actual patient number 1 in group 1.

                        My assumption so far is based on my command line, but I have no way to prove it since there seems to be no documentation on this matter. In short, based on the sample order I would assume the replicate numbers are assigned, so when running:

                        Code:
                        cuffdiff [options] -L G1,G2 \
                        G1sample_1,G1sample_2,G1sample_3 \
                        G2sample_1,G2sample_2,G2sample_3
                        
                        and obtaining this genes.read_group_tracking:
                        
                        tracking_id    condition    replicate
                        XLOC_000001    G1    1
                        XLOC_000001    G1    0
                        XLOC_000001    G1    2
                        XLOC_000001    G2    2
                        XLOC_000001    G2    1
                        XLOC_000001    G2    0
                        my guess is that I have the following correspondence table:

                        G1sample_1 = G1 replicate 0
                        G1sample_2 = G1 replicate 1
                        G1sample_3 = G1 replicate 2
                        G2sample_1 = G2 replicate 0
                        G2sample_2 = G2 replicate 1
                        G2sample_3 = G2 replicate 2

                        Is this correct?
                        I do need this information, as we can see strong variation between the different samples and we need to identify which samples are actually expressing our gene of interest.

                        Best regards

                        Comment


                        • #13
                          Hi rboettcher,

                          According to my experience, yes, the numbering is according to the input order. Please verify with the developers directly, though.

                          You may want to analyze your data with a count-based method (DeSeq or EdgeR) as they are quite transparent considering your specific needs in this project. (I am not even implying count-based methods are in any way superior to FPKM-based methods.)

                          Best regards,
                          Douglas

                          Comment


                          • #14
                            I would run your data through a different pipeline as it seems the cufflinks/cuffdiff estimations of expression and differential expression are very poorly correlated with true vales in simulation tests. Right now it looks like EM based tools that utilize transcriptome alignments are winning. RSEM and eXpress are good tools right now that have been evaluated to outperform cufflinks by "a lot" in terms of consistency and accuracy. The pipeline might go like this:

                            @ each sample align to a transcriptome reference with bowtie or BWA
                            Quantify expression with eXpress for each alignment
                            Quantify differential expression with DESeq or try a newer tool: EBSeq
                            /* Shawn Driscoll, Gene Expression Laboratory, Pfaff
                            Salk Institute for Biological Studies, La Jolla, CA, USA */

                            Comment

                            Latest Articles

                            Collapse

                            • seqadmin
                              Strategies for Sequencing Challenging Samples
                              by seqadmin


                              Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                              03-22-2024, 06:39 AM
                            • seqadmin
                              Techniques and Challenges in Conservation Genomics
                              by seqadmin



                              The field of conservation genomics centers on applying genomics technologies in support of conservation efforts and the preservation of biodiversity. This article features interviews with two researchers who showcase their innovative work and highlight the current state and future of conservation genomics.

                              Avian Conservation
                              Matthew DeSaix, a recent doctoral graduate from Kristen Ruegg’s lab at The University of Colorado, shared that most of his research...
                              03-08-2024, 10:41 AM

                            ad_right_rmr

                            Collapse

                            News

                            Collapse

                            Topics Statistics Last Post
                            Started by seqadmin, 03-27-2024, 06:37 PM
                            0 responses
                            13 views
                            0 likes
                            Last Post seqadmin  
                            Started by seqadmin, 03-27-2024, 06:07 PM
                            0 responses
                            11 views
                            0 likes
                            Last Post seqadmin  
                            Started by seqadmin, 03-22-2024, 10:03 AM
                            0 responses
                            53 views
                            0 likes
                            Last Post seqadmin  
                            Started by seqadmin, 03-21-2024, 07:32 AM
                            0 responses
                            69 views
                            0 likes
                            Last Post seqadmin  
                            Working...
                            X