Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • running Cuffdiff with biological replicates

    Hello,

    I have posted a question about Cuffdiff last week in this forum, in the subpart "rna-sequencing", but I think this place is more appropriate, since I have seen several topics about it.

    Nevertheless, I haven't found information about the use of Cuffdiff with biological replicates.

    I try to use Cuffdiff to find differentially expressed isoforms between 2 conditions (Raman - HM1). There are 3 replicates per condition. There are 5295 genes and 6432 isoforms.
    In the documentation, it's written "If you have more than one replicate for a sample, supply the SAM files for the sample as a single comma-separated list".
    So I ran Cuffdiff this way:

    cuffdiff -L Raman,HM1 -N --FDR 0.05 /path/fichier.gtf /path/Cufflinks/L2/accepted_hits.bam,/path/Cufflinks/L4/accepted_hits.bam,/path/Cufflinks/L6/accepted_hits.bam /path/Cufflinks/L3/accepted_hits.bam,/path/Cufflinks/L7/accepted_hits.bam,/path/Cufflinks/L8/accepted_hits.bam
    The replicates for the condition Raman are L2, L4 and L6. The replicates for the condition HM1 are L3, L7 and L8.
    I'm interested in the output files isoform_exp.diff:

    Code:
    test_id	gene	locus	sample_1	sample_2	status	value_1	value_2	ln(fold_change)	test_stat	p_value	significant
    EHI_000010.ref	-	DS571600:2419-3622	q1	q2	NOTEST	8.28385	21.4211	0.950069	-2.32208	0.0202284	no
    EHI_000130.alt1	-	DS571600:7792-8309	q1	q2	OK	108.42	6.20207	-2.86113	4.25521	2.08856e-05	yes
    EHI_000130.ref	-	DS571600:7792-8309	q1	q2	OK	1152.64	2299.79	0.690763	-16.1849	0	yes
    EHI_000240.ref	-	DS571186:1554-2669	q1	q2	OK	558.654	857.323	0.428284	-7.87676	3.33067e-15	yes
    EHI_000250.ref	-	DS571186:2850-3551	q1	q2	OK	134.444	301.066	0.80618	-7.77203	7.77156e-15	yes
    ...
    EHI_C00159.ref	-	EH4264:45-392	q1	q2	NOTEST	11.8076	27.8643	0.858599	-2.4726	0.0134133	no
    EHI_000010.ref	-	DS571600:2419-3622	q1	q3	NOTEST	8.28385	14.2453	0.542123	-1.24073	0.214705	no
    EHI_000130.alt1	-	DS571600:7792-8309	q1	q3	OK	108.42	4.26615	-3.2353	4.30991	1.63319e-05	yes
    EHI_000130.ref	-	DS571600:7792-8309	q1	q3	OK	1152.64	1736.74	0.409956	-9.25409	0	yes
    ...
    EHI_C00155.ref	-	DS571588:780-994	q5	q6	OK	4382.71	4702.45	0.0704184	-3.35392	0.000796741	yes
    EHI_C00156.ref	-	DS571646:5447-6016	q5	q6	NOTEST	12.2637	40.6463	1.19826	-3.67794	0.000235123	no
    EHI_C00157.ref	-	DS571705:1482-1899	q5	q6	OK	776.201	1585.64	0.714329	-16.3066	0	yes
    EHI_C00159.ref	-	EH4264:45-392	q5	q6	NOTEST	9.16532	4.74531	-0.65827	1.16396	0.244442	no
    Here is my problem. The file contains 96480 (15*6432) lines (instead of 6432). 15 is the number of combinations between the 6 data sets... I think Cuffdiff did not consider the 3 biological replicates as one condition !
    Moreover, I don't know why I don't have the adjusted p-value information...

    Did someone use Cuffdiff with replicates to get differentially expressed isoforms ?
    Thanks for your help,
    Jane

  • #2
    My problem came from the version of Cuffdiff. To handle replicates, the version 1.0.0 at least should be used !
    Finally, I got the "right files" and the following output, with the command line :

    Code:
     cuffdiff /path/fichier_isoem.gtf -N -u -L Raman,HM1 /path/Cufflinks/L2/accepted_hits.bam,/path/Cufflinks/L4/accepted_hits.bam,/path/Cufflinks/L6/accepted_hits.bam /path/Cufflinks/L3/accepted_hits.bam,/path/Cufflinks/L7/accepted_hits.bam,/path/Cufflinks/L8/accepted_hits.bam
    I remember that I have 6 samples: 2 conditions of 3 replicates each. I'm suprised by the output: it seems that "Inspecting maps and determining fragment length distributions" is done only for 3 samples and "Modeling fragment count overdispersion" for the 3 other samples... How do you interpret this output?

    Code:
    [10:47:14] Loading reference annotation.
    [10:47:15] Inspecting maps and determining fragment length distributions.
    > Map Properties:
    >       Upper Quartile: 1053.00
    >       Number of Multi-Reads: 131 (with 234 total hits)
    >       Read Type: 100bp x 100bp
    >       Fragment Length Distribution: Empirical (learned)
    >                     Estimated Mean: 144.16
    >                  Estimated Std Dev: 16.97
    > Map Properties:
    >       Upper Quartile: 2148.00
    >       Number of Multi-Reads: 1932 (with 3473 total hits)
    >       Read Type: 100bp x 100bp
    >       Fragment Length Distribution: Empirical (learned)
    >                     Estimated Mean: 136.57
    >                  Estimated Std Dev: 15.99
    > Map Properties:
    >       Upper Quartile: 2419.00
    >       Number of Multi-Reads: 10954 (with 23240 total hits)
    >       Read Type: 100bp x 100bp
    >       Fragment Length Distribution: Empirical (learned)
    >                     Estimated Mean: 124.10
    >                  Estimated Std Dev: 19.20
    [12:39:11] Modeling fragment count overdispersion.
    > Map Properties:
    >       Upper Quartile: 3427.00
    >       Number of Multi-Reads: 4714 (with 8673 total hits)
    >       Read Type: 100bp x 100bp
    >       Fragment Length Distribution: Empirical (learned)
    >                     Estimated Mean: 137.82
    >                  Estimated Std Dev: 24.89
    > Map Properties:
    >       Upper Quartile: 1900.00
    >       Number of Multi-Reads: 9871 (with 19464 total hits)
    >       Read Type: 100bp x 100bp
    >       Fragment Length Distribution: Empirical (learned)
    >                     Estimated Mean: 139.56
    >                  Estimated Std Dev: 28.99
    > Map Properties:
    >       Upper Quartile: 2061.00
    >       Number of Multi-Reads: 2252 (with 3480 total hits)
    >       Read Type: 100bp x 100bp
    >       Fragment Length Distribution: Empirical (learned)
    >                     Estimated Mean: 146.45
    >                  Estimated Std Dev: 18.72
    [13:48:50] Modeling fragment count overdispersion.
    [13:48:50] Calculating initial abundance estimates for multi-read correction.
    > Processed 5272 loci.                         [*************************] 100%
    [17:20:35] Testing for differential expression and regulation in locus.
    > Processed 5272 loci.                         [*************************] 100%
    Performed 6425 isoform-level transcription difference tests
    Performed 0 tss-level transcription difference tests
    Performed 5289 gene-level transcription difference tests
    Performed 0 CDS-level transcription difference tests
    Performed 0 splicing tests
    Performed 0 promoter preference tests
    Performing 0 relative CDS output tests
    Writing isoform-level FPKM tracking
    Writing TSS group-level FPKM tracking
    Writing gene-level FPKM tracking
    Writing CDS-level FPKM tracking

    Comment


    • #3
      Wierd.
      I suggest you could use the latest release Cufflinks-1.0.3 package and run it again.

      Good luck!

      Comment


      • #4
        It's already the latest release

        Comment


        • #5
          Hi Jane,

          I remember that I have 6 samples: 2 conditions of 3 replicates each. I'm suprised by the output: it seems that "Inspecting maps and determining fragment length distributions" is done only for 3 samples and "Modeling fragment count overdispersion" for the 3 other samples... How do you interpret this output?
          I also encounter this problem when running Cuffdiff on 2 condition samples (3 replicates for each). But my output is an another variant, which "Inspecting maps and determining fragment length distributions" for 5 replicates before one "Modeling fragment count overdispersion", and then "Inspecting maps and determining fragment length distributions" for the last replicate, followed by another "Modeling fragment count overdispersion". The following posts my output.

          Code:
          You are using Cufflinks v1.0.3, which is the most recent release.
          [15:13:54] Loading reference annotation.
          [15:13:59] Inspecting maps and determining fragment length distributions.
          Warning: Using default Gaussian distribution due to insufficient paired-end reads in open ranges.  It is recommended that correct paramaters (--frag-len-mean and --frag-len-std-dev) be provided.
          > Map Properties:
          >	Total Map Mass: 5543986.19
          >	Read Type: 81bp single-end
          >	Fragment Length Distribution: Truncated Gaussian (default)
          >	              Default Mean: 200
          >	           Default Std Dev: 80
          Warning: Using default Gaussian distribution due to insufficient paired-end reads in open ranges.  It is recommended that correct paramaters (--frag-len-mean and --frag-len-std-dev) be provided.
          > Map Properties:
          >	Total Map Mass: 6792178.65
          >	Read Type: 80bp single-end
          >	Fragment Length Distribution: Truncated Gaussian (default)
          >	              Default Mean: 200
          >	           Default Std Dev: 80
          Warning: Using default Gaussian distribution due to insufficient paired-end reads in open ranges.  It is recommended that correct paramaters (--frag-len-mean and --frag-len-std-dev) be provided.
          > Map Properties:
          >	Total Map Mass: 5448746.17
          >	Read Type: 81bp single-end
          >	Fragment Length Distribution: Truncated Gaussian (default)
          >	              Default Mean: 200
          >	           Default Std Dev: 80
          Warning: Using default Gaussian distribution due to insufficient paired-end reads in open ranges.  It is recommended that correct paramaters (--frag-len-mean and --frag-len-std-dev) be provided.
          > Map Properties:
          >	Total Map Mass: 10225346.76
          >	Read Type: 82bp single-end
          >	Fragment Length Distribution: Truncated Gaussian (default)
          >	              Default Mean: 200
          >	           Default Std Dev: 80
          Warning: Using default Gaussian distribution due to insufficient paired-end reads in open ranges.  It is recommended that correct paramaters (--frag-len-mean and --frag-len-std-dev) be provided.
          > Map Properties:
          >	Total Map Mass: 6787293.17
          >	Read Type: 80bp single-end
          >	Fragment Length Distribution: Truncated Gaussian (default)
          >	              Default Mean: 200
          >	           Default Std Dev: 80
          [15:18:25] Modeling fragment count overdispersion.
          Warning: Using default Gaussian distribution due to insufficient paired-end reads in open ranges.  It is recommended that correct paramaters (--frag-len-mean and --frag-len-std-dev) be provided.
          > Map Properties:
          >	Total Map Mass: 17084659.40
          >	Read Type: 82bp single-end
          >	Fragment Length Distribution: Truncated Gaussian (default)
          >	              Default Mean: 200
          >	           Default Std Dev: 80
          [15:19:53] Modeling fragment count overdispersion.
          [15:19:53] Testing for differential expression and regulation in locus.
          > Processed 20237 loci.                        [*************************] 100%
          Performed 20389 isoform-level transcription difference tests
          Performed 17515 tss-level transcription difference tests
          Performed 15748 gene-level transcription difference tests
          Performed 16804 CDS-level transcription difference tests
          Performed 1876 splicing tests
          Performed 1210 promoter preference tests
          Performing 1790 relative CDS output tests
          Writing isoform-level FPKM tracking
          Writing TSS group-level FPKM tracking
          Writing gene-level FPKM tracking
          Writing CDS-level FPKM tracking
          I guess it is possible that the "Modeling fragment count overdispersion" program is running in another process, which is independent of the "Inspecting maps and determining fragment length distributions". If so, it could happen that the their printings mix with each other. But I am not sure.
          I will try to check the source codes later in the day.

          Cheers,

          Comment


          • #6
            Please let me know Hunny if you find some explanations in the source codes

            I am also wondering about something else. With the command that I ran, I got 358 differentially expressed genes and 504 differentially expressed isoforms. Since it seems to be very few, I reran without any options. Then, I got 483 differentially expressed genes and 632 differentially expressed isoforms. That's quite a big difference... Does it mean that I should "trust" the results with upper quartile normalisation and multi reads correction?
            I've done the analysis with DESeq and I got 1830 differentially expressed genes. Can I conclude that Cuffdiff is not powerful enough? Has someone encountered such a difference?
            Thanks for your feedback.

            Comment


            • #7
              Hi Jane,

              Your Cuffdiff output printing is actually normal !

              [10:47:15] Inspecting maps and determining fragment length distributions.
              This sentence is printed before all processes below.

              > Map Properties:
              .......
              > Map Properties:
              .......
              > Map Properties:
              .......
              [12:39:11] Modeling fragment count overdispersion.
              The first half block above might be for your first condition sample.
              And the following block is likely for your second condition sample. Two "Modeling fragment count overdispersion" are run for two condition samples respectively.
              > Map Properties:
              .......
              > Map Properties:
              .......
              > Map Properties:
              .......
              [13:48:50] Modeling fragment count overdispersion.
              To my problem, it's all because I was using the -p option of Cuffdiff for running multithreads, which affect the printing order. That could be the point.

              I am also wondering about something else. With the command that I ran, I got 358 differentially expressed genes and 504 differentially expressed isoforms. Since it seems to be very few, I reran without any options. Then, I got 483 differentially expressed genes and 632 differentially expressed isoforms. That's quite a big difference... Does it mean that I should "trust" the results with upper quartile normalisation and multi reads correction?
              You mean you got different numbers of "significant yes" genes or isoforms? If so, why don't you run it again only with quartile normalisation to see what will happen?

              I've done the analysis with DESeq and I got 1830 differentially expressed genes. Can I conclude that Cuffdiff is not powerful enough? Has someone encountered such a difference?
              I haven't used DESeq before. Nevertheless, I think DESeq generating much more genes than Cuffdiff doesn't mean the former is better or not.

              Cheers,

              Comment


              • #8
                Originally posted by Hunny View Post
                Hi Jane,

                Your Cuffdiff output printing is actually normal !


                This sentence is printed before all processes below.


                The first half block above might be for your first condition sample.
                And the following block is likely for your second condition sample. Two "Modeling fragment count overdispersion" are run for two condition samples respectively.
                Ok, I see, thanks for your explanations !


                You mean you got different numbers of "significant yes" genes or isoforms? If so, why don't you run it again only with quartile normalisation to see what will happen?
                Yes, that is what I mean. I'm currently running with on one hand, the UQ normalisation and on the other hand, the pre-estimation for multi reads allocation. I will compare on Monday the results.

                I haven't used DESeq before. Nevertheless, I think DESeq generating much more genes than Cuffdiff doesn't mean the former is better or not.

                Cheers,
                I agree, but it shows that either both or one of them are/is wrong....

                Comment


                • #9
                  Similar results

                  Just wanted to say that I've also been comparing the results of DESeq and Cuffdiff in my experiment with biological replicates over two conditions (7/5 samples each). I'm also seeing a much greater number of significant results with DESeq (in my case ~40 vs 2). The normalised mean counts from DESeq and the FPKM values from Cuffdiff are roughly proportional, and fold-changes near identical, and the difference is not due to variations in multiple testing as I'd initially feared. It's in the initial estimation of significance that the difference lies, and I'm having trouble deciding which is the more sensible approach.

                  Comment


                  • #10
                    Thanks for sharing ! I haven't made any progress on this topic yet, but if I do, I will update this post

                    Comment


                    • #11
                      Hi,
                      I have some problem with the cuffdiff running. I run the cuffdiff for two conditions (13 samples vs. 8 samples) with the following code:

                      cuffdiff -o /path/result /path/reference.gtf sample1,sample2,...sample13 sample14,...sample21

                      I want to look at their gene expression difference. The server kept giving me the error message. And I only got 3 rows of the results after the running stopped. The error message is like the following:
                      PBS Job Id: 3324389.mast
                      Job Name: test
                      Exec host: comp116/11+comp116/10+comp116/9+comp116/8+comp116/7+comp116/6+comp116/5+comp116/4+comp116/3+comp116/2+comp116/1+comp116/0
                      An error has occurred processing your job, see below.
                      Post job file processing error; job 3324389.mast on host comp116/11+comp116/10+comp116/9+comp116/8+comp116/7+comp116/6+comp116/5+comp116/4+comp116/3+comp116/2+comp116/1+comp116/0


                      The cuffdiff is running at the back stage. I am just wondering where I can get the error message from the cuffdiff? which file?

                      Comment


                      • #12
                        Hi,
                        your error is a PBS error. I get the same type of message every once in a while and it is generally due to high load on the server. I try to copy all the files needed to process a job to the local node and make sure all the temp files etc point to a local directory to reduce the network load during processing. You could also run your job interactively to see what the exact error is. If you run interactively and there is a network error, you will get disconnected, if it is a cuffdiff issue, cuffdiff will error out and you will get a cuffdiff message.
                        Bob

                        Comment


                        • #13
                          Originally posted by bioBob View Post
                          Hi,
                          your error is a PBS error. I get the same type of message every once in a while and it is generally due to high load on the server. I try to copy all the files needed to process a job to the local node and make sure all the temp files etc point to a local directory to reduce the network load during processing. You could also run your job interactively to see what the exact error is. If you run interactively and there is a network error, you will get disconnected, if it is a cuffdiff issue, cuffdiff will error out and you will get a cuffdiff message.
                          Bob
                          Thanks so much! I will try that.

                          Comment


                          • #14
                            Originally posted by bioBob View Post
                            Hi,
                            your error is a PBS error. I get the same type of message every once in a while and it is generally due to high load on the server. I try to copy all the files needed to process a job to the local node and make sure all the temp files etc point to a local directory to reduce the network load during processing. You could also run your job interactively to see what the exact error is. If you run interactively and there is a network error, you will get disconnected, if it is a cuffdiff issue, cuffdiff will error out and you will get a cuffdiff message.
                            Bob
                            Hi, Bob,
                            After I ran this job for another time. There is no error message like before. However the result is still confusing. There is only 3 rows in the gene_exp.diff file. The first part of the log looks OK. They are about some locus processing... And the last part of the log is:
                            Performed 188097 isoform-level transcription difference tests
                            Performed 0 tss-level transcription difference tests
                            Performed 3 gene-level transcription difference tests
                            Performed 0 CDS-level transcription difference tests
                            Performed 0 splicing tests
                            Performed 0 promoter preference tests
                            Performing 0 relative CDS output tests
                            Writing isoform-level FPKM tracking
                            Writing TSS group-level FPKM tracking
                            Writing gene-level FPKM tracking
                            Writing CDS-level FPKM tracking
                            Writing isoform-level count tracking
                            Writing TSS group-level count tracking
                            Writing gene-level count tracking
                            Writing CDS-level count tracking
                            Writing isoform-level read group tracking
                            Writing TSS group-level read group tracking
                            Writing gene-level read group tracking
                            Writing CDS-level read group tracking
                            Writing read group info
                            Writing run info

                            I also posted my question in another post. Do you have any comments for that? Thanks!

                            Comment

                            Latest Articles

                            Collapse

                            • seqadmin
                              Current Approaches to Protein Sequencing
                              by seqadmin


                              Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
                              04-04-2024, 04:25 PM
                            • seqadmin
                              Strategies for Sequencing Challenging Samples
                              by seqadmin


                              Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                              03-22-2024, 06:39 AM

                            ad_right_rmr

                            Collapse

                            News

                            Collapse

                            Topics Statistics Last Post
                            Started by seqadmin, 04-11-2024, 12:08 PM
                            0 responses
                            24 views
                            0 likes
                            Last Post seqadmin  
                            Started by seqadmin, 04-10-2024, 10:19 PM
                            0 responses
                            25 views
                            0 likes
                            Last Post seqadmin  
                            Started by seqadmin, 04-10-2024, 09:21 AM
                            0 responses
                            21 views
                            0 likes
                            Last Post seqadmin  
                            Started by seqadmin, 04-04-2024, 09:00 AM
                            0 responses
                            52 views
                            0 likes
                            Last Post seqadmin  
                            Working...
                            X