Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Cuffdiff output - how reliable?

    I would really appreciate explanations as I try to understand my cuffdiff output -

    I ran cuffdiff through galaxy on my RNA-seq reads (paired end 50x50, Illumina Hiseq, no replicates - 1 control + 3 treated) pre-aligned to hg19 using cufflinks. I first ran a cuffcompare and then I inputted my bam files to run cuffdiff, using ensembl for reference annotation. I am confused by my cuffdiff output and would really appreciate some help -

    1. several genes that I am interested are not present in my differential expression of transcripts file. I am unsure as to why but genes such as SFRP1, XAGE1, SPANXA12 etc are missing from the output. Is there a filtering step early that dictates which genes are included in the output and which are not? I thought cuffdiff performed statistical analysis and listed all transcripts, irrespective of significance or not?

    2. as far as FPKM values go, I realized after attempting q RT-PCR analyses that low FPKM does not seem to correspond with low expression? If I am interested in genes that are not expressed in control and get expressed in treated, how would I set a cutoff to filter this list out? How do I correlate expression and FPKM values?

    3. I got an unexpected result from my cuffdiff that told me that all my treatments resulted in the significant upregulation of a similar number of genes. This is not what I expected at all and on attempting to validate results in the lab I had very poor validation. Also, I am unable to visually appreciate any big differences using IGV, despite cuffdiff output indicating that I have a significant, several fold changes between my treated and control. Again, why is there such a poor correlation?

    I would really appreciate your input on these questions so that I know how to proceed with my analyses.

  • #2
    Hi,
    Did you get an answer ?
    I am troubling for the same.
    Thank you

    Originally posted by whoopster101 View Post
    I would really appreciate explanations as I try to understand my cuffdiff output -

    I ran cuffdiff through galaxy on my RNA-seq reads (paired end 50x50, Illumina Hiseq, no replicates - 1 control + 3 treated) pre-aligned to hg19 using cufflinks. I first ran a cuffcompare and then I inputted my bam files to run cuffdiff, using ensembl for reference annotation. I am confused by my cuffdiff output and would really appreciate some help -

    1. several genes that I am interested are not present in my differential expression of transcripts file. I am unsure as to why but genes such as SFRP1, XAGE1, SPANXA12 etc are missing from the output. Is there a filtering step early that dictates which genes are included in the output and which are not? I thought cuffdiff performed statistical analysis and listed all transcripts, irrespective of significance or not?

    2. as far as FPKM values go, I realized after attempting q RT-PCR analyses that low FPKM does not seem to correspond with low expression? If I am interested in genes that are not expressed in control and get expressed in treated, how would I set a cutoff to filter this list out? How do I correlate expression and FPKM values?

    3. I got an unexpected result from my cuffdiff that told me that all my treatments resulted in the significant upregulation of a similar number of genes. This is not what I expected at all and on attempting to validate results in the lab I had very poor validation. Also, I am unable to visually appreciate any big differences using IGV, despite cuffdiff output indicating that I have a significant, several fold changes between my treated and control. Again, why is there such a poor correlation?

    I would really appreciate your input on these questions so that I know how to proceed with my analyses.

    Comment


    • #3
      Cufflinks/Cuffmerge pipeline

      I am running the TopHat/Cufflinks/Cuffmerge/Cuffdiff pipeline in 45 samples (the specie is a plant), I don’t have annotation, RNA-Seq analysis (Illumina 2x100bp reads).

      I don’t understand the results obtained. The numbers of genes obtained from Cufflinks are quite different from the number of genes output of Cuffmerge/Cufflinks.

      What I mean is: when I count the numbers of genes expressed per sample present in the file genes.read_group_tracking (Cuffdiff), the numbers of genes is much greater than the number obtained initially from Cufflinks (genes.fpkm_tracking file). For example, Cufflinks predicts 25,000 genes to the sample "X" but when I verify the predicted genes in the same sample in the file genes.read_group_tracking the number of genes is 35,000. And this is consistent through all samples.

      As I understood, there are two “moments of reconstruction” of the genes/transcripts – first Cufflinks performs the first reconstruction (individually sample by sample), second Cuffmerge/Cuffdiff performs again a kind of meta-reconstruction based on the overall information (all samples).

      I would expect less genes in the final assembly (after Cuffmerge/Cuffdiff) or at least a very similar number but that apparently doesn’t happen.

      Could you give me your opinion ?

      Regards,

      P.

      Comment


      • #4
        I think, 45 samples does not make any sense, does it ?
        categorise your samples and you can see the diff_genes ?
        what is the problem with that ?
        cufflinks makes an internal reference ID for a analysis you may not track in different run.
        Or possibly, I dont understand your problem,
        Please explain what you wish to achieve ..

        Thank you


        Originally posted by paulorapazote View Post
        I am running the TopHat/Cufflinks/Cuffmerge/Cuffdiff pipeline in 45 samples (the specie is a plant), I don’t have annotation, RNA-Seq analysis (Illumina 2x100bp reads).

        I don’t understand the results obtained. The numbers of genes obtained from Cufflinks are quite different from the number of genes output of Cuffmerge/Cufflinks.

        What I mean is: when I count the numbers of genes expressed per sample present in the file genes.read_group_tracking (Cuffdiff), the numbers of genes is much greater than the number obtained initially from Cufflinks (genes.fpkm_tracking file). For example, Cufflinks predicts 25,000 genes to the sample "X" but when I verify the predicted genes in the same sample in the file genes.read_group_tracking the number of genes is 35,000. And this is consistent through all samples.

        As I understood, there are two “moments of reconstruction” of the genes/transcripts – first Cufflinks performs the first reconstruction (individually sample by sample), second Cuffmerge/Cuffdiff performs again a kind of meta-reconstruction based on the overall information (all samples).

        I would expect less genes in the final assembly (after Cuffmerge/Cuffdiff) or at least a very similar number but that apparently doesn’t happen.

        Could you give me your opinion ?

        Regards,

        P.

        Comment


        • #5
          I am performing differential analysis with 15 samples times 3 replicates, 45 in total.
          When I run Cufflinks to assembly the genes/transcripts, I have 45 inputs and 45 outputs , right? At this stage I checked the number of genes predicted to one of the samples (“X” sample) and the number was 25,000.

          After, I ran Cuffmerge that “is essentially a meta-assembler”, and after that Cuffdiff (I labelled the samples). I checked again the number of genes predicted (genes.read_group_tracking file) to the sample “X” and now the quantity is 35,000.

          In resume, Cufflinks predicts initially 25,000 genes to this specific “X” sample but after the “meta-assembly” (Cuffmerge/Cuffdiff) the number is 35,000 genes. If Cuffmerge “filters a number of transfrags that are probably artfifacts” I would expect less genes in the final, not more. That’s a huge difference.

          As I understood, basically we have to assemblies moments in the pipeline: first Cufflinks (assembles each sample individually), and second occurs a meta-assembly with the information from the 15x3 samples. Am I right?

          The different output formats are very clear to me in this moment, but I cannot explain this difference in the number of predicted genes.

          Final question – Why the numbers of genes predicted at the different stages of the pipeline is not similar?

          Thanks for your time.

          Regards,

          P

          Comment


          • #6
            Please:
            1. check the ID/name of the genes, are there duplicates or more copies of same genes? It happnes usually due to transcripts of genes..
            2. Please post the detailed commands starting from tophat to cuffdiff then only i can understand what is going on with your data.
            Thank you

            Originally posted by paulorapazote View Post
            I am performing differential analysis with 15 samples times 3 replicates, 45 in total.
            When I run Cufflinks to assembly the genes/transcripts, I have 45 inputs and 45 outputs , right? At this stage I checked the number of genes predicted to one of the samples (“X” sample) and the number was 25,000.

            After, I ran Cuffmerge that “is essentially a meta-assembler”, and after that Cuffdiff (I labelled the samples). I checked again the number of genes predicted (genes.read_group_tracking file) to the sample “X” and now the quantity is 35,000.

            In resume, Cufflinks predicts initially 25,000 genes to this specific “X” sample but after the “meta-assembly” (Cuffmerge/Cuffdiff) the number is 35,000 genes. If Cuffmerge “filters a number of transfrags that are probably artfifacts” I would expect less genes in the final, not more. That’s a huge difference.

            As I understood, basically we have to assemblies moments in the pipeline: first Cufflinks (assembles each sample individually), and second occurs a meta-assembly with the information from the 15x3 samples. Am I right?

            The different output formats are very clear to me in this moment, but I cannot explain this difference in the number of predicted genes.

            Final question – Why the numbers of genes predicted at the different stages of the pipeline is not similar?

            Thanks for your time.

            Regards,

            P

            Comment


            • #7
              Diplication of genes Cufflinks/Cuffmerge

              Thanks for your time.

              I didnt find duplication of genes until now, but I am going to check with more detail.

              I dont know why but the column with the coverage is normally empty in all files, only in the isoforms.fpkm_tracking file the coverage results are present.

              Below, the commands I used:

              # Running TopHat (45 times)
              /tophat-2.0.7.Linux_x86_64/tophat2 -o $fileName -p 16 -I 50000 -i 65 -r $mate_inner_dist /reference_genome/Barley_Bowtie_Index/assembly3_WGSMorex_rbca ./$Pair_R1 ./$Pair_R2


              # Running Cufflinks (45 times)
              /cufflinks-2.1.1.Linux_x86_64/cufflinks -p 16 –b /reference_genome/Barley_Bowtie_Index/assembly3_WGSMorex_rbca.fa –u /$fileName/accepted_hits.sam


              # Running Cuffmerge (once), assemblies.txt file with the 45 paths to the transcripts.gft files
              /cufflinks-2.1.1.Linux_x86_64/cuffmerge -p 16 -s /reference_genome/Barley_Bowtie_Index/assembly3_WGSMorex_rbca.fa ./assemblies.txt


              ### Running Cuffdiff (once)
              /cufflinks-2.1.1.Linux_x86_64/cuffdiff -p 16 -b /reference_genome/Barley_Bowtie_Index/assembly3_WGSMorex_rbca.fa -L A,B,C,D,E,F,G,H,I,J,K,L,M,N,O -u /merged_asm/merged.gtf $A1/accepted_hits.sam,$A2/accepted_hits.sam,$A3/accepted_hits.sam $B1/accepted_hits.sam,$B2/accepted_hits.sam,$B3/accepted_hits.sam $C1/accepted_hits.sam,$C2/accepted_hits.sam,$C3/accepted_hits.sam $D1/accepted_hits.sam,$D2/accepted_hits.sam,$D3/accepted_hits.sam $E1/accepted_hits.sam,$E2/accepted_hits.sam,$E3/accepted_hits.sam $F1/accepted_hits.sam,$F2/accepted_hits.sam,$F3/accepted_hits.sam $G1/accepted_hits.sam,$G2/accepted_hits.sam,$G3/accepted_hits.sam $H1/accepted_hits.sam,$H2/accepted_hits.sam,$H3/accepted_hits.sam $I1/accepted_hits.sam,$I2/accepted_hits.sam,$I3/accepted_hits.sam $J1/accepted_hits.sam,$J2/accepted_hits.sam,$J3/accepted_hits.sam $K1/accepted_hits.sam,$K2/accepted_hits.sam,$K3/accepted_hits.sam $L1/accepted_hits.sam,$L2/accepted_hits.sam,$L3/accepted_hits.sam $M1/accepted_hits.sam,$M2/accepted_hits.sam,$M3/accepted_hits.sam $N1/accepted_hits.sam,$N2/accepted_hits.sam,$N3/accepted_hits.sam $O1/accepted_hits.sam,$O2/accepted_hits.sam,$O3/accepted_hits.sam

              Many Thanks,

              P

              Comment


              • #8
                Before I comment anything, Please tell me how much experience do you have working under linux and tophat and cufflinks. is this your first time ?
                Thanks.
                jp.
                Originally posted by paulorapazote View Post
                I didnt find duplication of genes until now, but I am going to check with more detail.

                I dont know why but the column with the coverage is normally empty in all files, only in the isoforms.fpkm_tracking file the coverage results are present.

                Below, the commands I used:

                # Running TopHat (45 times)
                /tophat-2.0.7.Linux_x86_64/tophat2 -o $fileName -p 16 -I 50000 -i 65 -r $mate_inner_dist /reference_genome/Barley_Bowtie_Index/assembly3_WGSMorex_rbca ./$Pair_R1 ./$Pair_R2


                # Running Cufflinks (45 times)
                /cufflinks-2.1.1.Linux_x86_64/cufflinks -p 16 –b /reference_genome/Barley_Bowtie_Index/assembly3_WGSMorex_rbca.fa –u /$fileName/accepted_hits.sam


                # Running Cuffmerge (once), assemblies.txt file with the 45 paths to the transcripts.gft files
                /cufflinks-2.1.1.Linux_x86_64/cuffmerge -p 16 -s /reference_genome/Barley_Bowtie_Index/assembly3_WGSMorex_rbca.fa ./assemblies.txt


                ### Running Cuffdiff (once)
                /cufflinks-2.1.1.Linux_x86_64/cuffdiff -p 16 -b /reference_genome/Barley_Bowtie_Index/assembly3_WGSMorex_rbca.fa -L A,B,C,D,E,F,G,H,I,J,K,L,M,N,O -u /merged_asm/merged.gtf $A1/accepted_hits.sam,$A2/accepted_hits.sam,$A3/accepted_hits.sam $B1/accepted_hits.sam,$B2/accepted_hits.sam,$B3/accepted_hits.sam $C1/accepted_hits.sam,$C2/accepted_hits.sam,$C3/accepted_hits.sam $D1/accepted_hits.sam,$D2/accepted_hits.sam,$D3/accepted_hits.sam $E1/accepted_hits.sam,$E2/accepted_hits.sam,$E3/accepted_hits.sam $F1/accepted_hits.sam,$F2/accepted_hits.sam,$F3/accepted_hits.sam $G1/accepted_hits.sam,$G2/accepted_hits.sam,$G3/accepted_hits.sam $H1/accepted_hits.sam,$H2/accepted_hits.sam,$H3/accepted_hits.sam $I1/accepted_hits.sam,$I2/accepted_hits.sam,$I3/accepted_hits.sam $J1/accepted_hits.sam,$J2/accepted_hits.sam,$J3/accepted_hits.sam $K1/accepted_hits.sam,$K2/accepted_hits.sam,$K3/accepted_hits.sam $L1/accepted_hits.sam,$L2/accepted_hits.sam,$L3/accepted_hits.sam $M1/accepted_hits.sam,$M2/accepted_hits.sam,$M3/accepted_hits.sam $N1/accepted_hits.sam,$N2/accepted_hits.sam,$N3/accepted_hits.sam $O1/accepted_hits.sam,$O2/accepted_hits.sam,$O3/accepted_hits.sam

                Many Thanks,

                P

                Comment


                • #9
                  Hi

                  Intermediate/comfortable experience in Linux and TopHat (Tophat, Mapsplice, SpliceMap, Supersplat, Bowtie) beginner in Cufflinks (started 3 months ago).

                  Regards,

                  P

                  Comment


                  • #10
                    Hi

                    I decided to investigate genes that could be repeated (more than one gene exactly in the same locus) in the Cuffdiff output and I found around 2000 genes in these circumstances. Most of cases mean 2 genes in the same locus, i.e. XLOC_000128 and XLOC_000129 at the locus contig_101175:16-3372.

                    Apparently this could be explained because genes could be in opposite strands, however there are case where 3,4 and 5 genes are exactly in the same locus...

                    Below, the “worst case”, the locus contig_135019:2018-11663 with 5 genes.

                    XLOC_002673 - - XLOC_002673 - TSS2805,TSS2806 contig_135019:2018-11663
                    XLOC_002674 - - XLOC_002674 - TSS2807 contig_135019:2018-11663
                    XLOC_002675 - - XLOC_002675 - TSS2808 contig_135019:2018-11663
                    XLOC_002676 - - XLOC_002676 - TSS2809 contig_135019:2018-11663
                    XLOC_002677 - - XLOC_002677 - TSS2810 contig_135019:2018-11663

                    Any explanation ?

                    Many thanks,

                    P

                    Comment


                    • #11
                      Hi again,

                      I believe I understood why the number of genes reported is different at Cufflinks and Cuffmerge/Cuffdiff level.

                      I found that in many loci there are very few reads aligned (one, two…) and by this reason Cufflinks doesn’t report genes in these places. However, after merging the replicates ( after Cuffmerge) genes are reported.
                      In a specific case, a sample has 5 reads aligned across the 3 replicates (1,2,2), Cufflinks doesn’t report any gene but after Cuffmerge the gene is reported (based on these 5 reads).
                      Strangely, according with the manual:
                      #####
                      --min-frags-per-transfrag <int> Assembled transfrags supported by fewer than this many aligned RNA-Seq fragments are not reported. Default: 10.
                      #####
                      This means that the minimum number of fragments aligned to report a gene is 10 (20 reads) ??
                      In conclusion, I’d like to understand:
                      a) How many reads are necessary to generate a gene? Is it dependent of a minimum FPKM value?
                      b) Apparently, singletons (not properly paired reads) are not reported/considered by the Cufflinks pipeline. Is this true?
                      Thanks in advance,
                      P.

                      Comment


                      • #12
                        How do you investigate genes that can be repeated? I have same problem. The output shows few genes change comparing my G2 and S phase samples!

                        Comment


                        • #13
                          And also how do you list TSS that are present in same locus?

                          Thanks

                          Comment

                          Latest Articles

                          Collapse

                          • seqadmin
                            Current Approaches to Protein Sequencing
                            by seqadmin


                            Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
                            04-04-2024, 04:25 PM
                          • seqadmin
                            Strategies for Sequencing Challenging Samples
                            by seqadmin


                            Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                            03-22-2024, 06:39 AM

                          ad_right_rmr

                          Collapse

                          News

                          Collapse

                          Topics Statistics Last Post
                          Started by seqadmin, 04-11-2024, 12:08 PM
                          0 responses
                          30 views
                          0 likes
                          Last Post seqadmin  
                          Started by seqadmin, 04-10-2024, 10:19 PM
                          0 responses
                          32 views
                          0 likes
                          Last Post seqadmin  
                          Started by seqadmin, 04-10-2024, 09:21 AM
                          0 responses
                          28 views
                          0 likes
                          Last Post seqadmin  
                          Started by seqadmin, 04-04-2024, 09:00 AM
                          0 responses
                          53 views
                          0 likes
                          Last Post seqadmin  
                          Working...
                          X