Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Differential gene expression analysis

    I'm trying to establish the best gene expression differential analysis for my purpose: 2 genotypes, 2 experimental situations, 3 biological replicates, 25 million reads per sample (sequenced RNA-seq libraries).
    Now I'm using tophat-cufflinks and following the protocol published by Trapnell:
    Differential gene and transcript expression analysis of RNA-seq experiments with TopHat and Cufflinks

    I'm working with a well annotated model organism "Arabidopsis thaliana"

    I have two goals:

    First look for diff expression in the already annotated transcriptome: TAIR10
    Second: I'm interested in possibility that previously NO annotated genes are differentially expressed between the two genotypes in one of the different experimental conditions.
    I have a protocol in mind BUT I will like to be advised for the expertize of this community:

    Remember: two genotypes, Two experimental confitions, triplicates MEANS 12 INDEPENDENT LIBRARIES (25 millions reads each)

    FIRST PROTOCOL FOR DIFFERENTIAL EXPRESSION:
    1) Tophat for each library
    2) Merge all the libraries in a single cufflinks (do I need to include the TAIR10.gtf?)
    3) Use the final assembly of step two togheter with the 12 acepted hits files from step one in cuffdiff.
    4) Use cuffcompare to identify locations of new genes.

    How can I automatically extract all those new genes thar are also differentially expressed?

    I will appreciate feedbacks for the protocol I have in mind and answer to my questions

  • #2
    What you are describing is exactly what the Cuffmerge program from the Cufflinks suite does. It takes the .gtf files from a set of cufflinks runs and merges them together into a single, non-redundant set of transfrags. You may optionally provides a GTF file contain a set of already known transcripts and the output mapping file will classify the transfrags as known, novel etc. See the class code documentation in the manual.

    Now if I may offer a perspective as one who has done a lot of RNA-Seq analysis in Arabidopsis, the A. thaliana genome has been analyzed and annotated to death. The odds of finding a new gene are very, very slim an probably not worth spending time working on unless you have time to waste. I can guarantee you that you will find groups of reads mapping to intergenic space, but will also bet large amounts of money that these will not be new genes. They most likely arise from mis-mapping or spurious transcriptional events.

    Comment


    • #3
      no gene/ transcript discovery

      Assuming that as you said, the Arabidopsis genome is very well annotated, do I need to run cufflinks?
      It is better in any way to perform the analysis just combining tophat, the referece trancriptome and cuffdiff?
      Do you suggest to run TopHat with "no gene/transcript discovery"?




      Originally posted by kmcarr View Post
      What you are describing is exactly what the Cuffmerge program from the Cufflinks suite does. It takes the .gtf files from a set of cufflinks runs and merges them together into a single, non-redundant set of transfrags. You may optionally provides a GTF file contain a set of already known transcripts and the output mapping file will classify the transfrags as known, novel etc. See the class code documentation in the manual.

      Now if I may offer a perspective as one who has done a lot of RNA-Seq analysis in Arabidopsis, the A. thaliana genome has been analyzed and annotated to death. The odds of finding a new gene are very, very slim an probably not worth spending time working on unless you have time to waste. I can guarantee you that you will find groups of reads mapping to intergenic space, but will also bet large amounts of money that these will not be new genes. They most likely arise from mis-mapping or spurious transcriptional events.

      Comment


      • #4
        Originally posted by colaneri View Post
        Assuming that as you said, the Arabidopsis genome is very well annotated, do I need to run cufflinks?
        No
        It is better in any way to perform the analysis just combining tophat, the referece trancriptome and cuffdiff?
        Yes. It's faster because you are skipping an unneccessary step, and the IDs used for cuffdiff analysis will be the normal TAIR AT IDs instead of cufflinks transfrag IDs (XLOCs) which you would then need to correlate to their TAIR IDs.
        Do you suggest to run TopHat with "no gene/transcript discovery"?
        That's what I normally do.

        Comment


        • #5
          Originally posted by colaneri
          I want to use tophat in galaxy with the parameter --no-novel-juncs genome

          how can I implement the parameter?
          Sorry, I don't use Galaxy so can't help you there.

          Comment


          • #6
            I think this was solved elsewhere

            Discussion of next-gen sequencing related bioinformatics: resources, algorithms, open source efforts, etc



            HTH

            Comment


            • #7
              Cuffdiff do not performed promoter preference test

              I have run cuffdiff 2 using this command

              cuffdiff -p 4 -c 4 --no-update-check /proj/seq/data/TAIR10_Ensembl/Annotation/Archives/archive-2013-03-06-09-54-25/Genes/genes.gtf -o cuffdiff_results ./C_ctrl_rep1_Trim37.tophat_out/accepted_hits.bam ./C_ABA_rep1_Trim40.tophat_out/accepted_hits.bam

              even though the work was successfully completed I received the below stated output. My question is why were not test performed for promoter preference or splicing? It is the default option?




              Performed 27529 isoform-level transcription difference tests
              Performed 25428 tss-level transcription difference tests
              Performed 21976 gene-level transcription difference tests
              Performed 24788 CDS-level transcription difference tests
              Performed 0 splicing tests
              Performed 0 promoter preference tests
              Performing 0 relative CDS output tests

              Writing isoform-level FPKM tracking
              Writing TSS group-level FPKM tracking
              Writing gene-level FPKM tracking
              Writing CDS-level FPKM tracking
              Writing isoform-level count tracking
              Writing TSS group-level count tracking
              Writing gene-level count tracking
              Writing CDS-level count tracking
              Writing isoform-level read group tracking
              Writing TSS group-level read group tracking
              Writing gene-level read group tracking
              Writing CDS-level read group tracking
              Writing read group info
              Writing run info

              Comment


              • #8
                concatenating files before tophat?

                Hi
                I have a RNA-seq library that has been sequenced multiple times, then I have four fastq files.
                Do I need to concatenate them before alignment in tophat?
                Can i just list the four files at the end of the tophat command like that?

                If my files are fastq1, fastq2, fastq3 and fastq4,

                and I do:

                tophat -p 4 --segment-length 20 --no-novel-juncs -G /proj/seq/data/TAIR10_Ensembl/Annotation/Archives/archive-2013-03-06-09-54-25/Genes/genes.gtf -o C_ctrl_rep1_THout_6 /proj/seq/data/TAIR10_Ensembl/Sequence/Bowtie2Index/genome fastq1 fastq2 fastq3 fastq4

                Comment


                • #9
                  selecting the approapiate range to trim

                  Most of my RNA-seq sequenced libraries look like that in the fastqc report (please see the images below)
                  Do I need to trim the first 10 bases?
                  Only the first 10 ones?
                  It is going to improve the results?

                  [IMG] [/IMG]

                  Comment


                  • #10
                    Defining replicates and different conditions in Cuffdiff2

                    Hi peoples
                    I'm trying to make cuffdiff 2 to compare RNA-seq data from
                    2 different genotypes in two different conditions and I did 3 biological replicates for each genotype in each condition.
                    So I have 12 different libraries, I aligned them separately with tophat.
                    My problem is in running cuffdiff from the command line, I can not get it to work in the way I would like, and I do not know what I'm doing wrong. PLEASE SOME HELP IN HERE GUYS!!!

                    I did run cuffdiff with this command
                    cuffdiff -p 8 -c 20 --no-update-check /proj/seq/data/TAIR10_Ensembl/Annotation/Archives/archive-2013-03-06-09-54-25/Genes/genes.gtf -o cuffdiff_ABA_whole_set_results_week \
                    ./C_ctrl_rep1_abaexp.tophat_out/accepted_hits.bam, ./C_ctrl_rep2_abaexp.tophat_out/accepted_hits.bam, ./C_ctrl_rep3_abaexp.tophat_out/accepted_hits.bam \
                    ./C_ABA_rep1_abaexp.tophat_out/accepted_hits.bam, ./C_ABA_rep2_abaexp.tophat_out/accepted_hits.bam, ./C_ABA_rep3_abaexp.tophat_out/accepted_hits.bam \
                    ./B_ctrl_rep1_abaexp.tophat_out/accepted_hits.bam, ./B_ctrl_rep2_abaexp.tophat_out/accepted_hits.bam, ./B_ctrl_rep3_abaexp.tophat_out/accepted_hits.bam \
                    ./B_ABA_rep1_abaexp.tophat_out/accepted_hits.bam, ./B_ABA_rep2_abaexp.tophat_out/accepted_hits.bam, ./B_ABA_rep3_abaexp.tophat_out/accepted_hits.bam


                    BUT THE RESULT IS THAT ALL THE FILES ARE COMPARED AGAINS THE OTHER, so all samples are considered different instead of 4 groups with triplicates

                    CAN SOME ONE TELL ME WHAT IS WRONG WITH MY COMMAND LINE?

                    Comment


                    • #11
                      Originally posted by colaneri View Post
                      Hi peoples
                      I'm trying to make cuffdiff 2 to compare RNA-seq data from
                      2 different genotypes in two different conditions and I did 3 biological replicates for each genotype in each condition.
                      So I have 12 different libraries, I aligned them separately with tophat.
                      My problem is in running cuffdiff from the command line, I can not get it to work in the way I would like, and I do not know what I'm doing wrong. PLEASE SOME HELP IN HERE GUYS!!!

                      I did run cuffdiff with this command
                      Code:
                      cuffdiff -p 8 -c 20 --no-update-check /proj/seq/data/TAIR10_Ensembl/Annotation/Archives/archive-2013-03-06-09-54-25/Genes/genes.gtf -o cuffdiff_ABA_whole_set_results_week \
                      ./C_ctrl_rep1_abaexp.tophat_out/accepted_hits.bam, ./C_ctrl_rep2_abaexp.tophat_out/accepted_hits.bam, ./C_ctrl_rep3_abaexp.tophat_out/accepted_hits.bam \
                      ./C_ABA_rep1_abaexp.tophat_out/accepted_hits.bam, ./C_ABA_rep2_abaexp.tophat_out/accepted_hits.bam, ./C_ABA_rep3_abaexp.tophat_out/accepted_hits.bam \
                      ./B_ctrl_rep1_abaexp.tophat_out/accepted_hits.bam, ./B_ctrl_rep2_abaexp.tophat_out/accepted_hits.bam, ./B_ctrl_rep3_abaexp.tophat_out/accepted_hits.bam \
                      ./B_ABA_rep1_abaexp.tophat_out/accepted_hits.bam, ./B_ABA_rep2_abaexp.tophat_out/accepted_hits.bam, ./B_ABA_rep3_abaexp.tophat_out/accepted_hits.bam

                      BUT THE RESULT IS THAT ALL THE FILES ARE COMPARED AGAINS THE OTHER, so all samples are considered different instead of 4 groups with triplicates

                      CAN SOME ONE TELL ME WHAT IS WRONG WITH MY COMMAND LINE?
                      You have spaces after the commas in your command line. The list of BAM files for your bio reps should be separated by commas WITHOUT SPACES, then spaces between the different condition groups. Since you put a space after every BAM file name cuffdiff interpreted them as twelve conditions.

                      BTW if you are posting blocks of command text or code please use the CODE tag formatting as I have done with your text above. It makes reading lines of code or output much easier and thus easier to spot the errors.

                      Comment


                      • #12
                        naming samples in cuffdiff

                        Than you very much KMCARR!

                        But the way, when I use the -L option to name the different samples,
                        do I also have to separate them with commas without spaces?
                        In term of the names, do I need to use exactly the same name that the one is pointing to the bam file? Or it is just the order of names after the -L option what it matters?

                        Comment


                        • #13
                          Originally posted by colaneri View Post
                          when I use the -L option to name the different samples, do I also have to separate them with commas without spaces?
                          Yes.

                          Originally posted by colaneri View Post
                          In term of the names, do I need to use exactly the same name that the one is pointing to the bam file? Or it is just the order of names after the -L option what it matters?
                          It's only the order that matters.
                          /* Shawn Driscoll, Gene Expression Laboratory, Pfaff
                          Salk Institute for Biological Studies, La Jolla, CA, USA */

                          Comment


                          • #14
                            I do not understand this tophat error

                            I have a fastq file that I used to align sequences with tophat v 1.3 (from a galaxy server) and I have not problem, but when I use the same fastq file to align the sequences with tophat 2 in command line I get this error.

                            Can you please explain to me why and what does means?

                            This is the output: (error is highlighted in red at the bottom)

                            [2013-06-13 00:43:09] Checking for Bowtie
                            Bowtie version: 2.1.0.0
                            [2013-06-13 00:43:09] Checking for Samtools
                            Samtools version: 0.1.19.0
                            [2013-06-13 00:43:09] Checking for Bowtie index files
                            [2013-06-13 00:43:09] Checking for reference FASTA file
                            [2013-06-13 00:43:09] Generating SAM header for /proj/seq/data/TAIR10_Ensembl/Sequence/Bowtie2Index/genome
                            format: fastq
                            quality scale: phred33 (default)
                            [2013-06-13 00:43:12] Reading known junctions from GTF file
                            [2013-06-13 00:43:16] Preparing reads
                            [FAILED]
                            Error running 'prep_reads'
                            Error: qual length (19) differs from seq length (41) for fastq record !

                            Comment


                            • #15
                              What it means to me is that the qualities string and the read string for at least one of the reads in your fastq file are not the same length. It doesn't explain why it worked on galaxy though.
                              /* Shawn Driscoll, Gene Expression Laboratory, Pfaff
                              Salk Institute for Biological Studies, La Jolla, CA, USA */

                              Comment

                              Latest Articles

                              Collapse

                              • seqadmin
                                Strategies for Sequencing Challenging Samples
                                by seqadmin


                                Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                                03-22-2024, 06:39 AM
                              • seqadmin
                                Techniques and Challenges in Conservation Genomics
                                by seqadmin



                                The field of conservation genomics centers on applying genomics technologies in support of conservation efforts and the preservation of biodiversity. This article features interviews with two researchers who showcase their innovative work and highlight the current state and future of conservation genomics.

                                Avian Conservation
                                Matthew DeSaix, a recent doctoral graduate from Kristen Ruegg’s lab at The University of Colorado, shared that most of his research...
                                03-08-2024, 10:41 AM

                              ad_right_rmr

                              Collapse

                              News

                              Collapse

                              Topics Statistics Last Post
                              Started by seqadmin, Yesterday, 06:37 PM
                              0 responses
                              8 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, Yesterday, 06:07 PM
                              0 responses
                              8 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 03-22-2024, 10:03 AM
                              0 responses
                              49 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 03-21-2024, 07:32 AM
                              0 responses
                              66 views
                              0 likes
                              Last Post seqadmin  
                              Working...
                              X