Unconfigured Ad

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts
  • Robin
    Member
    • Nov 2009
    • 10

    run cufflink, cuffcompare and cuffdiff workflow

    Hello All:

    I am not sure that I use the cufflink, cuffcompare and cuffdiff output files are corrected here.
    Here is my two work flows:
    1) work with unknow gene model:
    cufflink -p 4 -m 149 accepted_hits_s1.sam
    cufflink -p 4 -m 149 accepted_hits_s3.sam
    2) Using UCSC annotation GTF file
    cuffcompare -r refseqGene.gtf -R -s /indexes/ transcripts_s1.gtf transcripts_s3.gtf
    3) Using combined.gtf from the output file of cuffcompare:
    cuffdiff -p 4 -m 149 combined.gtf accepted_hits_s1.sam accepted_hits_s3.sam
    4) Cuffdiff output file: 0_1_splicing.diff contains about 75 records with test_stat=OK

    Workflow two with know gene model(refseqGene.gtf ):
    1) cufflink -p 4 -m 149 -G refseqGene.gtf accepted_hits_s1.sam
    cufflink -p 4 -m 149 -G refseqGene.gtf accepted_hits_s3.sam
    2) Using UCSC annotation GTF file
    cuffcompare -r refseqGene.gtf -R -s /indexes/ transcripts_s1.gtf transcripts_s3.gtf
    3) Using combined.gtf from the output file of cuffcompare:
    cuffdiff -p 4 -m 149 combined.gtf accepted_hits_s1.sam accepted_hits_s3.sam
    4) Cuffdiff output file: 0_1_splicing.diff contains about 445 records with test_stat=OK

    I should get more records in the first workflow than second workflow because the first one is run as de noval cufflink without any annotation GTF file. I get about 75 records of splicing.diff in the de noval (workflow one) and about 445 records of splicing.diff file in the know gene model (workflow two).
    Am I doing any steps wrong in these two workflows?

    Thanks for any comments!
    R
  • thinkRNA
    Member
    • Jan 2010
    • 94

    #2
    When you ran cufflinks without -G option, did you see that cufflinks detects more novel transcripts expression than when you ran it with the -G option?
    if the answer is no then it explains your results.

    I am interested in knowing how cufflinks assembles novel transcripts? does it look for ORF and splice sites when assembling them? It is important to know this if we want to trust its "novel" transcripts.

    I am also interested in exploring tophat/cufflinks/cuffcompare workflow parameters, so here are a few other "exploratory" parameters.

    I am using 75bp reads and want to determine differential expression between 2 states:
    1) tophat's -G parameter is set by defualt to 40 which means any read hitting less than 40 places in the genome are kept. I think this is way to high and I bring it down to 10. Although, I want to know how my results would differ if -G is set to 1. Has any one tried this?
    2) I also increase tophat's -a parameter from 8 to 10 which means splice sites having 10bases overlap on each side of the splice site will be considered. I wonder, if for 75bp reads, 10 may not be strict enough.
    3) tophat calls bowtie without its "-best" parameter. I am interested in calling bowtie (from tophat) with -best option to see if alignments differ. Anyone has any thoughts on this?
    4) I know for certain that the last 15 bases of my reads have bad quality overall. Should I trim these before aligning them? Does it make a difference by trimming or bowtie/tophat take the quality score in to consideration.
    5) Cufflinks -Q parameter is set to 0 by default. This is a critical parameter which can take the alignment quality into consideration. I don't know what is a good number to try this one.
    6) Your idea?

    I think an important discussion would be, how does one evaluate these runs? What are some plots, statistics that can give an idea of whether the program is doing a good job. some ideas:
    1) if you know certain genes will be down/up in your experimental conditions, check them
    2) Evaluate tophat's junctions.bed to determine how splice sites support changes with different parameters
    3) if you know of certain "problem" genes having complicated isoform expression or belonging to paralogous gene families, evaluate whether reads are being correctly aligned to them and if splicing changes are being detected
    4) Look at the uniformity in coverage in a gene. There should be genes that have even coverage through all exons.
    5) if you have replicate sample runs, you should theoretically see no differences in gene expression between them using your workflow. A dot plot of FPKM between replicates should be more or less a diagnol line. Ofcourse, this same plot between your control and treated sample should exhibit anomalies.
    6) Any other better ideas that have helped you
    Last edited by thinkRNA; 05-28-2010, 02:23 PM.

    Comment

    • gtb
      Junior Member
      • May 2010
      • 5

      #3
      Robin,
      You say the refseqGene.gtf file contains the UCSC annotations. Where exactly did you get the refseqGene.gtf file?

      Comment

      • dariober
        Senior Member
        • May 2010
        • 311

        #4
        Originally posted by thinkRNA View Post
        3) tophat calls bowtie without its "-best" parameter. I am interested in calling bowtie (from tophat) with -best option to see if alignments differ. Anyone has any thoughts on this?
        See if this thread can give you some help http://seqanswers.com/forums/showthr...=tophat+strata

        Dario

        Comment

        • pinki999
          Member
          • Oct 2010
          • 37

          #5
          Hi ThinkRNA,

          Did you figure out how TopHat deals with such situation?

          4) I know for certain that the last 15 bases of my reads have bad quality overall. Should I trim these before aligning them? Does it make a difference by trimming or bowtie/tophat take the quality score in to consideration.

          Comment

          • IBseq
            Member
            • Jul 2012
            • 56

            #6
            refseqGene.gtf file

            Originally posted by gtb View Post
            Robin,
            You say the refseqGene.gtf file contains the UCSC annotations. Where exactly did you get the refseqGene.gtf file?
            You go to galaxy--get data---ucsc main--set your parameters (clade,genome,assembly)--selct group: gene and genes prediction tracks and tracks: refseq genes---output format:GTF---send to galaxy---it will appear as a new job in your history

            Comment

            Latest Articles

            Collapse

            • SEQadmin2
              From Collection to Sequencing: Why Sample Preparation and Preservation Define Sequencing Data
              by SEQadmin2


              Data variability is still an issue in sequencing technologies despite the advances in reproducibility and accuracy of these platforms. But the problem does not originate in the sequencing itself, but in the previous steps, before the sample reaches the sequencer.


              The first step is collection, followed by preservation and sample preparation for analysis. Most scientists overlook those steps, but not being careful might just be skewing the experiment’s results.
              ...
              Yesterday, 10:05 AM
            • SEQadmin2
              Single-Cell Sequencing at an Inflection Point: Early Impacts of New Platforms and Emerging Trends
              by SEQadmin2


              With the launch of new single-cell sequencing platforms in 2026, the field stands at an exciting inflection point. This article surveys the most impactful advances in the field and discusses how they’re reshaping research in cancer, immunology, and beyond.


              Introduction

              Single-cell sequencing technologies have undergone remarkable advances over the past decade, transitioning from low-throughput experimental approaches to highly scalable platforms capable of...
              05-22-2026, 06:42 AM
            • SEQadmin2
              Environmental Genomics in the Age of NGS: From Microbes to Conservation Strategies
              by SEQadmin2

              Studying ecosystems means dealing with complex, multi-species communities that are hard to observe at scale. This complexity, however, hides many important questions to be answered, from how biogeochemical cycles work and how climate change can affect species distribution to how conservation strategies can work best.


              Genomics, particularly since the expansion of NGS, has transformed ecosystem ecology. By sequencing environmental DNA, we can now assess biodiversity without direct...
              05-06-2026, 09:04 AM

            ad_right_rmr

            Collapse

            News

            Collapse

            Topics Statistics Last Post
            Started by SEQadmin2, Yesterday, 12:03 PM
            0 responses
            19 views
            0 reactions
            Last Post SEQadmin2  
            Started by SEQadmin2, Yesterday, 11:40 AM
            0 responses
            14 views
            0 reactions
            Last Post SEQadmin2  
            Started by SEQadmin2, 05-28-2026, 11:40 AM
            0 responses
            29 views
            0 reactions
            Last Post SEQadmin2  
            Started by SEQadmin2, 05-26-2026, 10:12 AM
            0 responses
            31 views
            0 reactions
            Last Post SEQadmin2  
            Working...