Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • tophat/cufflinks no gene names or annotations showing up

    Hi everyone,

    I am working on a top hat /cufflinks differential expression pipeline and after I run through the whole pipeline, the resulting gene_exp.diff file does not contain any gene names. Also, there are about 13000 records in the transcript file, but the resulting diff file only contains about 2000. The rest of the entries are all CUFF identifiers. Following is my pipeline, transcript file and diff output. Any help is appreciated.

    Tophat:
    Code:
    tophat -p 16 -r 175 --no-coverage-search -o $Path/run1/nacre/ --transcriptome-index=/transcriptome/ucsc/zv9_transcriptome /genomes/bwt2/danRer7 /fastq_files/nacre_R1_filtered.fastq /fastq_files/nacre_R2_filtered.fastq
    
    tophat -p 16 -r 175 --no-coverage-search -o $Path/run1/tub/ --transcriptome-index=/transcriptome/ucsc/zv9_transcriptome /genomes/bwt2/danRer7 /fastq_files/tub_R1_filtered.fastq /fastq_files/tub_R2_filtered.fastq
    Cufflinks:
    Code:
    nohup cufflinks -o $Path/run1/nacre/cuff1 -g /transcriptome/ucsc/zv9_transcriptome.gtf -p 16 $Path/run1/nacre/accepted_hits.bam
    
    nohup cufflinks -o $Path/run1/tub/cuff1 -g /transcriptome/ucsc/zv9_transcriptome.gtf -p 16 $Path/run1/tub/accepted_hits.bam
    Assembly1.txt file:
    Code:
    $path/tophat_run/full_test_runs/run1/nacre/cuff1/transcripts.gtf
    $path/tophat_run/full_test_runs/run1/tub/cuff1/transcripts.gtf
    Cuffmerge:
    Code:
    cuffmerge -o $path/run1/cuff_merge/cuff1 -g /scratchLocal/sac2026/transcriptome/ucsc/zv9_transcriptome.gtf -p 16 -s /genomes/bwt2/danRer7.fa $path/run1/assembly1.txt &
    CuffDiff:
    Code:
    cuffdiff -o $path/run1/cuff_diff/cuff1/ -L nacre,tub -p 8 $path/run1/cuff_merge/cuff1/transcripts.gtf $path/run1/nacre/accepted_hits.bam $path/run1/tub/accepted_hits.bam
    Transcript.gtf file downloaded from ucsc:
    Code:
    chr1    danRer7_refGene start_codon     50322025        50322027        0.000000        +       .       gene_id "NM_131426"; transcript_id "NM_131426"; 
    chr1    danRer7_refGene CDS     50322025        50322231        0.000000        +       0       gene_id "NM_131426"; transcript_id "NM_131426"; 
    chr1    danRer7_refGene exon    50321634        50322231        0.000000        +       .       gene_id "NM_131426"; transcript_id "NM_131426"; 
    chr1    danRer7_refGene CDS     50323685        50323751        0.000000        +       0       gene_id "NM_131426"; transcript_id "NM_131426"; 
    chr1    danRer7_refGene exon    50323685        50323751        0.000000        +       .       gene_id "NM_131426"; transcript_id "NM_131426"; 
    chr1    danRer7_refGene CDS     50327723        50327850        0.000000        +       2       gene_id "NM_131426"; transcript_id "NM_131426"; 
    chr1    danRer7_refGene exon    50327723        50327850        0.000000        +       .       gene_id "NM_131426"; transcript_id "NM_131426"; 
    chr1    danRer7_refGene CDS     50376642        50376774        0.000000        +       0       gene_id "NM_131426"; transcript_id "NM_131426"; 
    chr1    danRer7_refGene exon    50376642        50376774        0.000000        +       .       gene_id "NM_131426"; transcript_id "NM_131426"; 
    chr1    danRer7_refGene CDS     50384689        50384782        0.000000        +       2       gene_id "NM_131426"; transcript_id "NM_131426"; 
    chr1    danRer7_refGene exon    50384689        50384782        0.000000        +       .       gene_id "NM_131426"; transcript_id "NM_131426"; 
    chr1    danRer7_refGene CDS     50384996        50385109        0.000000        +       1       gene_id "NM_131426"; transcript_id "NM_131426"; 
    chr1    danRer7_refGene exon    50384996        50385109        0.000000        +       .       gene_id "NM_131426"; transcript_id "NM_131426"; 
    chr1    danRer7_refGene CDS     50387282        50387444        0.000000        +       1       gene_id "NM_131426"; transcript_id "NM_131426"; 
    chr1    danRer7_refGene exon    50387282        50387444        0.000000        +       .       gene_id "NM_131426"; transcript_id "NM_131426"; 
    chr1    danRer7_refGene CDS     50388022        50388129        0.000000        +       0       gene_id "NM_131426"; transcript_id "NM_131426"; 
    chr1    danRer7_refGene exon    50388022        50388129        0.000000        +       .       gene_id "NM_131426"; transcript_id "NM_131426"; 
    chr1    danRer7_refGene CDS     50392531        50392579        0.000000        +       0       gene_id "NM_131426"; transcript_id "NM_131426"; 
    chr1    danRer7_refGene exon    50392531        50392579        0.000000        +       .       gene_id "NM_131426"; transcript_id "NM_131426"; 
    chr1    danRer7_refGene CDS     50393548        50393579        0.000000        +       2       gene_id "NM_131426"; transcript_id "NM_131426"; 
    chr1    danRer7_refGene stop_codon      50393580        50393582        0.000000        +       .       gene_id "NM_131426"; transcript_id "NM_131426"; 
    chr1    danRer7_refGene exon    50393548        50393588        0.000000        +       .       gene_id "NM_131426"; transcript_id "NM_131426"; 
    chr1    danRer7_refGene exon    50409290        50410568        0.000000        +       .       gene_id "NM_131426"; transcript_id "NM_131426"; 
    chr1    danRer7_refGene stop_codon      58701201        58701203        0.000000        -       .       gene_id "NM_001110522"; transcript_id "NM_001110522"; 
    chr1    danRer7_refGene CDS     58701204        58701468        0.000000        -       1       gene_id "NM_001110522"; transcript_id "NM_001110522"; 
    chr1    danRer7_refGene exon    58701201        58701468        0.000000        -       .       gene_id "NM_001110522"; transcript_id "NM_001110522";
    output gene_exp.diff file:
    Code:
    CUFF.21460      CUFF.21460      -       chr15:42401169-42414185 nacre   tub     OK      0.30098 0.192342        -0.645988       0.93529 0.349639        0.999981        no
    CUFF.21461      CUFF.21461      -       chr15:42517544-42517876 nacre   tub     OK      0.303951        0.0349624       -3.11996        0.710738        0.477247        0.999981        no
    CUFF.21462      CUFF.21462      -       chr15:42593781-42597957 nacre   tub     OK      1.06523 1.85185 0.797809        -1.28757        0.197895        0.999981        no
    CUFF.21463      CUFF.21463      -       chr15:42567449-42568700 nacre   tub     NOTEST  0.0441381       0.0433716       -0.0252743      0.0151731       0.987894        1       no
    CUFF.21464      CUFF.21464      -       chr15:42572428-42593418 nacre   tub     OK      2.26891 18.0882 2.99498 -6.08449        1.1686e-09      1.9611e-06      yes
    CUFF.21465      CUFF.21465      -       chr15:42624106-42624606 nacre   tub     OK      2.78658 2.24085 -0.314451       0.375988        0.706925        0.999981        no
    CUFF.21466      CUFF.21466      -       chr15:41251756-41266370 nacre   tub     OK      0.819343        1.03169 0.332465        -0.386342       0.699243        0.999981        no
    CUFF.21467      CUFF.21467      -       chr15:41999382-42013139 nacre   tub     OK      0.13403 0.484079        1.85268 -1.61461        0.106394        0.999981        no
    CUFF.21468      CUFF.21468      -       chr15:42636714-42637489 nacre   tub     OK      0.245696        0.00871635      -4.81701        1.12025 0.262609        0.999981        no
    CUFF.21469      CUFF.21469      -       chr15:41251756-41266370 nacre   tub     OK      0.120829        0.186014        0.622448        -0.179106       0.857854        0.999981        no
    CUFF.2147       CUFF.2147       -       19:6835973-6925393      nacre   tub     NOTEST  0       0       0       0       1       1       no
    CUFF.21470      CUFF.21470      -       chr15:41999382-42013139 nacre   tub     NOTEST  0.0487298       0.0200532       -1.28098        0.244489        0.806852        1       no
    CUFF.21471      CUFF.21471      -       chr15:42663333-42663506 nacre   tub     OK      0.264006        23.4892 6.47528 -1.43214        0.152105        0.999981        no
    CUFF.21472      CUFF.21472      -       chr15:41478958-41496849 nacre   tub     OK      68.4197 60.2869 -0.182566       0.416749        0.676862        0.999981        no

    There are some NM ids that show up in the file but like I said, there are only about 2000 of them out of about 13000. Some cuffs should actually be in annotated since the transcriptome has it. For example, CUFF.21464 in the above file is a Tyr gene which is very well annotated in ucsc but it shows up with CUFF identifier. What am I doing wrong? How can I get this pipeline to include the gene names/other annotations?

    Please also feel free to comment on the pipeline. This is for zebrafish reads.

    Thank you in advance.

  • #2
    Me too

    I'm having this same problem and am somewhat surprised that it seems hard to find a solution. I was thinking maybe one option is to search by some other thing such as the chromosome location to salvage the data. It is very bad because cufflinks takes so long to run! It is much longer than STAR.

    Comment


    • #3
      have you tried running a few of your bam files from tophat directly into cuffdiff with no local transcriptome assembly (i.e. skip cufflinks and cuffmerge)? this might at least get you some data to look at while you sort the cufflinks problem out. also what does your cuffmerge'd transcripts file look like? How many NM records are there?

      I think if the goal is to get an idea of expression from known loci, you may be able to skip the de novo transcriptome assembly. The CUFF annotations are being generated where there are new transcripts found, but which may (as you say) be very similar to existing transcripts in the zebrafish gtf. You could use bedtools to rename your CUFF transcripts with the original name, based on a percentage of overlap and shared strand, etc.

      Comment

      Latest Articles

      Collapse

      • seqadmin
        Strategies for Sequencing Challenging Samples
        by seqadmin


        Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
        03-22-2024, 06:39 AM
      • seqadmin
        Techniques and Challenges in Conservation Genomics
        by seqadmin



        The field of conservation genomics centers on applying genomics technologies in support of conservation efforts and the preservation of biodiversity. This article features interviews with two researchers who showcase their innovative work and highlight the current state and future of conservation genomics.

        Avian Conservation
        Matthew DeSaix, a recent doctoral graduate from Kristen Ruegg’s lab at The University of Colorado, shared that most of his research...
        03-08-2024, 10:41 AM

      ad_right_rmr

      Collapse

      News

      Collapse

      Topics Statistics Last Post
      Started by seqadmin, Yesterday, 06:37 PM
      0 responses
      10 views
      0 likes
      Last Post seqadmin  
      Started by seqadmin, Yesterday, 06:07 PM
      0 responses
      9 views
      0 likes
      Last Post seqadmin  
      Started by seqadmin, 03-22-2024, 10:03 AM
      0 responses
      49 views
      0 likes
      Last Post seqadmin  
      Started by seqadmin, 03-21-2024, 07:32 AM
      0 responses
      67 views
      0 likes
      Last Post seqadmin  
      Working...
      X