Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • cufflinks errors of duplicates

    Hi,

    I searched for a while for my problem running cufflinks, sounds no answer yet.

    I run tophat + bowtie for RNA-seq data (single end read), and got the widetype .sam file plus treated .sam file. The -G GFF option was supplied for tophat, which file was converted from Danio rerio GTF file and downloaded from http://www.ensembl.org/info/data/ftp/index.html.

    Then I try to run cufflinks with the following command:

    [mMi@devaP Felipa]$ cufflinks -G /home/RNASeq/FishGenome/Danio_rerio_Zv8_57.gtf ./WT_accepted_hits.sam

    Counting hits in map
    Error: duplicate GFF ID 'ENSDART00000099599' (or exons too far apart)!


    #####################

    I cannot find strings of 'ENSDART00000099599' in the WT.accepted_hits.sam file but write a pl script looking in Danio_rerio_Zv8_57.gtf file

    mMi@mMi-Ubuntu:/A01 RNA-seq$ perl FindTargetRecord.pl
    18 protein_coding exon 16261480 16262025 .- . gene_id "ENSDARG00000068779"; transcript_id "ENSDART00000099599"; exon_number "1"; gene_name "zgc:162977"; transcript_name "zgc:162977";
    **
    18 protein_coding CDS 16261480 16262025 . - 0 gene_id "ENSDARG00000068779"; transcript_id "ENSDART00000099599"; exon_number "1"; gene_name "zgc:162977"; transcript_name "zgc:162977"; protein_id "ENSDARP00000090373";
    **
    18 protein_coding start_codon 16262023 16262025 . - 0 gene_id "ENSDARG00000068779"; transcript_id "ENSDART00000099599"; exon_number "1"; gene_name "zgc:162977"; transcript_name "zgc:162977";
    **
    18 protein_coding exon 14234408 14234520 . - . gene_id "ENSDARG00000068779"; transcript_id "ENSDART00000099599"; exon_number "2"; gene_name "zgc:162977"; transcript_name "zgc:162977";
    **
    18 protein_coding CDS 14234408 14234520 . - 0 gene_id "ENSDARG00000068779"; transcript_id "ENSDART00000099599"; exon_number "2"; gene_name "zgc:162977"; transcript_name "zgc:162977"; protein_id "ENSDARP00000090373";
    **
    18 protein_coding exon 14234169 14234325 . - . gene_id "ENSDARG00000068779"; transcript_id "ENSDART00000099599"; exon_number "3"; gene_name "zgc:162977"; transcript_name "zgc:162977";
    **
    18 protein_coding CDS 14234169 14234325 . - 1 gene_id "ENSDARG00000068779"; transcript_id "ENSDART00000099599"; exon_number "3"; gene_name "zgc:162977"; transcript_name "zgc:162977"; protein_id "ENSDARP00000090373";
    **
    18 protein_coding exon 14231851 14232003 . - . gene_id "ENSDARG00000068779"; transcript_id "ENSDART00000099599"; exon_number "4"; gene_name "zgc:162977"; transcript_name "zgc:162977";
    **
    18 protein_coding CDS 14231851 14232003 . - 0 gene_id "ENSDARG00000068779"; transcript_id "ENSDART00000099599"; exon_number "4"; gene_name "zgc:162977"; transcript_name "zgc:162977"; protein_id "ENSDARP00000090373";
    **
    18 protein_coding exon 14223590 14224135 . - . gene_id "ENSDARG00000068779"; transcript_id "ENSDART00000099599"; exon_number "5"; gene_name "zgc:162977"; transcript_name "zgc:162977";
    **
    18 protein_coding CDS 14223593 14224135 . - 0 gene_id "ENSDARG00000068779"; transcript_id "ENSDART00000099599"; exon_number "5"; gene_name "zgc:162977"; transcript_name "zgc:162977"; protein_id "ENSDARP00000090373";
    **
    18 protein_coding stop_codon 14223590 14223592 . - 0 gene_id "ENSDARG00000068779"; transcript_id "ENSDART00000099599"; exon_number "5"; gene_name "zgc:162977"; transcript_name "zgc:162977";
    **

    Does this mean to delete some lines in reference gtf file?

    #################

    Then I delete the -G option

    [mMi@devaP Felipa]$ cufflinks WT_accepted_hits.sam

    now it sounds fine and produces .gtf gene.expr and trasncripts.expr files, but all ID are annotated with cuffID, not gene or transcript ID.

    #####################

    any suggestion of sorting it out?

    cheers

  • #2
    It looks like there is an abnormally large intron there, over 2Mb long, between the 1st an 2nd exon of that transcript.
    Removing that transcript from your reference annotation file (yes, deleting all lines mentioning ENSDART00000099599) should solve the problem.

    Comment


    • #3
      thanks gpertea and others. I have tried modifying genome gtf file from ensembl (like deleting all lines mentioning ENSDART00000099599), but other duplicated IDs are found and there are too many to be deleted manually. additionally the raw SAM file was generated by Tophat, and I sorted it again. cufflinks still reports

      "Processing bundle [ chr1:1203-1254 ] with 1 non-redundant alignments".

      Can anyone doing human genome RNA-seq data suggest which reference gtf file should be used here?

      cheers

      Comment


      • #4
        What I did was to change the names of the duplicated genes to ENSGxxxxxxxxxxx_dup1 in the GFF file I downloaded from Ensembl for the human genome.

        Once you have no records with the same name but in different positions you should be able to run Cufflinks without any problems.

        Cheers

        Comment

        Latest Articles

        Collapse

        • seqadmin
          Current Approaches to Protein Sequencing
          by seqadmin


          Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
          04-04-2024, 04:25 PM
        • seqadmin
          Strategies for Sequencing Challenging Samples
          by seqadmin


          Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
          03-22-2024, 06:39 AM

        ad_right_rmr

        Collapse

        News

        Collapse

        Topics Statistics Last Post
        Started by seqadmin, 04-11-2024, 12:08 PM
        0 responses
        25 views
        0 likes
        Last Post seqadmin  
        Started by seqadmin, 04-10-2024, 10:19 PM
        0 responses
        27 views
        0 likes
        Last Post seqadmin  
        Started by seqadmin, 04-10-2024, 09:21 AM
        0 responses
        24 views
        0 likes
        Last Post seqadmin  
        Started by seqadmin, 04-04-2024, 09:00 AM
        0 responses
        52 views
        0 likes
        Last Post seqadmin  
        Working...
        X