Hi!
I am using the cufflinks – cuffmerge workflow to assemble transcripts for in total 12 libraries. I now want to extract nucleotide sequences from the "merged.gft" file produced by cuffmerge, in order to create a blastX database, so that I can easily find homologoues of genes I am interested in. To this I wanted to use the “gffread” tool that comes with the cufflinks suite – however, running it onto the “merged.gtf” file does not produced anything, i.e. an empty file.
Here’s the command:
gffread merged.gff –g genome.fasta –w transcripts.fasta
However, using the gff file that was feeded into cuffmerge does produce a proper fasta file:
gffread augustus.gff –g genome.fasta –w transcripts.fasta
I guess that the “merged.gtf” file is missing a “gene structure”, i.e. it has only exons annotated, although I did feed cuffmerge with the proper gff3 file:
MERGED.GFT
scaffold00001 Cufflinks exon 26 275 . + . gene_id "XLOC_000001"; transcript_id "TCONS_00000001"; exon_number "1"; gene_name "g1"; oId "CUFF.1.1"; nearest_ref "g1.t1"; class_code "="; tss_id "TSS1"; p_id "P1";
scaffold00001 Cufflinks exon 444 602 . + . gene_id "XLOC_000001"; transcript_id "TCONS_00000001"; exon_number "2"; gene_name "g1"; oId "CUFF.1.1"; nearest_ref "g1.t1"; class_code "="; tss_id "TSS1"; p_id "P1";
scaffold00001 Cufflinks exon 874 1038 . + . gene_id "XLOC_000001"; transcript_id "TCONS_00000001"; exon_number "3"; gene_name "g1"; oId "CUFF.1.1"; nearest_ref "g1.t1"; class_code "="; tss_id "TSS1"; p_id "P1";
scaffold00001 Cufflinks exon 1285 2083 . + . gene_id "XLOC_000001"; transcript_id "TCONS_00000001"; exon_number "4"; gene_name "g1"; oId "CUFF.1.1"; nearest_ref "g1.t1"; class_code "="; tss_id "TSS1"; p_id "P1";
scaffold00001 Cufflinks exon 210 275 . + . gene_id "XLOC_000001"; transcript_id "TCONS_00000002"; exon_number "1"; gene_name "g1"; oId "g1.t1"; nearest_ref "g1.t1"; class_code "="; tss_id "TSS2"; p_id "P1";
scaffold00001 Cufflinks exon 444 602 . + . gene_id "XLOC_000001"; transcript_id "TCONS_00000002"; exon_number "2"; gene_name "g1"; oId "g1.t1"; nearest_ref "g1.t1"; class_code "="; tss_id "TSS2"; p_id "P1";
scaffold00001 Cufflinks exon 874 1038 . + . gene_id "XLOC_000001"; transcript_id "TCONS_00000002"; exon_number "3"; gene_name "g1"; oId "g1.t1"; nearest_ref "g1.t1"; class_code "="; tss_id "TSS2"; p_id "P1";
scaffold00001 Cufflinks exon 1285 1377 . + . gene_id "XLOC_000001"; transcript_id "TCONS_00000002"; exon_number "4"; gene_name "g1"; oId "g1.t1"; nearest_ref "g1.t1"; class_code "="; tss_id "TSS2"; p_id "P1";
AUGUSTUS.GFF
# Predicted genes for sequence number 1 on both strands
# start gene g1
scaffold00001 AUGUSTUS gene 1 1377 1 + . g1
scaffold00001 AUGUSTUS transcript 1 1377 1 + . g1.t1
scaffold00001 AUGUSTUS intron 1 209 1 + . transcript_id "g1.t1"; gene_id "g1";
scaffold00001 AUGUSTUS intron 276 443 1 + . transcript_id "g1.t1"; gene_id "g1";
scaffold00001 AUGUSTUS intron 603 873 1 + . transcript_id "g1.t1"; gene_id "g1";
scaffold00001 AUGUSTUS intron 1039 1284 1 + . transcript_id "g1.t1"; gene_id "g1";
scaffold00001 AUGUSTUS CDS 210 275 1 + 0 transcript_id "g1.t1"; gene_id "g1";
scaffold00001 AUGUSTUS CDS 444 602 1 + 0 transcript_id "g1.t1"; gene_id "g1";
scaffold00001 AUGUSTUS CDS 874 1038 1 + 0 transcript_id "g1.t1"; gene_id "g1";
scaffold00001 AUGUSTUS CDS 1285 1377 1 + 0 transcript_id "g1.t1"; gene_id "g1";
scaffold00001 AUGUSTUS stop_codon 1375 1377 . + 0 transcript_id "g1.t1"; gene_id "g1";
Do you have any suggestions how I can extract transcript sequences (nucleotides) from the “merged.gtf” file? Do I have to manually add the gene and mRNA structures into the “merged.gtf” file?
Thanks a lot in advance!
D.
I am using the cufflinks – cuffmerge workflow to assemble transcripts for in total 12 libraries. I now want to extract nucleotide sequences from the "merged.gft" file produced by cuffmerge, in order to create a blastX database, so that I can easily find homologoues of genes I am interested in. To this I wanted to use the “gffread” tool that comes with the cufflinks suite – however, running it onto the “merged.gtf” file does not produced anything, i.e. an empty file.
Here’s the command:
gffread merged.gff –g genome.fasta –w transcripts.fasta
However, using the gff file that was feeded into cuffmerge does produce a proper fasta file:
gffread augustus.gff –g genome.fasta –w transcripts.fasta
I guess that the “merged.gtf” file is missing a “gene structure”, i.e. it has only exons annotated, although I did feed cuffmerge with the proper gff3 file:
MERGED.GFT
scaffold00001 Cufflinks exon 26 275 . + . gene_id "XLOC_000001"; transcript_id "TCONS_00000001"; exon_number "1"; gene_name "g1"; oId "CUFF.1.1"; nearest_ref "g1.t1"; class_code "="; tss_id "TSS1"; p_id "P1";
scaffold00001 Cufflinks exon 444 602 . + . gene_id "XLOC_000001"; transcript_id "TCONS_00000001"; exon_number "2"; gene_name "g1"; oId "CUFF.1.1"; nearest_ref "g1.t1"; class_code "="; tss_id "TSS1"; p_id "P1";
scaffold00001 Cufflinks exon 874 1038 . + . gene_id "XLOC_000001"; transcript_id "TCONS_00000001"; exon_number "3"; gene_name "g1"; oId "CUFF.1.1"; nearest_ref "g1.t1"; class_code "="; tss_id "TSS1"; p_id "P1";
scaffold00001 Cufflinks exon 1285 2083 . + . gene_id "XLOC_000001"; transcript_id "TCONS_00000001"; exon_number "4"; gene_name "g1"; oId "CUFF.1.1"; nearest_ref "g1.t1"; class_code "="; tss_id "TSS1"; p_id "P1";
scaffold00001 Cufflinks exon 210 275 . + . gene_id "XLOC_000001"; transcript_id "TCONS_00000002"; exon_number "1"; gene_name "g1"; oId "g1.t1"; nearest_ref "g1.t1"; class_code "="; tss_id "TSS2"; p_id "P1";
scaffold00001 Cufflinks exon 444 602 . + . gene_id "XLOC_000001"; transcript_id "TCONS_00000002"; exon_number "2"; gene_name "g1"; oId "g1.t1"; nearest_ref "g1.t1"; class_code "="; tss_id "TSS2"; p_id "P1";
scaffold00001 Cufflinks exon 874 1038 . + . gene_id "XLOC_000001"; transcript_id "TCONS_00000002"; exon_number "3"; gene_name "g1"; oId "g1.t1"; nearest_ref "g1.t1"; class_code "="; tss_id "TSS2"; p_id "P1";
scaffold00001 Cufflinks exon 1285 1377 . + . gene_id "XLOC_000001"; transcript_id "TCONS_00000002"; exon_number "4"; gene_name "g1"; oId "g1.t1"; nearest_ref "g1.t1"; class_code "="; tss_id "TSS2"; p_id "P1";
AUGUSTUS.GFF
# Predicted genes for sequence number 1 on both strands
# start gene g1
scaffold00001 AUGUSTUS gene 1 1377 1 + . g1
scaffold00001 AUGUSTUS transcript 1 1377 1 + . g1.t1
scaffold00001 AUGUSTUS intron 1 209 1 + . transcript_id "g1.t1"; gene_id "g1";
scaffold00001 AUGUSTUS intron 276 443 1 + . transcript_id "g1.t1"; gene_id "g1";
scaffold00001 AUGUSTUS intron 603 873 1 + . transcript_id "g1.t1"; gene_id "g1";
scaffold00001 AUGUSTUS intron 1039 1284 1 + . transcript_id "g1.t1"; gene_id "g1";
scaffold00001 AUGUSTUS CDS 210 275 1 + 0 transcript_id "g1.t1"; gene_id "g1";
scaffold00001 AUGUSTUS CDS 444 602 1 + 0 transcript_id "g1.t1"; gene_id "g1";
scaffold00001 AUGUSTUS CDS 874 1038 1 + 0 transcript_id "g1.t1"; gene_id "g1";
scaffold00001 AUGUSTUS CDS 1285 1377 1 + 0 transcript_id "g1.t1"; gene_id "g1";
scaffold00001 AUGUSTUS stop_codon 1375 1377 . + 0 transcript_id "g1.t1"; gene_id "g1";
Do you have any suggestions how I can extract transcript sequences (nucleotides) from the “merged.gtf” file? Do I have to manually add the gene and mRNA structures into the “merged.gtf” file?
Thanks a lot in advance!
D.
Comment