Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • How do I get a genes.gtf file for RNA Seq Analysis

    Hello all,

    I am interested in using the TopHat Cufflinks workflow to analyze my RNA seq data.

    The first step's command line is:
    tophat -p 8 -G genes.gtf -o C1_R1_thout genome C1_R1.fq

    From the NCBI website, I can get a multi-fasta file of the coding sequence, but how can I get a genes.gtf?

    God bless,
    Jason

  • #2
    If you are working with a genome that is available from iGenomes site you will find the gtf file in the big tarball there. Also bundled in are various sequence and corresponding index files (bwa, bowtie etc) so you may want to stick with this for entire analysis.



    What genome are you working with?
    Last edited by GenoMax; 06-19-2013, 07:40 AM.

    Comment


    • #3
      I ended up converting the genbank file to a GTF one with the python script from the link below. It worked great! Thanks so much GenoMax for all your extremely helpful advice!!

      Comment


      • #4
        Well, I thought it worked...

        A GTF file was created and looks similar to the one from the example data (FruitFly). The first line of the head command for my gtf file gives:
        CP001666.1 gb2gtf gene 101 1453 . + . gene_id "<firstgene>"; transcript_id "<firsttranscript>"; gene_id "dnaA"
        <The gene_id and transcript_id were left out above.>
        Notice that the value of the first column was "CP001666.1", and gb2gtf is the program used to do the conversion.

        The first line of the head of the example .gtf file is:
        2L protein_coding exon 7529 8116 . + . exon_number "1"; gene_id "FBgn0031208"; gene_name "CG11023"; p_id "P9062"; transcript_id "FBtr0300689"; transcript_name "CG11023-RB"; tss_id "TSS8382";
        In this file, "2L" is the first column for each line of data. Here they use "protein_coding" instead of "gene".

        Now transitioning from the example .gtf file to the example .fa file.
        Notice that the example .fa file has a "2L" as the first line.
        jmwhitha@jmwhitha-OptiPlex-755:~/my_rnaseq_exp$ head genome.fa
        >2L
        CGACAATGCACGACAGAGGAAGCAGAACAGATATTTAGATTGCCTCTCATTTTCTCTCCC
        ATATTATAGGGAGAAATATGATCGCGTATGCGAGAGTAGTGCCAACATATTGTGCTCTTT
        GATTTTTTGGCAACCCAAAATGGTGGCGGATGAACGAGATGATAATATATTCAAGTTGCC
        GCTAATCAGAAATAAATTCATTGCAACGTTAAATACAGCACAATATATGATCGCGTATGC
        GAGAGTAGTGCCAACATATTGTGCTAATGAGTGCCTCTCGTTCTCTGTCTTATATTACCG
        CAAACCCAAAAAGACAATACACGACAGAGAGAGAGAGCAGCGGAGATATTTAGATTGCCT
        ATTAAATATGATCGCGTATGCGAGAGTAGTGCCAACATATTGTGCTCTCTATATAATGAC
        TGCCTCTCATTCTGTCTTATTTTACCGCAAACCCAAATCGACAATGCACGACAGAGGAAG
        CAGAACAGATATTTAGATTGCCTCTCATTTTCTCTCCCATATTATAGGGAGAAATATGAT

        My .fa file had a long string of identification so I manually changed it to CP001666.1.

        Why is this important?
        According to the TopHat manual, "the values in the first column of the provided GTF/GFF file (column which indicates the chromosome or contig on which the feature is located), must match the name of the reference sequence in the Bowtie index you are using with TopHat."
        So did I do that correctly? I'm not sure.

        What I know is that before I made that change the tophat command would give the following warning. Then, it would give an error and quit after a while of processing.

        tophat -p 8 -G <mygtf>.gtf -o 0111-01_1_thout genome *0111-01_1.fq
        # Bowtie version: 1.0.0.0
        #[2013-06-19 22:22:06] Checking for Samtools
        # Samtools version: 0.1.19.0
        #[2013-06-19 22:22:06] Checking for Bowtie index files
        #[2013-06-19 22:22:06] Checking for reference FASTA file
        #[2013-06-19 22:22:06] Generating SAM header for genome
        # format: fastq
        # quality scale: phred33 (default)
        #[2013-06-19 22:22:06] Reading known junctions from GTF file
        # Warning: TopHat did not find any junctions in GTF file

        Though this warning still showed up after I altered my .fa file, instead of quitting, it continued on with the following processing:

        #[2013-06-20 08:59:15] Preparing reads
        # left reads: min. length=100, max. length=100, 24223083 kept reads (10522 discarded)
        #[2013-06-20 09:08:52] Creating transcriptome data files..
        #[2013-06-20 09:08:52] Building Bowtie index from <myfasta>.fa
        #[2013-06-20 09:08:59] Mapping left_kept_reads to transcriptome <mybase> with Bowtie
        #[2013-06-20 10:14:48] Resuming TopHat pipeline with unmapped reads
        #[2013-06-20 10:14:48] Mapping left_kept_reads.m2g_um to genome genome with Bowtie
        #[2013-06-20 10:33:05] Mapping left_kept_reads.m2g_um_seg1 to genome genome with Bowtie (1/4)
        #[2013-06-20 10:37:18] Mapping left_kept_reads.m2g_um_seg2 to genome genome with Bowtie (2/4)
        #[2013-06-20 10:42:02] Mapping left_kept_reads.m2g_um_seg3 to genome genome with Bowtie (3/4)
        #[2013-06-20 10:46:34] Mapping left_kept_reads.m2g_um_seg4 to genome genome with Bowtie (4/4)
        #[2013-06-20 10:50:18] Searching for junctions via segment mapping
        #[2013-06-20 10:55:31] Retrieving sequences for splices
        #[2013-06-20 10:55:32] Indexing splices
        #[2013-06-20 10:55:35] Mapping left_kept_reads.m2g_um_seg1 to genome segment_juncs with Bowtie (1/4)
        #[2013-06-20 10:57:48] Mapping left_kept_reads.m2g_um_seg2 to genome segment_juncs with Bowtie (2/4)
        #[2013-06-20 11:00:36] Mapping left_kept_reads.m2g_um_seg3 to genome segment_juncs with Bowtie (3/4)
        #[2013-06-20 11:03:20] Mapping left_kept_reads.m2g_um_seg4 to genome segment_juncs with Bowtie (4/4)
        #[2013-06-20 11:05:31] Joining segment hits
        #[2013-06-20 11:07:08] Reporting output tracks
        #-----------------------------------------------
        #[2013-06-20 11:35:01] Run complete: 02:35:45 elapsed

        After this process was done, I had a new .thout file and a new transcriptone_data directory with known.3.ebwt, known.4.ebwt, known.fa, and known.gff. I was surprised there was no known.1.ebwt or known.2.ebwt. I'm not sure, but this may be the reason why I had trouble with the following step.

        According to the TopHat manual:
        "When providing TopHat with a known transcript file (-G/--GTF option above), a transcriptome sequence file is built and a Bowtie index has to be created for it in order to align the reads to the known transcripts. Creating this Bowtie index can be time consuming and in many cases the same transcriptome data is being used for aligning multiple samples with TopHat. A transcriptome index and the associated data files (the original GFF file) can be thus reused for multiple TopHat runs with this option, so these files are only created for the first run with a given set of transcripts. If multiple TopHat runs are planned with the same transcriptome data, TopHat should be first run with the -G option and with the --transcriptome-index option pointing to a directory and a name prefix which will indicate where the transcriptome data files will be stored. Then subsequent TopHat runs using the same --transcriptome-index option value will directly use the transcriptome data created in the first run (no -G option needed for subsequent runs).
        For example the first TopHat run could look like this:
        tophat -o out_sample1 -G known_genes.gtf \
        --transcriptome-index=transcriptome_data/known \
        hg19 sample1_1.fq.z
        In this example the first run will create the transcriptome_data directory if it doesn't exist, and files known.fa, known.gff and known.*ebwt (Bowtie index files) will be generated in that directory. Then for subsequent runs with the same genome and known transcripts but different reads (e.g. sample2_2.fq.z etc.), TopHat will no longer spend time building the transcriptome index because it can directly use the previously built transcriptome index, so the -G option can be discarded for subsequent runs (however using it again will not force TopHat to build the transcriptome index files again if they are already present)
        tophat -o out_sample2 --transcriptome-index=transcriptome_data/known hg19 sample2_1.fq.z"

        Therefore, instead of the command I used for the first one I tried:
        (original command) tophat -p 8 -G <mygtf>.gtf -o 0111-01_1_thout genome *0111-01_1.fq
        # <mygtf> is just a place holder
        (new one) tophat -p 8 --transcriptome-index=transcriptome_data/known -o 0111-02_1_thout genome *0112-01_1.fq
        (variant of the new one) tophat -p 8 transcriptome-index=transcriptome_data/known -o 0111-02_1_thout genome *0112-01_1.fq
        (variant of the new one) tophat -p 8 transcriptome_data/known -o 0111-02_1_thout genome *0112-01_1.fq
        (variant of the new one) tophat -p 8 -o 0111-02_1_thout transcriptome_data/known *0112-01_1.fq
        (variant of the new one) tophat -p 8 -o 0111-02_1_thout transcriptome-index=transcriptome_data/known *0112-01_1.fq
        (variant of the new one) tophat -p 8 -o 0111-02_1_thout ./transcriptome_data/known *0111-02_1.fq

        A similar error message keeps comes up:
        Error: Could not find Bowtie index files (transcriptome-index=transcriptome_data/known.*.ebwt)

        When I use the original command that worked, as far as I can tell, Bowtie is building another transcriptome_data directory (or writing over it). Each time it takes several hours.

        Any ideas on why this is happening?

        Thanks and God bless,
        Jason

        Comment


        • #5
          By the way all,

          I am working with a bacterial genome.

          Comment


          • #6
            Wait, I think I'm figuring out the problem...

            I noticed that when making a transcriptome index, TopHat only used "left reads". This might be why only a known.3.ebwt and a known.4.ebwt.

            When I searched known.3.ebwt and known.4.ebwt, I discovered the bowtie-build options. I had to use bowtie-build since I don't have ebwt files. But I should have used the -r option.

            "No reference indexes. Do not build the NAME.3.ebwt and NAME.4.ebwt portions of the index. Used only for paired-end alignment. [off]"

            This is probably why known.ebwt files were made. I am therefore going to redo the bowtie-build and start over.

            Let you know what happens!

            Comment


            • #7
              To be clear. I have single reads, not paired end.

              Comment


              • #8
                Did you use a fasta formatted genome sequence file for the bowtie-build?
                Last edited by GenoMax; 06-20-2013, 11:39 AM.

                Comment


                • #9
                  Yes, it is the whole genome fasta from NCBI (not multifasta coding sequence).

                  Comment

                  Latest Articles

                  Collapse

                  • seqadmin
                    Essential Discoveries and Tools in Epitranscriptomics
                    by seqadmin




                    The field of epigenetics has traditionally concentrated more on DNA and how changes like methylation and phosphorylation of histones impact gene expression and regulation. However, our increased understanding of RNA modifications and their importance in cellular processes has led to a rise in epitranscriptomics research. “Epitranscriptomics brings together the concepts of epigenetics and gene expression,” explained Adrien Leger, PhD, Principal Research Scientist...
                    04-22-2024, 07:01 AM
                  • seqadmin
                    Current Approaches to Protein Sequencing
                    by seqadmin


                    Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
                    04-04-2024, 04:25 PM

                  ad_right_rmr

                  Collapse

                  News

                  Collapse

                  Topics Statistics Last Post
                  Started by seqadmin, Yesterday, 08:47 AM
                  0 responses
                  14 views
                  0 likes
                  Last Post seqadmin  
                  Started by seqadmin, 04-11-2024, 12:08 PM
                  0 responses
                  60 views
                  0 likes
                  Last Post seqadmin  
                  Started by seqadmin, 04-10-2024, 10:19 PM
                  0 responses
                  60 views
                  0 likes
                  Last Post seqadmin  
                  Started by seqadmin, 04-10-2024, 09:21 AM
                  0 responses
                  54 views
                  0 likes
                  Last Post seqadmin  
                  Working...
                  X