Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • TopHat and the GFF3 file

    Hello all,

    I've been trying to make TopHat align some reads to the known human exome fore some time now, with no success so far. In the web manual, they state that with the -G option you can supply a reference gff3 file for the program to map the reads to. I've found it impossible to find a human exome gff3 file, so I proceeded to download the hg18 exome gtf file from the UCSC browser, and tried to make it into a gff3 file, since the formats are very close already.

    The UCSC gtf file reads as
    Code:
    chr1	hg18_knownGene	exon	24476	25037	0.000000	-	.	gene_id "uc001aak.1"; transcript_id "uc001aak.1"; 
    chr1	hg18_knownGene	exon	25140	25344	0.000000	-	.	gene_id "uc001aak.1"; transcript_id "uc001aak.1"; 
    chr1	hg18_knownGene	exon	25584	25940	0.000000	-	.	gene_id "uc001aak.1"; transcript_id "uc001aak.1"; 
    chr1	hg18_knownGene	exon	55425	55436	0.000000	+	.	gene_id "uc009vjh.1"; transcript_id "uc009vjh.1"; 
    chr1	hg18_knownGene	exon	55752	55834	0.000000	+	.	gene_id "uc009vjh.1"; transcript_id "uc009vjh.1"; 
    chr1	hg18_knownGene	start_codon	58954	58956	0.000000	+	.	gene_id "uc009vjh.1"; transcript_id "uc009vjh.1"; 
    chr1	hg18_knownGene	CDS	58954	59691	0.000000	+	0	gene_id "uc009vjh.1"; transcript_id "uc009vjh.1"; 
    chr1	hg18_knownGene	exon	58900	59692	0.000000	+	.	gene_id "uc009vjh.1"; transcript_id "uc009vjh.1"; 
    chr1	hg18_knownGene	start_codon	58954	58956	0.000000	+	.	gene_id "uc001aal.1"; transcript_id "uc001aal.1"; 
    chr1	hg18_knownGene	CDS	58954	59871	0.000000	+	0	gene_id "uc001aal.1"; transcript_id "uc001aal.1"; 
    chr1	hg18_knownGene	exon	58954	59871	0.000000	+	.	gene_id "uc001aal.1"; transcript_id "uc001aal.1"; 
    chr1	hg18_knownGene	exon	124780	126351	0.000000	-	.	gene_id "uc009vjj.1"; transcript_id "uc009vjj.1";
    From the gff3 format specification found in the Sequence Ontology webpage, one can see that the main difference resides in the last column, the ID tags. Following their indications, as well as the TopHat requirements as stated in the manual, I've manually converted the gtf file to this hopefully-gff3 format:

    Code:
    chr1	hg18_knownGene	gene	24476	25940	.	-	.	ID=uc001aak.1
    chr1	hg18_knownGene	mRNA	24476	25940	.	-	.	ID=uc001aak.1_mRNA;Parent=uc001aak.1
    chr1	hg18_knownGene	exon	24476	25037	.	-	.	ID=exon0001_uc001aak.1;Parent=uc001aak.1_mRNA
    chr1	hg18_knownGene	exon	25140	25344	.	-	.	ID=exon0002_uc001aak.1;Parent=uc001aak.1_mRNA
    chr1	hg18_knownGene	exon	25584	25940	.	-	.	ID=exon0003_uc001aak.1;Parent=uc001aak.1_mRNA
    chr1	hg18_knownGene	gene	58954	59871	.	+	.	ID=uc001aal.1
    chr1	hg18_knownGene	mRNA	58954	59871	.	+	.	ID=uc001aal.1_mRNA;Parent=uc001aal.1
    chr1	hg18_knownGene	exon	58954	59871	.	+	.	ID=exon0001_uc001aal.1;Parent=uc001aal.1_mRNA
    chr1	hg18_knownGene	gene	127587	128558	.	-	.	ID=uc001aam.2
    chr1	hg18_knownGene	mRNA	127587	128558	.	-	.	ID=uc001aam.2_mRNA;Parent=uc001aam.2
    chr1	hg18_knownGene	exon	127587	128558	.	-	.	ID=exon0001_uc001aam.2;Parent=uc001aam.2_mRNA
    which I thought was correct. However, whenever I try to make an alignment using this pseudo-gff3 file, TopHat does not undesrtand it and the alignment does not proceed correctly:
    Code:
    $ tophat --solexa-quals -p 6 --coverage-search -o tophat_exome --no-novel-juncs -G ~/UCSC_genome.gff3 ~/bowtie-0.11.3/indexes/hg18 RNA_sequence.fastq
    
    [Mon Jan 11 11:30:24 2010] Beginning TopHat run (v1.0.12)
    -----------------------------------------------
    [Mon Jan 11 11:30:24 2010] Preparing output location tophat_exome/
    [Mon Jan 11 11:30:24 2010] Checking for Bowtie index files
    [Mon Jan 11 11:30:24 2010] Checking for reference FASTA file
    [Mon Jan 11 11:30:24 2010] Checking for Bowtie
            Bowtie version:          0.11.3.0
    [Mon Jan 11 11:30:24 2010] Checking reads
            seed length:     35bp
            format:          fastq
            quality scale:   --solexa-quals
    [Mon Jan 11 11:31:58 2010] Reading known junctions from GFF file
            Warning: TopHat did not find any junctions in GFF file
    [Mon Jan 11 11:32:22 2010] Mapping reads against hg18 with Bowtie
    [Mon Jan 11 11:39:06 2010] Joining segment hits
    Warning: junction database is empty!
    [Mon Jan 11 11:40:26 2010] Joining segment hits
    [Mon Jan 11 11:41:38 2010] Reporting output tracks
    -----------------------------------------------
    Run complete [00:15:34 elapsed]
    Once this is solved, the other question that remains is if TopHat needs the gff3 file to be zero or one-based; I think UCSC works with zero-based coordinates but the web-manual does not specify which convention does TopHat use, and in splice-junction finding, that single bp is what really makes or breaks the deal.

    So, if anyone has any expertise on this or has an idea on why this is not working at all, I'm very interested to know.


    Thanks a lot!
    Last edited by Ender985; 01-11-2010, 07:42 AM.

  • #2
    I am having the exact same problem and, in fact, was doing a search on the forum on this topic.

    Does anyone have a working, tested UCSC hg18 or hg19 GFF3 file that they could pass along to me or post somewhere?

    One problem with searching for the terms "GFF" or "GFT" on this forum is that the words are too short to be included as a search term

    Thanks,
    Sam Darko

    Comment


    • #3
      have you thought of trying the cufflinks module?

      Getting GFF3-formatted files, or constructing them from gtfs, is a real pain. Been there, done that, for the mouse genome. Have you considered using Cufflinks, from the authors that produced Tophat? (Disclosure: I'm not at all affiliated with them but their tools have worked very well for me!) While the software is advertised as being of most use for isoform quantitation / splicing junction work, you'll find that you can feed Cufflinks a gtf file (like those you already have) and it will annotate your reads against genes and their separate transcripts...

      For mouse work, I downloaded a gtf file from Ensembl and have been using it with no issues with Cufflinks.

      Your workflow becomes:
      - run Tophat as per usual, and without the -G switch
      - take the accepted_hits.sam file that Tophat produces, and give it to cufflinks together with your gtf annotation file
      - look in the genes.expr and transcripts.expr files for your annotated RPKM data

      Comment


      • #4
        Sorry I cannot give you an obvious answer, but maybe you can try to use Ensembl gtf format files:
        ftp://ftp.ensembl.org/pub/current_gtf/homo_sapiens/
        Then change the chromosome name:
        1--> chr1
        MT-->chrM
        And then use gtf2gff3.pl to transform the gtf format into gff3 format.
        This works well for me.

        By the way, I think the version of start site(0 based or 1 based) is dependent on the reference genome you use.

        The tophat code is looks like this:
        tophat -r 50 --mate-std-dev 20 -a 8 -m 0 -I 1000000 -p 4 -g 100 -o ~/gene/labs/phase16/result/tophat/human/s_1/withgff3_r50 --solexa1.3-quals --coverage-search --microexon-search --segment-mismatches 2 --segment-length 25 --max-segment-intron 1000000 -G ~/share/database/ensembl/Ensembl56/GRCh37/GTF/GFF3/v1/Homo_sapiens.GRCh37.56_v1.gff3 ~/share/database/tophat/index/hg19 ~/data/Processed_Solexa/Long_Solexa_Seq/s_1_1.fq ~/data/Processed_Solexa/Long_Solexa_Seq/s_1_2.fq
        The gff3 files looks like this:
        ##gff-version 3
        chr7 protein_coding gene 143701022 143702068 . + . ID=ENSG00000221813; NAME=OR6B1;
        chr7 protein_coding mRNA 143701022 143702068 . + . ID=ENST00000408922; NAME=OR6B1-001; PARENT=ENSG00000221813;
        chr7 protein_coding five_prime_utr 143701022 143701089 . + . ID=five_prime_utr:ENST00000408922:1; PARENT=ENST00000408922;
        chr7 protein_coding exon 143701022 143702068 . + . ID=exon:ENST00000408922:1; PARENT=ENST00000408922;
        Last edited by Gangcai; 01-13-2010, 05:58 PM.

        Comment


        • #5
          Glad to see I'm not the only one with the problem!

          @sjm: I've been using Cufflinks already for some data analysis, but in this case I was trying to directly map my reads against the human transcriptome (hoping to find more splice junctions), rather than mapping them to the whole genome and looking at the transcriptome afterwards, like you suggest. I've already done that, I was interested in taking this alternative approach and seeing how different the results would be.

          @Gangcai: I'll try to convert the Ensembl gtf file as you suggest and hopefully it will work, even though I can not see any differences between the gff3 file you posted and the one I produced.

          Thanks for the suggestions people, keep them coming; I'll keep this thread updated if I find a working solution.

          Comment


          • #6
            Okay, I ended up using the solution posted in this thread. That is, I downloaded the gtf from Ensembl and converted it for use with UCSC reference using the program in the thread. I tested it yesterday and it looked good.

            @sjm: I'm also going to try your approach in the future. Thanks so much for the suggestion!

            Also a I want to give a general thanks to the good people of the forum who are happy to help strangers in need

            Cheers,
            Sam
            Last edited by sdarko; 01-14-2010, 08:07 AM.

            Comment


            • #7
              Hi sjm,

              Do you know that how to specify Tophat produce accepted_hits.sam?
              After I run Tophat, why it only generate accepted_hits.bam
              Thanks for advice.

              Comment


              • #8
                Hi sjm,

                Do you know that how to specify Tophat produce accepted_hits.sam?
                After I run Tophat, why it only generate accepted_hits.bam
                Thanks for advice.

                Comment


                • #9
                  Tophat bam and sam files

                  I could be wrong, but my understanding is that the current version of tophat only produces files in .bam format. However, it's easy enough to convert back later - you already have the samtools package installed, otherwise tophat wouldn't work, so look in the help for samtools and use the command to convert .bam back to .sam :

                  samtools view input.bam > output.sam
                  Last edited by sjm; 05-27-2011, 01:36 PM.

                  Comment


                  • #10
                    Thanks sjm
                    In general, should we need to include the "-h" option as well?
                    eg.
                    Code:
                    samtools view -h -o input.bam output.sam
                    My purpose is wanna to use the *.sam as an input for Cufflink after then.
                    In order to run Cufflink in default, I just not sure weather I needed to include header or not.

                    Do you familiar about Cufflink as well?
                    I'm using the latest version of Cufflink 1.0.1 and run the following command"
                    Code:
                    Cufflink/cufflinks-1.0.1.Linux_x86_64/cufflinks -p 4 -G Homo_sapiens.GRCh37.62.gtf --library-type fr-unstranded tophat_out/accepted_hits.bam 
                    Cufflink/cufflinks-1.0.1.Linux_x86_64/cufflinks: /lib64/libz.so.1: no version information available (required by Cufflink/cufflinks-1.0.1.Linux_x86_64/cufflinks)
                    Warning: Could not connect to update server to verify current version. Please check at the Cufflinks website (http://cufflinks.cbcb.umd.edu).
                    [19:26:23] Loading reference annotation.
                    [19:26:37] Inspecting reads and determining fragment length distribution.
                    Processed 33455 loci.                        [*************************] 100%
                    Warning: Using default Gaussian distribution due to insufficient paired-end reads in open ranges.  It is recommended that correct paramaters (--frag-len-mean and --frag-len-std-dev) be provided.
                    Map Properties:
                           Total Map Mass: 77344272.10
                           Read Type: [B]0bp single-end[/B]
                           Fragment Length Distribution: Truncated Gaussian (default)
                                         Default Mean: 200
                                      Default Std Dev: 80
                    [19:32:30] Estimating transcript abundances.
                    Processed 33455 loci.                        [*************************] 100%
                    My input file is 2X50bp, paired-end read.
                    I'm not sure why Cufflink will detect it as "0bp single-end".
                    Do you have any idea regarding the above error message?
                    If I'm using the "-g" option, the progress worked fine but the result is the same as the default Cufflink (without -g and Homo_sapiens.GRCh37.62.gtf in the command) running result
                    Code:
                    Cufflink/cufflinks-1.0.1.Linux_x86_64/cufflinks -p 4 -g Homo_sapiens.GRCh37.62.gtf --library-type fr-unstranded tophat_out/accepted_hits.bam
                    I was thinking just to assemble back the annotate transcript in my sample.
                    Thanks for any advice to improve the progress.

                    Comment


                    • #11
                      Hi Gangcai,

                      Do you know how to calculate "--mate-std-dev" of RNA-seq data?
                      My input file is 2X50bp, sequencing by Illumina Hiseq, library insert size is 210bp, adaptor sequence is 100bp.
                      Apart frm that, do you know how to calculate "--frag-len-mean and --frag-len-std-dev" as well?
                      I'm facing the problem as shown in thread above.
                      Thanks.

                      Comment


                      • #12
                        paired-end data in correct format?

                        @edge: are you sure that your 2x50 bp paired-end data are in the correct format for Cufflinks, i.e. are you choosing the right --library-type option? Sorry I can't be more helpful, as I haven't used paired-end data with Cufflinks data yet. I don't think converting between .bam and .sam is going to help you much if you don't have the right --library-type option set.

                        Comment


                        • #13
                          Hi sjm,

                          I believe my input data should be in correct format.
                          When I type the following command:
                          Code:
                          Cufflink/cufflinks-1.0.1.Linux_x86_64/cufflinks -p 4 -g Homo_sapiens.GRCh37.62.gtf --library-type fr-unstranded tophat_out/accepted_hits.bam
                          
                          [10:44:16] Loading reference annotation.
                          [10:44:31] Inspecting reads and determining fragment length distribution.
                          > Processed 217311 loci.                       [*************************] 100%
                          > Map Properties:
                          >       Total Map Mass: 77344272.10
                          >       Read Type: 50bp x 50bp
                          >       Fragment Length Distribution: Empirical (learned)
                          >                     Estimated Mean: 181.46
                          >                  Estimated Std Dev: 32.00
                          [11:07:40] Assembling transcripts and estimating abundances.
                          > Processed 217311 loci.                       [*************************] 100%
                          The above command shown my input data is 50bp X 50bp.
                          Thus I not sure why "-g" work fined while "-G" got error

                          Comment


                          • #14
                            Hi All,

                            I just wanted to add one point: the old versions of Cullinks worked with only the gtf format and the latest version works with the both gff/gtf format. Check the cufflinks manual for details.

                            Thank you,
                            Douglas

                            Comment

                            Latest Articles

                            Collapse

                            • seqadmin
                              Current Approaches to Protein Sequencing
                              by seqadmin


                              Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
                              04-04-2024, 04:25 PM
                            • seqadmin
                              Strategies for Sequencing Challenging Samples
                              by seqadmin


                              Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                              03-22-2024, 06:39 AM

                            ad_right_rmr

                            Collapse

                            News

                            Collapse

                            Topics Statistics Last Post
                            Started by seqadmin, 04-11-2024, 12:08 PM
                            0 responses
                            27 views
                            0 likes
                            Last Post seqadmin  
                            Started by seqadmin, 04-10-2024, 10:19 PM
                            0 responses
                            31 views
                            0 likes
                            Last Post seqadmin  
                            Started by seqadmin, 04-10-2024, 09:21 AM
                            0 responses
                            27 views
                            0 likes
                            Last Post seqadmin  
                            Started by seqadmin, 04-04-2024, 09:00 AM
                            0 responses
                            52 views
                            0 likes
                            Last Post seqadmin  
                            Working...
                            X