Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • .gtf or .gff file for TopHat and Cufflinks (and bowtie2)

    I am a newcomerto the RNAseq field. I have been trying to get (differential) expression values from my experiments (rat and mouse). To achieve this I want to run tophat and cufflinks with annotated transcripts/genes (i.e. -G option in top hat). this requires a GTF or GFF file. I have searched in different sources: iGenomes, Seq_gene.md.gz files from NCBI, refFlat.txt.gz from UCSC and genes.GTF from Ensembl (the latter is missing mitochondrial RNA for instance). however, these either have the wrong file extension, formats and/or the chromosome names do not match.

    Does anybody know what currently is the best place to obtain both a reference genome and annotated transcripts/genes (possibly even index files)?
    Or are there certain scripts I need to run to convert either my genome fileor my GTF file to the correct format?

    the previous post on this seems to be somewhat outdated: http://seqanswers.com/forums/showthread.php?t=12694

    Thanks in advance,

    Rob

  • #2
    You have to get a gtf file that matches the genome you aligned to. For example, UCSC designates chromosome 1 as "chr1" and Ensembl designates it as "1". If they do not match, most programs will throw an error. You can either convert the labels or get a GTF file that matches the genome you aligned to.

    Comment


    • #3
      iGenomes site has (almost) all the things you are likely to need for a genome build (bwa, bowtie, bowtie2 indexes, gtf annotations, sequence). If genome/build of your choice is not on that list then that may be the only reason you would not be able to use iGenomes data (other than the one pointed out by pblurscript above).

      Comment


      • #4
        iGenomes does indeed provide index files, the full genomes and a .gtf file (and Seq_gene.md.gz + refFlat.txt.gz). This is true for both my organisms of interest: rat and mouse. However when I checked the names in the first column of the gtf file, they do not match the index files.

        GTF: 'chr1' 'chr10' 'chr10_random' 'chr11' 'chr11_random' 'chr12'
        'chr12_random' 'chr13' 'chr13_random' 'chr14' 'chr14_random' 'chr15'
        'chr15_random' 'chr16' 'chr16_random' 'chr17' 'chr17_random' 'chr18'
        'chr18_random' 'chr19' 'chr19_random' 'chr1_random' 'chr2' 'chr20'
        'chr20_random' 'chr2_random' 'chr3' 'chr3_random' 'chr4' 'chr4_random'
        'chr5' 'chr5_random' 'chr6' 'chr6_random' 'chr7' 'chr7_random' 'chr8'
        'chr8_random' 'chr9' 'chr9_random' 'chrUn' 'chrUn_random' 'chrX'
        'chrX_random'

        Index files: chr10
        chr11
        chr12
        chr13
        chr14
        chr15
        chr16
        chr17
        chr18
        chr19
        chr1
        chr20
        chr2
        chr3
        chr4
        chr5
        chr6
        chr7
        chr8
        chr9
        chrM
        chrX

        as you can see the mitochondrial data are missing from the gtf file

        Comment


        • #5
          If you truly need all the data that is missing from the files then you are going to have to modify some files/build your own.

          Have you looked at the mouse data available from JAX: ftp://ftp.informatics.jax.org/pub/re...index.html#seq Note sure if that is complete (just in terms of annotation).
          Last edited by GenoMax; 06-13-2013, 06:29 AM.

          Comment


          • #6
            Ensemble has great annotations and genome references. You'll just need a translation between their chromosome names and those used by UCSC and the IGV browser if you use those tools. Use their "primary" genome and not the "toplevel" version.
            /* Shawn Driscoll, Gene Expression Laboratory, Pfaff
            Salk Institute for Biological Studies, La Jolla, CA, USA */

            Comment


            • #7
              thanks, I'll have a look at JAX.

              A colleague of mine just told me that the mitochondrial reference might be missing since the mitochondrial DNA can have many copies, and might therefore be difficult to work with. I don't know if this is necessarily true, but they might have intentionally left it out.

              PS. the data I reported are from rat

              Comment


              • #8
                I've heard if people leaving those features out of their gee expression analysis simply because they are expressed super high which could possibly skew normalized expressions such as FPKMs. It makes more sense to me for everyone to be using everything so that our expression values are slightly more comparable.
                /* Shawn Driscoll, Gene Expression Laboratory, Pfaff
                Salk Institute for Biological Studies, La Jolla, CA, USA */

                Comment


                • #9
                  I got this same error. I am using cufflinksiGenome. Ensemble ref with the begining of tophat run and cufflinks and got the error. I used then UCSC genome at cuffmerge and cuffdiff, but the same error:
                  Warning: couldn't find fasta record for 'NT_166469'!
                  This contig will not be bias corrected.
                  Warning: couldn't find fasta record for 'X'!
                  This contig will not be bias corrected.
                  Warning: couldn't find fasta record for 'Y'!
                  This contig will not be bias corrected.

                  Can somebody explain me ?

                  Comment


                  • #10
                    Your GTF and FASTA files do not have the same reference sequences. The issue is likely from combining Ensembl and UCSC. For example, in Ensembl, chromosome X is referred to as "X" while in UCSC, it is referred to as "chrX".
                    When you mix them up, the programs get confused. I would go back and stick with one source for the reference genome and annotation.

                    Comment


                    • #11
                      Thank you pbluescript
                      First, I used all ensemble ref gtf, bowtie index, .fa and got this error. then I used UCSC also got this error... So I think it is not because of reference origin problem...I am still not sure ?

                      Originally posted by pbluescript View Post
                      Your GTF and FASTA files do not have the same reference sequences. The issue is likely from combining Ensembl and UCSC. For example, in Ensembl, chromosome X is referred to as "X" while in UCSC, it is referred to as "chrX".
                      When you mix them up, the programs get confused. I would go back and stick with one source for the reference genome and annotation.

                      Comment


                      • #12
                        Originally posted by jp. View Post
                        Thank you pbluescript
                        First, I used all ensemble ref gtf, bowtie index, .fa and got this error. then I used UCSC also got this error... So I think it is not because of reference origin problem...I am still not sure ?
                        If you got that error, then one of the files you used originally is either missing data or has data with names that do not match.

                        To get this information from your GTF files, you could use the command line.
                        Something like this:

                        Code:
                        cut -f1 your_file.gtf | sort -u
                        For the FASTA file, use this command:

                        Code:
                        grep "^>" your_file.fasta
                        Compare the results to see which has different information. Sometimes these errors can be ignored, but not if one of the chromosomes it can't find is the X chromosome.

                        Comment

                        Latest Articles

                        Collapse

                        • seqadmin
                          Strategies for Sequencing Challenging Samples
                          by seqadmin


                          Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                          03-22-2024, 06:39 AM
                        • seqadmin
                          Techniques and Challenges in Conservation Genomics
                          by seqadmin



                          The field of conservation genomics centers on applying genomics technologies in support of conservation efforts and the preservation of biodiversity. This article features interviews with two researchers who showcase their innovative work and highlight the current state and future of conservation genomics.

                          Avian Conservation
                          Matthew DeSaix, a recent doctoral graduate from Kristen Ruegg’s lab at The University of Colorado, shared that most of his research...
                          03-08-2024, 10:41 AM

                        ad_right_rmr

                        Collapse

                        News

                        Collapse

                        Topics Statistics Last Post
                        Started by seqadmin, Yesterday, 06:37 PM
                        0 responses
                        10 views
                        0 likes
                        Last Post seqadmin  
                        Started by seqadmin, Yesterday, 06:07 PM
                        0 responses
                        9 views
                        0 likes
                        Last Post seqadmin  
                        Started by seqadmin, 03-22-2024, 10:03 AM
                        0 responses
                        49 views
                        0 likes
                        Last Post seqadmin  
                        Started by seqadmin, 03-21-2024, 07:32 AM
                        0 responses
                        67 views
                        0 likes
                        Last Post seqadmin  
                        Working...
                        X