Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • How to run Tophat2 with GRCh38?

    Hi,

    This is a very simple question that I'm hopeful someone has resolved already.

    How does one run Tophat2 with GRCh38?

    I've downloaded the reference genome from Ensembl.
    I've indexed the reference genome with bowtie2-build.

    The problem is that bowtie2-build generates large index files with the extension bt2l that are not recognized by TopHat.

    What should I do?
    Would an older version of Bowtie2 allow me to generate bt2 files?

    Someone must have resolved this problem.
    iGenomes does not yet provide indexes for GRCh38.
    I'm happy with Tophat, and don't want to switch to STAR, although I find this issue annoying and perplexing.

    The problem has been reported in the Tuxedo user group, but no solution has been provided.


    TopHat v2.0.12
    Bowtie2 version 2.2.3

    Error: Could not find Bowtie 2 index files (/stockage/genomes/Homo_sapiens/Ensembl/GRCh38/Sequence/Bowtie2Index/Homo_sapiens.GRCh38.dna.toplevel.*.bt2)

    Thank you for your help.

  • #2
    Though it has not been said explicitly on TopHat web page (last time this was mentioned was for v. 2.0.11) it is still likely that TopHat does not support 64-bit bowtie2 indexes. I think that is what you have generated.

    According to the manual bowtie2-build should generate normal indexes (if the reference is < 4 gigabases). Not sure why you are getting large indexes.
    Last edited by GenoMax; 10-07-2014, 07:58 AM.

    Comment


    • #3
      How do I generate the smaller (32-bit) index files?

      There is no option in bowtie2-build.

      I used the following simple command to generate the index files.

      Code:
      bowtie2-build Homo_sapiens.GRCh38.dna.toplevel.fa Homo_sapiens.GRCh38.dna.toplevel \
      &> bowtie2_build.sh.log
      Code:
      Bowtie 2 version 2.2.3 by Ben Langmead ([email protected], www.cs.jhu.edu/~langmea)
      Usage: bowtie2-build [options]* <reference_in> <bt2_index_base>
          reference_in            comma-separated list of files with ref sequences
          bt2_index_base          write bt2 data to files with this dir/basename
      *** Bowtie 2 indexes work only with v2 (not v1).  Likewise for v1 indexes. ***
      Options:
          -f                      reference files are Fasta (default)
          -c                      reference sequences given on cmd line (as
                                  <reference_in>)
          --large-index           force generated index to be 'large', even if ref
                                  has fewer than 4 billion nucleotides
          -a/--noauto             disable automatic -p/--bmax/--dcv memory-fitting
          -p/--packed             use packed strings internally; slower, less memory
          --bmax <int>            max bucket sz for blockwise suffix-array builder
          --bmaxdivn <int>        max bucket sz as divisor of ref len (default: 4)
          --dcv <int>             diff-cover period for blockwise (default: 1024)
          --nodc                  disable diff-cover (algorithm becomes quadratic)
          -r/--noref              don't build .3/.4 index files
          -3/--justref            just build .3/.4 index files
          -o/--offrate <int>      SA is sampled every 2^<int> BWT chars (default: 5)
          -t/--ftabchars <int>    # of chars consumed in initial lookup (default: 10)
          --seed <int>            seed for random number generator
          -q/--quiet              verbose output (for debugging)
          -h/--help               print detailed description of tool and its options
          --usage                 print this usage message
          --version               print version information and quit
      Last edited by blancha; 10-07-2014, 08:07 AM.

      Comment


      • #4
        Originally posted by blancha View Post
        How do I generate the smaller (32-bit) index files?

        There is no option in bowtie2-build.

        I used the following simple command to generate the index files.

        Code:
        bowtie2-build Homo_sapiens.GRCh38.dna.toplevel.fa Homo_sapiens.GRCh38.dna.toplevel \
        &> bowtie2_build.sh.log
        Code:
        Bowtie 2 version 2.2.3 by Ben Langmead ([email protected], www.cs.jhu.edu/~langmea)
        Usage: bowtie2-build [options]* <reference_in> <bt2_index_base>
            reference_in            comma-separated list of files with ref sequences
            bt2_index_base          write bt2 data to files with this dir/basename
        *** Bowtie 2 indexes work only with v2 (not v1).  Likewise for v1 indexes. ***
        Options:
            -f                      reference files are Fasta (default)
            -c                      reference sequences given on cmd line (as
                                    <reference_in>)
            --large-index           force generated index to be 'large', even if ref
                                    has fewer than 4 billion nucleotides
            -a/--noauto             disable automatic -p/--bmax/--dcv memory-fitting
            -p/--packed             use packed strings internally; slower, less memory
            --bmax <int>            max bucket sz for blockwise suffix-array builder
            --bmaxdivn <int>        max bucket sz as divisor of ref len (default: 4)
            --dcv <int>             diff-cover period for blockwise (default: 1024)
            --nodc                  disable diff-cover (algorithm becomes quadratic)
            -r/--noref              don't build .3/.4 index files
            -3/--justref            just build .3/.4 index files
            -o/--offrate <int>      SA is sampled every 2^<int> BWT chars (default: 5)
            -t/--ftabchars <int>    # of chars consumed in initial lookup (default: 10)
            --seed <int>            seed for random number generator
            -q/--quiet              verbose output (for debugging)
            -h/--help               print detailed description of tool and its options
            --usage                 print this usage message
            --version               print version information and quit
        bowtie2-build is really just a small wrapper script which then calls either bowtie2-build-s ('small' genomes) or bowtie2-build-l ('large'). While not recommended you could try directly using bowtie2-build-s, e.g.
        Code:
        bowtie2-build-s Homo_sapiens.GRCh38.dna.toplevel.fa Homo_sapiens.GRCh38.dna.toplevel \
        &> bowtie2_build.sh.log
        I do not know if this will work for GRCh38.

        Comment


        • #5
          Homo_sapiens.GRCh38.dna.toplevel.fa from ensembl is 36G in size. It appears to contain alternate haplotypes for a number of locations/scaffolds in addition to the chromosomes. No wonder bowtie2 is building long indexes.

          I am going to see if I can find a link for just the chromosomes.

          Comment


          • #6
            @GenoMax,@kmcarr

            Thank you both for your help.

            I've downloaded Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz, which excludes haplotypes and patches.
            bowtie2-build built the smaller bt2 index files on this file.

            Since I was interested in novel transcript discovery in addition to gene expression quantification, I wanted to use the most complete genome version available, so I was using Homo_sapiens.GRCh38.dna.toplevel.fa.gz. In hindsight, Homo_sapiens.GRCh37.dna.primary_assembly.fa was probably more appropriate.

            The following description of the files says GRCh37, but it was downloaded from the GRCh38 directory on the Ensembl FTP site.
            Code:
            ---------
            TOPLEVEL
            ---------
            These files contains all sequence regions flagged as toplevel in an Ensembl
            schema. This includes chromsomes, regions not assembled into chromosomes and
            N padded haplotype/patch regions.
            
            EXAMPLES
            
              Toplevel sequences unmasked:
                Homo_sapiens.GRCh37.dna.toplevel.fa.gz
              
              Toplevel soft/hard masked sequences:
                Homo_sapiens.GRCh37.dna_sm.toplevel.fa.gz
                Homo_sapiens.GRCh37.dna_rm.toplevel.fa.gz
            
            -----------------
            PRIMARY ASSEMBLY
            -----------------
            Primary assembly contains all toplevel sequence regions excluding haplotypes
            and patches. This file is best used for performing sequence similarity searches
            where patch and haplotype sequences would confuse analysis.   
            
            EXAMPLES
            
              Primary assembly sequences unmasked:
                Homo_sapiens.GRCh37.dna.primary_assembly.fa.gz
              
              Primary assembly soft/hard masked sequences:
                Homo_sapiens.GRCh37.dna_sm.primary_assembly.fa.gz
                Homo_sapiens.GRCh37.dna_rm.primary_assembly.fa.gz
            Last edited by blancha; 10-07-2014, 09:35 AM.

            Comment

            Latest Articles

            Collapse

            • seqadmin
              Strategies for Sequencing Challenging Samples
              by seqadmin


              Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
              03-22-2024, 06:39 AM
            • seqadmin
              Techniques and Challenges in Conservation Genomics
              by seqadmin



              The field of conservation genomics centers on applying genomics technologies in support of conservation efforts and the preservation of biodiversity. This article features interviews with two researchers who showcase their innovative work and highlight the current state and future of conservation genomics.

              Avian Conservation
              Matthew DeSaix, a recent doctoral graduate from Kristen Ruegg’s lab at The University of Colorado, shared that most of his research...
              03-08-2024, 10:41 AM

            ad_right_rmr

            Collapse

            News

            Collapse

            Topics Statistics Last Post
            Started by seqadmin, Yesterday, 06:37 PM
            0 responses
            8 views
            0 likes
            Last Post seqadmin  
            Started by seqadmin, Yesterday, 06:07 PM
            0 responses
            8 views
            0 likes
            Last Post seqadmin  
            Started by seqadmin, 03-22-2024, 10:03 AM
            0 responses
            49 views
            0 likes
            Last Post seqadmin  
            Started by seqadmin, 03-21-2024, 07:32 AM
            0 responses
            67 views
            0 likes
            Last Post seqadmin  
            Working...
            X