Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Indexing very large genomes

    Hi, I was wondering if anyone could help me.

    I have a diploid plant genome of around 10Gb split up into around 500,000 contigs.

    I have run into the problem with indexing using Bowtie2, as I reach the limit of characters. So unless I recompile Bowtie2-build in 64-bit it wont be possible.

    I resorted to using STAR, but again, I have the same problem because of the genome size. STAR informs me that I should limit the memory to at least 65Gb of RAM and make sure it is available. I have total memory of 65Gb RAM (around 55Gb free), but of course I can't use all the systems resources and I'm not able to make any more available for use. I can not upgrade to more RAM, so I'm stuck with what I have.

    The other alternative is splitting the genome up, perhaps in half, and indexing each half.

    Is there a way to do this and merge the indexes? Or if this is possible will I simply hit the same problem again when I load the indexed genome into memory, I'll be short of RAM again.

    I could also just align to each half, but this will result in biases and an increase in false positives with the alignments, which I would prefer to avoid. My aim is to identify novel isoforms, so this will just throw doubt on any novelty I find. Unfortunately, this may be my only option if I can not get any more RAM, merge indexes or tweak STAR or Bowtie2 parameters to work with the large genome.

    Thanks in advance for any help. It is much appreciated.

  • #2
    Bowtie2 does not work with >4GB genomes yet, so far as I know. It is limited by the internal integer types. Update me if I am wrong. Don't know about star. If you are working with genomic reads (not RNA-seq), try BWA. A 10GB genome should need <20GB RAM.

    Comment


    • #3
      Thanks for your feedback. Yes, that's what I found with Bowtie2. I'm working with RNA-seq reads and I need to use an aligner which does spliced alignment, that's why I chose TopHat2 and STAR. Is there a way to get BWA to do spliced alignments?

      Comment


      • #4
        Originally posted by Brett_CCG View Post
        I'm working with RNA-seq reads and I need to use an aligner which does spliced alignment
        What about subread/subjunct? It handles RNA-seq data and it might be able to work with your reference. (But I've never used it)

        Dario

        Comment


        • #5
          Thanks. I've heard of subread. I'll try it out over the next few days and update this thread.

          Comment


          • #6
            I'd just split the genome into chunks, build the index on each chunk, align to each, and merge the results. You'll need to do a little post-processing if you're looking to find a single best hit for each read, but sorting reads by score isn't that hard.

            I could also just align to each half, but this will result in biases and an increase in false positives with the alignments, which I would prefer to avoid.
            How does aligning to each half cause false positives or bias? You need to merge your results properly (eg, decide which half has the better score and whether it's better enough to be unique), but that's all doable.

            Comment


            • #7
              Thanks. I didn't think of this approach. Although I may end up filtering out true-positives by filtering on score. I'll try both approaches: subread (if it runs with whole genome) and running STAR on a split genome with quality score filtering.

              Comment


              • #8
                Ok. I've tried Subread, and it has a limit of 4Gb for the genome. So I can't use that. I'm now using STAR, splitting up the genome (the genome is hexaploid, so I'm splitting on each genome) and running alignments on each. I wont merge them and filter on score, instead I'll work with each individually and identify splice junctions in each.

                Now my main concern: I'm using annotation guided alignment. But the annotation contains a number of exons/CDS in the wrong frame due to start/end positions out by 1-2 bp. Will this affect alignments? My understanding is that STAR would align to regions based on annotated GFF, and then attempt realignment with reads which didn't align using the annotated GFF file. From what I've read this is what Tophat2 does. Is this the case for all RNA-seq GFF guided alignments? I can not get a better annotated GFF unless I wait for the group who done the assembly and annotation to improve it. I need to wrap up this project and move on. It's apart of my PhD (I have 1 year left on my scholarship), so I don't have the luxury of waiting around.

                Thanks for any help anyone can provide.

                Comment


                • #9
                  I'm not sure if this is what you want, but BWA MEM (v0.7.5a) reports chimeric alignments in the sam output with SA tags. For example:

                  Code:
                  HWI-ST226:220:D0AU7ACXX:5:1101:11456:146339	2193	Oa_Locus_2615_Transcript_18	2194	44	42M46H	=	2182	-54	AATTGAGCTACCAAAAACCCTAACCCAAAAATTTGTAGCGTC	*	NM:i:2	AS:i:35	XS:i:0	SA:Z:Oa_Locus_2615_Transcript_14,1923,+,21S43M24S,60,0;Oa_Locus_2615_Transcript_14,1976,-,50S38M,55,0;
                  See bwa manual and the latest SAM format specification document for details about the SA tag.

                  BWA should also be able to index your large genome.

                  Comment


                  • #10
                    I think bwa-mem works with RNA-seq to a lesser extend. It might be useful in certain non-typical analyses. However, for typical RNA-seq, it would be good to use a standard RNA-seq mapper if possible.

                    Comment


                    • #11
                      Thanks for your input. I'm going to be using particular scripts to extract out spliced reads. These scripts look for the CIGAR string, so for now at least I can't use BWA since it uses SA tags.

                      I am really interested though in finding out if RNA-seq alignment using a GFF file for guided alignment which is based on a rough draft annotation containing errors isn't a problem, because this is all I have to work with. Details of which are in my post above. If it is a problem, I'll take the unguided approach. Thanks for your help.

                      Comment

                      Latest Articles

                      Collapse

                      • seqadmin
                        Current Approaches to Protein Sequencing
                        by seqadmin


                        Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
                        04-04-2024, 04:25 PM
                      • seqadmin
                        Strategies for Sequencing Challenging Samples
                        by seqadmin


                        Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                        03-22-2024, 06:39 AM

                      ad_right_rmr

                      Collapse

                      News

                      Collapse

                      Topics Statistics Last Post
                      Started by seqadmin, 04-11-2024, 12:08 PM
                      0 responses
                      25 views
                      0 likes
                      Last Post seqadmin  
                      Started by seqadmin, 04-10-2024, 10:19 PM
                      0 responses
                      27 views
                      0 likes
                      Last Post seqadmin  
                      Started by seqadmin, 04-10-2024, 09:21 AM
                      0 responses
                      24 views
                      0 likes
                      Last Post seqadmin  
                      Started by seqadmin, 04-04-2024, 09:00 AM
                      0 responses
                      52 views
                      0 likes
                      Last Post seqadmin  
                      Working...
                      X