Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Visualization Tools for Large Datasets

    We have a whole transcriptome dataset from a SOLiD sequencer that is a 100GB .bam file. In some places there is a read depth of greater than 1x10^7 reads. We have not been able to find a tool able to visualize this amount of data. IGV, MagicViewer, Tablet and Artemis have all died when looking at those portions of the genome (which for this experiment contain our genes of interest). Our visualization testing was done allowing up to 12GB of RAM, though we could probably get that up close to 20GB for a few tests.

    Is there a tool that can visualize this sort of data directly? If so, what kind of memory requirements would it have for this much data?
    Are there tools to pre-process or distill the data down to a visualizable summary?

  • #2
    Originally posted by mrawlins View Post
    We have a whole transcriptome dataset from a SOLiD sequencer that is a 100GB .bam file. In some places there is a read depth of greater than 1x10^7 reads. We have not been able to find a tool able to visualize this amount of data. IGV, MagicViewer, Tablet and Artemis have all died when looking at those portions of the genome (which for this experiment contain our genes of interest). Our visualization testing was done allowing up to 12GB of RAM, though we could probably get that up close to 20GB for a few tests.

    Is there a tool that can visualize this sort of data directly? If so, what kind of memory requirements would it have for this much data?
    Are there tools to pre-process or distill the data down to a visualizable summary?
    I am unaware of a tool to handle 10 million coverage. How about doing some data reduction, by removing all reads that start at the same position (and/or have the same sequence)? You can do the former with Picard's MarkDuplicates. Then you at least are able visualize some of the data.

    Comment


    • #3
      The problem is that a lot of the layout algorithms really slow down on deep data. I'm intrigued to know how well my own code works on this so I'll experiment some, but I suspect with that much depth you're basically going to really struggle with all tools.

      One solution is just random sampling of the deep regions so you can get a representative set. More optimal may be duplicate removal as suggested, but this may take a long time to run.

      James

      Comment


      • #4
        So I did a test using gap5 with a short repeated section of a genome, artificially made by replicating the same sequences so it compresses overly well and isn't the optimal test.

        It was 94bp long, with 5.8 million sequences (mostly 36bp) and a peak depth of around 4million.

        To open up the assembly (note NOT bam format but gap5's own) and view the "template display" showing all 5.8 million reads in a LookSeq style plot took 5 seconds and a shade under 1Gb of memory. I'm guessing LookSeq itself would be similarly fast if you convert the bam file to LookSeq's own sqlite format instead. Note that the speed of this plot is very much proportional to the number of objects visible and not their depth. So 10 million deep for 50bp is fine, as long as it's not 10 million deep for a 10kb region. How many sequences do you think would be visible on your plots?

        The template display "stacking" mode, where sequences get displayed with a Y coordinate attempting to prevent them all overlapping, takes longer as it has to run a layout algorithm to allocate Y values; but not unusably so - an extra 20seconds to display.

        To display the contig editor however was tragically sluggish, taking over 2 mins to just come up and rather annoyingly 40ish seconds to highlight what's under the mouse or 1min to scroll a base to the right. Not particularly usable at that level.

        The reason there's a difference between graphically drawing the sequences and actually displaying the alignments is due to how gap5 stores the data. The location, orientation, read-pairing and a few flags for sequences get stored together in the recursing binning system. The actual sequences, quality values and read-names are elsewhere - in the main sequence structures. Typically around only 5% of the database is consumed by the positional binning arrays, although this may differ for extreme depth cases - I haven't checked.

        In contrast, when viewing a bam file you'd need to load the entire data as it's all mingled together. This is why I think both Gap5 and LookSeq are worth investigation. (I believe LookSeq, if not running in bam mode, also only stores location information and not the seqs/qual).

        James

        Comment


        • #5
          I've just run a quick test with Tablet - again with simulated data (I made it load the same set of reads over and over again until it had 10 million of them), and although it took a minute or so to load them from a BAM file, packing only took 5 or 6s, and it was fine during display too.

          Without access to the actual data though, it'll be hard for any of us to genuinely replicate the problem and try to fix it for you.

          Iain
          Our software: Tablet | Flapjack | Strudel | CurlyWhirly | TOPALi

          Comment

          Latest Articles

          Collapse

          • seqadmin
            Strategies for Sequencing Challenging Samples
            by seqadmin


            Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
            03-22-2024, 06:39 AM
          • seqadmin
            Techniques and Challenges in Conservation Genomics
            by seqadmin



            The field of conservation genomics centers on applying genomics technologies in support of conservation efforts and the preservation of biodiversity. This article features interviews with two researchers who showcase their innovative work and highlight the current state and future of conservation genomics.

            Avian Conservation
            Matthew DeSaix, a recent doctoral graduate from Kristen Ruegg’s lab at The University of Colorado, shared that most of his research...
            03-08-2024, 10:41 AM

          ad_right_rmr

          Collapse

          News

          Collapse

          Topics Statistics Last Post
          Started by seqadmin, Yesterday, 06:37 PM
          0 responses
          8 views
          0 likes
          Last Post seqadmin  
          Started by seqadmin, Yesterday, 06:07 PM
          0 responses
          8 views
          0 likes
          Last Post seqadmin  
          Started by seqadmin, 03-22-2024, 10:03 AM
          0 responses
          49 views
          0 likes
          Last Post seqadmin  
          Started by seqadmin, 03-21-2024, 07:32 AM
          0 responses
          66 views
          0 likes
          Last Post seqadmin  
          Working...
          X