Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Efficient Data Structure for Managing Ribosome Profiling Data

    Hello all,

    New to forum here, so please correct me if posted in wrong section.

    I am working with ribosome profiling (see http://goo.gl/hvN8PP for a review) libraries generated on an Illumina HiSeq platform. For a lot of the analyses, I need to know precisely where the reads are aligning. Specifically, I need to know if the reads aligning to the coding sequence of a given mRNA are in the 0, +1, or +2 reading frame.

    Right now, my workflow is:
    1.) Use tophat (with a GFF file as a transcript guide) to map reads to the genome**
    2.) Use samtools to generate a sam file of the mapped reads
    3.) Read the allignment data into a R table, read the corresponding species GTF file into a table, then create a "local transcript coordinate" and a "CDS coordinate" for each read.

    So all 48 million of my reads have a number assigned to them which indicates where the 5' end of the read alligns to in a given transcript relative to the 5' most nucleotide of a given transcript, and then protein-encoding mRNA reads have a number assigned to them that which indicates where the 5' end of a read alligns to in a given transcript relative to the 'A' of the annotated 'AUG' start codon for a given ORF in the transcript

    Now, the issue is that the 'local transcript coordinate' and 'local CDS coordinate' for a given read can have different values depending on which isoform a read maps to. So if a given gene has 2 isoforms that happen to share some exons, a read may be alligning to the 50th position in a given annotated CDS and the 110th position in another annotated CDS.

    My workflow currently creates an index of all transcripts and their corresponding exons, then makes a table for each transcript with all the reads that could map to that transcript as well as their 'local transcript' and 'local CDS coordinates'.

    This seems vastly inefficient because instead of keeping track of 48 million reads, I have to keep track of ~500 million pseudo-reads - i.e. all of the reads that could align to multiple isoforms. I need advice for how to make some kind of efficient data structure that would allow me to keep track of 'local CDS' and 'local transcript' coordinates for all reads without having to duplicate the read objects.

    Something like:
    Gene ABC:
    Transcript_isoform_1:
    Transcript_isoform_2
    Read_238: (local_position_112, local_position_213), Read_239: (local_position_45, local_position_105), Read_240: (local_position_431, NA)

    Right now, I am thinking about making a custom class for the ribosome profiling reads, where each read contains a unique gene object, a unique genomic index, and then a list of objects corresponding to the transcript/'local CDS coordinate'/'local transcript coordinate' to which a read maps to.

    Can you recommend a more memory efficient way of doing the aforementioned task (or potentially some work-around that negates the problem entirely)?

    Please let me know if you require any clarification about my problem or my question.

    *** We decided against mapping reads to the transcriptome because we were concerned that tophat would arbitrarily assign reads to only one specific isoform, when in reality they could be coming from multiple isoforms. My supervisor advised me to also avoid using CuffLinks for our ribosome profiling analyses.

    Thanks in advance for your help!

  • #2
    What's the most efficient really depends on what you want to then do with the data. For example, keeping track of individual reads doesn't make much sense if you just want the # of reads/frame/transcript/gene. Anyway, assuming you're not dealing with genomic multimappers, I would think that a more efficient manner would be to simply store the number of alignments starting at each position (you can do this conveniently with HTSlib) and then converting those positions to frame/transcript/gene afterward.

    Comment


    • #3
      I don't understand why you would care which frame a read aligns to, unless you are translating them into amino acids afterward. But if you already know where a read aligns, there doesn't seem to be much point in translating the the read to AA space. Can you elaborate?

      Comment


      • #4
        @Brian: Presumably they're interested in alternate reading frames, which ribosomal profiling would allow determining.

        Comment


        • #5
          Thank dpryan and Brian for the quick replies!

          It is very important for us to know the framing of the data, as it allows us to examine ribosomal frameshifting, overlapping/alternate reading frames, as well as cases of ribosomes engaged in eukaryotic non-sense mediated decay. Converting the expressed sequences to amino acid-space is one of the tangential goals, but not one of the primary aims. Hope that clarifies things!

          @dpryan - thanks for the earlier suggestion. So in addition to wanting to know the # of reads/frame/transcript/gene, we also want to know WHERE and when changes in framing occur. For example, it appears that in one particular transcript, the ribosomes translate the entire coding sequence in the annotated 0-frame from start to stop codon, but when exposed to a chemical stimulus, the ribosomes elongating along this transcript shift into the -1 frame at about 450 nucleotides downstream of the annotated start codon, translate for about 60 nucleotides in the -1 frame, and then hit a -1 frame premature stop codon. So effectively, the stimulus causes a change in translation patterns that shifts the ribosome into an alternate reading frame along a specific sub-section of the mRNA. That is an example of one of our analyses that we would like to be able to globally and why we need to have framing data for each individual read so that we can do downstream automated analyses and visualization of sub-regions with unusual framing. Does that address your comment?

          I will try to implement your suggested solution using HTSlib and downstream frame/transcript/gene conversion tomorrow and let you know how that goes.

          Comment


          • #6
            Yes, it does indeed. In that case, remember that only one end of each alignment is informative (probably the start of read #1, but this will depend on how the libraries are made), so there's no need to store both. There's also no need to store this information per-read. You'll want to run the downstream analysis per-transcript anyway, so either store things as counts/genomic position and then map that to frame within each read or store everything in transcript coordinates (this would have been simpler had the mapping been done directly against the transcriptome with bowtie2/bwa/bbmap/etc.).

            BTW, I should note that htslib isn't the worlds fastest method, but it's the reference library for SAM/BAM/CRAM so it will mostly just work (and do so efficiently enough). This is also assuming that you're using something like C or C++. Other convenient options would be htsjdk (java) and pysam (a python interface to htslib). Given the wording in your post, I'm assuming that one of those is appropriate (i.e., you're not some crazy perl person).

            Comment

            Latest Articles

            Collapse

            • seqadmin
              Current Approaches to Protein Sequencing
              by seqadmin


              Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
              04-04-2024, 04:25 PM
            • seqadmin
              Strategies for Sequencing Challenging Samples
              by seqadmin


              Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
              03-22-2024, 06:39 AM

            ad_right_rmr

            Collapse

            News

            Collapse

            Topics Statistics Last Post
            Started by seqadmin, 04-11-2024, 12:08 PM
            0 responses
            30 views
            0 likes
            Last Post seqadmin  
            Started by seqadmin, 04-10-2024, 10:19 PM
            0 responses
            32 views
            0 likes
            Last Post seqadmin  
            Started by seqadmin, 04-10-2024, 09:21 AM
            0 responses
            28 views
            0 likes
            Last Post seqadmin  
            Started by seqadmin, 04-04-2024, 09:00 AM
            0 responses
            52 views
            0 likes
            Last Post seqadmin  
            Working...
            X