Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • MAQ and short read length (DGE)

    We are currently looking into the viability of Digital Gene Expression (DGE) or mRNA-seq as a possible replacement for expression microarrays in our breast cancer studies. DGE generates reads that are only 17 bases in length, and thus allowing for even 1 mismatch is a little questionable when aligning against the human genome. MAQ doesn't seem to allow you to specify the -n flag as anything less than 1 - is this something that can be altered easily? I would love to align my short reads via MAQ but only keep those that align perfectly.

    Along those lines, if a read maps to more than 1 location, MAQ will randomly pick one of those locations for the placement of that read. Is there any way to customize this function so that it checks against a coordinate file or something like that so we can at least have MAQ select a location for that read that is only in the transcriptome to raise our chances of the placement being 'correct'?

    Thank you for your help

  • #2
    I have looked at DGE data, and even with 16/17 bp, more than 90% map to the tag sequences (all possible 16mers with the enzyme specificity).
    I am curious to see how MAQ can be modified as well.. quite a few other tools have specific tag algorithm to take care of such aspects..
    --
    bioinfosm

    Comment


    • #3
      You really see >90% mapping to "canonical" regions?
      I've been aligning with MAQ with -n set to 1, and map >99% to the genome. I then extend all reads 4bp off the 5' end and only keep reads that contain CATG (we cut with NlaIII) - we're only keeping 50% of our mapped reads at this step. Then after that we check to see the overlap with genic regions, and it is certainly not as high as you report. What do you do differently?

      Comment


      • #4
        Technically you should not be trying to align your DGE reads to the genome. The tags may not exist as contiguous sequence in the genome; they may span splice sites or polyadenylation sites. To properly interpret DGE data you should first generate a complete set of predicted tags from the genome and transcriptome and then attempt to align your reads to that. To do this you need a well annotated genome. Please see this thread linked below for the software stack created by Ariel Paulson at the Stowers Institute for creating these tag tables and then scripts to interpret the Eland alignments.

        Discussion of next-gen sequencing related bioinformatics: resources, algorithms, open source efforts, etc


        I have used this pipeline for a couple of DGE projects. In one project with Arabidopsis I was able to map 97% of my filtered reads to predicted tags. This was allowing for up to 2 mismatches in the alignment. Counting only perfect matches the hit rate was ~90%. Not all of these were mapped to annotated genes though. Roughly 63% were mapped to genes, the remainder were to intergenic or repetitive regions.
        Last edited by kmcarr; 02-23-2009, 03:29 PM. Reason: correct spelling error

        Comment


        • #5
          Originally posted by jms1223 View Post
          Along those lines, if a read maps to more than 1 location, MAQ will randomly pick one of those locations for the placement of that read. Is there any way to customize this function so that it checks against a coordinate file or something like that so we can at least have MAQ select a location for that read that is only in the transcriptome to raise our chances of the placement being 'correct'?
          If you only want to have reads mapped to your transcriptome, perhaps just make your reference sequences the transcripts themselves, rather than the genome sequence?

          --Torst

          Comment


          • #6
            kmarr and Torst answered that for me jms1223
            --
            bioinfosm

            Comment

            Latest Articles

            Collapse

            • seqadmin
              Current Approaches to Protein Sequencing
              by seqadmin


              Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
              04-04-2024, 04:25 PM
            • seqadmin
              Strategies for Sequencing Challenging Samples
              by seqadmin


              Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
              03-22-2024, 06:39 AM

            ad_right_rmr

            Collapse

            News

            Collapse

            Topics Statistics Last Post
            Started by seqadmin, 04-11-2024, 12:08 PM
            0 responses
            18 views
            0 likes
            Last Post seqadmin  
            Started by seqadmin, 04-10-2024, 10:19 PM
            0 responses
            22 views
            0 likes
            Last Post seqadmin  
            Started by seqadmin, 04-10-2024, 09:21 AM
            0 responses
            17 views
            0 likes
            Last Post seqadmin  
            Started by seqadmin, 04-04-2024, 09:00 AM
            0 responses
            48 views
            0 likes
            Last Post seqadmin  
            Working...
            X