Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Can NGS learn something from its older brother Microarrays?

    Greetings all,

    I like next-gen sequencing as much as the next person, but I find it curious that it might suffer from some limitations not seen in microarrays. I have two cases in mind:

    (1) longer reads seem great but they make aligning to reference genomes more difficult, due to things such as more mismatches and reads crossing multiple exon-exon boundaries or structural variations.

    And (2) as has been pointed out in some papers, statistical methods to detect differential expression have higher power with higher read counts, and so there could be a bias in detecting differential expression in longer genes and transcripts. Some genes might show up at the top of your differential list just because they are long.

    Microarrays do not seem to have these problems and maybe that is because they are measuring intensity values of equal length probes. So my question is, would it make sense to detect expression levels from a set of probes, similar to how microarrays do things? Here is how I would imagine this would go . . .

    - Generate a list of "pseudo-probes" of the same length, as many as you want (hey, no need to worry about actually manufacturing a custom array). Ideally, each probe would uniquely identify some genomic feature. For example, the probe might cross an exon-exon boundary specific to a particular transcript so any reads aligning to that probe would be evidence for that transcript.
    - Align your reads to the probes. Should be pretty fast. And longer reads are now an advantage rather than a nuisance since they will have a greater chance of hitting one of your probes.
    - Run your differential analysis on these probe counts (without having to worry about transcript length biases) and relate them back to the genomic features of interest.

    Here are the advantages as I see them:
    - Alignment would be quick - only aligning against a probe-set
    - Alignment would not get harder as read length increases. For example, the number of allowed mismatches in the probes could remain constant even as your read length increases.
    - Potentially eliminate the effects of gene length on statistical power since each probe will be of equal length
    - "Cross-hybridization" could be explicitly measured as the sequence similarity (such as edit distance) of two probes.
    - Would not have to worry about modeling binding affinities for each probe since we can explicitly read the sequence and determine if it is complementary to the probe instead of relying on the physical properties of binding between nucleotides.

    Of course, you wouldn't be able to use this approach to find structural variation or novel isoforms. But if what you are interested in is differential expression of known transcripts (and maybe that is a good place to start), then why not make your alignment and analysis job easier?

    Just some thoughts. I may be way off base but I wanted to pitch that idea, and all baseball analogies aside, I would be interested to hear others' comments and thoughts.

    Thanks!
    BAMseek

  • #2
    test

    Hi,
    This is a situation where you have your choice among previously well characterized gene expression biology to see if your expectations pan out. Just do a good literature search to identify a particular cell based system and the characteristics you would expect reconfirm using the techniques you propose above. More time in the library=less time at the bench. Never let a few weeks at the bench save you from spending a few hours in the library.

    Comment


    • #3
      Hi Joann,

      Thanks for the suggestions. No doubt the best way to see if something works is to give it a try. I am definitely glad to hear suggestions from biologically-minded people like you, since my background is more in the computer sciences. I decided to just post the question to see if others thought the approach seemed reasonable before fully embarking down that path. Thanks!

      Comment


      • #4
        As with microarrays, It may be risky to have the expression level of a specific transcript only rely on a handful of supposedly isoform-specific probes. The higher the number of reads/probes, the lower the impact of probe-specific biases (GC content, relative position within transcript, etc). I believe that there is still too much unappreciated bias in RNA-seq experiments to let us define a reliable gold-standard probe set.

        Plus you have to assume that the whole transcriptome of your species of interest is entirely known and well-established, which is almost systematically disproved by exploratory transcriptomic study-even in model organisms. OK, I know this is not crucial in the context of measuring the expression of known genes but I like to repeat that statement

        Sounds like an alternative normalization approach to me: do not consider all of the signal, but just the part you predefined as significant/specific. Like Digital Gene Expression or SAGE. Not sure a reduction of the information is the best strategy, or at least not for now. I prefer what Cufflinks and Scripture are doing. Just my 2 cents.

        Comment


        • #5
          Hi steven,
          Thanks for the very useful comments! I agree that looking at only a portion of a transcript may not give the best idea of what is going on, due to non-uniform coverage of reads across the transcript and differences in GC-content. From my experiences looking at visualizations of alternative splicing, the eye is usually drawn to exons or splice junctions that only appear in one of the isoforms to determine which isoform is being expressed. So I thought it might be nice to simplify things and measure those interesting locations directly. I guess my concern is that when the data is transformed by doing a dash of GC-correction here and a pinch of quantile normalization there, the model gets pretty complex and it is difficult to get useful statistics out of it.

          I think you are right about the similarities of the approach I described with SAGE/digital gene counting. Tag based approaches seem nice because you don't have to worry about differences in transcript length or fragmentation biases. Of course, you can't do some of the cooler stuff like transcript level expression or alternative splicing detection. I guess I am surprised I don't see tag-based approaches used more often for gene-level expression since it might simplify the analysis.

          Thanks again. I will definitely explore the links you sent.

          BAMseek

          Comment


          • #6
            my assumptions

            Hello:
            I am assuming that you are talking about using the proposed approach to look at a set state of regulated expression in a well characterized cell system, not a random piece of species.

            A situation of protein secretion, for example, or induction of hemoglobin synthesis comes to mind but there are many examples of differential isozyme expression in past literature as well.

            From previous studies you would be able to tell just about how much message to expect around a target gene expression as well as a lot about what else is not being expressed, so what you can find outside expectations would be novel and interesting if it were real.

            Tune the biological system for support of your methodology exploration.

            Comment


            • #7
              Case (1): For longer reads, it is more important to perform local alignment, instead of glocal alignment most mappers are currently doing. Local alignment does not have the problems you are describing.

              Case (2): What matters more is the mean instead of the variance and the statistical fluctuation due to variable transcript lengths can be corrected.

              I think while case (1) might be problematic to a limited extend as we do not do local alignment often, case (2) is not a problem. Another argument against longer reads is the cost.

              By using probes, you may be dropping informative data. It is a valid method, but I guess is less powerful than looking at the full data set (I cannot predict how much less). Also, my impression is with sophisticated tools such as cufflinks, measuring gene expression nowadays is not a particularly hard problem. It is not necessary to trade the information in data for reduced computing time.

              Comment


              • #8
                Hi Heng,

                Thanks alot for your reply! I agree that as reads get longer, more consideration will need to be made for local re-alignments. I wonder how important it is to reconsider the defaults used for many of the short-read aligners out there as read lengths go beyond what they were originally intended for. Maybe a shorter portion of long read can be used to quickly find potential hits, followed by a more exhaustive local alignment, similar to the BLAST approach. I know SHRiMP does a local Smith-Waterman, but at the expense of speed compared to the other aligners (at least that has been my experience).

                I still think it is an open question on how to best do differential expression of genes. Cuffdiff looks at FPKM values - I've seen some suggestions that dividing by transcript length and total reads may be too simple for normalizing data (since a majority of the expression could be caused by a minority of the genes, and changes in highly expressed genes could affect total reads). Cuffdiff will optionally use the Bullard normalization in the FPKM computation, but how would you know when to use that or not? Some approaches that work off of raw counts would tell you not to normalize at all, but then I don't think they address the transcript length bias issue. If an approach works off of just the raw reads, I would think they would either need to look at equal sized regions (which is why I thought about aligning to equal-sized "probes") or somehow account for the differences in transcript length.

                I admit I am still trying to wrap my head around all this, so please forgive me if I am mistaken.

                thanks!
                BAMseek

                Comment


                • #9
                  As to local alignment, you may consider bwa-sw, probably with "-T20" and possibly with "-z5". I kind of think given >100bp RNA-seq reads, we should do local alignment more often, but I am not in this field, so do not really know if this is a good idea.

                  Comment


                  • #10
                    As to local alignment, you may consider bwa-sw, probably with "-T20" and possibly with "-z5". I kind of think given >100bp RNA-seq reads, we should do local alignment more often, but I am not in this field, so do not really know if this is a good idea.
                    I will definitely try that out. Thanks!

                    Comment

                    Latest Articles

                    Collapse

                    • seqadmin
                      Strategies for Sequencing Challenging Samples
                      by seqadmin


                      Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                      03-22-2024, 06:39 AM
                    • seqadmin
                      Techniques and Challenges in Conservation Genomics
                      by seqadmin



                      The field of conservation genomics centers on applying genomics technologies in support of conservation efforts and the preservation of biodiversity. This article features interviews with two researchers who showcase their innovative work and highlight the current state and future of conservation genomics.

                      Avian Conservation
                      Matthew DeSaix, a recent doctoral graduate from Kristen Ruegg’s lab at The University of Colorado, shared that most of his research...
                      03-08-2024, 10:41 AM

                    ad_right_rmr

                    Collapse

                    News

                    Collapse

                    Topics Statistics Last Post
                    Started by seqadmin, 03-27-2024, 06:37 PM
                    0 responses
                    12 views
                    0 likes
                    Last Post seqadmin  
                    Started by seqadmin, 03-27-2024, 06:07 PM
                    0 responses
                    11 views
                    0 likes
                    Last Post seqadmin  
                    Started by seqadmin, 03-22-2024, 10:03 AM
                    0 responses
                    52 views
                    0 likes
                    Last Post seqadmin  
                    Started by seqadmin, 03-21-2024, 07:32 AM
                    0 responses
                    68 views
                    0 likes
                    Last Post seqadmin  
                    Working...
                    X