SEQanswers

Go Back   SEQanswers > Applications Forums > RNA Sequencing



Similar Threads
Thread Thread Starter Forum Replies Last Post
Using TruSeq adapters with older protocols- not working? jlove Illumina/Solexa 2 12-01-2011 12:11 PM
Coordinating sequencing data and microarrays bio Bioinformatics 2 02-18-2011 09:41 AM
Anyone know where to find older dbSNP .vcf files? petriedish Bioinformatics 0 02-15-2011 11:20 AM

Reply
 
Thread Tools
Old 06-08-2011, 10:10 PM   #1
BAMseek
Senior Member
 
Location: St. Louis, MO, USA

Join Date: Apr 2011
Posts: 124
Question Can NGS learn something from its older brother Microarrays?

Greetings all,

I like next-gen sequencing as much as the next person, but I find it curious that it might suffer from some limitations not seen in microarrays. I have two cases in mind:

(1) longer reads seem great but they make aligning to reference genomes more difficult, due to things such as more mismatches and reads crossing multiple exon-exon boundaries or structural variations.

And (2) as has been pointed out in some papers, statistical methods to detect differential expression have higher power with higher read counts, and so there could be a bias in detecting differential expression in longer genes and transcripts. Some genes might show up at the top of your differential list just because they are long.

Microarrays do not seem to have these problems and maybe that is because they are measuring intensity values of equal length probes. So my question is, would it make sense to detect expression levels from a set of probes, similar to how microarrays do things? Here is how I would imagine this would go . . .

- Generate a list of "pseudo-probes" of the same length, as many as you want (hey, no need to worry about actually manufacturing a custom array). Ideally, each probe would uniquely identify some genomic feature. For example, the probe might cross an exon-exon boundary specific to a particular transcript so any reads aligning to that probe would be evidence for that transcript.
- Align your reads to the probes. Should be pretty fast. And longer reads are now an advantage rather than a nuisance since they will have a greater chance of hitting one of your probes.
- Run your differential analysis on these probe counts (without having to worry about transcript length biases) and relate them back to the genomic features of interest.

Here are the advantages as I see them:
- Alignment would be quick - only aligning against a probe-set
- Alignment would not get harder as read length increases. For example, the number of allowed mismatches in the probes could remain constant even as your read length increases.
- Potentially eliminate the effects of gene length on statistical power since each probe will be of equal length
- "Cross-hybridization" could be explicitly measured as the sequence similarity (such as edit distance) of two probes.
- Would not have to worry about modeling binding affinities for each probe since we can explicitly read the sequence and determine if it is complementary to the probe instead of relying on the physical properties of binding between nucleotides.

Of course, you wouldn't be able to use this approach to find structural variation or novel isoforms. But if what you are interested in is differential expression of known transcripts (and maybe that is a good place to start), then why not make your alignment and analysis job easier?

Just some thoughts. I may be way off base but I wanted to pitch that idea, and all baseball analogies aside, I would be interested to hear others' comments and thoughts.

Thanks!
BAMseek
BAMseek is offline   Reply With Quote
Old 06-09-2011, 07:42 AM   #2
Joann
Senior Member
 
Location: Woodbridge CT

Join Date: Oct 2008
Posts: 231
Default test

Hi,
This is a situation where you have your choice among previously well characterized gene expression biology to see if your expectations pan out. Just do a good literature search to identify a particular cell based system and the characteristics you would expect reconfirm using the techniques you propose above. More time in the library=less time at the bench. Never let a few weeks at the bench save you from spending a few hours in the library.
Joann is offline   Reply With Quote
Old 06-09-2011, 09:23 PM   #3
BAMseek
Senior Member
 
Location: St. Louis, MO, USA

Join Date: Apr 2011
Posts: 124
Default

Hi Joann,

Thanks for the suggestions. No doubt the best way to see if something works is to give it a try. I am definitely glad to hear suggestions from biologically-minded people like you, since my background is more in the computer sciences. I decided to just post the question to see if others thought the approach seemed reasonable before fully embarking down that path. Thanks!
BAMseek is offline   Reply With Quote
Old 06-09-2011, 11:39 PM   #4
steven
Senior Member
 
Location: Southern France

Join Date: Aug 2009
Posts: 269
Default

As with microarrays, It may be risky to have the expression level of a specific transcript only rely on a handful of supposedly isoform-specific probes. The higher the number of reads/probes, the lower the impact of probe-specific biases (GC content, relative position within transcript, etc). I believe that there is still too much unappreciated bias in RNA-seq experiments to let us define a reliable gold-standard probe set.

Plus you have to assume that the whole transcriptome of your species of interest is entirely known and well-established, which is almost systematically disproved by exploratory transcriptomic study-even in model organisms. OK, I know this is not crucial in the context of measuring the expression of known genes but I like to repeat that statement

Sounds like an alternative normalization approach to me: do not consider all of the signal, but just the part you predefined as significant/specific. Like Digital Gene Expression or SAGE. Not sure a reduction of the information is the best strategy, or at least not for now. I prefer what Cufflinks and Scripture are doing. Just my 2 cents.
steven is offline   Reply With Quote
Old 06-10-2011, 06:22 AM   #5
BAMseek
Senior Member
 
Location: St. Louis, MO, USA

Join Date: Apr 2011
Posts: 124
Default

Hi steven,
Thanks for the very useful comments! I agree that looking at only a portion of a transcript may not give the best idea of what is going on, due to non-uniform coverage of reads across the transcript and differences in GC-content. From my experiences looking at visualizations of alternative splicing, the eye is usually drawn to exons or splice junctions that only appear in one of the isoforms to determine which isoform is being expressed. So I thought it might be nice to simplify things and measure those interesting locations directly. I guess my concern is that when the data is transformed by doing a dash of GC-correction here and a pinch of quantile normalization there, the model gets pretty complex and it is difficult to get useful statistics out of it.

I think you are right about the similarities of the approach I described with SAGE/digital gene counting. Tag based approaches seem nice because you don't have to worry about differences in transcript length or fragmentation biases. Of course, you can't do some of the cooler stuff like transcript level expression or alternative splicing detection. I guess I am surprised I don't see tag-based approaches used more often for gene-level expression since it might simplify the analysis.

Thanks again. I will definitely explore the links you sent.

BAMseek
BAMseek is offline   Reply With Quote
Old 06-10-2011, 09:16 AM   #6
Joann
Senior Member
 
Location: Woodbridge CT

Join Date: Oct 2008
Posts: 231
Default my assumptions

Hello:
I am assuming that you are talking about using the proposed approach to look at a set state of regulated expression in a well characterized cell system, not a random piece of species.

A situation of protein secretion, for example, or induction of hemoglobin synthesis comes to mind but there are many examples of differential isozyme expression in past literature as well.

From previous studies you would be able to tell just about how much message to expect around a target gene expression as well as a lot about what else is not being expressed, so what you can find outside expectations would be novel and interesting if it were real.

Tune the biological system for support of your methodology exploration.
Joann is offline   Reply With Quote
Old 06-10-2011, 10:37 AM   #7
lh3
Senior Member
 
Location: Boston

Join Date: Feb 2008
Posts: 693
Default

Case (1): For longer reads, it is more important to perform local alignment, instead of glocal alignment most mappers are currently doing. Local alignment does not have the problems you are describing.

Case (2): What matters more is the mean instead of the variance and the statistical fluctuation due to variable transcript lengths can be corrected.

I think while case (1) might be problematic to a limited extend as we do not do local alignment often, case (2) is not a problem. Another argument against longer reads is the cost.

By using probes, you may be dropping informative data. It is a valid method, but I guess is less powerful than looking at the full data set (I cannot predict how much less). Also, my impression is with sophisticated tools such as cufflinks, measuring gene expression nowadays is not a particularly hard problem. It is not necessary to trade the information in data for reduced computing time.
lh3 is offline   Reply With Quote
Old 06-13-2011, 05:52 PM   #8
BAMseek
Senior Member
 
Location: St. Louis, MO, USA

Join Date: Apr 2011
Posts: 124
Default

Hi Heng,

Thanks alot for your reply! I agree that as reads get longer, more consideration will need to be made for local re-alignments. I wonder how important it is to reconsider the defaults used for many of the short-read aligners out there as read lengths go beyond what they were originally intended for. Maybe a shorter portion of long read can be used to quickly find potential hits, followed by a more exhaustive local alignment, similar to the BLAST approach. I know SHRiMP does a local Smith-Waterman, but at the expense of speed compared to the other aligners (at least that has been my experience).

I still think it is an open question on how to best do differential expression of genes. Cuffdiff looks at FPKM values - I've seen some suggestions that dividing by transcript length and total reads may be too simple for normalizing data (since a majority of the expression could be caused by a minority of the genes, and changes in highly expressed genes could affect total reads). Cuffdiff will optionally use the Bullard normalization in the FPKM computation, but how would you know when to use that or not? Some approaches that work off of raw counts would tell you not to normalize at all, but then I don't think they address the transcript length bias issue. If an approach works off of just the raw reads, I would think they would either need to look at equal sized regions (which is why I thought about aligning to equal-sized "probes") or somehow account for the differences in transcript length.

I admit I am still trying to wrap my head around all this, so please forgive me if I am mistaken.

thanks!
BAMseek
BAMseek is offline   Reply With Quote
Old 06-14-2011, 07:05 AM   #9
lh3
Senior Member
 
Location: Boston

Join Date: Feb 2008
Posts: 693
Default

As to local alignment, you may consider bwa-sw, probably with "-T20" and possibly with "-z5". I kind of think given >100bp RNA-seq reads, we should do local alignment more often, but I am not in this field, so do not really know if this is a good idea.
lh3 is offline   Reply With Quote
Old 06-15-2011, 04:01 AM   #10
BAMseek
Senior Member
 
Location: St. Louis, MO, USA

Join Date: Apr 2011
Posts: 124
Default

Quote:
As to local alignment, you may consider bwa-sw, probably with "-T20" and possibly with "-z5". I kind of think given >100bp RNA-seq reads, we should do local alignment more often, but I am not in this field, so do not really know if this is a good idea.
I will definitely try that out. Thanks!
BAMseek is offline   Reply With Quote
Reply

Tags
bias, microarray, ngs, probes

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 03:39 PM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2020, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO