Greetings all,
I like next-gen sequencing as much as the next person, but I find it curious that it might suffer from some limitations not seen in microarrays. I have two cases in mind:
(1) longer reads seem great but they make aligning to reference genomes more difficult, due to things such as more mismatches and reads crossing multiple exon-exon boundaries or structural variations.
And (2) as has been pointed out in some papers, statistical methods to detect differential expression have higher power with higher read counts, and so there could be a bias in detecting differential expression in longer genes and transcripts. Some genes might show up at the top of your differential list just because they are long.
Microarrays do not seem to have these problems and maybe that is because they are measuring intensity values of equal length probes. So my question is, would it make sense to detect expression levels from a set of probes, similar to how microarrays do things? Here is how I would imagine this would go . . .
- Generate a list of "pseudo-probes" of the same length, as many as you want (hey, no need to worry about actually manufacturing a custom array). Ideally, each probe would uniquely identify some genomic feature. For example, the probe might cross an exon-exon boundary specific to a particular transcript so any reads aligning to that probe would be evidence for that transcript.
- Align your reads to the probes. Should be pretty fast. And longer reads are now an advantage rather than a nuisance since they will have a greater chance of hitting one of your probes.
- Run your differential analysis on these probe counts (without having to worry about transcript length biases) and relate them back to the genomic features of interest.
Here are the advantages as I see them:
- Alignment would be quick - only aligning against a probe-set
- Alignment would not get harder as read length increases. For example, the number of allowed mismatches in the probes could remain constant even as your read length increases.
- Potentially eliminate the effects of gene length on statistical power since each probe will be of equal length
- "Cross-hybridization" could be explicitly measured as the sequence similarity (such as edit distance) of two probes.
- Would not have to worry about modeling binding affinities for each probe since we can explicitly read the sequence and determine if it is complementary to the probe instead of relying on the physical properties of binding between nucleotides.
Of course, you wouldn't be able to use this approach to find structural variation or novel isoforms. But if what you are interested in is differential expression of known transcripts (and maybe that is a good place to start), then why not make your alignment and analysis job easier?
Just some thoughts. I may be way off base but I wanted to pitch that idea, and all baseball analogies aside, I would be interested to hear others' comments and thoughts.
Thanks!
BAMseek
I like next-gen sequencing as much as the next person, but I find it curious that it might suffer from some limitations not seen in microarrays. I have two cases in mind:
(1) longer reads seem great but they make aligning to reference genomes more difficult, due to things such as more mismatches and reads crossing multiple exon-exon boundaries or structural variations.
And (2) as has been pointed out in some papers, statistical methods to detect differential expression have higher power with higher read counts, and so there could be a bias in detecting differential expression in longer genes and transcripts. Some genes might show up at the top of your differential list just because they are long.
Microarrays do not seem to have these problems and maybe that is because they are measuring intensity values of equal length probes. So my question is, would it make sense to detect expression levels from a set of probes, similar to how microarrays do things? Here is how I would imagine this would go . . .
- Generate a list of "pseudo-probes" of the same length, as many as you want (hey, no need to worry about actually manufacturing a custom array). Ideally, each probe would uniquely identify some genomic feature. For example, the probe might cross an exon-exon boundary specific to a particular transcript so any reads aligning to that probe would be evidence for that transcript.
- Align your reads to the probes. Should be pretty fast. And longer reads are now an advantage rather than a nuisance since they will have a greater chance of hitting one of your probes.
- Run your differential analysis on these probe counts (without having to worry about transcript length biases) and relate them back to the genomic features of interest.
Here are the advantages as I see them:
- Alignment would be quick - only aligning against a probe-set
- Alignment would not get harder as read length increases. For example, the number of allowed mismatches in the probes could remain constant even as your read length increases.
- Potentially eliminate the effects of gene length on statistical power since each probe will be of equal length
- "Cross-hybridization" could be explicitly measured as the sequence similarity (such as edit distance) of two probes.
- Would not have to worry about modeling binding affinities for each probe since we can explicitly read the sequence and determine if it is complementary to the probe instead of relying on the physical properties of binding between nucleotides.
Of course, you wouldn't be able to use this approach to find structural variation or novel isoforms. But if what you are interested in is differential expression of known transcripts (and maybe that is a good place to start), then why not make your alignment and analysis job easier?
Just some thoughts. I may be way off base but I wanted to pitch that idea, and all baseball analogies aside, I would be interested to hear others' comments and thoughts.
Thanks!
BAMseek
Comment