For programs such as DESeq that use raw read count per gene, the long read length of 454 results in a lower read count per gene compared to the illumina platform, but contains more information. This is going to become even more pronounced when 454 bring out the 800 bp chemistry in July this year where a single mapped read will be equivalent to 6-10 illumina reads.
I have been thinking about either counting the bases in each mapped read or splitting reads into smaller kmers of 30-100 bp each and using the read count from this. Programs such as DESeq have reduced power when the read count is low so counting a 400 bp read as 4-8 reads will give the analysis more power.
The read could be split either before mapping (which may mean that not all kmers from the read match) or after the mapping (which may artifically increase the actual hit if the read was only partially mapped)
Can anyone see any problems with this approach?
I couldn't find any previous threads using a few searches, but if this has been discussed before please feel free to link it.
I have been thinking about either counting the bases in each mapped read or splitting reads into smaller kmers of 30-100 bp each and using the read count from this. Programs such as DESeq have reduced power when the read count is low so counting a 400 bp read as 4-8 reads will give the analysis more power.
The read could be split either before mapping (which may mean that not all kmers from the read match) or after the mapping (which may artifically increase the actual hit if the read was only partially mapped)
Can anyone see any problems with this approach?
I couldn't find any previous threads using a few searches, but if this has been discussed before please feel free to link it.