I am wondering if anyone else has seen the following problem: I am using RNAseq to identify viral sequences among transcripts from an infected plant. Assembly of a single chip of Ion Torrent data was taking far too long (> 1 week - that's a separate issue) so I decided to use BLAST to identify reads matching short pieces of reference sequence from the virus in question. The matching reads were then assembled with Newbler 2.8. Three contigs were produced, the largest of which was 678 bp in length, contained 3/4 of the reads going into the assembly (3,700/4,500) and was reported as having high quality (most bases were Phred 64). Problem is, none of the resulting contigs match the viral sequences used to identify the constituent reads. Nor do any of the constituent reads match the assembly!
Most of the reads had very good matches to the reference (< 3 mismatches) and assembly with phrap (using reads generated with sffinfo) produced contigs with lower reported quality values but which DO match the viral references. Thus Newbler is outputting assemblies which it reports as high quality but which are, in fact, complete garbage.
Most of the reads had very good matches to the reference (< 3 mismatches) and assembly with phrap (using reads generated with sffinfo) produced contigs with lower reported quality values but which DO match the viral references. Thus Newbler is outputting assemblies which it reports as high quality but which are, in fact, complete garbage.
Comment