Hi all,
Relatively simple question, but then I'll go into some details after the main question.
If a 454 sequence adapter has sequencing errors (or actual indels) is the standard to throw away the sequence completely as unreliable/low quality? Or is it to identify the adapter as best as possible and chop that off?
I have some 454 data that I am using for a de novo assembly. These data are used for error-correcting PacBio reads (note: possibly more on that in other threads later) as well as going into the assembly itself, which I'm performing in MIRA and fine tuning in phrap/swat/crossmatch/consed with Sanger reads added. In any case, I used sff_extract (I believe the version supplied within MIRA as a 3rd party tool, rather than the sff_extract from seq_crumbs/biopython) to generate a fastq and traceinfo.xml. In any case, this masks the adapter (as well as low-quality reads), but I think it only masks if the adapter matches 100%. Additionally, this masking is just noted in the traceinfo.xml file, which, although used by MIRA appears to be ignored by the error-correcting software pacBioToCA. So, for purposes of the pacBioToCA, I need to trim (rather than mask) the adapter sequence and I don't know of a tool that uses the traceinfo file for this purpose (there's no sff_extract option that I know that trims rather than masks).
In any case, I'm reverse-comp'ing my sequence with the fastx toolkit, then trimming the adapter off with scythe. However, of the 225,000 reads, 1400 don't have the sequence close enough for scythe to recognize it with one of two adapter variations. I've found an additional 3 slight variations which account for 700 of the remaining 1400.
During the MIRA assembly process, it complained about "megahubs" which may (by my estimation) be due to either residual adapter tags or the presence of a 16-fold transposable element.
So, do I throw away the 1400 sequences entirely (or rather 5700 if requiring an exact match to the primary adapter sequence)? or do I leave the adapters on the 700 I can't identify with my five linkers? Do I perform a manual chop of 11 bases (or so) of sequences that don't have a readily-identifiable adapter? Is there an easy way of passing on my identifications to the traceinfo.xml or of re-generating it using alternate adapter tags? sff_extract gives an option to input a sequence for paired-end linkers, but there doesn't appear to be an option for alternate adapter sequences.
Relatively simple question, but then I'll go into some details after the main question.
If a 454 sequence adapter has sequencing errors (or actual indels) is the standard to throw away the sequence completely as unreliable/low quality? Or is it to identify the adapter as best as possible and chop that off?
I have some 454 data that I am using for a de novo assembly. These data are used for error-correcting PacBio reads (note: possibly more on that in other threads later) as well as going into the assembly itself, which I'm performing in MIRA and fine tuning in phrap/swat/crossmatch/consed with Sanger reads added. In any case, I used sff_extract (I believe the version supplied within MIRA as a 3rd party tool, rather than the sff_extract from seq_crumbs/biopython) to generate a fastq and traceinfo.xml. In any case, this masks the adapter (as well as low-quality reads), but I think it only masks if the adapter matches 100%. Additionally, this masking is just noted in the traceinfo.xml file, which, although used by MIRA appears to be ignored by the error-correcting software pacBioToCA. So, for purposes of the pacBioToCA, I need to trim (rather than mask) the adapter sequence and I don't know of a tool that uses the traceinfo file for this purpose (there's no sff_extract option that I know that trims rather than masks).
In any case, I'm reverse-comp'ing my sequence with the fastx toolkit, then trimming the adapter off with scythe. However, of the 225,000 reads, 1400 don't have the sequence close enough for scythe to recognize it with one of two adapter variations. I've found an additional 3 slight variations which account for 700 of the remaining 1400.
During the MIRA assembly process, it complained about "megahubs" which may (by my estimation) be due to either residual adapter tags or the presence of a 16-fold transposable element.
So, do I throw away the 1400 sequences entirely (or rather 5700 if requiring an exact match to the primary adapter sequence)? or do I leave the adapters on the 700 I can't identify with my five linkers? Do I perform a manual chop of 11 bases (or so) of sequences that don't have a readily-identifiable adapter? Is there an easy way of passing on my identifications to the traceinfo.xml or of re-generating it using alternate adapter tags? sff_extract gives an option to input a sequence for paired-end linkers, but there doesn't appear to be an option for alternate adapter sequences.
Comment