SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics



Similar Threads
Thread Thread Starter Forum Replies Last Post
Quality-, adapter- and RRBS-trimming with Trim Galore! fkrueger Bioinformatics 136 01-18-2019 02:24 AM
trim adapter from Illumina Genome Analyzer IIe miRNA reads NicoBxl Bioinformatics 5 01-02-2014 06:31 AM
Trim adapter using CLCBio azleen Bioinformatics 1 04-02-2013 06:37 AM
PCR-involved 454 pyrosequencing does not meet standards of qPCR linkingchen 454 Pyrosequencing 4 03-14-2013 05:45 AM
trim 3' adapter sequence for mRNA-Seq? slny Bioinformatics 14 06-14-2011 07:15 AM

Reply
 
Thread Tools
Old 04-18-2013, 06:16 PM   #1
pag
Member
 
Location: CA, USA

Join Date: May 2012
Posts: 72
Default trim 454 adapter - standards?

Hi all,

Relatively simple question, but then I'll go into some details after the main question.

If a 454 sequence adapter has sequencing errors (or actual indels) is the standard to throw away the sequence completely as unreliable/low quality? Or is it to identify the adapter as best as possible and chop that off?

I have some 454 data that I am using for a de novo assembly. These data are used for error-correcting PacBio reads (note: possibly more on that in other threads later) as well as going into the assembly itself, which I'm performing in MIRA and fine tuning in phrap/swat/crossmatch/consed with Sanger reads added. In any case, I used sff_extract (I believe the version supplied within MIRA as a 3rd party tool, rather than the sff_extract from seq_crumbs/biopython) to generate a fastq and traceinfo.xml. In any case, this masks the adapter (as well as low-quality reads), but I think it only masks if the adapter matches 100%. Additionally, this masking is just noted in the traceinfo.xml file, which, although used by MIRA appears to be ignored by the error-correcting software pacBioToCA. So, for purposes of the pacBioToCA, I need to trim (rather than mask) the adapter sequence and I don't know of a tool that uses the traceinfo file for this purpose (there's no sff_extract option that I know that trims rather than masks).

In any case, I'm reverse-comp'ing my sequence with the fastx toolkit, then trimming the adapter off with scythe. However, of the 225,000 reads, 1400 don't have the sequence close enough for scythe to recognize it with one of two adapter variations. I've found an additional 3 slight variations which account for 700 of the remaining 1400.

During the MIRA assembly process, it complained about "megahubs" which may (by my estimation) be due to either residual adapter tags or the presence of a 16-fold transposable element.

So, do I throw away the 1400 sequences entirely (or rather 5700 if requiring an exact match to the primary adapter sequence)? or do I leave the adapters on the 700 I can't identify with my five linkers? Do I perform a manual chop of 11 bases (or so) of sequences that don't have a readily-identifiable adapter? Is there an easy way of passing on my identifications to the traceinfo.xml or of re-generating it using alternate adapter tags? sff_extract gives an option to input a sequence for paired-end linkers, but there doesn't appear to be an option for alternate adapter sequences.
pag is offline   Reply With Quote
Old 04-19-2013, 03:00 PM   #2
pag
Member
 
Location: CA, USA

Join Date: May 2012
Posts: 72
Default follow-up

The version of sff_extract shipped as a 3rd party tool with MIRA 3.9 (developmental version) has a -c (cut) option. This appears to cut off both the 5' adapter as well as low-quality sequence at the 3' end.

There is no note as to the methodology of adapter identification: does it look through your sequence to find the most common start? Does it chop off a fixed number of nucleotides from the 5' end without checking if it has already been processed? Is there a way to supply a custom adapter sequence? As without the -c flag, there were a variable number of bases chopped, it appears that this seems unlikely.

Scythe appears to be a BIT more transparent with methodology, but at least sff_extract cleans up the xml file.
pag is offline   Reply With Quote
Old 12-19-2014, 12:31 PM   #3
martin2
Member
 
Location: Prague, Czech Republic

Join Date: Nov 2010
Posts: 40
Default

I just came across this unanswered thread ... my 2c below.

Quote:
Originally Posted by pag View Post
If a 454 sequence adapter has sequencing errors (or actual indels) is the standard to throw away the sequence completely as unreliable/low quality? Or is it to identify the adapter as best as possible and chop that off?
First of all, not every read has an adapter, imagine a long sample insert with an adapter present way too far on the right side, which was never reached by sequencer during the fixed amount of sequencing flows. Such read sequences just don't have an adapter and increase the number of those seemingly lacking it. So, if your software does not find an adapter, in some cases it really had no chance to find one. The proportion between reads bearing an adapter and not depends how lab technicians purified the and size-selected the sample.

Second, it doesn't seem anybody applied such strict criteria to discard reads if there was a sequencing error in adapter. The top-most errors are CAFIE errors, btw. In other words, most adapters contain an error. See some of the charts on my website (link below).

I prefer to identify the adapter and chop it off while keeping low-qual sequence.

Quote:
Originally Posted by pag View Post
I have some 454 data that I am using for a de novo assembly. These data are used for error-correcting PacBio reads (note: possibly more on that in other threads later) as well as going into the assembly itself, which I'm performing in MIRA and fine tuning in phrap/swat/crossmatch/consed with Sanger reads added. In any case, I used sff_extract (I believe the version supplied within MIRA as a 3rd party tool, rather than the sff_extract from seq_crumbs/biopython) to generate a fastq and traceinfo.xml. In any case, this masks the adapter (as well as low-quality reads), but I think it only masks if the adapter matches 100%.
Yes, it uses SSAHA2 ... for linker detection only I think. Anyway, in my hands SSAHA* was not sensitive enough for adapter detection. And, there were more types of adapters Roche used throughout the years and "Mira's definitive guide" manual isn't their complete listing, at all.

Quote:
Originally Posted by pag View Post
Additionally, this masking is just noted in the traceinfo.xml file, which, although used by MIRA appears to be ignored by the error-correcting software pacBioToCA. So, for purposes of the pacBioToCA, I need to trim (rather than mask) the adapter sequence and I don't know of a tool that uses the traceinfo file for this purpose (there's no sff_extract option that I know that trims rather than masks).
You can tell mira just to respect uppercase letters in FASTA+QUAL file, you just delete the traceinfo.xml and from the FASTA+QUAL files you drop the lowercase letters/quals using some parser.


Quote:
Originally Posted by pag View Post
In any case, I'm reverse-comp'ing my sequence with the fastx toolkit, then trimming the adapter off with scythe. However, of the 225,000 reads, 1400 don't have the sequence close enough for scythe to recognize it with one of two adapter variations. I've found an additional 3 slight variations which account for 700 of the remaining 1400.
Doh, there are many types of errors in each type of the Roche adapters, and their occurence differs with each dataset. Maybe the top-most 10 or 20 occur relatively stably but still, their incidence is shuffled. The "minor-ones" ... appear or don't in one or another dataset.

Quote:
Originally Posted by pag View Post
During the MIRA assembly process, it complained about "megahubs" which may (by my estimation) be due to either residual adapter tags or the presence of a 16-fold transposable element.

So, do I throw away the 1400 sequences entirely (or rather 5700 if requiring an exact match to the primary adapter sequence)?
Either way, you have to look for more screwed incarnations of adapters in the data. Discarding your 1400 or even 5700 reads will not help.

Quote:
Originally Posted by pag View Post
or do I leave the adapters on the 700 I can't identify with my five linkers?
I wouldn't, convert them into lowercase letters before feeding Mira with them (while not providing the traceinfo file). Note: I hope you made a typo and meant an adapter, not linker. Or do you mean MID sequence (hence the 11nt mentioned further)?

Quote:
Originally Posted by pag View Post
Do I perform a manual chop of 11 bases (or so) of sequences that don't have a readily-identifiable adapter?
You are messing up something here. 11nt is the size of Rapid-Library-specific MIDs. But no, nobody chops blindly last 11nt for rcRLMID or even 19nt, 25nt, 30nt, 32nt, 33nt or 44nt off 454-based raw reads would you speak of adapters. It is inefficient, the adapters are typically inflated by sequencing errors, and mainly followed on their right side by some seemingly correct sequence (possibly due to well-to-well crosstalk). So, technically they are not always at the very right end.

However, I do chop using my software blindly 11nt ahead of an adapter to trim away rcRLMID+adapter. It is an optional feature, sometimes the data is a mixture of MID-tagged and untagged reads.

Quote:
Originally Posted by pag View Post
Is there an easy way of passing on my identifications to the traceinfo.xml or of re-generating it using alternate adapter tags? sff_extract gives an option to input a sequence for paired-end linkers, but there doesn't appear to be an option for alternate adapter sequences.
I developed my own tool for that, because of the poor sensitivity, wrong queries, no possibility to filter and tune matched candidates, no easy evaluation of matched/mis-matched sequyences, etc. I just coded something fundamentally different.

Quote:
Originally Posted by pag View Post
The version of sff_extract shipped as a 3rd party tool with MIRA 3.9 (developmental version) has a -c (cut) option. This appears to cut off both the 5' adapter as well as low-quality sequence at the 3' end.

There is no note as to the methodology of adapter identification: does it look through your sequence to find the most common start?
Common start? It uses SSAHA2 to align the query to a read, and if it does align then it treats the aligned region as a linker. Look into the sff_extract code, it is not so long.

Quote:
Originally Posted by pag View Post
Does it chop off a fixed number of nucleotides from the 5' end without checking if it has already been processed? Is there a way to supply a custom adapter sequence? As without the -c flag, there were a variable number of bases chopped, it appears that this seems unlikely.

Scythe appears to be a BIT more transparent with methodology, but at least sff_extract cleans up the xml file.
Last time I used sff_extract I found some off-by-one error in it, and reported it to authors. Other than that I do not remember much from the code. Looking into it now briefly tells me that you may be happy with:

--min_left_clip if the left clip coming from the SFF is smaller than this value
, override it, default=0

The -c option (--clip) ensures that the "lowercase" sequence is dropped from output, but this just takes over the knowledge Roche pipeline put into the SFF file.

And, unless I have overlooked the sequence, sff_extract does not look for any adapters, only for linker sequence which you have to include in an external file. That is used to split the paired reads.


A small advert of my work: some figures at http://www.bioinformatics.cz show incidence of sequencing errors in adapters but are also mixed with incidence of other artefacts in them. Doing the work right took me a few years of researching, coding and testing. You had good questions and ideas but ... detecting a short sequence masked by sequencing errors ask for thorough tuning of sensitivity while watchinh specificity. It is a lot of work and testing. Maybe you happily become my customer instead? ;-)
My tool can look even for custom MID sequences, not only adapters. And can trim away blindly for missed rcRLMIDs. Roche tools do not trim away rcRLMIDs at all, seems it was always anticipated they will be ignored by amplicon mapper (they just won't align to template) and they never provided a tool to mask them for de novo assemblies. Well, I spent an insane amount of time on 454 datasets fetched from SRA. Next time I tell you how many assembled genomes/transcriptomes are poisoned with one or another type of artefact, adapter, MID tag, and which labs were contaminated, swapped samples, protocols, .... Oh dear.

Martin
martin2 is offline   Reply With Quote
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 05:07 PM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2020, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO