Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • trim 454 adapter - standards?

    Hi all,

    Relatively simple question, but then I'll go into some details after the main question.

    If a 454 sequence adapter has sequencing errors (or actual indels) is the standard to throw away the sequence completely as unreliable/low quality? Or is it to identify the adapter as best as possible and chop that off?

    I have some 454 data that I am using for a de novo assembly. These data are used for error-correcting PacBio reads (note: possibly more on that in other threads later) as well as going into the assembly itself, which I'm performing in MIRA and fine tuning in phrap/swat/crossmatch/consed with Sanger reads added. In any case, I used sff_extract (I believe the version supplied within MIRA as a 3rd party tool, rather than the sff_extract from seq_crumbs/biopython) to generate a fastq and traceinfo.xml. In any case, this masks the adapter (as well as low-quality reads), but I think it only masks if the adapter matches 100%. Additionally, this masking is just noted in the traceinfo.xml file, which, although used by MIRA appears to be ignored by the error-correcting software pacBioToCA. So, for purposes of the pacBioToCA, I need to trim (rather than mask) the adapter sequence and I don't know of a tool that uses the traceinfo file for this purpose (there's no sff_extract option that I know that trims rather than masks).

    In any case, I'm reverse-comp'ing my sequence with the fastx toolkit, then trimming the adapter off with scythe. However, of the 225,000 reads, 1400 don't have the sequence close enough for scythe to recognize it with one of two adapter variations. I've found an additional 3 slight variations which account for 700 of the remaining 1400.

    During the MIRA assembly process, it complained about "megahubs" which may (by my estimation) be due to either residual adapter tags or the presence of a 16-fold transposable element.

    So, do I throw away the 1400 sequences entirely (or rather 5700 if requiring an exact match to the primary adapter sequence)? or do I leave the adapters on the 700 I can't identify with my five linkers? Do I perform a manual chop of 11 bases (or so) of sequences that don't have a readily-identifiable adapter? Is there an easy way of passing on my identifications to the traceinfo.xml or of re-generating it using alternate adapter tags? sff_extract gives an option to input a sequence for paired-end linkers, but there doesn't appear to be an option for alternate adapter sequences.

  • #2
    follow-up

    The version of sff_extract shipped as a 3rd party tool with MIRA 3.9 (developmental version) has a -c (cut) option. This appears to cut off both the 5' adapter as well as low-quality sequence at the 3' end.

    There is no note as to the methodology of adapter identification: does it look through your sequence to find the most common start? Does it chop off a fixed number of nucleotides from the 5' end without checking if it has already been processed? Is there a way to supply a custom adapter sequence? As without the -c flag, there were a variable number of bases chopped, it appears that this seems unlikely.

    Scythe appears to be a BIT more transparent with methodology, but at least sff_extract cleans up the xml file.

    Comment


    • #3
      I just came across this unanswered thread ... my 2c below.

      Originally posted by pag View Post
      If a 454 sequence adapter has sequencing errors (or actual indels) is the standard to throw away the sequence completely as unreliable/low quality? Or is it to identify the adapter as best as possible and chop that off?
      First of all, not every read has an adapter, imagine a long sample insert with an adapter present way too far on the right side, which was never reached by sequencer during the fixed amount of sequencing flows. Such read sequences just don't have an adapter and increase the number of those seemingly lacking it. So, if your software does not find an adapter, in some cases it really had no chance to find one. The proportion between reads bearing an adapter and not depends how lab technicians purified the and size-selected the sample.

      Second, it doesn't seem anybody applied such strict criteria to discard reads if there was a sequencing error in adapter. The top-most errors are CAFIE errors, btw. In other words, most adapters contain an error. See some of the charts on my website (link below).

      I prefer to identify the adapter and chop it off while keeping low-qual sequence.

      Originally posted by pag View Post
      I have some 454 data that I am using for a de novo assembly. These data are used for error-correcting PacBio reads (note: possibly more on that in other threads later) as well as going into the assembly itself, which I'm performing in MIRA and fine tuning in phrap/swat/crossmatch/consed with Sanger reads added. In any case, I used sff_extract (I believe the version supplied within MIRA as a 3rd party tool, rather than the sff_extract from seq_crumbs/biopython) to generate a fastq and traceinfo.xml. In any case, this masks the adapter (as well as low-quality reads), but I think it only masks if the adapter matches 100%.
      Yes, it uses SSAHA2 ... for linker detection only I think. Anyway, in my hands SSAHA* was not sensitive enough for adapter detection. And, there were more types of adapters Roche used throughout the years and "Mira's definitive guide" manual isn't their complete listing, at all.

      Originally posted by pag View Post
      Additionally, this masking is just noted in the traceinfo.xml file, which, although used by MIRA appears to be ignored by the error-correcting software pacBioToCA. So, for purposes of the pacBioToCA, I need to trim (rather than mask) the adapter sequence and I don't know of a tool that uses the traceinfo file for this purpose (there's no sff_extract option that I know that trims rather than masks).
      You can tell mira just to respect uppercase letters in FASTA+QUAL file, you just delete the traceinfo.xml and from the FASTA+QUAL files you drop the lowercase letters/quals using some parser.


      Originally posted by pag View Post
      In any case, I'm reverse-comp'ing my sequence with the fastx toolkit, then trimming the adapter off with scythe. However, of the 225,000 reads, 1400 don't have the sequence close enough for scythe to recognize it with one of two adapter variations. I've found an additional 3 slight variations which account for 700 of the remaining 1400.
      Doh, there are many types of errors in each type of the Roche adapters, and their occurence differs with each dataset. Maybe the top-most 10 or 20 occur relatively stably but still, their incidence is shuffled. The "minor-ones" ... appear or don't in one or another dataset.

      Originally posted by pag View Post
      During the MIRA assembly process, it complained about "megahubs" which may (by my estimation) be due to either residual adapter tags or the presence of a 16-fold transposable element.

      So, do I throw away the 1400 sequences entirely (or rather 5700 if requiring an exact match to the primary adapter sequence)?
      Either way, you have to look for more screwed incarnations of adapters in the data. Discarding your 1400 or even 5700 reads will not help.

      Originally posted by pag View Post
      or do I leave the adapters on the 700 I can't identify with my five linkers?
      I wouldn't, convert them into lowercase letters before feeding Mira with them (while not providing the traceinfo file). Note: I hope you made a typo and meant an adapter, not linker. Or do you mean MID sequence (hence the 11nt mentioned further)?

      Originally posted by pag View Post
      Do I perform a manual chop of 11 bases (or so) of sequences that don't have a readily-identifiable adapter?
      You are messing up something here. 11nt is the size of Rapid-Library-specific MIDs. But no, nobody chops blindly last 11nt for rcRLMID or even 19nt, 25nt, 30nt, 32nt, 33nt or 44nt off 454-based raw reads would you speak of adapters. It is inefficient, the adapters are typically inflated by sequencing errors, and mainly followed on their right side by some seemingly correct sequence (possibly due to well-to-well crosstalk). So, technically they are not always at the very right end.

      However, I do chop using my software blindly 11nt ahead of an adapter to trim away rcRLMID+adapter. It is an optional feature, sometimes the data is a mixture of MID-tagged and untagged reads.

      Originally posted by pag View Post
      Is there an easy way of passing on my identifications to the traceinfo.xml or of re-generating it using alternate adapter tags? sff_extract gives an option to input a sequence for paired-end linkers, but there doesn't appear to be an option for alternate adapter sequences.
      I developed my own tool for that, because of the poor sensitivity, wrong queries, no possibility to filter and tune matched candidates, no easy evaluation of matched/mis-matched sequyences, etc. I just coded something fundamentally different.

      Originally posted by pag View Post
      The version of sff_extract shipped as a 3rd party tool with MIRA 3.9 (developmental version) has a -c (cut) option. This appears to cut off both the 5' adapter as well as low-quality sequence at the 3' end.

      There is no note as to the methodology of adapter identification: does it look through your sequence to find the most common start?
      Common start? It uses SSAHA2 to align the query to a read, and if it does align then it treats the aligned region as a linker. Look into the sff_extract code, it is not so long.

      Originally posted by pag View Post
      Does it chop off a fixed number of nucleotides from the 5' end without checking if it has already been processed? Is there a way to supply a custom adapter sequence? As without the -c flag, there were a variable number of bases chopped, it appears that this seems unlikely.

      Scythe appears to be a BIT more transparent with methodology, but at least sff_extract cleans up the xml file.
      Last time I used sff_extract I found some off-by-one error in it, and reported it to authors. Other than that I do not remember much from the code. Looking into it now briefly tells me that you may be happy with:

      --min_left_clip if the left clip coming from the SFF is smaller than this value
      , override it, default=0

      The -c option (--clip) ensures that the "lowercase" sequence is dropped from output, but this just takes over the knowledge Roche pipeline put into the SFF file.

      And, unless I have overlooked the sequence, sff_extract does not look for any adapters, only for linker sequence which you have to include in an external file. That is used to split the paired reads.


      A small advert of my work: some figures at http://www.bioinformatics.cz show incidence of sequencing errors in adapters but are also mixed with incidence of other artefacts in them. Doing the work right took me a few years of researching, coding and testing. You had good questions and ideas but ... detecting a short sequence masked by sequencing errors ask for thorough tuning of sensitivity while watchinh specificity. It is a lot of work and testing. Maybe you happily become my customer instead? ;-)
      My tool can look even for custom MID sequences, not only adapters. And can trim away blindly for missed rcRLMIDs. Roche tools do not trim away rcRLMIDs at all, seems it was always anticipated they will be ignored by amplicon mapper (they just won't align to template) and they never provided a tool to mask them for de novo assemblies. Well, I spent an insane amount of time on 454 datasets fetched from SRA. Next time I tell you how many assembled genomes/transcriptomes are poisoned with one or another type of artefact, adapter, MID tag, and which labs were contaminated, swapped samples, protocols, .... Oh dear.

      Martin

      Comment

      Latest Articles

      Collapse

      • seqadmin
        Current Approaches to Protein Sequencing
        by seqadmin


        Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
        04-04-2024, 04:25 PM
      • seqadmin
        Strategies for Sequencing Challenging Samples
        by seqadmin


        Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
        03-22-2024, 06:39 AM

      ad_right_rmr

      Collapse

      News

      Collapse

      Topics Statistics Last Post
      Started by seqadmin, 04-11-2024, 12:08 PM
      0 responses
      23 views
      0 likes
      Last Post seqadmin  
      Started by seqadmin, 04-10-2024, 10:19 PM
      0 responses
      24 views
      0 likes
      Last Post seqadmin  
      Started by seqadmin, 04-10-2024, 09:21 AM
      0 responses
      21 views
      0 likes
      Last Post seqadmin  
      Started by seqadmin, 04-04-2024, 09:00 AM
      0 responses
      52 views
      0 likes
      Last Post seqadmin  
      Working...
      X