Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • help: how to clean pair-edn Ilumina reads before assembly

    Hello everyone,

    I have some challenges for the group and any help and suggestion is welcome:

    I run several genomes using the 8kb pair end protocol at one genome per lane. The bio informatics group in my facility have little experience on this and are very challenged helping my project. so here are the problems.

    A) the runs seem contaminated by chimeric fragments from the sequencing adapters used in making the pair end data. so is there any software or script out there that can remove sequences matching the adapters (and the key part is) allowing for certain percentage of mismatch to adapter sequence (this to account for chimeric multiprimer sequences)

    B) now the next problem is that in the pair end data you also use a central adapter and the true pair end data will be the one where the reads start at both ends of genomic fragment far from the central adapter (see pdf protocol for more detail http://www.illumina.com/applications...equencing.ilmn). however since the technology can not control that the position of the central adapter be just in the center, because the random shearing steps required, then the 42 bp adapter and the genomic sequence can come in all combinations possible as follows:
    1- sequence read+adapter read (this is the easy one where a 3' triming tool can do the job)
    2- adapter read+sequence read (this would need a 5' triming tool) and can be tricky when the adapter read is small as 2 or 3 bases because those bases will appear later in the assembly. But more importantly a read that start with adapter follow by the actual sequence is not a True 8KB pair end. this is actually a pair end of the 500 bases placed in the sequencing reaction. Then this data should be trimmed and move into a 500 b fragment file (along with its pair) and used in that way or just used as single read.

    3- adapter in the center case: sequence read+adapter read+sequence read this case should be handle by a 3'end trimmer but the trimmer should be able to recognize the adapter as in the center of the read and not at the end of the reads as they are usually coded for.

    4- the removing tool should be able to take an action for the pairs: e.g if kicking one pair as chimeric primer then should also throw away the second one (no chimera allowed). the trimming tool should trim the pair continously and place the trim pair as true 8Kb or 500 b or if one of the reads is eliminated because what is left is too small then the other read should go to a single reads file.

    after all the sorting, filtering, etc the reads should be organized in different files: true clipped pair ends of 8Kb, 500 b pair ends, single reads all these after removing the chimeric/artifact reads coming from primer dimerization. This is a complex case and my question is about your recommendations on which tool or set of tools can allow me doing all these steps so I can use my ginormous amount of reads that have been kidnapped by all these issues.

    any advice is welcome

    Hinsby

  • #2
    Hinsby,

    You appear to be confusing sequencing platforms and paired-end (or mate-paired) protocols here.

    The Illumina paired-end protocol is meant to generate two reads, one from each end of a contiguous fragment of dsDNA. The reads point towards each other (in their 5'->3' directions) and are separated by 200-600 bp, depending on the size of the DNA fragment.

    The Illumina mate-pair protocol is meant to generate two reads which are separated by 2-5 kbp. This protocol includes a circularization step and subsequent fragmentation of the circle. The standard protocol does not use any linker DNA in the circularization. The two reads will be separated by 2-5 kbp and will point away from each other.

    The Roche/454 paired end protocol is meant to produce two reads which are separated by 3, 8 or 20 kbp depending on the size of your original shearing of the genomic DNA. This protocol also uses a circularization step but includes a 42 bp linker at the point of circularization. The two reads will be separated by 3, 8 or 20 kbp and will point in the same direction.

    You state that the sequence data was generated using the Illumina platform but that it has a 42 bp linker. The presence of the 42 bp linker would indicate the data was generated using 454. You need to clarify with your sequencing center what platform was used to generate the sequence before we can advise you on how to process/interpret your data.

    Comment


    • #3
      I was reading some previous post and yes I have mislabeled my data. they are Illumina mate pairs of 8Kb distance. Indeed the standard illumina protocol does not uses a 42bp central linker which would avoid the problem of having to remove this sequence but, in the believe of our sequencing facility manager, not having a central linker also does not let you recognize true mate pairs (sequencing from the extremes of your 8kb fragments) from reads hitting the central part of the fragment (the joint where the DNA fragment was circularized) and that hitting it will create intragenomic chimeric reads. Thus he changed the protocol and added the extra linker. Now the linker idea sounds pretty much like 454 because it was adapted from that technique. so the data was generated using a modified protocol using a extra central linker which in the long run should help to differentiate true mate pairs from pair ends in Ilumina, however it also created a challenge for actually processing the data (trimming and separation) before assembly.

      I am inexperienced with this technology so any help is highly appreciated.

      Hinsby

      Comment


      • #4
        In any case, the ShortRead package in R will solve your trimming problems.
        You'll need to know/learn R though.
        Here there are very useful examples on how to do the trimming and much more:


        Also, Google the "vcountPattern" function, it seems very adequate for you.

        Comment


        • #5
          Originally posted by hinsby View Post
          ...in the believe of our sequencing facility manager, not having a central linker also does not let you recognize true mate pairs (sequencing from the extremes of your 8kb fragments) from reads hitting the central part of the fragment (the joint where the DNA fragment was circularized) and that hitting it will create intragenomic chimeric reads. Thus he changed the protocol and added the extra linker. Now the linker idea sounds pretty much like 454 because it was adapted from that technique.
          Well then I'm afraid your sequencing facility manager left you with a hot mess. The Illumina protocol recognizes the possibility that a read could cross the circular junction point but if you follow it as recommended the frequency should be very low. Here is what the Illumina mate-pair guide says:

          When sequencing a mate pair library, Illumina recommends a read length no longer than 36 bases. A longer read length elevates error rates, because longer reads are more likely to cross over the junction of the two joined ends of a size-selected fragment. The Illumina analysis pipeline discards these junction reads, since they do not align to the reference sequence.

          To minimize junction reads, the mate pair library uses a template size range of 350–650 bp. This is larger than a typical paired-end library template of 300–400 bp. Increasing the size range of the library in the mate pair protocol minimizes the number of sequence reads that pass through a junction.
          Did you perform long reads with this library? The mate-pair protocol (as opposed to the paired-end protocol) is meant to provide scaffolding information, not sequence coverage.

          You could try the fuzznucc program (http://embossgui.sourceforge.net/dem...l/fuzznuc.html) in the EMBOSS Suite (you would need to install all of EMBOSS). This won't trim the reads, just identify the location of the linker in your reads. You would then need to parse the output and trim or split the reads yourself.

          Comment


          • #6
            Yeap , it sounds like it is a hot mess indeed.

            The reads are in average 80 bp so they are long reads.

            Ok, I was not aware of the protocol and the use of shorter reads to reduce the chance of getting in the joint (center)before placing the sequencing order, however I trusted the judgment of our sequencing manager, and his intention was to maximize the information by using the long reads, the adapter to somehow flag true mate pairs and possibly obtain a de novo assembly using a full lane per each 2.6 Mb genomes. The idea seems a good one too me except that the center was not bioinformatically ready to deal with the sorting and cleaning of the sequences before assembly, and now I got that task and I am new in bioinformatics.

            I will try the fuzznucc, sounds like it could help but I have 3 samples with something in the order of 15 million reads each which makes this task computationally long and memory demanding, my mac can barely handle the big files. Thanks for the help of course and any other idea or suggestion is welcome any time.

            Comment

            Latest Articles

            Collapse

            • seqadmin
              Essential Discoveries and Tools in Epitranscriptomics
              by seqadmin




              The field of epigenetics has traditionally concentrated more on DNA and how changes like methylation and phosphorylation of histones impact gene expression and regulation. However, our increased understanding of RNA modifications and their importance in cellular processes has led to a rise in epitranscriptomics research. “Epitranscriptomics brings together the concepts of epigenetics and gene expression,” explained Adrien Leger, PhD, Principal Research Scientist...
              04-22-2024, 07:01 AM
            • seqadmin
              Current Approaches to Protein Sequencing
              by seqadmin


              Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
              04-04-2024, 04:25 PM

            ad_right_rmr

            Collapse

            News

            Collapse

            Topics Statistics Last Post
            Started by seqadmin, Yesterday, 08:47 AM
            0 responses
            12 views
            0 likes
            Last Post seqadmin  
            Started by seqadmin, 04-11-2024, 12:08 PM
            0 responses
            60 views
            0 likes
            Last Post seqadmin  
            Started by seqadmin, 04-10-2024, 10:19 PM
            0 responses
            59 views
            0 likes
            Last Post seqadmin  
            Started by seqadmin, 04-10-2024, 09:21 AM
            0 responses
            54 views
            0 likes
            Last Post seqadmin  
            Working...
            X