Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • 454: Unmapped contigs after Reference Assembly

    Dear all,

    We have sequenced human BAC clones using 454 sequencing technology. During the assembling process into a consensus sequence (CONS) using in parallel two reference sequences, many reads were not incorporated into the corresponding resulting consensus.

    Afterwards, I did a De Novo Assembly (using no reference sequence) of these unmapped reads and I am currently analysing the resulting contigs.

    I had two different scenarios: (A) the resulting contigs correspond to the cloning vector or to E. coli DNA (traces of bacterial DNA not eliminated during the maxiprep); (B) some other contigs, do map to our human target region

    (A) This is for most of the contigs and those being the longest and having the deepest read coverage.

    (B) When mapping these contigs into reference sequences, some of them behave similar than Paired End Tags but with different orientations or distance between the aligned segments than the expected in PET (3 kb in our case). I do not believe they correspond to structural variation between my template and these references.

    It is worth to mention that i) most of those contigs in (B) scenario are around 200 – 500 bp and none exceed 1300 bp ii) whilst the coverage in CONS is around 80 fold, the coverage of the ctg is for most of them between 2 and 3 and few of them exceed 10 fold coverage.

    Has anyone I would appreciate if i) anyone that has observed these kind of reads / contigs in their 454 analysis could let me know.

    I am also wondering how common is this type of reads / contigs and why is that occurring? Does anyone know?

    Thank you in advance for your help.

    With kindest regards

    Alex

  • #2
    Alex:

    I just finished this type of analysis with a 454 Titanium run on E. coli only I started with a de-novo assembly of the reads and then mapped them to the E. coli genome instead of using the Mapper and then assembling the remaining reads like you did. (Although I did also run the Mapper as a separate trial.) The statistics from the de-novo assembly:

    There are 651 "large" (>= 500 bp) contigs.

    Of these a whopping 545 do not match E. coli W3110. However none of these non-matching contigs are very long -- ranging from 500 to 2988 bp. As a comparison the 106 matching contigs tend be long and range from 531 bp to 222,307 bp.

    So it is obvious that the non-matching contigs are not very good. Never-the-less it is curious as to what the non-matching contigs do match.

    Of the 545 contigs:

    36 do not significantly match anything in genbank.

    137 match many entries in genbank.

    348 match Bacillus licheniformis genomes.

    3 match B. licheniformis plasmid

    9 match P. flourescens.

    2 match K. pneumoniae

    The remaining 10 I did not bother to characterize since they did not hit the same genbank entries.

    ---------------------------------------

    So what conclusions can be, tentatively, drawn?

    A) We did not have wholesale contamination otherwise the non-matching-to-Ecoli contigs would have been long.

    B) Perhaps E. coli is picking up strands of DNA from its environment?

    C) Perhaps the environment of strands of DNA is getting into our experiment? Due to a poor laboratory sterile technique. Perhaps due to DNA being stuck on new or reused equipment.

    I suspect that NextGen sequencers will uncover a lot of this low-level contamination. We are dealing with so many reads that, in my mind, it seems like some will arise from external sources.

    As to your particular case, you mentioned that your case (B) you were able to map the contigs back to your human reference sequence but that the contigs were looking strange. It is possible that you are finding traces of human contamination. Either the cells being sequenced had trace rogue DNA in them or in the handling trace DNA 'fell in' to the prep. It is an idea.

    I am looking forward to analyzing our next titanium run.

    Comment


    • #3
      Originally posted by westerman View Post
      Alex:

      I just finished this type of analysis with a 454 Titanium run on E. coli only I started with a de-novo assembly of the reads and then mapped them to the E. coli genome instead of using the Mapper and then assembling the remaining reads like you did. (Although I did also run the Mapper as a separate trial.) The statistics from the de-novo assembly:

      There are 651 "large" (>= 500 bp) contigs.

      Of these a whopping 545 do not match E. coli W3110. However none of these non-matching contigs are very long -- ranging from 500 to 2988 bp. As a comparison the 106 matching contigs tend be long and range from 531 bp to 222,307 bp.

      So it is obvious that the non-matching contigs are not very good. Never-the-less it is curious as to what the non-matching contigs do match.

      Of the 545 contigs:

      36 do not significantly match anything in genbank.

      137 match many entries in genbank.

      348 match Bacillus licheniformis genomes.

      3 match B. licheniformis plasmid

      9 match P. flourescens.

      2 match K. pneumoniae

      The remaining 10 I did not bother to characterize since they did not hit the same genbank entries.

      ---------------------------------------

      So what conclusions can be, tentatively, drawn?

      A) We did not have wholesale contamination otherwise the non-matching-to-Ecoli contigs would have been long.

      B) Perhaps E. coli is picking up strands of DNA from its environment?

      C) Perhaps the environment of strands of DNA is getting into our experiment? Due to a poor laboratory sterile technique. Perhaps due to DNA being stuck on new or reused equipment.

      I suspect that NextGen sequencers will uncover a lot of this low-level contamination. We are dealing with so many reads that, in my mind, it seems like some will arise from external sources.

      As to your particular case, you mentioned that your case (B) you were able to map the contigs back to your human reference sequence but that the contigs were looking strange. It is possible that you are finding traces of human contamination. Either the cells being sequenced had trace rogue DNA in them or in the handling trace DNA 'fell in' to the prep. It is an idea.

      I am looking forward to analyzing our next titanium run.

      What program/software did you use to obtain those statistics?
      thanks

      Comment


      • #4
        Originally posted by Chuckytah View Post
        What program/software did you use to obtain those statistics?
        thanks
        Hum, making me think about project done over 2 years ago. That is forever in NGS time! I can not remember exactly but I probably used blast to get the statistics. E. coli is small enough that blasting the contigs to it would not be onerous.

        Comment


        • #5
          Originally posted by westerman View Post
          Hum, making me think about project done over 2 years ago. That is forever in NGS time! I can not remember exactly but I probably used blast to get the statistics. E. coli is small enough that blasting the contigs to it would not be onerous.
          sorry i didn't saw the dates lol
          ty anyway

          Comment


          • #6
            Originally posted by Alex Clop View Post
            It is worth to mention that i) most of those contigs in (B) scenario are around 200 – 500 bp and none exceed 1300 bp ii) whilst the coverage in CONS is around 80 fold, the coverage of the ctg is for most of them between 2 and 3 and few of them exceed 10 fold coverage.

            Has anyone I would appreciate if i) anyone that has observed these kind of reads / contigs in their 454 analysis could let me know.
            Alex
            The process of ligating adapters to the DNA fragments also produces chimeric sequences where two DNA fragments ligate together. The ratio of primers to DNA is designed to limit this but it does happen. More often than not it will be repetitive DNA that ligates. If these chimeric sequences then get the correct primers on each end they will amplify in the subsequent PCR steps producing more copies. That's why the sequences you describe have low sequence coverage and behave like paired end tags - they are an artefact of the ligation process.

            Comment

            Latest Articles

            Collapse

            • seqadmin
              Current Approaches to Protein Sequencing
              by seqadmin


              Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
              04-04-2024, 04:25 PM
            • seqadmin
              Strategies for Sequencing Challenging Samples
              by seqadmin


              Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
              03-22-2024, 06:39 AM

            ad_right_rmr

            Collapse

            News

            Collapse

            Topics Statistics Last Post
            Started by seqadmin, 04-11-2024, 12:08 PM
            0 responses
            30 views
            0 likes
            Last Post seqadmin  
            Started by seqadmin, 04-10-2024, 10:19 PM
            0 responses
            32 views
            0 likes
            Last Post seqadmin  
            Started by seqadmin, 04-10-2024, 09:21 AM
            0 responses
            28 views
            0 likes
            Last Post seqadmin  
            Started by seqadmin, 04-04-2024, 09:00 AM
            0 responses
            52 views
            0 likes
            Last Post seqadmin  
            Working...
            X