Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Identifying contamination

    I've got an interesting problem and wondered if anyone else had any thoughts about how I can approach this.

    I've got some Illumina data from a run which should have contained human sequence but appears to have been contaminated with some other sequence of unknown origin. We're pretty sure the samples haven't been mixed up since some of the affected lanes were barcoded and the barcodes are present. The problem now is to try to identify the source of the contamination.

    The sequences we produced are very diverse, with little or no duplication of reads, so this isn't just primers or plasmid DNA.

    So far I've ruled out:

    Human
    Mouse
    Rat
    Any other vertebrate species the lab concerned work on
    E.coli

    ..and now I'm stuck!

    If you had 30million+ reads of unknown origin (or origins) how would you try to find where they'd come from?

  • #2
    MEGAN?



    (or any other metagenomics analysis pipeline suitable for Illumina reads)

    Comment


    • #3
      Originally posted by flxlex View Post
      MEGAN?
      Very interesting! I'd not seen that before and I can think of uses for it! However it does require a separate blast step before you can do any of that analysis and I don't really have the resources to blast this number of sequences. I may try something along those lines with a smaller random selection of sequences though.

      As an aside I've been thinking about putting together a database of potentially contaminating sequences which you could map a next gen dataset against as a QC measure. It would include all of the primers used for library prep, Ecoli sequence, various families of repeats and other stuff we regularly see turning up in our libraries. Has anyone tried this before?

      Comment


      • #4
        Align then blast

        There is usually quite a bit of overlap of sequences. In your case you do not have a reference genome and there for no genes to which to align the sequence. However, may I propose another way of utilizing the coverage depth to obtain a gene. Align the sequence reads to themselves with a required alignment of some arbitrarily high identity match of say 80%. For those genes that are highly expressed there will be enough coverage depth to recover the exon providing larger sequences to do a blast search for your organism (provided it is a single organism).

        Let us know how it goes. Good luck!

        The alignment might look something like this
        __________________
        _____________________ ___________________
        _____________________
        _______________________
        ____________________ __________________

        providing a sequence that is quite a bit longer.
        _______________________________________________
        Last edited by severin; 09-25-2009, 04:39 AM.

        Comment


        • #5
          I dear,
          do you simply looks for short sequences and try to understand from where they are from? I think a good approach could be try de novo assembly of short reads and blast the output on NCBI on you database of supposed sources of contamination.

          The advantage of perform de novo assembly is that you can blast longer sequences and you probably discard a lot of reads that contain only errors.

          Comment


          • #6
            @francesco.vezzi

            Exactly.

            Comment


            • #7
              Originally posted by simonandrews View Post
              Very interesting! I'd not seen that before and I can think of uses for it! However it does require a separate blast step before you can do any of that analysis and I don't really have the resources to blast this number of sequences.
              You don't have to BLAST the entire set to get a good picture of the source of the contamination. Select a random set of ~300,000 (1% of your total). That should provide enough information.

              Comment


              • #8
                Originally posted by francesco.vezzi View Post
                I dear,
                do you simply looks for short sequences and try to understand from where they are from? I think a good approach could be try de novo assembly of short reads and blast the output on NCBI on you database of supposed sources of contamination.

                The advantage of perform de novo assembly is that you can blast longer sequences and you probably discard a lot of reads that contain only errors.
                This is exactly the approach (CLCbio de novo assembly combined with blast and MEGAN) that we use to characterise metagenomics datasets. Contamination is easy to identify when you use MEGAN to visualise the blast results - despite us working on plant pathogens/environmental samples, we always seem to get some good quality human contaminating sequences...

                Comment


                • #9
                  I ended up doing a de-novo assembly with velvet (which was much easier and quicker than I thought it would be). I got several contigs of over 1kb in length. Blasting these gave a few high identity (though not identical) hits to a bacterial genome so I guess that something similar to that was the main contaminant. Interestingly I've still got a couple of contigs of 5+kb which don't appear anywhere in EMBL so the mystery isn't completely solved - but things are a lot clearer than they were.

                  Thanks for the suggestions.

                  Comment


                  • #10
                    Originally posted by kmcarr View Post
                    You don't have to BLAST the entire set to get a good picture of the source of the contamination. Select a random set of ~300,000 (1% of your total). That should provide enough information.
                    I think you overestimate the compute power I currently have available to me. 300,000 blasts is not something I generally do just before I go home on a Friday

                    Comment


                    • #11
                      Someone used before megan? there are some tutorials?

                      Comment


                      • #12
                        Megan Blast

                        We Blast 100K 50bp reads against the NCBI nucleotide database (14M sequences) on a Cray XT6m supercomputer, 24-hour runs, followed by Megan analysis for metagenomic data. The 100K sample represents 2.5% of the total reads population (4M reads). The taxonomic distribution of species shown in Megan generally agrees with the taxon expected by the PI's. We plan on making 168-hour runs (1 week) to sample 17.5% of the total reads population. We want to see if the taxon distribution changes substantially. I think it's an open question whether Blasting a very small subset of reads yields an accurate estimate of the taxon represented in a metagenomic sample. Another question is whether you would miss mapping reads from rare species in the sample.

                        R

                        Comment

                        Latest Articles

                        Collapse

                        • seqadmin
                          Current Approaches to Protein Sequencing
                          by seqadmin


                          Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
                          04-04-2024, 04:25 PM
                        • seqadmin
                          Strategies for Sequencing Challenging Samples
                          by seqadmin


                          Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                          03-22-2024, 06:39 AM

                        ad_right_rmr

                        Collapse

                        News

                        Collapse

                        Topics Statistics Last Post
                        Started by seqadmin, 04-11-2024, 12:08 PM
                        0 responses
                        25 views
                        0 likes
                        Last Post seqadmin  
                        Started by seqadmin, 04-10-2024, 10:19 PM
                        0 responses
                        28 views
                        0 likes
                        Last Post seqadmin  
                        Started by seqadmin, 04-10-2024, 09:21 AM
                        0 responses
                        24 views
                        0 likes
                        Last Post seqadmin  
                        Started by seqadmin, 04-04-2024, 09:00 AM
                        0 responses
                        52 views
                        0 likes
                        Last Post seqadmin  
                        Working...
                        X