Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • bbduk with a large reference database

    Hi,
    I would like to check for contaminants using both phiX and the human genome. My data is metagenomics data and i want to remove any read mapping to both phiX and the Human genome.

    So far bbduk can handle this by using the ref=phiX.fa
    However for checking contaminations from human samples i would like to ust the non redundant nucleotide database. It is split into small pieces and usually i access them through blast using the reference nt.nal file.

    Is that is also feasible with bbduk ??

  • #2
    I don't completely understand what you mean by "i would like to use the non redundant nucleotide database" to remove contamination from human samples. It may still be easier to do what you have been doing (separate human reads from other stuff).

    You should be able to use BBSplit or seal, which can accept a folder of references. Whether BBSplit can accept a "nr" size folder may need to be experimented with.

    Comment


    • #3
      Sorry for the confusion. I was confused with large blast databases (.nal file). bbduk does its own indexing....so no way to use blast index databases.

      Which Human database does people mots frequently use to discard human contamintation reads from metagenomes ? I tough to use the nt database (nucleotide sequence database, with entries from all traditional divisions of GenBank, EMBL, and DDBJ; excluding bulk divisions (gss, sts, pat, est, htg) ) ???

      Comment


      • #4
        Originally posted by danova View Post
        Sorry for the confusion. I was confused with large blast databases (.nal file). bbduk does its own indexing....so no way to use blast index databases.

        Which Human database does people mots frequently use to discard human contamintation reads from metagenomes ? I tough to use the nt database (nucleotide sequence database, with entries from all traditional divisions of GenBank, EMBL, and DDBJ; excluding bulk divisions (gss, sts, pat, est, htg) ) ???
        Correct - for first question/comment.

        You can just use the human genome sequence (multi-fasta concatenated chromosomes in single file, from UCSC/Ensembl/NCBI/iGenomes) with bbduk (or bbsplit). BBSplit may be better since you can bin all sequences that align to human in one file and capture the rest of the data in second output file.

        Comment


        • #5
          great i´ll work on that.... combining with bbsplit
          thanks

          Comment


          • #6
            After using BBDuk for PhiX removal, the protocol JGI uses for human removal is this, with BBMap and a masked human reference. Using BBSplit is strictly better, if you know your intended organism's genome. But, JGI rarely knows that, which is why we are sequencing it

            You can download the masked human reference from the link provided. It constitutes around 98% of the human genome. That means some reads will intentionally slip through, in regions that are highly conserved down to early eukaryotes, or those with very low complexity. But, the point is to remove virtually all human contamination with no risk of false positives. If you absolutely need to remove ALL human contamination and don't know the organism's genome, you should use the unmasked reference, and you probably will get some false positive removals.

            For assembly of a new organism, I think it is best to remove human contaminants using the above very safe procedure, then assemble, then BLAST the assembly and remove anything long (say, >400bp) that hits human with >98% identity, and hits nothing else other than other primates (typically chimp, gorilla, and orangutan).

            Also, note that I do not recommend using nt/nr in any primary decontamination procedure for which you know the possible contaminants (like determining which reads are, specifically, human) - they are incomplete, poorly-curated, and the process becomes extremely slow because they are huge. Rather, using the references (or masked versions of the references) will give you a better signal-to-noise ratio. nt/nr are much better for diagnosing which things may be present than actually removing them.

            Since you're doing metagenomics, using an unmasked human genome is probably fine since humans and bacteria are very dissimilar. But, unless you are doing a human-related microbiome, you might consider removing common human-associated microbes such as E.coli and Salmonella. They seem to be anywhere humans are. Masking things like ribosomes is probably prudent if you do this. There are also some others like Delftia and Pseudomonas that seem to be common sequencing contaminants and cause problems with metagenome analysis, as they seem to show up everywhere, even if human-related DNA is not present, and even in single-cell experiments of other species. Anyway, something to consider.

            Comment


            • #7
              Thanks Brian,

              Thanks for the masked version on Hg19. Do you hava also masked version hg38 ?

              Just another quick question. Have you published BBmap or how to cite your software ?

              Comment


              • #8
                You can use bbmask.sh from BBMap to create masked version of hg38.

                BBMap has not been published yet. In the past @Brian has asked people to cite the project's SourceForge (http://sourceforge.net/projects/bbmap/) website in publications.

                Comment


                • #9
                  I would not worry about HG19 versus HG38 for the purposes of contaminant removal. They mainly differ in their coordinates, not contents.

                  Comment

                  Latest Articles

                  Collapse

                  • seqadmin
                    Strategies for Sequencing Challenging Samples
                    by seqadmin


                    Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                    03-22-2024, 06:39 AM
                  • seqadmin
                    Techniques and Challenges in Conservation Genomics
                    by seqadmin



                    The field of conservation genomics centers on applying genomics technologies in support of conservation efforts and the preservation of biodiversity. This article features interviews with two researchers who showcase their innovative work and highlight the current state and future of conservation genomics.

                    Avian Conservation
                    Matthew DeSaix, a recent doctoral graduate from Kristen Ruegg’s lab at The University of Colorado, shared that most of his research...
                    03-08-2024, 10:41 AM

                  ad_right_rmr

                  Collapse

                  News

                  Collapse

                  Topics Statistics Last Post
                  Started by seqadmin, Yesterday, 06:37 PM
                  0 responses
                  10 views
                  0 likes
                  Last Post seqadmin  
                  Started by seqadmin, Yesterday, 06:07 PM
                  0 responses
                  9 views
                  0 likes
                  Last Post seqadmin  
                  Started by seqadmin, 03-22-2024, 10:03 AM
                  0 responses
                  51 views
                  0 likes
                  Last Post seqadmin  
                  Started by seqadmin, 03-21-2024, 07:32 AM
                  0 responses
                  67 views
                  0 likes
                  Last Post seqadmin  
                  Working...
                  X