Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • #46
    For some other programs in BBMap suite there has been a "race condition" which produces a nasty (though according to @Brian a harmless) error related to threads.

    At least in this case the result appears to be identical. In the last table you had given us the results all looked different.

    Not sure why you still have this in the output, if you had deleted the ref folder:
    NOTE: Ignoring reference file because it already appears to have been processed.
    NOTE: If you wish to regenerate the index, please manually delete ref/genome/1/summary.txt
    You can set nodisk=t to prevent the index from being written to disk (memory will always be used).
    Last edited by GenoMax; 10-05-2017, 04:43 AM.

    Comment


    • #47
      Hi Brian,

      I am wondering for BBsplit (with ambiguous2=split) if it is expected behavior that if I provide 2 references:

      Ref 1 has 1 sequence that is a perfect match to PE sequencing reads
      Ref 2 has many unique sequences that are at least 1 MM different from sequencing reads

      For BBsplit to think the reads are ambiguously mapped? And for this to be the case for >90% of read pairs but not for every read pair?

      Is there a way to force BBsplit to select Ref1 (perfect) over the options in Ref2?

      I am trying to remove a nearly-identical putative contaminant sequence from my data...

      (FYI I have ambiguous is set to best, and I've also tried setting both ambiguous and ambiguous2 to best, but the reads are almost all considered to be ambiguous in that scenario too)

      Thx! Kate
      Last edited by sk8bro; 10-11-2017, 01:08 PM.

      Comment


      • #48
        Originally posted by GenoMax View Post
        You can set nodisk=t to prevent the index from being written to disk (memory will always be used).
        Actually, "nodisk" does not work with BBSplit... sorry! I'll clarify that in the documentation. It's not like it's impossible to make it work, but it would be pretty complicated; one of those things that I would do if I could clone myself.

        Hi Brian,

        I am wondering for BBsplit (with ambiguous2=split) if it is expected behavior that if I provide 2 references:

        Ref 1 has 1 sequence that is a perfect match to PE sequencing reads
        Ref 2 has many unique sequences that are at least 1 MM different from sequencing reads

        For BBsplit to think the reads are ambiguously mapped? And for this to be the case for >90% of read pairs but not for every read pair?

        Is there a way to force BBsplit to select Ref1 (perfect) over the options in Ref2?

        I am trying to remove a nearly-identical putative contaminant sequence from my data...

        (FYI I have ambiguous is set to best, and I've also tried setting both ambiguous and ambiguous2 to best, but the reads are almost all considered to be ambiguous in that scenario too)

        Thx! Kate
        Hi Kate,

        You can certainly set "perfectmode" to only allow mappings that are 100% identity to the reference. For example, you could run once in perfectmode, and then run the remaining unmapped reads normally. But generally, yes, this is intended behavior. BBSplit is intended to do things like separate mouse reads from human reads when a mouse is used as the vector for some human DNA study. Also, it is designed to separate reads belonging to various bacteria in a metagenome, and chloroplast/plant reads in a plant. In any of these cases, if there is a single bp mismatch, you can't unambiguously assign a read to one reference or the other, since it could be a sequencing error or (more importantly) an actual variation.

        You might find Seal (seal.sh) useful. It has functionality similar to BBSplit, but is alignment free (meaning it is way faster, but uses more memory). It allows you to specify cutoffs, so you can for example send all of the perfectly-matching reads to one file, regardless of whether they almost match another genome. It decides which file to assign a read to based on the number of kmers (by default, 31-mers) matching that reference. When dealing with bacteria, I always prefer Seal over BBSplit because it is so much faster and easier to use, and bacteria have tiny genomes that are under high evolutionary pressure to avoid low-complexity sequence. Euks have huge genomes with a lot of low-complexity regions so BBSplit is better in that case, since it is more precise.

        If you want to remove a particular contaminant from your data, I suggest trying BBDuk. BBDuk does not allow kmers longer than 31. However, it emulates longer kmers. For example, if you set "k=90", then it will consider reads as matching (by default, they get discarded) if there is a 90bp stretch in which all 31-mers match 31-mers in the reference. In practice this is very similar to matching 90-mers.

        So, for example:

        bbduk.sh in=reads.fq out=filtered.fq ref=contaminant.fa k=90 mm=f

        That will remove all reads containing at least 90bp that shares all kmers with your reference (which should be the unwanted sequence). The exact values depend on the length of the sequence in question. Note that read length is very important here; if a read only overlaps 80bp of the 90bp sequence in question, it would not be removed.

        Comment


        • #49
          Thanks Brian, I will look into the options you suggested to handle this use case!

          Kate

          Comment


          • #50
            Biological question about contaminants DB

            Dear Brian,

            I have created a de novo transcriptome of a non-model organism. After BLASTing the contigs I found some homology with possible contaminants. My question is if I should use DNA genomes (un/masked) or the CDS gene predictions from ENSMBL database.

            In the first case, do you recommend the unmasked, the masked version of genomes (ensmbl) or should I use BBMask?
            And in the second case, could BBsplit handle with CDS mapping?

            I am a bit lost, I think that contigs from de novo transcriptome have to be mapped against CDS sequences.

            With thanks,
            Xavier
            Xavier

            Comment


            • #51
              While Brian will have a more detailed insight it should be ok to use the unmasked genome with BBSplit. Any mapping issues you may have with short reads should be more or less same with genome or just CDS sequences.

              Comment


              • #52
                Originally posted by GenoMax View Post
                While Brian will have a more detailed insight it should be ok to use the unmasked genome with BBSplit. Any mapping issues you may have with short reads should be more or less same with genome or just CDS sequences.
                Thanks GenoMax,

                I see from other thread that masked genomes are used in order to prevent false positives when removing contamination.

                At the same time in the thread you suggest simply use BBSplit, which should be enough.

                Originally posted by GenoMax View Post
                You can create them yourself using bbmask.sh. Not sure if you would need to if you are just looking to remove reads mapping to mito and chloroplast.

                I assume you have seen BBsplit, which can be used for this purpose.
                With thanks,
                Xavier
                Xavier

                Comment


                • #53
                  @Brian is doing something very specific to remove human contamination in JGI's non-human samples.

                  Re-reading your original question it would be good to know if your contaminants are "close" relatives or can be considered a distant species. Success or failure of bbsplit is going to largely depend on that. No tool is going to be able to separate reads from a very closely related species based on sequence alone.
                  Last edited by GenoMax; 10-12-2017, 07:38 AM.

                  Comment


                  • #54
                    Hi Brian,

                    So I've considered your 3 suggestions and Seal seems the path of least resistance. To separate sequences with 1 SNP of importance, in presence of other sequencing error SNPs the references need to be considered simultaneously or else which SNP is the important one gets lost.

                    So... i took approach of using BBsplit (to ensure the .95 minid/idfilter) that I want, and then took everything mapped as input to Seal (default except k=8, ambiguous=toss).

                    There are a few read pairs that are coming out as "Unmapped" in the Seal step, but they aligned with BBsplit. I looked at one and saw it has 1 SNP in each of F and R ~150 bp reads. Which Processing parameter do I need to change in order to "Map" this read pair?

                    Thx, Kate

                    Comment


                    • #55
                      Consensus seq from bbsplit

                      Hi Brian,

                      Do you have any thoughts on how to generate a consensus sequence from the read-pairs that I've used BBsplit to align and separate from each other?

                      I tried out a set of samtools mpileup/bcftools commands but it is reference-based and each of my reference files has many sequences in it so I'd have to additionally pick one reference or generate a consensus reference and then re-align to it, which seems like it could introduce reference-based bias.

                      I also tried out a set of ClustalW2/ANDES commands that is reference-free but it is too memory-intensive doing the MSA for the millions of reads that I have

                      I was thinking to go down the Mothur rabbit hole next because it looks like there is hope of combining tools to compress both unique sequences and sequences contained in others, and then keep track such that the consensus reflects the true base frequencies

                      Anyways, I am just wondering if any BBmap tools are suited to this task, or if you've used something I'm not thinking of (I took a quick look at clumpify but the results were >1 sequence outputted I believe)

                      Thx! Kate

                      Comment


                      • #56
                        @Kate: You should take a look at tadpole.sh from BBMap suite, which is a k-mer based aligner. Someone recently used it to assemble the axolotl genome so whatever you are working with may be feasible to do. See post #64 to get started.

                        Comment


                        • #57
                          Is your tool suitable for microbiome data where the database reference is many many bacterial 16s sequences?
                          i am looking for a tool that will take my fastq reads and align them to a given bacterial database and then will provide the location start and stop of where the reads aligned.

                          thanks!
                          Jen

                          Comment


                          • #58
                            BBsplit should work here. Make sure you clean fasta headers from your genome sequences (remove spaces in headers etc, make sure they are unique). You have the option of handling multi-mapping data in various ways (discard, assign to all genomes etc) with BBSplit. So consider those carefully.

                            While you could make a BAM file(s) directly from BBSplit, you should split the data into separate fastq's first and then re-align to the respective genomes using "bbmap.sh". This avoids having a large number of @SQ header lines in BAM files which can cause problems with some tools (e.g. IGV).

                            Comment


                            • #59
                              Hi Brain,

                              Thanks for developing great tools for the community !!
                              We are using bbsplit to separate insect and it's symbiont transcripts from a ribodepleted transcriptome. However, the reference for the insect is a cDNA transcriptome and for the symbiont its a genome. Do we need to do sequential mapping to bin for individual species given the references are different or can we give include both the references, a transcriptome and a genome, in the same command.?

                              Thanks
                              Priya

                              Comment


                              • #60
                                @Priya: You should be able to include both sequences in the same command. There is always a debate about how to handle the multi-mapping (to both species) reads. First take a look to see how big that number is. If it is not large then you should be able to move forward.

                                Comment

                                Latest Articles

                                Collapse

                                • seqadmin
                                  Strategies for Sequencing Challenging Samples
                                  by seqadmin


                                  Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                                  03-22-2024, 06:39 AM
                                • seqadmin
                                  Techniques and Challenges in Conservation Genomics
                                  by seqadmin



                                  The field of conservation genomics centers on applying genomics technologies in support of conservation efforts and the preservation of biodiversity. This article features interviews with two researchers who showcase their innovative work and highlight the current state and future of conservation genomics.

                                  Avian Conservation
                                  Matthew DeSaix, a recent doctoral graduate from Kristen Ruegg’s lab at The University of Colorado, shared that most of his research...
                                  03-08-2024, 10:41 AM
                                • seqadmin
                                  The Impact of AI in Genomic Medicine
                                  by seqadmin



                                  Artificial intelligence (AI) has evolved from a futuristic vision to a mainstream technology, highlighted by the introduction of tools like OpenAI's ChatGPT and Google's Gemini. In recent years, AI has become increasingly integrated into the field of genomics. This integration has enabled new scientific discoveries while simultaneously raising important ethical questions1. Interviews with two researchers at the center of this intersection provide insightful perspectives into...
                                  02-26-2024, 02:07 PM

                                ad_right_rmr

                                Collapse

                                News

                                Collapse

                                Topics Statistics Last Post
                                Started by seqadmin, 03-22-2024, 10:03 AM
                                0 responses
                                39 views
                                0 likes
                                Last Post seqadmin  
                                Started by seqadmin, 03-21-2024, 07:32 AM
                                0 responses
                                54 views
                                0 likes
                                Last Post seqadmin  
                                Started by seqadmin, 03-19-2024, 02:42 PM
                                0 responses
                                48 views
                                0 likes
                                Last Post seqadmin  
                                Started by seqadmin, 03-14-2024, 06:13 AM
                                0 responses
                                60 views
                                0 likes
                                Last Post seqadmin  
                                Working...
                                X