Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Hi Shini,

    I will certainly do this if I get a chance. Right now I'm working on 3 new pieces of software, which have priority... but I have more time for it than I did in January, so hopefully I'll be able to get to it sometime soon.

    -Brian

    Comment


    • Hi Brian,

      This sounds really great!

      Just FYI, I ran another large clustering job in less then 5 h, which required >1 month (!) using alternative tools, with similar output. As I said, it is only that dedupe does not provide the containment information... .

      Best wishes,
      Shini

      Comment


      • hey, Brian, I am not able to enable the multiple hits feature with bbmap. I actually want to study the indels. I constructed a ref with 3 templates, one original and two with ins and del for 2 and 3 bp, respectively. however, the output indicates that read can only have one hit (the original), even I set maxindel=10 maxsites=10.

        my command:

        bbmap.sh ref=../input/demo.genome.fa in=../input/query.fastq outm=test.sam threads=1 maxindel=10 maxsites=10
        what's the way out?

        ref seq:
        >chr1_ins
        atcggatCCAAAATCTAGCCGGTCCTCTCACAATAGCAAATTGGGCTTGCAAAATggCCCAAATGCTCTCTCCACATCTTTTCTAGCCGCCGCATGAGCATTGTGGAAGTttgac(insertion gg from >chr1)
        >chr1
        atcggatCCAAAATCTAGCCGGTCCTCTCACAATAGCAAATTGGGCTTGCAAAATCCCAAATGCTCTCTCCACATCTTTTCTAGCCGCCGCATGAGCATTGTGGAAGTttgac
        >chr2_del
        atcggatCCAAAATCTAGCCGGTCCTCTCACAATAGCAAATTGGGCTTGCAAAATAAATGCTCTCTCCACATCTTTTCTAGCCGCCGCATGAGCATTGTGGAAGTttgac(deletion ccc from >chr1)


        input:
        @HWI-ST822:1690U49ACXX:8:1101:2952:2236 1:N:0:CAGATC
        CCAAAATCTAGCCGGTCCTCTCACAATAGCAAATTGGGCTTGCAAAATCCCAAATGCTCTCTCCACATCTTTTCTAGCCGCCGCATGAGCATTGTGGAAGT
        +
        @C@FFDFFHHFFDIJDDEHHGDEHIIIBC>FEIADGAHHGHADHEDGIIHGDEGGEGGGGGIGIIEHEECD@CD>?CCED@@;B;9<>:@@CC:@C84:>>

        outputonly one hit)
        HWI-ST822:1690U49ACXX:8:1101:2952:2236 1:N:0:CAGATC 0 chr1 8 38 101= * 0 0 CCAAAATCTAGCCGGTCCTCTCACAATAGCAAATTGGGCTTGCAAAATCCCAAATGCTCTCTCCACATCTTTTCTAGC
        CGCCGCATGAGCATTGTGGAAGT @C@FFDFFHHFFDIJDDEHHGDEHIIIBC>FEIADGAHHGHADHEDGIIHGDEGGEGGGGGIGIIEHEECD@CD>?CCED@@;B;9<>:@@CC:@C84:>> NM:i:0 AM:i:38
        Last edited by yaanlpc; 10-04-2016, 03:20 PM.

        Comment


        • coverage stats

          Hi,

          I would like to calculate the average coverage of some genomes. I used:

          Code:
          bbmap.sh in1=reads_R1.fq in2=reads_R2.fq ref=scaffolds.fasta nodisk covstats=stats.txt covhist=histogram.txt
          I get nice output files with coverage etc. for each scaffold, but the average for the whole genome is shown only in the terminal. How could I output it to a file, and preferably in a way that I can get tens of genomes coverages summarized in the same file?

          Thanks!

          Comment


          • Originally posted by sebl View Post
            Hi,

            I would like to calculate the average coverage of some genomes. I used:

            Code:
            bbmap.sh in1=reads_R1.fq in2=reads_R2.fq ref=scaffolds.fasta nodisk covstats=stats.txt covhist=histogram.txt
            I get nice output files with coverage etc. for each scaffold, but the average for the whole genome is shown only in the terminal. How could I output it to a file, and preferably in a way that I can get tens of genomes coverages summarized in the same file?

            Thanks!
            Capture the standard out/error streams to a file (would depend on your shell but examples for bash are here) and you will have that information in a file.

            Comment


            • Originally posted by Brian Bushnell View Post
              You can filter proper pairs only with the "po" flag; unpaired reads will then be classified as unmapped. You can use the 'outm' stream to just get mapped reads. You can also filter with samtools after the fact.

              From these, you could filter only the pairs where at least one read is uniquely aligned (mapq>3 or some higher threshold for greater confidence), but this can't be done in BBMap currently; you could, however, do this with a custom script or with samtools.
              Hi Brian,

              I'm also struggling with the selection of unique alignments. I had previously used the option ambiguous=toss, but for some pair-end reads I only get the forward to be considered aligned. I saw your suggestions of removing the "ambiguos=toss" flag and later filter the alignment based on their mapq value. Nevertheless, I am curious on how to select a correct threshold for my filtering. I am using local=t alignment, is the maximum mapq score different when you use local as compared to global alignment? why do you suggest 4 as a good threshold?

              Thanks in advance !!!

              Carol

              Comment


              • Hi Brian,
                Is there a way to set a desired coverage for randomreads.sh rather than the total number of reads?

                On that note - I'm trying to simulate a metagenome with bbmap tools. Knowing BBMap there's probably a way to do that (especially since JGI has probably needed it too). So can BBmap simulate a metagenome (in terms of abundance profile) from a bunch of reference genomes?

                Comment


                • Originally posted by darthsequencer View Post
                  Hi Brian,
                  Is there a way to set a desired coverage for randomreads.sh rather than the total number of reads?

                  On that note - I'm trying to simulate a metagenome with bbmap tools. Knowing BBMap there's probably a way to do that (especially since JGI has probably needed it too). So can BBmap simulate a metagenome (in terms of abundance profile) from a bunch of reference genomes?
                  Unfortunately, no to both. And you are correct, simulating metagenomes is a need we have at the JGI so I probably should have developed one by now!

                  I'll look into making a wrapper for the purpose; it shouldn't take very long.

                  Comment


                  • Originally posted by Carol_B View Post
                    Hi Brian,

                    I'm also struggling with the selection of unique alignments. I had previously used the option ambiguous=toss, but for some pair-end reads I only get the forward to be considered aligned. I saw your suggestions of removing the "ambiguos=toss" flag and later filter the alignment based on their mapq value. Nevertheless, I am curious on how to select a correct threshold for my filtering. I am using local=t alignment, is the maximum mapq score different when you use local as compared to global alignment? why do you suggest 4 as a good threshold?

                    Thanks in advance !!!

                    Carol
                    Hi Carol,

                    The definition of the mapq is the Phred-scaled (decibel) probability that the mapping is incorrect; q=10 means a 10% chance of being incorrect, q=20 means 1%, q=30 means 0.1%, etc. q=3 indicates 50% - so q=3 or below indicate that the read is at least 50% likely to have come from somewhere else, while q=4 or above means the mapping is probably correct. Therefore, reads that map equally well to multiple locations should be assigned a mapping score of 3 or less, and a mapping score of 4 or more indicates the read cannot map anywhere else as well as the indicated position.

                    The local alignments will give different scores than global alignments because mismatches toward the ends of the reads are ignored. In practice, though, there will not be much difference in most cases; mainly just near the ends of contigs, or when dealing with reads that are chimeric/low-quality/not-adapter-trimmed, etc.

                    Comment


                    • Originally posted by Brian Bushnell View Post
                      Unfortunately, no to both. And you are correct, simulating metagenomes is a need we have at the JGI so I probably should have developed one by now!

                      I'll look into making a wrapper for the purpose; it shouldn't take very long.
                      Starting with a bunch of bacterial genomes in a file and using randomreads.sh should produce something acceptable, correct? I have used it in the past with a defined set of genomes to simulate a mixed dataset.

                      Comment


                      • You can simulate a mixed dataset just fine that way, but unfortunately metagenomes tend to have an exponential distribution of coverage. So to really simulate a metagenome, you need to mimic that, which would mean (for now) running randomreads on each genome individually with an exponential-random coverage target, then combining all the outputs together.

                        I did write a program "mutate.sh" that I use in simulating metagenomes, which takes a genome fasta and produces a mutant version of it at a specified error rate; I use it to do things like verify that error-correction is not erasing strain differences in metagenomes.

                        Anyway, I think it will be pretty easy to modify randomreads for a metagenome mode, so I'll try to do that today. And a coverage (rather than #reads) threshold is pretty easy as well.

                        Comment


                        • That would be awesome Brian! Most of the tools out there are bit wonky, slow, or have fallen to the wayside.

                          More of a nuts and bolts question but how do you handle fragmented reference genomes in a simulated metagenome? In the sense that one reference but is in multiple fasta files but will get assigned different coverages because most programs think each file is a different genome.

                          Comment


                          • I just uploaded BBMap v36.59 which adds "coverage" and "metagenome" flags.

                            coverage=X will automatically set "reads" to a level that will give X average coverage (decimal point is allowed).

                            metagenome will assign each scaffold a random exponential variable, which decides the probability that a read be generated from that scaffold. So, if you concatenate together 20 bacterial genomes, you can run randomreads and get a metagenomic-like distribution. It could also be used for RNA-seq when using a transcriptome reference.

                            The coverage is decided on a per-reference-sequence level, so if a bacterial assembly has more than one contig, you may want to glue them together first with fuse.sh before concatenating them with the other references.

                            Comment


                            • Great - Thanks a bunch!

                              Comment


                              • qtrim and maq in BBMap

                                Hi Brian,

                                I'm using BBMap to get reads that do not align to a genome. I want the output quality trimmed, and use
                                Code:
                                bbmap.sh in=raw_reads.fastq.gz outu=stdout.sam qin=auto qout=33 qtrim=lr maq=30 mintrimlength=30
                                The BBDuk docs state that the minimum average quality is checked after clipping. I assume this is the same in BBMap. However with this command I have output such as this:
                                Code:
                                1600:7:1:1413:2208      653     *       0       0       *       *       0       0       AAATGAAGAGAAAGATTACAGGGAAGTAAC  ##############################
                                The read should clearly be removed, I would think? Or does that work only for the mapped reads?
                                Also, is there an option to drop all reads that are clipped to below a certain length?

                                Btw, the option "minaveragequality" is not recognized, only the short version "maq". Not sure if that's intentional.

                                Cheers,

                                Till

                                Comment

                                Latest Articles

                                Collapse

                                • seqadmin
                                  Strategies for Sequencing Challenging Samples
                                  by seqadmin


                                  Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                                  03-22-2024, 06:39 AM
                                • seqadmin
                                  Techniques and Challenges in Conservation Genomics
                                  by seqadmin



                                  The field of conservation genomics centers on applying genomics technologies in support of conservation efforts and the preservation of biodiversity. This article features interviews with two researchers who showcase their innovative work and highlight the current state and future of conservation genomics.

                                  Avian Conservation
                                  Matthew DeSaix, a recent doctoral graduate from Kristen Ruegg’s lab at The University of Colorado, shared that most of his research...
                                  03-08-2024, 10:41 AM

                                ad_right_rmr

                                Collapse

                                News

                                Collapse

                                Topics Statistics Last Post
                                Started by seqadmin, Yesterday, 06:37 PM
                                0 responses
                                11 views
                                0 likes
                                Last Post seqadmin  
                                Started by seqadmin, Yesterday, 06:07 PM
                                0 responses
                                10 views
                                0 likes
                                Last Post seqadmin  
                                Started by seqadmin, 03-22-2024, 10:03 AM
                                0 responses
                                51 views
                                0 likes
                                Last Post seqadmin  
                                Started by seqadmin, 03-21-2024, 07:32 AM
                                0 responses
                                68 views
                                0 likes
                                Last Post seqadmin  
                                Working...
                                X