Seqanswers Leaderboard Ad

**Brian Bushnell** · 09-26-2016, 09:30 AM

Hi Shini,

I will certainly do this if I get a chance. Right now I'm working on 3 new pieces of software, which have priority... but I have more time for it than I did in January, so hopefully I'll be able to get to it sometime soon.

-Brian

**Shini Sunagawa** · 09-27-2016, 04:13 AM

Hi Brian,

This sounds really great!

Just FYI, I ran another large clustering job in less then 5 h, which required >1 month (!) using alternative tools, with similar output. As I said, it is only that dedupe does not provide the containment information... .

Best wishes,
Shini

**yaanlpc** · 10-04-2016, 03:12 PM

hey, Brian, I am not able to enable the multiple hits feature with bbmap. I actually want to study the indels. I constructed a ref with 3 templates, one original and two with ins and del for 2 and 3 bp, respectively. however, the output indicates that read can only have one hit (the original), even I set maxindel=10 maxsites=10.

my command:

bbmap.sh ref=../input/demo.genome.fa in=../input/query.fastq outm=test.sam threads=1 maxindel=10 maxsites=10

what's the way out?

ref seq:
>chr1_ins
atcggatCCAAAATCTAGCCGGTCCTCTCACAATAGCAAATTGGGCTTGCAAAATggCCCAAATGCTCTCTCCACATCTTTTCTAGCCGCCGCATGAGCATTGTGGAAGTttgac(insertion gg from >chr1)
>chr1
atcggatCCAAAATCTAGCCGGTCCTCTCACAATAGCAAATTGGGCTTGCAAAATCCCAAATGCTCTCTCCACATCTTTTCTAGCCGCCGCATGAGCATTGTGGAAGTttgac
>chr2_del
atcggatCCAAAATCTAGCCGGTCCTCTCACAATAGCAAATTGGGCTTGCAAAATAAATGCTCTCTCCACATCTTTTCTAGCCGCCGCATGAGCATTGTGGAAGTttgac(deletion ccc from >chr1)

input:
@HWI-ST822:169

0U49ACXX:8:1101:2952:2236 1:N:0:CAGATC
CCAAAATCTAGCCGGTCCTCTCACAATAGCAAATTGGGCTTGCAAAATCCCAAATGCTCTCTCCACATCTTTTCTAGCCGCCGCATGAGCATTGTGGAAGT
+
@C@FFDFFHHFFDIJDDEHHGDEHIIIBC>FEIADGAHHGHADHEDGIIHGDEGGEGGGGGIGIIEHEECD@CD>?CCED@@;B;9<>:@@CC:@C84:>>

output

only one hit)
HWI-ST822:169

0U49ACXX:8:1101:2952:2236 1:N:0:CAGATC 0 chr1 8 38 101= * 0 0 CCAAAATCTAGCCGGTCCTCTCACAATAGCAAATTGGGCTTGCAAAATCCCAAATGCTCTCTCCACATCTTTTCTAGC
CGCCGCATGAGCATTGTGGAAGT @C@FFDFFHHFFDIJDDEHHGDEHIIIBC>FEIADGAHHGHADHEDGIIHGDEGGEGGGGGIGIIEHEECD@CD>?CCED@@;B;9<>:@@CC:@C84:>> NM:i:0 AM:i:38

**sebl** · 10-09-2016, 05:19 AM

coverage stats

Hi,

I would like to calculate the average coverage of some genomes. I used:

Code:

bbmap.sh in1=reads_R1.fq in2=reads_R2.fq ref=scaffolds.fasta nodisk covstats=stats.txt covhist=histogram.txt

I get nice output files with coverage etc. for each scaffold, but the average for the whole genome is shown only in the terminal. How could I output it to a file, and preferably in a way that I can get tens of genomes coverages summarized in the same file?

Thanks!

**GenoMax** · 10-10-2016, 03:50 AM

Originally posted by sebl View Post

Hi,

I would like to calculate the average coverage of some genomes. I used:

Code:

bbmap.sh in1=reads_R1.fq in2=reads_R2.fq ref=scaffolds.fasta nodisk covstats=stats.txt covhist=histogram.txt

I get nice output files with coverage etc. for each scaffold, but the average for the whole genome is shown only in the terminal. How could I output it to a file, and preferably in a way that I can get tens of genomes coverages summarized in the same file?

Thanks!

Capture the standard out/error streams to a file (would depend on your shell but examples for bash are here) and you will have that information in a file.

**Carol_B** · 10-15-2016, 03:33 AM

Originally posted by Brian Bushnell View Post

You can filter proper pairs only with the "po" flag; unpaired reads will then be classified as unmapped. You can use the 'outm' stream to just get mapped reads. You can also filter with samtools after the fact.

From these, you could filter only the pairs where at least one read is uniquely aligned (mapq>3 or some higher threshold for greater confidence), but this can't be done in BBMap currently; you could, however, do this with a custom script or with samtools.

Hi Brian,

I'm also struggling with the selection of unique alignments. I had previously used the option ambiguous=toss, but for some pair-end reads I only get the forward to be considered aligned. I saw your suggestions of removing the "ambiguos=toss" flag and later filter the alignment based on their mapq value. Nevertheless, I am curious on how to select a correct threshold for my filtering. I am using local=t alignment, is the maximum mapq score different when you use local as compared to global alignment? why do you suggest 4 as a good threshold?

Thanks in advance !!!

Carol

**darthsequencer** · 11-07-2016, 07:54 PM

Hi Brian,
Is there a way to set a desired coverage for randomreads.sh rather than the total number of reads?

On that note - I'm trying to simulate a metagenome with bbmap tools. Knowing BBMap there's probably a way to do that (especially since JGI has probably needed it too). So can BBmap simulate a metagenome (in terms of abundance profile) from a bunch of reference genomes?

**Brian Bushnell** · 11-08-2016, 10:19 AM

Originally posted by darthsequencer View Post

Hi Brian,
Is there a way to set a desired coverage for randomreads.sh rather than the total number of reads?

On that note - I'm trying to simulate a metagenome with bbmap tools. Knowing BBMap there's probably a way to do that (especially since JGI has probably needed it too). So can BBmap simulate a metagenome (in terms of abundance profile) from a bunch of reference genomes?

Unfortunately, no to both. And you are correct, simulating metagenomes is a need we have at the JGI so I probably should have developed one by now!

I'll look into making a wrapper for the purpose; it shouldn't take very long.

**Brian Bushnell** · 11-08-2016, 10:25 AM

Originally posted by Carol_B View Post

Hi Brian,

I'm also struggling with the selection of unique alignments. I had previously used the option ambiguous=toss, but for some pair-end reads I only get the forward to be considered aligned. I saw your suggestions of removing the "ambiguos=toss" flag and later filter the alignment based on their mapq value. Nevertheless, I am curious on how to select a correct threshold for my filtering. I am using local=t alignment, is the maximum mapq score different when you use local as compared to global alignment? why do you suggest 4 as a good threshold?

Thanks in advance !!!

Carol

Hi Carol,

The definition of the mapq is the Phred-scaled (decibel) probability that the mapping is incorrect; q=10 means a 10% chance of being incorrect, q=20 means 1%, q=30 means 0.1%, etc. q=3 indicates 50% - so q=3 or below indicate that the read is at least 50% likely to have come from somewhere else, while q=4 or above means the mapping is probably correct. Therefore, reads that map equally well to multiple locations should be assigned a mapping score of 3 or less, and a mapping score of 4 or more indicates the read cannot map anywhere else as well as the indicated position.

The local alignments will give different scores than global alignments because mismatches toward the ends of the reads are ignored. In practice, though, there will not be much difference in most cases; mainly just near the ends of contigs, or when dealing with reads that are chimeric/low-quality/not-adapter-trimmed, etc.

**GenoMax** · 11-08-2016, 10:36 AM

Originally posted by Brian Bushnell View Post

Unfortunately, no to both. And you are correct, simulating metagenomes is a need we have at the JGI so I probably should have developed one by now!

I'll look into making a wrapper for the purpose; it shouldn't take very long.

Starting with a bunch of bacterial genomes in a file and using randomreads.sh should produce something acceptable, correct? I have used it in the past with a defined set of genomes to simulate a mixed dataset.

**Brian Bushnell** · 11-08-2016, 10:43 AM

You can simulate a mixed dataset just fine that way, but unfortunately metagenomes tend to have an exponential distribution of coverage. So to really simulate a metagenome, you need to mimic that, which would mean (for now) running randomreads on each genome individually with an exponential-random coverage target, then combining all the outputs together.

I did write a program "mutate.sh" that I use in simulating metagenomes, which takes a genome fasta and produces a mutant version of it at a specified error rate; I use it to do things like verify that error-correction is not erasing strain differences in metagenomes.

Anyway, I think it will be pretty easy to modify randomreads for a metagenome mode, so I'll try to do that today. And a coverage (rather than #reads) threshold is pretty easy as well.

**darthsequencer** · 11-08-2016, 12:03 PM

That would be awesome Brian! Most of the tools out there are bit wonky, slow, or have fallen to the wayside.

More of a nuts and bolts question but how do you handle fragmented reference genomes in a simulated metagenome? In the sense that one reference but is in multiple fasta files but will get assigned different coverages because most programs think each file is a different genome.

**Brian Bushnell** · 11-08-2016, 03:19 PM

I just uploaded BBMap v36.59 which adds "coverage" and "metagenome" flags.

coverage=X will automatically set "reads" to a level that will give X average coverage (decimal point is allowed).

metagenome will assign each scaffold a random exponential variable, which decides the probability that a read be generated from that scaffold. So, if you concatenate together 20 bacterial genomes, you can run randomreads and get a metagenomic-like distribution. It could also be used for RNA-seq when using a transcriptome reference.

The coverage is decided on a per-reference-sequence level, so if a bacterial assembly has more than one contig, you may want to glue them together first with fuse.sh before concatenating them with the other references.

**darthsequencer** · 11-08-2016, 03:22 PM

Great - Thanks a bunch!

**ShellfishGene** · 11-18-2016, 03:02 AM

qtrim and maq in BBMap

Hi Brian,

I'm using BBMap to get reads that do not align to a genome. I want the output quality trimmed, and use

Code:

bbmap.sh in=raw_reads.fastq.gz outu=stdout.sam qin=auto qout=33 qtrim=lr maq=30 mintrimlength=30

The BBDuk docs state that the minimum average quality is checked after clipping. I assume this is the same in BBMap. However with this command I have output such as this:

Code:

1600:7:1:1413:2208      653     *       0       0       *       *       0       0       AAATGAAGAGAAAGATTACAGGGAAGTAAC  ##############################

The read should clearly be removed, I would think? Or does that work only for the mapped reads?
Also, is there an option to drop all reads that are clipped to below a certain length?

Btw, the option "minaveragequality" is not recognized, only the short version "maq". Not sure if that's intentional.

Cheers,

Till

Topics	Statistics	Last Post
New Model Aims to Explain Polygenic Diseases by Connecting Genomic Mutations and Regulatory Networks by seqadmin Started by seqadmin, Yesterday, 05:31 AM	0 responses 10 views 0 likes	Last Post by seqadmin Yesterday, 05:31 AM
Small Blood Stem Cell Subset Linked to Immune System Aging by seqadmin Started by seqadmin, 10-24-2024, 06:58 AM	0 responses 20 views 0 likes	Last Post by seqadmin 10-24-2024, 06:58 AM
New AI Model Designs Synthetic DNA Switches for Targeted Gene Expression in Specific Cell Types by seqadmin Started by seqadmin, 10-23-2024, 08:43 AM	0 responses 50 views 0 likes	Last Post by seqadmin 10-23-2024, 08:43 AM
Microbes in Urban Spaces Adapt to Disinfectants and Scarce Resources by seqadmin Started by seqadmin, 10-17-2024, 07:29 AM	0 responses 58 views 0 likes	Last Post by seqadmin 10-17-2024, 07:29 AM

Seqanswers Leaderboard Ad

Announcement

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News