SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics



Similar Threads
Thread Thread Starter Forum Replies Last Post
BBMap (aligner for DNA/RNAseq) is now open-source and available for download. Brian Bushnell Bioinformatics 587 12-05-2017 07:30 AM
BBMap for BitSeq dietmar13 Bioinformatics 1 04-30-2015 09:40 AM
Please help my BBMap cannot remove Illumina adapter TofuKaj Bioinformatics 4 04-28-2015 09:53 AM
BBMap Error Phage Hunter Bioinformatics 5 01-14-2015 05:34 AM
Introducing BBMap, a new short-read aligner for DNA and RNA Brian Bushnell Bioinformatics 24 07-07-2014 10:37 AM

Reply
 
Thread Tools
Old 10-23-2017, 07:15 PM   #181
GenoMax
Senior Member
 
Location: East Coast USA

Join Date: Feb 2008
Posts: 6,582
Default

Can you check the R2 file by the same method?
GenoMax is online now   Reply With Quote
Old 10-23-2017, 09:29 PM   #182
bio_d
Junior Member
 
Location: USA

Join Date: Oct 2017
Posts: 5
Default

Hi,

There seems to be nothing wrong with the second read too

zcat mate_pair_2.fq.gz | sed -n '2051361,2051364p'

@HWI-D00294:282:CAB4VANXX:7:1101:13812:601352:N:0:GCCAAT
TTGAAGCAGCAGTTCAAAAACATTGTCTCAGTCTGTCTTAATTTGGTATAATCCCCTGAATCTATTAAACCAAGACCAGCTGTCTGACATTTTTCACTATTTTCTTTTCTCCGCTTGTTCTTTTC
+
@[email protected]@[email protected]>[email protected]@[email protected]@>FG>FG>DEG1E>[email protected]>[email protected]>GGFC=D>[email protected]>F>[email protected]>[email protected]

Last edited by bio_d; 10-23-2017 at 09:34 PM.
bio_d is offline   Reply With Quote
Old 11-02-2017, 09:32 AM   #183
santiagorevale
Member
 
Location: UK

Join Date: Dec 2016
Posts: 17
Default

Quote:
Originally Posted by santiagorevale View Post
Hi Brian,

I'm using "filterbyname.sh" script from bbmap v37.60 (using Java 1.8.0_102) to extract some reads from a FastQ file given a list of IDs.

The current FastQ file has 196 Mi reads and I want to keep 85 Mi. Uncompressed FastQ file size is 14G while compressed is only 1.4G. IDs file is 3.1G.

When running the script using 24G of RAM it dies with OutOfMemoryError. Isn't it an excessive use of memory for just filtering a FastQ file? Also, among the script arguments the is no "threads" option, however the script is using all available cores. Any way of limiting both memory as well as threads usage?

Here is the error:

java -ea -Xmx24G -cp /software/bbmap-37.60/current/ driver.FilterReadsByName -Xmx24G include=t in=Sample1.I1.fastq.gz out=filtered.Sample1.I1.fastq.gz names=reads.ids
Executing driver.FilterReadsByName [-Xmx24G, include=t, in=Sample1.I1.fastq.gz, out=filtered.Sample1.I1.fastq.gz, names=reads.ids]

Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
at java.util.Arrays.copyOfRange(Arrays.java:3664)
at java.lang.String.<init>(String.java:207)
at java.lang.String.toLowerCase(String.java:2647)
at java.lang.String.toLowerCase(String.java:2670)
at driver.FilterReadsByName.<init>(FilterReadsByName.java:145)
at driver.FilterReadsByName.main(FilterReadsByName.java:40)

Thank you very much in advance.

Best regards,
Santiago
Hi there,

Any hint on what I've previously asked?

Thanks!
santiagorevale is offline   Reply With Quote
Old 11-06-2017, 08:08 PM   #184
TomHarrop
Member
 
Location: New Zealand

Join Date: Jul 2014
Posts: 20
Default

Hi Brian & others,

I tried to run bbnorm with a kmer size of 99, but it crashed with the following error:

Code:
Exception in thread "Thread-371" Exception in thread "Thread-357" Exception in thread "Thread-368" Exception in thread "Thread-377" Exception in thread "Thread-367" Exception in thread "Thread-362" Exception in thread "Thread-380" Exception in thread "Thread-363" Exception in thread "Thread-365" Exception in thread "Thread-364" Exception in thread "Thread-366" Exception in thread "Thread-358" Exception in thread "Thread-361" Exception in thread "Thread-360" Exception in thread "Thread-381" Exception in thread "Thread-387" Exception in thread "Thread-372" Exception in thread "Thread-399" java.lang.AssertionError: this function not tested with k>31
    at jgi.KmerNormalize.correctErrors(KmerNormalize.java:2124)
    at jgi.KmerNormalize.access$19(KmerNormalize.java:2121)
    at jgi.KmerNormalize$ProcessThread.normalizeInThread(KmerNormalize.java:3043)
    at jgi.KmerNormalize$ProcessThread.run(KmerNormalize.java:2806)
I'm wondering if I used a bad combination of parameters. Here's the call:

Code:
java -ea -Xmx132160m -Xms132160m -cp PATH/TO/bbmap/current/ jgi.KmerNormalize bits=32 in=output/trim_decon/reads.fastq.gz threads=50 out=output/k_99/norm/normalised.fastq.gz zl=9 hist=output/k_99/norm/hist_before.txt histout=output/k_99/norm/hist_after.txt target=50 min=5 prefilter ecc k=99 peaks=output/k_99/norm/peaks.txt
Otherwise, is it supported to use bbnorm with larger kmer sizes, or would you recommend estimating the target coverage for k = 99 based on the coverage at k = 31?

I've posted the log on pastebin: https://pastebin.com/jPkKagFs

Thanks again for the bbmap suite!

Tom
TomHarrop is offline   Reply With Quote
Old 11-07-2017, 02:54 AM   #185
boulund
Member
 
Location: Sweden

Join Date: Jan 2017
Posts: 16
Default

Hi, just want to make sure I'm not missing anything here, but randomreads.sh cannot produce metagenomes according to a specific profile, right? I only find information about it drawing random numbers from an exponential distribution for each reference sequence and thus produces a simulated metagenome from a set of reference sequences, which of course is awesome, but right now I would like to produce a simulated metagenome with a very specific composition.
boulund is offline   Reply With Quote
Old 11-07-2017, 05:01 AM   #186
GenoMax
Senior Member
 
Location: East Coast USA

Join Date: Feb 2008
Posts: 6,582
Default

Quote:
Originally Posted by santiagorevale View Post
Hi there,

Any hint on what I've previously asked?

Thanks!
Perhaps. But if you have more memory why not allocate more and see if that helps. Unless you are being charged by every megabyte you use
GenoMax is online now   Reply With Quote
Old 11-07-2017, 06:00 AM   #187
santiagorevale
Member
 
Location: UK

Join Date: Dec 2016
Posts: 17
Default

Quote:
Originally Posted by GenoMax View Post
Perhaps. But if you have more memory why not allocate more and see if that helps. Unless you are being charged by every megabyte you use
Hi GenoMax,

Because I'm running this in a cluster, to get more memory means to get more cores (slots), and processes requiring more cores take longer to be executed. Also, I was running this command along other commands requiring same amount of cores.

However, isn't it weird for the script to require much more memory than the size of both uncompressed FastQ plus IDs files together?

Thanks!
santiagorevale is offline   Reply With Quote
Old 11-07-2017, 07:27 AM   #188
GenoMax
Senior Member
 
Location: East Coast USA

Join Date: Feb 2008
Posts: 6,582
Default

Quote:
Originally Posted by santiagorevale View Post
Hi GenoMax,

Because I'm running this in a cluster, to get more memory means to get more cores (slots), and processes requiring more cores take longer to be executed. Also, I was running this command along other commands requiring same amount of cores.

However, isn't it weird for the script to require much more memory than the size of both uncompressed FastQ plus IDs files together?

Thanks!
While that is an odd restriction it is what it is when one is using shared compute resources.

Just for kicks have you tried to run this on a local desktop that has a decent amount of RAM (16G)? Just keeping fastq headers in memory should not take a large amount of RAM as you speculate.
GenoMax is online now   Reply With Quote
Old 11-07-2017, 09:59 AM   #189
santiagorevale
Member
 
Location: UK

Join Date: Dec 2016
Posts: 17
Default

Quote:
Originally Posted by GenoMax View Post
While that is an odd restriction it is what it is when one is using shared compute resources.

Just for kicks have you tried to run this on a local desktop that has a decent amount of RAM (16G)? Just keeping fastq headers in memory should not take a large amount of RAM as you speculate.
I tried running it in a computer with 8Gb of RAM and in cluster nodes using a -Xmx limit of 18Gb and 24Gb (the max memory of the nodes is between 96 and 128 Gb).

Before I wasn't saying that keeping headers in memory take lots of RAM. I just tried to say that I couldn't understand why it ran out of memory when using 24Gb, because if the program were to load both files (FastQ and IDs files) into memory (I currently don't know how the program works), that would add up to 17.1Gb. So even in this scenario it should have not ran out of memory.

I ran the command on 232 sets of files with -Xmx18G, with the following results:
- Exception in thread "main" java.lang.OutOfMemoryError: GC overhead limit exceeded, 39 times

Code:
Exception in thread "main" java.lang.OutOfMemoryError: GC overhead limit exceeded
        at java.util.LinkedHashMap.newNode(LinkedHashMap.java:256)
        at java.util.HashMap.putVal(HashMap.java:641)
        at java.util.HashMap.put(HashMap.java:611)
        at java.util.HashSet.add(HashSet.java:219)
        at shared.Tools.addNames(Tools.java:456)
        at driver.FilterReadsByName.<init>(FilterReadsByName.java:138)
        at driver.FilterReadsByName.main(FilterReadsByName.java:40)
- Exception in thread "main" java.lang.OutOfMemoryError: Java heap space, 8 times

Code:
Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
        at java.lang.StringCoding.decode(StringCoding.java:187)
        at java.lang.StringCoding.decode(StringCoding.java:254)
        at java.lang.String.<init>(String.java:546)
        at java.lang.String.<init>(String.java:566)
        at shared.Tools.addNames(Tools.java:456)
        at driver.FilterReadsByName.<init>(FilterReadsByName.java:138)
        at driver.FilterReadsByName.main(FilterReadsByName.java:40)
- java.lang.OutOfMemoryError: GC overhead limit exceeded, 5 times

Code:
java.lang.OutOfMemoryError: GC overhead limit exceeded
        at java.util.Arrays.copyOfRange(Arrays.java:3520)
        at stream.KillSwitch.copyOfRange(KillSwitch.java:300)
        at fileIO.ByteFile1.nextLine(ByteFile1.java:164)
        at shared.Tools.addNames(Tools.java:454)
        at driver.FilterReadsByName.<init>(FilterReadsByName.java:138)
        at driver.FilterReadsByName.main(FilterReadsByName.java:40)

This program ran out of memory.
Try increasing the -Xmx flag and using tool-specific memory-related parameters.
I couldn't identify a particular reason for each of the three different errors. But what I do can tell is that the driver for failing is related to the amount of reads kept: all of the processes that failed were trying to retain at least 56,881,244 pair-end reads. The first one not failing was retaining 50,519,102 pair-end reads.

One thing that I realise it could be causing it to crash is that it doesn't have a way of limiting the threads it's using. So it's always using all the available cores in the machine. Even if you launch it using the option "threads=1" (which is currently not defined as an option for "filterbyname"), you get the message "Set threads to 1" but it still uses all of them.

I don't want you to make this a priority because I manage to avoid this solution. But I think it should be something to check. Also, I think limiting the threads should be a must on any command, because in most scenarios they will be run on shared servers/clusters.

Thanks for your help!
santiagorevale is offline   Reply With Quote
Old 12-04-2017, 03:49 AM   #190
vzinche
Junior Member
 
Location: Goettingen, Germany

Join Date: Dec 2017
Posts: 3
Default randomreads.sh for huge data

Hello!

I am trying to simulate the reads from many genomes using metagenomics mode of randomreads.
The problem is the more genomes I use, the worse is the quality of the reads. Let's say, when I use 100 of genomes, around 99% of the reads can be mapped back (using bbmap) to the original genomes. Though, while using 1000 genomes, I can map only around 30-40% of generated reads. Is there some reasonable explanation for this?
vzinche is offline   Reply With Quote
Old 12-04-2017, 04:32 AM   #191
GenoMax
Senior Member
 
Location: East Coast USA

Join Date: Feb 2008
Posts: 6,582
Default

Quote:
Originally Posted by vzinche View Post
Hello!

I am trying to simulate the reads from many genomes using metagenomics mode of randomreads.
The problem is the more genomes I use, the worse is the quality of the reads. Let's say, when I use 100 of genomes, around 99% of the reads can be mapped back (using bbmap) to the original genomes. Though, while using 1000 genomes, I can map only around 30-40% of generated reads. Is there some reasonable explanation for this?
How similar are those 1000 genomes? What parameters are you using related to multi-mapping with BBMap? As the number of similar genomes increases the numbers of reads that multi-map will go up as well. You could use "ambig=all" to allow reads to map to every location/genome and that will likely take the % of aligned reads up. But you are losing specificity at that point. Other thing you could do is to generate longer reads that will increase mapping specificity.

Can you say what is the reason behind this exercise and what exact parameters you used for the randomreads.sh and bbmap.sh runs?
GenoMax is online now   Reply With Quote
Old 12-04-2017, 05:37 AM   #192
vzinche
Junior Member
 
Location: Goettingen, Germany

Join Date: Dec 2017
Posts: 3
Default

Sorry, I didn't describe the problem well enough in the previous message.

The mapping isn't the main goal and the main problem.
I need to simulate a huge metagenomics dataset (1000 genomes) for further usage, but I need to carefully keep track of the positions of the reads on genomes.
The dataset was simulated with following parameters: len=250 paired coverage=5 metagenome=t simplenames snprate=0.02
When I tried to manually compare the sequence located on the genome between the positions stated in read header with the actual read sequence, for most of the reads they were too different (blast alignment of these sequences showed no similarity). Though, for some they matched perfectly. I checked only +stand reads for simplicity.
That's why I head an idea to ran BBmap to estimate the number of reads that can't be even mapped to original genomes. I ran it with all the default parameters and it could map only around 35% of reads.

But when I have redone all the same with 100 genomes (randomly samples from these 1000), I couldn't find these 'messed up' reads and could map more than 99%.
Increasing the number of genomes, the percentage of mapped reads decreased.

Genomes are not very closely related, and changing the number of genomes being used didn't really affect their similarity.

Thus, my main concern is not the mapping itself, but the source of these 'messed up' reads.
vzinche is offline   Reply With Quote
Old 12-04-2017, 06:20 AM   #193
GenoMax
Senior Member
 
Location: East Coast USA

Join Date: Feb 2008
Posts: 6,582
Default

@Brian will likely have to weigh in on this (especially "positions stated in read header with the actual read sequence, for most of the reads they were too different ") but be aware that he has been behind on support of late.

A few things to check that I can think of:

1. If you are only going to check the + strands then perhaps you should have used the samestrand=t option when generating the reads.
2. Default value for BBMap is ambig=best. Can you try mapping with ambig=all to see if that improves alignments?
3. Do you know why the remaining reads are not mapping (are they chimeras)?
GenoMax is online now   Reply With Quote
Old 12-04-2017, 06:29 AM   #194
vzinche
Junior Member
 
Location: Goettingen, Germany

Join Date: Dec 2017
Posts: 3
Default

I will try that, thank you.

And regarding the third question, that is actually a problem. I have no idea where these reads come from. I tried to search them or parts of them in the original genomes, but apparently with no success. Could be chimeras made up of short sequences, but I can't say for sure.

The first thought was that it could be some memory problem, since it gets worse when increasing the size of the initial file, but it's just a random idea.
vzinche is offline   Reply With Quote
Old 12-04-2017, 07:01 AM   #195
GenoMax
Senior Member
 
Location: East Coast USA

Join Date: Feb 2008
Posts: 6,582
Default

Have you looked through the logs and such to see if there is any indication of any issues? There is always the possibility that @Brian may not have checked extreme usage case like this for randomreads.sh and this may be a genuine bug that is clearly a road-block.

Since you have said that 100 genomes seem to work fine you could do 10 runs of 100 genomes each and then perhaps merge the data. A thought.
GenoMax is online now   Reply With Quote
Old 12-11-2017, 09:21 PM   #196
mcmc
Member
 
Location: Midwest, USA

Join Date: Jan 2016
Posts: 11
Default BBSplit ambig=toss

Hi Brian et al.,
When I run BBsplit with ambig=toss, the ambiguous reads are not written to unmapped.fq; but when I run BBmap, they are. Is this the expected behavior? I'd like to be able to retrieve the ambiguous reads from BBsplit (both within/between two references).
Thanks,
MC
mcmc is offline   Reply With Quote
Old 12-13-2017, 09:29 AM   #197
mcmc
Member
 
Location: Midwest, USA

Join Date: Jan 2016
Posts: 11
Default summarizing mapped reads by orf

Is there a way to use BBtools to summarize reads mapped to a genome (using BBmap/BBsplit, in a sam file) by orf? I see that pileup.sh will take a prodigal-output fasta file with orf info, but I've got a genome downloaded from refseq with all the ncbi files (gff, cds fasta, gb). Can BBtools parse one of these to summarize my sam file by orf?

While I could map to the orfs.fna instead, I'm interested in intergenics too, e.g. for orf/RNA discovery.

Thanks,
MCMC
mcmc is offline   Reply With Quote
Old 12-13-2017, 10:27 AM   #198
GenoMax
Senior Member
 
Location: East Coast USA

Join Date: Feb 2008
Posts: 6,582
Default

Quote:
Originally Posted by mcmc View Post
Is there a way to use BBtools to summarize reads mapped to a genome (using BBmap/BBsplit, in a sam file) by orf? I see that pileup.sh will take a prodigal-output fasta file with orf info, but I've got a genome downloaded from refseq with all the ncbi files (gff, cds fasta, gb). Can BBtools parse one of these to summarize my sam file by orf?

While I could map to the orfs.fna instead, I'm interested in intergenics too, e.g. for orf/RNA discovery.

Thanks,
MCMC
BBTools currently has no count utilities. They may be on the wish list since many have asked Brian. For now, your best bet is to use featureCounts.
GenoMax is online now   Reply With Quote
Old 12-13-2017, 01:36 PM   #199
mcmc
Member
 
Location: Midwest, USA

Join Date: Jan 2016
Posts: 11
Default

Quote:
Originally Posted by GenoMax View Post
BBTools currently has no count utilities. They may be on the wish list since many have asked Brian. For now, your best bet is to use featureCounts.
Thanks! I'm surprised there's something BBTools doesn't do
mcmc is offline   Reply With Quote
Reply

Tags
bbmap

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 09:44 AM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2017, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO