SEQanswers

SEQanswers (http://seqanswers.com/forums/index.php)
-   Bioinformatics (http://seqanswers.com/forums/forumdisplay.php?f=18)
-   -   Yes .. BBMap can do that! (http://seqanswers.com/forums/showthread.php?t=58221)

GenoMax 04-12-2016 03:20 AM

Quote:

Originally Posted by susanklein (Post 192170)
Hi,

can you suggest how to get raw reads mapped per gene from bbmap?

Thanks,

S.

Based on your other posts you appear to have discovered htseq-count and featureCounts. BBMap suite does not have a read counting program (though @Brian has one on the wishlist of things to add to BBMap).

JeanLove 05-28-2016 02:08 AM

hi! is there a way to extract specific sequences from fasta file, regarding their positions in the file using BBMap tools? I know the numbers of their name lines and want to extract whole sequences to a file.

GenoMax 05-28-2016 04:45 AM

Quote:

Originally Posted by JeanLove (Post 195154)
hi! is there a way to extract specific sequences from fasta file, regarding their positions in the file using BBMap tools? I know the numbers of their name lines and want to extract whole sequences to a file.

It is not clear if you know their identifiers (or just their positions). If you have the identifiers then you can do the following

Quote:

By default, "filterbyname" discards reads with names in your name list, and keeps the rest. To include them and discard the others, do this:

Code:

$ filterbyname.sh in=003.fastq out=filter003.fq names=names003.txt include=t

Even though this example is for fastq files I am reasonsably certain that it would work for fasta. Make sure your file names end in .fa to facilitate that.

In case BBMap does not work you can try faSomeRecords utility from Jim Kent @UCSC which is described here: http://seqanswers.com/forums/showthread.php?t=64004

JeanLove 05-28-2016 07:26 AM

Quote:

Originally Posted by GenoMax (Post 195158)
It is not clear if you know their identifiers (or just their positions). If you have the identifiers then you can do the following[/url]

I do not know identifiers.. :(


Quote:

Originally Posted by GenoMax (Post 195158)
In case BBMap does not work you can try faSomeRecords utility from Jim Kent @UCSC which is described here: http://seqanswers.com/forums/showthread.php?t=64004

thanks for your help, i'll look faSomeRecords utility up! :)

Brian Bushnell 05-28-2016 07:38 AM

If you know the sequence numbers, you can use getreads.sh:

Quote:

Usage: getreads.sh in=<file> id=<number,number,number...> out=<file>

The first read (or pair) has ID 0, the second read (or pair) has ID 1, etc.

Parameters:
in=<file> Specify the input file, or stdin.
out=<file> Specify the output file, or stdout.
id= Comma delimited list of numbers or ranges, in any order.
For example: id=5,93,17-31,8,0,12-13

JeanLove 05-29-2016 12:59 AM

Quote:

Originally Posted by Brian Bushnell (Post 195161)
If you know the sequence numbers, you can use getreads.sh:

I have tried it, but i get the java error:

Exception in thread "main" java.lang.RuntimeException: GetReads terminated in an error state; the output may be corrupt.
at jgi.GetReads.<init>(GetReads.java:321)
at jgi.GetReads.main(GetReads.java:37)

what could be the cause? the input file is normal .fasta file with 22000 reads from Illumina MiSeq.

Brian Bushnell 05-29-2016 04:01 AM

Can you please post the exact command you used, and the entire screen output?

JeanLove 05-29-2016 05:06 AM

Quote:

Originally Posted by Brian Bushnell (Post 195172)
Can you please post the exact command you used, and the entire screen output?

getreads.sh in=af_svi_0.05Perc_22633.fasta out=bla.fasta id=6272,19594

java -ea -Xmx200m -cp /common/WORK/bbmap/current/ jgi.GetReads in=af_svi_0.05Perc_22633.fasta out=bla.fasta id=6272,19594
Executing jgi.GetReads [in=af_svi_0.05Perc_22633.fasta, out=bla.fasta, id=6272,19594]

Input is unpaired
java.lang.AssertionError: Attempting to read from a closed file. Current header: MISEQ:70:000000000-A4FTJ:1:2111:9262:20597 1:N:0:
at stream.FastaReadInputStream.nextBases(FastaReadInputStream.java:357)
at stream.FastaReadInputStream.generateRead(FastaReadInputStream.java:237)
at stream.FastaReadInputStream.fillList(FastaReadInputStream.java:210)
at stream.FastaReadInputStream.nextList(FastaReadInputStream.java:121)
at stream.ConcurrentGenericReadInputStream$ReadThread.readLists(ConcurrentGenericReadInputStream.java:656)
at stream.ConcurrentGenericReadInputStream$ReadThread.run(ConcurrentGenericReadInputStream.java:635)
Time: 0.167 seconds.
Reads Processed: 19595 117.09k reads/sec
Bases Processed: 4621k 27.62m bases/sec
Exception in thread "main" java.lang.RuntimeException: GetReads terminated in an error state; the output may be corrupt.
at jgi.GetReads.<init>(GetReads.java:321)
at jgi.GetReads.main(GetReads.java:37)

I get the output file and it seems normal, but I don't know why do I get this error.

Canadian_philosophy 06-03-2016 12:26 PM

BBMap
 
1) I am interested in understanding how BBmap is simultaneously aligning reads to multiple genomes? Is this different from using bwa to align to multiple genomes sequencially?

2) I used Bbsplit to split my fastq to hg19 reads and mouse reads. The output was only in samformat ( although I tried to name the o/p as _1.fastq.gz and 2.fastq.gz). Is it possible to directly get fastq outputs after aligning to multiple genomes?
with something like this :
out_hg19_1=SL118119_t3_hg19_gatkorder_R1.fastq, out_hg19_2=SL118119_t3_hg19_gatkorder_R2.fastq, out_mm10_1=SL118119_t3_mm10_gatkorder_R1.fastq, out_mm10_2=SL118119_t3_mm10_gatkorder_R2.fastq

3) The sam output header for both the .mm10.sam and .hg19.sam had both mouse and human chromosomes. I had to manually edit this for samtools to do post processing. Is there a way out of this?

Brian Bushnell 06-03-2016 01:05 PM

1) The difference is only in highly conserved regions. If you first map to mouse, then map the unmapped reads to human, human reads from regions with high identity to mouse will already have been removed, because they mapped to the mouse. When mapping to both genomes at once, the reads will go to the organism they actually came from because they (should) match that one better.

2) Can you give your entire command line? I just tried it and verified that I got fastq output when I used this command:

bbsplit.sh in=reads.fq basename=o_%.fq

3) Sorry, that's kind of inevitable. Actually I do not recommend using sam output from BBSplit, only fastq output. Sam output will have the ambiguously-mapped reads listed as mapped to only one organism if you are in ambig=all mode. For example, if a read mappes to human and mouse equally well, it will be output in both sam files, but in each case it will be listed as mapped to the same location (which is either human or mouse).

Canadian_philosophy 06-03-2016 01:20 PM

Hi Brian,

Thanks for responding.

1) That is what I assumed, wanted to hear it from someone else here :)

2) These are things I am trying with bbsplit:

Trial 1 :
Code:


java -Djava.library.path=/data/davelab/dr153/softwares/bbmap/jni/ -ea -Xmx56G -cp /data/davelab/dr153/softwares/bbmap/current/ align2.BBSplitter ow=t fastareadlen=500 minhits=1 minratio=0.56 maxindel=20 qtrim=rl untrim=t trimq=6
in1=/data/davelab/projects/Xenomousie/fastq_link/SL118119_S5_L001_R1_001.fastq.gz, in2=/data/davelab/projects/Xenomousie/fastq_link/SL118119_S5_L001_R2_001.fastq.gz, out_hg19=SL118119_t4_hg19_gatkorder_R1.fastq, out_hg19=SL118119_t4_hg19_gatkorder_R2.fastq, out_mm10=SL118119_t4_mm10_gatkorder_R1.fastq, out_mm10=SL118119_t4_mm10_gatkorder_R2.fastq, ref=/data/davelab/bix/resources/genomes/hg19/ucsc.hg19.fasta,/data/davelab/bix/resources/genomes/mm10/mm10_gatkorder.fa, ambiguous2=toss

Trial 2:
Code:

java -Djava.library.path=/data/davelab/dr153/softwares/bbmap/jni/ -ea -Xmx56G -cp /data/davelab/dr153/softwares/bbmap/current/ align2.BBSplitter ow=t fastareadlen=500 minhits=1 minratio=0.56 maxindel=20 qtrim=rl untrim=t trimq=6 in1=/data/davelab/projects/Xenomousie/fastq_link/SL118119_S5_L001_R1_001.fastq.gz, in2=/data/davelab/projects/Xenomousie/fastq_link/SL118119_S5_L001_R2_001.fastq.gz, out_hg19_1=SL118119_t3_hg19_gatkorder_R1.fastq, out_hg19_2=SL118119_t3_hg19_gatkorder_R2.fastq, out_mm10_1=SL118119_t3_mm10_gatkorder_R1.fastq, out_mm10_2=SL118119_t3_mm10_gatkorder_R2.fastq, ref=/data/davelab/bix/resources/genomes/hg19/ucsc.hg19.fasta,/data/davelab/bix/resources/genomes/mm10/mm10_gatkorder.fa, ambiguous2=toss
3) Thanks for clarifying the bam issue. I saw the fastq as recommended output

Thanks,
Deepthi

Brian Bushnell 06-03-2016 01:24 PM

Oh, looks like the problem is the commas. There should not be any commas in the command line. BBTools is interpreting "SL118119_S5_L001_R2_001.fastq.gz," as a filename, and it does not know what format "fastq.gz," is so it defaults to sam. I'll change it to default to fastq for that program.

Canadian_philosophy 06-03-2016 01:33 PM

Hmmm.. so I pasted that from the stderr/out file of the program, maybe that is how you formatted it there? This is how my command looks
Code:

java -Djava.library.path=/data/davelab/dr153/softwares/bbmap/jni/ -ea -Xmx56G -cp /data/davelab/dr153/softwares/bbmap/current/ align2.BBSplitter ow=t fastareadlen=500 minhits=1 minratio=0.56 maxindel=20 qtrim=rl untrim=t trimq=6 in1=/data/davelab/projects/Xenomousie/fastq_link/SL118119_S5_L001_R1_001.fastq.gz in2=/data/davelab/projects/Xenomousie/fastq_link/SL118119_S5_L001_R2_001.fastq.gz interleaved=True basename=o5%_#.fq ref=/data/davelab/bix/resources/genomes/hg19/ucsc.hg19.fasta,/data/davelab/bix/resources/genomes/mm10/mm10_gatkorder.fa refstats=refstats.txt ambiguous2=toss
The only comma is in between ref files. Does this look okay to you?

Also, thanks for fixing the default input.

Brian Bushnell 06-03-2016 01:40 PM

Hmmm, that command looks fine. Just to clarify, is that the specific command that produced sam output?

Canadian_philosophy 06-06-2016 06:23 AM

Hi Brian,
This command below worked for me. It looks like if I name the output files, that doesn't work (e.g out1_hg19=,out2_hg19=,out1_mm10= etc)
.
If I use the basename command, the fastqs look fine!

java -Djava.library.path=/data/davelab/dr153/softwares/bbmap/jni/ -ea -Xmx56G -cp /data/davelab/dr153/softwares/bbmap/current/ align2.BBSplitter ow=t fastareadlen=500 minhits=1 minratio=0.56 maxindel=20 qtrim=rl untrim=t trimq=6 in1=/data/davelab/projects/Xenomousie/fastq_link/SL118119_S5_L001_R1_001.fastq.gz in2=/data/davelab/projects/Xenomousie/fastq_link/SL118119_S5_L001_R2_001.fastq.gz interleaved=True basename=o5%_#.fq ref=/data/davelab/bix/resources/genomes/hg19/ucsc.hg19.fasta,/data/davelab/bix/resources/genomes/mm10/mm10_gatkorder.fa refstats=refstats.txt ambiguous2=toss

moistplus 07-22-2016 05:36 AM

Is it possible to compute the insert size (and make a histogram distribution) from a .bam file ?

GenoMax 07-22-2016 06:02 AM

Quote:

Originally Posted by moistplus (Post 197067)
Is it possible to compute the insert size (and make a histogram distribution) from a .bam file ?

I think so (use unsorted bam). If reads are spliced you will get some odd results. @Brian may have a suggestion later on how to handle that.

Make sure samtools is available in your $PATH.

Code:

reformat.sh in=your_file.bam ihist=hist.txt

Brian Bushnell 07-22-2016 08:59 AM

As long as the file is unsorted or name-sorted, that command will work. Spliced reads shouldn't cause a problem, unless the splice junction is in the unsequenced area between the reads, of course.

JVGen 09-07-2016 10:15 AM

Quote:

Originally Posted by Brian Bushnell (Post 197074)
As long as the file is unsorted or name-sorted, that command will work. Spliced reads shouldn't cause a problem, unless the splice junction is in the unsequenced area between the reads, of course.

Hi Brian and Geno, I didn't want to start a new thread, so I figured I'd quote and post to get your attention.

I started using BBDeDupe and Normalization in Geneious on Merged Paired Reads. These reads are not yet assembled, but will be with De Novo after DeDupe/Norm. I'm doing it because I have about 50,000 reads that cover a 10kb PCR amplicon, and I think the depth is causing assembly issues.

I wasn't able to find much information on DeDupe. Does it remove all 100% matching reads that are encompassed in another read, thereby reducing my coverage to 1? If so, is there a way to increase my coverage, to say 50-100 instead?

I'm getting a few areas with low coverage, less than 10, and I generally require that I have a minimum coverage of 15 to call a base in the consensus sequence, otherwise I call a gap. I generally have >500-1000 coverage in all places, but occasionally I get some places that have <10 coverage. I don't think this is amplicon bias, rather I think it is contaminating sequence. I use a core service for the illumina sequencing, but nearly everyone that submits samples is working on HIV, hence my need for a coverage threshold.

Thoughts on this?

Thanks,
Jake

Brian Bushnell 09-07-2016 10:56 AM

Hi Jake,

By default, Dedupe will remove all but one copy of reads that are fully contained within other reads. This may not be a good idea prior to assembly of merged reads (since many will be contained by others due to the variable insert size); just using normalization alone with a target of 100x may work better. You can set Dedupe to only remove reads that are identical to others, though, by adding the flag "ac=f".

If you suspect your data is contaminated, I suggest you try to decontaminate it prior to normalization/assembly. If you are working on non-HIV and it is contaminated with HIV, that's fairly straightforward; you can use BBSplit or Seal (if you have both references) or just BBMap (if you only have the HIV reference but nothing for your amplicon) to separate the reads by mapping. Alternately, if you are confident that all of the low-coverage stuff is contamination, you can assemble the low-coverage portion of your data using Tadpole, which allows coverage thresholds (e.g. "mincr=1 maxcr=15" will make it assemble only those contigs with coverage between 1x and 15x). Then use the resulting assembly to filter all of the reads via mapping, keeping only the unmapped reads.

Incidentally, Tadpole tends to do a good job assembling very-high-coverage short sequences.

JVGen 09-07-2016 11:31 AM

Quote:

Originally Posted by Brian Bushnell (Post 198593)
Hi Jake,

By default, Dedupe will remove all but one copy of reads that are fully contained within other reads. This may not be a good idea prior to assembly of merged reads (since many will be contained by others due to the variable insert size); just using normalization alone with a target of 100x may work better. You can set Dedupe to only remove reads that are identical to others, though, by adding the flag "ac=f".

If you suspect your data is contaminated, I suggest you try to decontaminate it prior to normalization/assembly. If you are working on non-HIV and it is contaminated with HIV, that's fairly straightforward; you can use BBSplit or Seal (if you have both references) or just BBMap (if you only have the HIV reference but nothing for your amplicon) to separate the reads by mapping. Alternately, if you are confident that all of the low-coverage stuff is contamination, you can assemble the low-coverage portion of your data using Tadpole, which allows coverage thresholds (e.g. "mincr=1 maxcr=15" will make it assemble only those contigs with coverage between 1x and 15x). Then use the resulting assembly to filter all of the reads via mapping, keeping only the unmapped reads.

Incidentally, Tadpole tends to do a good job assembling very-high-coverage short sequences.

Thanks Brian,

I took the de novo approach to try to eliminate contaminating sequences. We're all working on HIV, and it's highly mutated: it's hard to remove contaminants by DNA sequence alone. However, since I know the approximate size of my PCR amplicons (it's different in each case because of DNA deletions), I know what size my contig should be after de novo assembly. By using stringent assembly parameters, that require ~30 bp overlap with maybe 1 mismatch, I can force contaminants to be assembled into a separate contig. I can then extract consensus sequences from each of these contigs, and filter them so that only contigs greater than ~150 bp are used in a subsequent map to reference, which should remove any minor contaminants that were present. I can then extract the consensus from this alignment, which should represent the original PCR amplicon - any contaminants that might have made it into my contig should be lost as they are outnumber by the 100s.

Does this workflow make sense for my application? I'm working on a desktop Mac, so I have limited options. I've been told that Spades might be a better assembler for me, but I think I'd need to purchase a server. With the little coding experience I have, I'm a bit nervous to invest the money, lest we never get the thing to work.

Are you available for hire as a consultant?

Thanks,
Jake

Brian Bushnell 09-07-2016 01:03 PM

Quote:

Originally Posted by JVGen (Post 198595)
Thanks Brian,

I took the de novo approach to try to eliminate contaminating sequences. We're all working on HIV, and it's highly mutated: it's hard to remove contaminants by DNA sequence alone. However, since I know the approximate size of my PCR amplicons (it's different in each case because of DNA deletions), I know what size my contig should be after de novo assembly. By using stringent assembly parameters, that require ~30 bp overlap with maybe 1 mismatch, I can force contaminants to be assembled into a separate contig. I can then extract consensus sequences from each of these contigs, and filter them so that only contigs greater than ~150 bp are used in a subsequent map to reference, which should remove any minor contaminants that were present. I can then extract the consensus from this alignment, which should represent the original PCR amplicon - any contaminants that might have made it into my contig should be lost as they are outnumber by the 100s.

Does this workflow make sense for my application? I'm working on a desktop Mac, so I have limited options. I've been told that Spades might be a better assembler for me, but I think I'd need to purchase a server. With the little coding experience I have, I'm a bit nervous to invest the money, lest we never get the thing to work.

Are you available for hire as a consultant?

Thanks,
Jake

Hi Jake,

It might be helpful in this case if you could generate a kmer frequency histogram to see whether the contaminant and non-contaminant sequence is easily separable by depth alone. If so, there are a couple of easy ways to remove it. You can generate a kmer-frequency histogram with kmercountexact or BBNorm; just attach the text file to this thread. Normally I look at it in a log-log plot.

What assembler are you currently using, by the way? I've had poor results with Spades on viruses, and better results with Tadpole. But this was raw viral sequence and amplicon sequence may give different results.

As for consultancy, I've sent you a pm.

-Brian

moistplus 11-07-2016 05:47 AM

I've ran pileup for contig coverage estimation after assembly.

The output is :

ID Avg_fold Length Ref_GC Base_Coverage Read_GC

1) The coverage is the Avg_fold right ?

2) If yes, I have some negative or 0 coverage ... How can it be possible ?

Brian Bushnell 11-07-2016 08:35 AM

Quote:

Originally Posted by moistplus (Post 200764)
I've ran pileup for contig coverage estimation after assembly.

The output is :

ID Avg_fold Length Ref_GC Base_Coverage Read_GC

1) The coverage is the Avg_fold right ?

Yes, that's correct.

Quote:

2) If yes, I have some negative or 0 coverage ... How can it be possible ?
Zero coverage is certainly possible, since mapping and assembly results can differ. Negative coverage should be impossible. Can you send me the output file, and post your exact command line?

Thanks,
Brian

moistplus 11-08-2016 01:34 AM

Quote:

Originally Posted by Brian Bushnell (Post 200766)
Yes, that's correct.



Zero coverage is certainly possible, since mapping and assembly results can differ. Negative coverage should be impossible. Can you send me the output file, and post your exact command line?

Thanks,
Brian

1) So I have PE from 2 differents library:
PE 350 (insert size)
PE 550

I merged them with a simple
HTML Code:

cat
so I have at the end F reads (pair1) and R reads (pair2).

2) Mapping with bwa:

HTML Code:

bwa mem contig.fasta pair1 pair2 | samtools view -bS - > pair12.bam
3) Sort the bam file

HTML Code:

samtools sort pair12.bam -o pair12.sorted.bam
4) Pileup

HTML Code:

pileup.sh in=pair12.sorted.bam  out=stats.txt hist=histogram.txt

One more precision, I use the BBmap v32.15

The stat file:

http://www.filedropper.com/stats

Brian Bushnell 11-08-2016 08:58 AM

Indeed, I see 3 entries with negative coverage, which should not happen. However, v32.15 was released about 2.5 years ago and there have been thousands of changes since then, including hundreds of bug fixes. Could you try the latest version and see if that fixes the problem?

moistplus 11-09-2016 01:46 AM

Quote:

Originally Posted by Brian Bushnell (Post 200828)
Indeed, I see 3 entries with negative coverage, which should not happen. However, v32.15 was released about 2.5 years ago and there have been thousands of changes since then, including hundreds of bug fixes. Could you try the latest version and see if that fixes the problem?

With the new version, I don't have anymore negative coverage!

Another question:
Using this command :

Quote:

reformat.sh in1=reads1.fq in2=reads2.fq out1=sampled1.fq out2=sampled2.fq samplerate=0.1
in1 and in2 are Forward reads and Reverse reads right ? So it keeps the mated pair at the end ?

GenoMax 11-09-2016 01:56 PM

Quote:

Originally Posted by moistplus (Post 200853)
With the new version, I don't have anymore negative coverage!

Another question:
Using this command :



in1 and in2 are Forward reads and Reverse reads right ? So it keeps the mated pair at the end ?

It should do that.

moistplus 11-16-2016 01:35 AM

Can you confirm that sampling reads is random ?

Code:

reformat.sh in1=reads1.fq in2=reads2.fq out1=sampled1.fq out2=sampled2.fq samplerate=0.1

Brian Bushnell 11-16-2016 09:01 AM

Yes, it's random.

Dario1984 11-29-2016 09:00 PM

But can it output chimerically mapped reads to a separate output file, like STAR can? There's no mention of chimeric reads in the user guide, so I'm not sure if it's even suitable for that case.

Brian Bushnell 11-30-2016 08:14 AM

No, BBMap does not output chimerically-mapped reads to a separate file, though it can be used to separate properly-paired reads from improper (possibly chimeric) pairs.

GenoMax 11-30-2016 08:25 AM

Quote:

Originally Posted by Brian Bushnell (Post 201525)
No, BBMap does not output chimerically-mapped reads to a separate file, though it can be used to separate properly-paired reads from improper (possibly chimeric) pairs.

Is that functionality on "feature request" list?

Brian Bushnell 11-30-2016 09:14 AM

Quote:

Originally Posted by GenoMax (Post 201526)
Is that functionality on "feature request" list?

Oh, very well, I'll add it to the list :)

JVGen 12-05-2016 02:59 PM

Quote:

Originally Posted by Brian Bushnell (Post 201529)
Oh, very well, I'll add it to the list :)

Hi Brian,

I am trying to use BBMap to align 150 bp paired end reads to a 10 kb reference. The reference is an HIV genome, and my sequencing input is PCR amplified HIV proviruses (means I get lots of coverage).

I use BBDuk to adapter and quality trim my reads. I then used BBNorm to normalize coverage to ~150. Then I used BBMap to map the reads to the reference.

Deletions in HIV proviruses are common, and I noticed that BBMap seems to get hung up near the deletions. For instance, if a read spans the deletion, BBMap doesn't seem to insert a gap into the read so that it aligns on the other size of the deletion. I've attached some pictures. First pic is the full alignment, second 5' of the deletion, third 3' of the deletion. Note that the "CTGAGGGGACAGAT" sequence is present on the reads on the 5' side of the deletion, and should extend to the 3' side. In this case, most of these reads were trimmed so that the consensus would reflect the correct deletion, but I do worry that this might not always be the case.

Are there any settings to adjust this to allow the reads to span the deletion?

Thanks,
Jake

http://i1248.photobucket.com/albums/...pskrcmlojb.png

http://i1248.photobucket.com/albums/...psfaglicy1.png

http://i1248.photobucket.com/albums/...psvthnnqgk.png

GenoMax 12-05-2016 03:57 PM

@JVGen: Are you using default alignment settings for bbmap? You may need to adjust maxindel (which defaults to 16000) and intronlen settings.

JVGen 12-05-2016 05:14 PM

Quote:

Originally Posted by GenoMax (Post 201705)
@JVGen: Are you using default alignment settings for bbmap? You may need to adjust maxindel (which defaults to 16000) and intronlen settings.

Hi GenoMax,

The entirety of my reference is only 9000 bp, so I think the default maxindel size is appropriate. What does intronlen do?

Thanks!
JV

Brian Bushnell 12-06-2016 03:06 AM

The default maxindel should be fine in this case. If there are reads spanning the deletion, they will be mapped spanning the deletion. I'm not familiar with the viewer you are using... perhaps you could try IGV?

"intronlen=10" will, for example, replace "D" (deletion) symbols in cigar strings with "N" (skipped) symbols, for deletions of at least 10bp in length. Some programs and viewers prefer N over D for whatever reason. I consider them equivalent. But, it's possible the viewer you are using does not properly display reads with "D" symbols in the cigar string, so using IGV or remapping with the "intronlen=10" flag might be helpful. Or, if you send me the sam file and reference I can look at it.

Quote:

most of these reads were trimmed so that the consensus would reflect the correct deletion
I'm not sure what you mean by that - can you clarify? Also, can you give the exact command you used for BBMap? If you use the "local" flag, long deletions might get erased.

I've honestly never heard a concern before that BBMap was unwilling to map reads spanning long deletions - only the opposite. In your last picture, it looks to me like all of the reads are mapped with a long deletion extending off the screen to the left; or am I misinterpreting it?

JVGen 12-06-2016 04:13 AM

Quote:

Originally Posted by Brian Bushnell (Post 201709)
The default maxindel should be fine in this case. If there are reads spanning the deletion, they will be mapped spanning the deletion. I'm not familiar with the viewer you are using... perhaps you could try IGV?

Hi Brian, that's Geneious I'm viewing in. I download and tried using IGV, but I get an error when trying to load up the SAM file. I shared the SAM file with you on google drive. I included the trimmed, normalized, unassembled reads and the reference as a separate FASTA as well, just in case.

Quote:

Originally Posted by Brian Bushnell (Post 201709)
"intronlen=10" will, for example, replace "D" (deletion) symbols in cigar strings with "N" (skipped) symbols, for deletions of at least 10bp in length. Some programs and viewers prefer N over D for whatever reason. I consider them equivalent. But, it's possible the viewer you are using does not properly display reads with "D" symbols in the cigar string, so using IGV or remapping with the "intronlen=10" flag might be helpful. Or, if you send me the sam file and reference I can look at it.

This could be, but they replace no coverage with gaps, so I don't know what the program is doing behind the scenes (if it's logging D's or N's). I will try repeating with this flag and see if it changes the outcome.

Quote:

Originally Posted by Brian Bushnell (Post 201709)
I'm not sure what you mean by that - can you clarify? Also, can you give the exact command you used for BBMap? If you use the "local" flag, long deletions might get erased.

I'm running it in Geneious with the following parameters (in picture). I'm going to try ticking the "discard trimmed regions", though I think it is unnecessary, because I don't think BBDuk keeps trimmed information (which is how I trimmed the reads). Dissolve contigs is redundant - no contigs have yet been assembled. Quirk of the program.
http://i1248.photobucket.com/albums/...pswr40jfxx.png


Quote:

Originally Posted by Brian Bushnell (Post 201709)
I've honestly never heard a concern before that BBMap was unwilling to map reads spanning long deletions - only the opposite. In your last picture, it looks to me like all of the reads are mapped with a long deletion extending off the screen to the left; or am I misinterpreting it?

There is a ~3.5kb deletion on the 3' end of the HIV genome. The HIV reference sequence is depicted in the faded yellow box. Reads assembled to the reference are depicted below as black rectangles (which, when I zoom in, show their sequence). A coverage map is shown above the reference in red. A consensus sequence for the assembled reads is provided above the coverage map. Within the consensus, black represents a mismatch to the reference (this many mismatches is not uncommon for HIV, as it's a retrovirus and reverse transcription introduces many mutations).

Thanks for any help!

JV

Brian Bushnell 12-06-2016 12:21 PM

Quote:

Originally Posted by JVGen (Post 201727)
Hi Brian, that's Geneious I'm viewing in. I download and tried using IGV, but I get an error when trying to load up the SAM file.

IGV needs a sorted, indexed bam file. It won't accept sam.

Quote:

I shared the SAM file with you on google drive. I included the trimmed, normalized, unassembled reads and the reference as a separate FASTA as well, just in case.
Please send me the links, and I'll look at them.

Quote:

I'm running it in Geneious with the following parameters (in picture). I'm going to try ticking the "discard trimmed regions", though I think it is unnecessary, because I don't think BBDuk keeps trimmed information (which is how I trimmed the reads). Dissolve contigs is redundant - no contigs have yet been assembled. Quirk of the program.
Hmmm, I'm not really sure what Geneous is doing behind the scenes here with regards to trimming, but it doesn't look like it would have any kind of effect that would suppress long deletions.

Dinab 04-04-2017 05:58 AM

Quality distribution graph
 
A bit late, but hope you can help me:
I want to create an average quality distribution graph of multiple sequences from fastq file. Is it possible with reformat?
Also, if I use reformat.sh with maq=XX, does it simply calculates the avvarge quality based on base quality and checks the error rate (2% max)?

Thank you!

Brian Bushnell 04-04-2017 04:09 PM

Quote:

Originally Posted by Dinab (Post 205922)
A bit late, but hope you can help me:
I want to create an average quality distribution graph of multiple sequences from fastq file. Is it possible with reformat?

Yes, the histogram section of the help covers that. Depending on how you want to plot it, you can use qhist, qchist, aqhist, or bqhist. The most generic is aqhist, which plots the number of reads per average quality score.

Quote:

Also, if I use reformat.sh with maq=XX, does it simply calculates the average quality based on base quality and checks the error rate (2% max)?
"maq" transforms the qualities into probability space, calculates the average error rate, and then transforms that back into the phred scale. So, a 2% maximum error rate would be approximately maq=17.

Dinab 04-09-2017 05:11 AM

Perfect, Thank you!

Dinab 04-09-2017 11:26 PM

Another question
 
Hi Brian,
Your previous response worked perfectly, thank you again.
I now have additional complexity to add: is it possible to get a plot of the mean quality of each sequence as variable of the length?

Brian Bushnell 04-10-2017 08:51 AM

Quote:

Originally Posted by Dinab (Post 206103)
I now have additional complexity to add: is it possible to get a plot of the mean quality of each sequence as variable of the length?

No, I've never implemented something like that... the closest you could get with the existing package is to use a loop for each length with something like...

Code:

reformat.sh in=reads.fq minlen=147 maxlen=147 out=stdout.fq int=f | reformat.sh in=stdin.fq aqhist=aqhist_147.txt

GenoMax 04-10-2017 04:33 PM

Quote:

Originally Posted by Dinab (Post 206103)
is it possible to get a plot of the mean quality of each sequence as variable of the length?

FastQC gives you an average per sequence quality score.

blsfoxfox 05-07-2017 10:25 PM

fastq format error
 
Hi Brian,

I found there is some format error with my fastq. By using reformat.sh in=in.fastq out=out.fastq, I get following error:

Input is being processed as unpaired
java.lang.AssertionError:
Error in in.fastq, line 4241767, with these 4 lines:
>m150430_235943_42146_c100804572550000001823173810081565_s1_p0/108988/0_3288 RQ=0.868
CTCTTTCCGGTAACACGCGCCTAGATACTAACAACCTACCTTATACTAATCAGCATGCCTAATACCAACA
ACCTACTTTATACCTAATAACAATGCCTTGATAACTAACAACATACCTTAATACCTAATACACGCTTAAT
ACCTAACAACAATACCTTAATACCTAACAACATACTTTAATACCTAACAACCTACCTTAATAACTTAATA

at stream.FASTQ.quadToRead(FASTQ.java:779)
at stream.FASTQ.toReadList(FASTQ.java:710)
at stream.FastqReadInputStream.fillBuffer(FastqReadInputStream.java:110)
at stream.FastqReadInputStream.nextList(FastqReadInputStream.java:95)
at stream.ConcurrentGenericReadInputStream$ReadThread.readLists(ConcurrentGenericReadInputStream.java:656)
at stream.ConcurrentGenericReadInputStream$ReadThread.run(ConcurrentGenericReadInputStream.java:635)
Input: 1060400 reads 9378503777 bases
Output: 1060400 reads (100.00%) 9378503777 bases (100.00%)

Time: 75.193 seconds.
Reads Processed: 1060k 14.10k reads/sec
Bases Processed: 9378m 124.73m bases/sec
Exception in thread "main" java.lang.RuntimeException: ReformatReads terminated in an error state; the output may be corrupt.
at jgi.ReformatReads.process(ReformatReads.java:1130)
at jgi.ReformatReads.main(ReformatReads.java:45)

Could you please provide any suggestions on which steps shoud I follow to find the exact error of that read?

Thanks!

Brian Bushnell 05-07-2017 10:33 PM

The problem here is that you are using a fasta sequence named "in.fastq". BBTools is sensitive to filenames, and *.fastq will be processed as a fastq file. If the file has no extension it will usually look at the contents to try to figure out what it is, but when there is a known extension, it assumes it is correct.

So, just rename the input file to "in.fasta" or add the flag "extin=.fasta" to override filename. Although for most uses of PacBio data I do recommend that you go back and get the original fastq file and use that, because the quality scores are often useful.

blsfoxfox 05-07-2017 10:42 PM

Hi Brian,

Thanks for your quick response!

Actually, I am doing that on purpose. I found there is format error in my fastq file while using another software, but I don't know where is it. By using reformat.sh with in=in.fastq out=out.fastq, bbmap will report where is that error. In this case, read >m150430_235943_42146_c100804572550000001823173810081565_s1_p0/108988/0_3288 RQ=0.868 causes a corruption of the output. However, I still don't know what's wrong and the only solution I can think of now is using filterbyname tool to exclude that read from my fastq.
Thus, I really appreciate if you could provide any suggestion on how could I identify and fix that format error.

Thanks again for your time,

blsfoxfox 05-08-2017 06:39 AM

Quote:

Originally Posted by Brian Bushnell (Post 207119)
The problem here is that you are using a fasta sequence named "in.fastq". BBTools is sensitive to filenames, and *.fastq will be processed as a fastq file. If the file has no extension it will usually look at the contents to try to figure out what it is, but when there is a known extension, it assumes it is correct.

So, just rename the input file to "in.fasta" or add the flag "extin=.fasta" to override filename. Although for most uses of PacBio data I do recommend that you go back and get the original fastq file and use that, because the quality scores are often useful.

Hi Brian,

Sorry for that I should double checked my input.fastq. After grep multiple lines after that read, I find some fasta sequences in that fastq file. As they are raw sequences, I don't have a 'clean' backup for it. So my problem now is to find a tool which could extract fasta sequences from fastq.

Thanks,

GenoMax 05-08-2017 06:47 AM

Code:

reformat.sh in=seq.fq out=seq.fa
will convert fastq sequences to fasta. If your problem is malformed fastq records then you need to find and delete those records manually.

blsfoxfox 05-08-2017 06:58 AM

Quote:

Originally Posted by GenoMax (Post 207134)
Code:

reformat.sh in=seq.fq out=seq.fa
will convert fastq sequences to fasta. If your problem is malformed fastq records then you need to find and delete those records manually.

Hi,

Thanks for your response!

My problem is that my file is mixed of fasta and fastq, and I don't know how to identify those fasta sequences and extract them.

GenoMax 05-08-2017 07:02 AM

How did that happen?

Unless you have a defined reason I would suggest not trusting that file.

skatrinli 05-30-2017 01:29 AM

Hey Brian,

Thanks for this great tool but I could not properly download it. I installed the latest version 37.25 but when i try to test the installation with the command
$ (installation directory)/stats.sh in=(installation directory)/resources/phix174_ill.ref.fa.gz
it comes this error:
Exception in thread "main" java.lang.RuntimeException: Unknown parameter Downloads/bbmap/resources/phix174_ill.ref.fa.gz
at jgi.AssemblyStats2.<init>(AssemblyStats2.java:166)
at jgi.AssemblyStats2.main(AssemblyStats2.java:39)

What did I do wrong?

GenoMax 05-30-2017 04:26 AM

There is no "installation" needed for BBMap. Just uncompress and run (as long as you have Java available for your OS). Are you using Java 1.7 or 1.8?
@Brian no longer tests against Java 1.6 (which is what you may be using) if I recall.

I get the following when I run stats.sh.

Code:

$ stats.sh in=/path_to/bbmap/resources/phix174_ill.ref.fa.gz
A      C      G      T      N      IUPAC  Other  GC      GC_stdev
0.2399  0.2144  0.2326  0.3130  0.0000  0.0000  0.0000  0.4471  0.0000

Main genome scaffold total:            1
Main genome contig total:              1
Main genome scaffold sequence total:    0.005 MB
Main genome contig sequence total:      0.005 MB        0.000% gap
Main genome scaffold N/L50:            1/5.386 KB
Main genome contig N/L50:              1/5.386 KB
Main genome scaffold N/L90:            1/5.386 KB
Main genome contig N/L90:              1/5.386 KB
Max scaffold length:                    5.386 KB
Max contig length:                      5.386 KB
Number of scaffolds > 50 KB:            0
% main genome in scaffolds > 50 KB:    0.00%


Minimum        Number          Number          Total          Total          Scaffold
Scaffold        of              of              Scaffold        Contig          Contig 
Length          Scaffolds      Contigs        Length          Length          Coverage
--------        --------------  --------------  --------------  --------------  --------
    All                      1              1          5,386          5,386  100.00%
  5 KB                      1              1          5,386          5,386  100.00%


Brian Bushnell 05-30-2017 09:02 AM

Quote:

Originally Posted by skatrinli (Post 207878)
Hey Brian,

Thanks for this great tool but I could not properly download it. I installed the latest version 37.25 but when i try to test the installation with the command
$ (installation directory)/stats.sh in=(installation directory)/resources/phix174_ill.ref.fa.gz
it comes this error:
Exception in thread "main" java.lang.RuntimeException: Unknown parameter Downloads/bbmap/resources/phix174_ill.ref.fa.gz
at jgi.AssemblyStats2.<init>(AssemblyStats2.java:166)
at jgi.AssemblyStats2.main(AssemblyStats2.java:39)

What did I do wrong?

You can't have spaces in the filenames without specific countermeasures like quotes. For example:

Code:

stats.sh in=foo bar.fa
Exception in thread "main" java.lang.RuntimeException: Unknown parameter bar.fa
        at jgi.AssemblyStats2.<init>(AssemblyStats2.java:166)
        at jgi.AssemblyStats2.main(AssemblyStats2.java:39)

That doesn't work...
Code:

stats.sh in="foo bar.fa"
Exception in thread "main" java.lang.RuntimeException: Unknown parameter bar.fa
        at jgi.AssemblyStats2.<init>(AssemblyStats2.java:166)
        at jgi.AssemblyStats2.main(AssemblyStats2.java:39)

That doesn't work either.

Code:

stats.sh in="foo\ bar.fa"
A      C      G      T      N      IUPAC  Other  GC      GC_stdev
NaN    NaN    NaN    NaN    NaN    NaN    NaN    NaN    0.0000

Main genome scaffold total:            0
Main genome contig total:              0
Main genome scaffold sequence total:    0.000 MB
Main genome contig sequence total:      0.000 MB        NaN% gap
Main genome scaffold N/L50:            0/0
Main genome contig N/L50:              0/0
Main genome scaffold N/L90:            0/0
Main genome contig N/L90:              0/0
Max scaffold length:                    0
Max contig length:                      0
Number of scaffolds > 50 KB:            0
% main genome in scaffolds > 50 KB:    0.00%


Minimum        Number          Number          Total          Total          Scaffold
Scaffold        of              of              Scaffold        Contig          Contig
Length          Scaffolds      Contigs        Length          Length          Coverage
--------        --------------  --------------  --------------  --------------  --------

That does work (though I ran it on an empty file).

The exact way to deal with spaces is system-specific. In the normal Windows shell you can just use quotes; in Linux bash it looks like you need quotes and an escape character (backslash); in Windows under a Linux emulator I'm not entirely sure. The easiest thing to do is to put files in a path that does not have any spaces (so, not in My Documents, but in C:\data\ or something like that.)

TomHarrop 05-31-2017 06:12 PM

Hi Brian,

I'm using reformat.sh to play around with some Nanopore reads. Is there any way to get the histograms (e.g. mhist, qhist, bhist) to track longer reads, like the `max` parameter in readlength.sh?

Thanks,

Tom

Brian Bushnell 06-01-2017 04:13 PM

Quote:

Originally Posted by TomHarrop (Post 207984)
Hi Brian,

I'm using reformat.sh to play around with some Nanopore reads. Is there any way to get the histograms (e.g. mhist, qhist, bhist) to track longer reads, like the `max` parameter in readlength.sh?

Thanks,

Tom

Sorry, the max lengths are currently constants. I'll add support for changing them. Generally I didn't find them all that useful for variable-length reads since I kind of designed them to find position-related anomalies, but it's fairly easy to change.

jazz710 08-12-2017 12:29 AM

Re: reformat.sh

Is there now, or could there be in the future, be a way to specify multiple sets of paired read inputs (ie. different libraries) which could be randomly sampled and output to a single FASTQ file?

Ie) in=FASTQ_A_R1,FASTQ_B_R1,FASTQ_C_R1 in2=FASTQ_A_R2,FASTQ_B_R2,FASTQ_C_R2 out=Subset_R1.fastq out2=Subset_R2.fastq

Can do it via a previous cat command (A+B+C -> reformat) but with large files cat can be an I/O issue.

Best and Thanks,
Bob

SNPsaurus 08-12-2017 07:47 AM

Quote:

Originally Posted by jazz710 (Post 210090)
Re: reformat.sh

Is there now, or could there be in the future, be a way to specify multiple sets of paired read inputs (ie. different libraries) which could be randomly sampled and output to a single FASTQ file?

Ie) in=FASTQ_A_R1,FASTQ_B_R1,FASTQ_C_R1 in2=FASTQ_A_R2,FASTQ_B_R2,FASTQ_C_R2 out=Subset_R1.fastq out2=Subset_R2.fastq

Can do it via a previous cat command (A+B+C -> reformat) but with large files cat can be an I/O issue.

Best and Thanks,
Bob

I take a subset from each file and then cat the subsets together. That would be faster than combining the inputs with cat, at least. If the subset size is a significant fraction of most of the files it would still be slow, but if you are just collecting a small fraction of reads it is fast.

Brian Bushnell 08-14-2017 09:58 AM

Quote:

Originally Posted by SNPsaurus (Post 210112)
I take a subset from each file and then cat the subsets together. That would be faster than combining the inputs with cat, at least.

True, this works well. Alternatively, for interleaved files, you can do this:

Code:

cat a.fq b.fq c.fq | reformat.sh in=stdin.fq out=sampled.fq interleaved samplerate=0.1
...which avoids writing temp files. Won't work for twin paired files, though, unless you do something tricky with named pipes.

To avoid wasting disk space and bandwidth, I normally keep all fastq files gzipped at all times. Using pigz for parallel compression, or compression level 2 (zl=2 flag), will eliminate much of the speed penalty from dealing with compressed files; and if you are I/O limited, compressed files tend to speed things up.

I don't have any plans at present to add multiple input file support (or wildcard support) to Reformat, but I'll put it on my list and consider it. It's something I've occasionally wanted also.

TomHarrop 09-11-2017 02:21 PM

Can BBmap remove reads containing homopolymers?
 
Hi BBmap-ers,

Say I want to remove reads with more than 5 consecutive identical nucleotides. Is there a homopolymer/ polyX filtering option with BBDuk or reformat.sh?

Thanks!

Tom

Brian Bushnell 09-11-2017 02:24 PM

There is no homopolymer filter, per se, but you can accomplish that like this:

Quote:

bbduk.sh in=reads.fq out=clean.fq literal=AAAAAA,CCCCCC,GGGGGG,TTTTTT k=6 mm=f

TomHarrop 09-11-2017 02:35 PM

Great, thanks!

gokhulkrishnakilaru 09-23-2017 01:56 PM

Hi Brian-

1. We are currently using bedtools BAM to BED and then extending reads for chip visualization. Does BBMAP suite has an option to extend reads in BAM based on the fragment length estimate from MACS or automatically from BAM?

2. Is there an option/tool to remove chip duplicates based on the read ID in BAM instead of co-ordinates?

Brian Bushnell 09-23-2017 07:02 PM

1) No, it does not. Tadpole can extend reads based on kmer-counts but it does not make use of mapping information.

2) Clumpify can remove optical duplicate based on a combination of sequence (which indicates whether they are duplicate) and position from the read ID (which indicates if they are very close). However, it cannot handle bam files, only fasta and fastq.

-Brian

dho 09-26-2017 05:26 PM

Comprehensive reporting of mapped reads
 
Hi Brian,

I am trying to map 100bp Illumina sequences to a collection of very similar, 80bp reference sequences (multiple alleles of a single exon) using bbmap.

When mapping to a single reference sequence from the collection in isolation using settings:

'ambiguous=all',
'maxsites=1000000',
'vslow',
'subfilter=0'

An expected number of sequences (in this case, ~270) map and fully cover the reference sequence.

When using the same parameters and mapping to a collection of sequences containing the same reference sequence only ~120 reads map.

Is there another parameter(s) I need to be setting to map my reads exhaustively against all reference sequences?

Thanks,

dave

Brian Bushnell 09-27-2017 01:30 PM

Hi Dave,

BBMap has some heuristics that may make it non-ideal for the situation with a large number of near-identical sequences, particularly when the reads don't map glocally well to any of them (because the reads are longer than the reference sequences) and the reference sequences are tiny. You might also try the flag "excludefraction=0" to see if this changes anything. To clarify, there are perfect alignments (with zero mismatches) that are getting missed, correct?

Many of the heuristics related to ignoring extremely common, uninformative reference kmers are disabled in bbmapskimmer.sh, which is designed specifically for a high degree of multimapping. The syntax is the same as BBMap, so please give that a try and let me know if it works better. You'll need to additionally add the flag "minscaf=1" or really short scaffolds get discarded, so something like:

Code:

bbmapskimmer.sh in=reads.fq ref=ref.fa maxsites=1000000 vslow ambig=all maxindel=0 subfilter=0 excludefraction=0 out=mapped.sam minscaf=1
Please let me know whether that changes the situation.

-Brian

dho 09-27-2017 05:37 PM

Hi Brian,

No, unfortunately it didn't help. I think you understand the question correctly:

When I map 100bp reads against a single allele of a gene with four exons using:

Code:

maxsites=1000000 vslow ambig=all maxindel=0 subfilter=0 excludefraction=0 out=mapped.sam minscaf=1 covstats=stdout | grep 'IPD0001613'
I get 100% read support for all four exons.

Code:

Mafa-DPA1*07:02|IPD0001613_2_MHC-II-DPA        288.0902        244        0.5000        100.0000        244        421        388        0.4788        297        35.43
Mafa-DPA1*07:02|IPD0001613_3_MHC-II-DPA        235.8719        281        0.5658        100.0000        281        408        343        0.5554        247        40.92
Mafa-DPA1*07:02|IPD0001613_4_MHC-II-DPA        218.1935        155        0.6323        100.0000        155        227        220        0.5913        228        27.77
Mafa-DPA1*07:02|IPD0001613_1_MHC-II-DPA        199.3457        81        0.5679        100.0000        81        147        111        0.5713        200        35.48

When I use the same parameters but include the allele in a larger reference sequence that contains many other alleles, the number of mapped reads decreases for all four exons and no longer fully covers exon 1:

Code:

Mafa-DPA1*07:02|IPD0001613_2_MHC-II-DPA        242.6475        244        0.5000        100.0000        244        343        327        0.4746        233        43.23
Mafa-DPA1*07:02|IPD0001613_3_MHC-II-DPA        58.2847        281        0.5658        100.0000        281        81        93        0.5509        63        19.81
Mafa-DPA1*07:02|IPD0001613_4_MHC-II-DPA        194.1226        155        0.6323        100.0000        155        201        181        0.5935        204        33.17
Mafa-DPA1*07:02|IPD0001613_1_MHC-II-DPA        38.3333        81        0.5679        97.5309        79        34        17        0.6038        51        17.89

Any other thoughts?

Thanks,

dave

gokhulkrishnakilaru 09-28-2017 07:42 AM

Quote:

Originally Posted by Brian Bushnell (Post 211221)
1) No, it does not. Tadpole can extend reads based on kmer-counts but it does not make use of mapping information.

2) Clumpify can remove optical duplicate based on a combination of sequence (which indicates whether they are duplicate) and position from the read ID (which indicates if they are very close). However, it cannot handle bam files, only fasta and fastq.

-Brian

Thanks Brian.

How about BED?

dho 10-01-2017 10:36 AM

Hi Brian,

I figured out the problem and a workaround.The problem is that even with your recommended settings reads that extended beyond the ends of both short and long reference sequences would only be mapped to the longer reference sequences.

The workaround is to avoid having longer and shorter reference sequences as mapping targets at the same time. I subdivided by reference sequences into multiple reference sequences that each contain sequences of the same size and then map against each of these individually. I can then merge the output from all of these mappings.

Thanks for your help this week and I hope my solution helps others who encounter the same issue!

dave

Brian Bushnell 10-03-2017 02:41 PM

Quote:

Originally Posted by dho (Post 211425)
Hi Brian,

I figured out the problem and a workaround.The problem is that even with your recommended settings reads that extended beyond the ends of both short and long reference sequences would only be mapped to the longer reference sequences.

The workaround is to avoid having longer and shorter reference sequences as mapping targets at the same time. I subdivided by reference sequences into multiple reference sequences that each contain sequences of the same size and then map against each of these individually. I can then merge the output from all of these mappings.

Thanks for your help this week and I hope my solution helps others who encounter the same issue!

dave

Hi Dave,

Thanks for the followup. I apologize for not getting back to you in a timely fashion, I'm pretty swamped currently!

dho 10-04-2017 04:33 PM

Hi Brian,

No problem! Thanks for all that you do.

For posterity, the solution I proposed above worked for the most part, but I encountered a few cases where it didn't. Since then, I've decided to be fully exhaustive and map against each reference sequence individually. This is computationally intensive but gives the most robust results. Even with these settings, though, reads mapping to very small reference sequences (<50bp) doesn't seem to work as consistently as I'd like, though I haven't figured out the source of this inconsistency.

--dave

gokhulkrishnakilaru 10-10-2017 08:21 AM

Adding Unique Identifier To Paired End Reads. Editing FASTQ Read ID based on randomer
 
Hi Brian-

We have a paired end rnaseq data.

For every sequence in read2, we want to extract the first 6nucleotides and append them to the read id in read2 and also to the respective read id in read1.

Please see example below.

Code:

cat read1
@NB501293:231:HV3CTBGX2:1:11101:2280:1047 1:N:0:CAGATC
CCGCANGTTGCAGAGCGGGTGGGAGCCNCTNCGGGCGCGGCACTGNAGCCCTGANACTGAACCCCGAACCCGAGCC
+
AAAAA#EEEEEEE/EEEEEEEEE6E//#<E#/EEE/E/EAEAEEE#/EA/AEEE#/EAAAEE/EEEAEEE/EE//6
@NB501293:231:HV3CTBGX2:1:11101:9866:1047 1:N:0:CAGATC
CTCAANGGGAGAGACCTTAGATGATACNCANGATGACAGTAGGTANAGGGAACTTATAGAGCCACCTCCATCAGGA
+
AAAAA#EEEEEEEEEAEEEEEEEEEEE#EE#EEEEEEEEEEEEEE#EEEEEEEEEEEEEEEEEAEEEEEEEEEEEE
@NB501293:231:HV3CTBGX2:1:11101:24301:1048 1:N:0:CAGATC
ACGGANCTCTGGCTGTTGTATGGAAAGNTANGCTGTAACACGCACNGACAGAAGAGAGCCATTTTCTCCCTGAACT
+
AA/AA#EEEEEAEAEAEEEEE/EEEEE#E<#6EEAEEAAAE///E#EEAEE//A/</EE</EE//E6E6EEEEEE6
@NB501293:231:HV3CTBGX2:1:11101:16754:1048 1:N:0:CAGATC
CCTGGNAGCCGCCGCAAGCGCCGGACCNCANGCACTCCCAGGCGCGCGCGCTTCTTCTGCAAAAAGTTGAGGGCTC
+
AAAAA#EEAEEEEEE6EEEEEEEEAE/#EE#EE/EE//E/EEAEEEEEEEEAEEEEEEEE/EEEEEEEEEEEEE/E


cat read2
@NB501293:231:HV3CTBGX2:1:11101:2280:1047 2:N:0:CAGATC
GCTGGGCGAGTAGCTTCTGGATCCTGGCCTCCTGAGCCTGTGGCCCGGGCTAGGCTCGGGGCTCGGGTTCGGGGTT
+
AAAAAEEEEEEE/AEEAEEEEEEEE/AEEEAAAE/EEEEEAEAEEEE6EEEAEEEEEEEEEEE/EAEAAE6EEE<A
@NB501293:231:HV3CTBGX2:1:11101:9866:1047 2:N:0:CAGATC
TGACTGTGGGGTGGCAACCCCATTCCTCACTTGATGTCCTGTCTTCCTGATGGAGGTGGCTCTATAAGTTCCCTCT
+
AAAAAEEEEEEEEEE/AEEEEEEEEEEEEEEEEEEEEEAEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEAEE
@NB501293:231:HV3CTBGX2:1:11101:24301:1048 2:N:0:CAGATC
TCGCATCATCCTACACAGCACGGACGCTAGATGACAGGACGTGCCATGACAGTCTAAGTTCAGGGAGAAAATGGCT
+
/AAAAEE//EEEEEEEEEEAEAE/EEEEEE/EA/EEA/AEEEEEEE/E/EEA/EEEEE/EEEEEEEEAEEEEA/EE
@NB501293:231:HV3CTBGX2:1:11101:16754:1048 2:N:0:CAGATC
TGACCAGCCATTGGCTGGTGGGAGTAGTGATGTCACCCATATGACACCCTGATAACGAGTTGAGAGAGAGCCCTCA
+
AA/AAEEEEEAEEEEEEEEEAEEEEEEEEAEEEEEA/EEEEE</</EAEEEEE<EEEAEEEEEEEEAA/EE<6/EE





Desired Output


Code:

cat read1
@GCTGGG_NB501293:231:HV3CTBGX2:1:11101:2280:1047 1:N:0:CAGATC
CCGCANGTTGCAGAGCGGGTGGGAGCCNCTNCGGGCGCGGCACTGNAGCCCTGANACTGAACCCCGAACCCGAGCC
+
AAAAA#EEEEEEE/EEEEEEEEE6E//#<E#/EEE/E/EAEAEEE#/EA/AEEE#/EAAAEE/EEEAEEE/EE//6
@TGACTG_NB501293:231:HV3CTBGX2:1:11101:9866:1047 1:N:0:CAGATC
CTCAANGGGAGAGACCTTAGATGATACNCANGATGACAGTAGGTANAGGGAACTTATAGAGCCACCTCCATCAGGA
+
AAAAA#EEEEEEEEEAEEEEEEEEEEE#EE#EEEEEEEEEEEEEE#EEEEEEEEEEEEEEEEEAEEEEEEEEEEEE
@TCGCAT_NB501293:231:HV3CTBGX2:1:11101:24301:1048 1:N:0:CAGATC
ACGGANCTCTGGCTGTTGTATGGAAAGNTANGCTGTAACACGCACNGACAGAAGAGAGCCATTTTCTCCCTGAACT
+
AA/AA#EEEEEAEAEAEEEEE/EEEEE#E<#6EEAEEAAAE///E#EEAEE//A/</EE</EE//E6E6EEEEEE6
@TGACCA_NB501293:231:HV3CTBGX2:1:11101:16754:1048 1:N:0:CAGATC
CCTGGNAGCCGCCGCAAGCGCCGGACCNCANGCACTCCCAGGCGCGCGCGCTTCTTCTGCAAAAAGTTGAGGGCTC
+
AAAAA#EEAEEEEEE6EEEEEEEEAE/#EE#EE/EE//E/EEAEEEEEEEEAEEEEEEEE/EEEEEEEEEEEEE/E


cat read2
@GCTGGG_NB501293:231:HV3CTBGX2:1:11101:2280:1047 2:N:0:CAGATC
GCTGGGCGAGTAGCTTCTGGATCCTGGCCTCCTGAGCCTGTGGCCCGGGCTAGGCTCGGGGCTCGGGTTCGGGGTT
+
AAAAAEEEEEEE/AEEAEEEEEEEE/AEEEAAAE/EEEEEAEAEEEE6EEEAEEEEEEEEEEE/EAEAAE6EEE<A
@TGACTG_NB501293:231:HV3CTBGX2:1:11101:9866:1047 2:N:0:CAGATC
TGACTGTGGGGTGGCAACCCCATTCCTCACTTGATGTCCTGTCTTCCTGATGGAGGTGGCTCTATAAGTTCCCTCT
+
AAAAAEEEEEEEEEE/AEEEEEEEEEEEEEEEEEEEEEAEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEAEE
@TCGCAT_NB501293:231:HV3CTBGX2:1:11101:24301:1048 2:N:0:CAGATC
TCGCATCATCCTACACAGCACGGACGCTAGATGACAGGACGTGCCATGACAGTCTAAGTTCAGGGAGAAAATGGCT
+
/AAAAEE//EEEEEEEEEEAEAE/EEEEEE/EA/EEA/AEEEEEEE/E/EEA/EEEEE/EEEEEEEEAEEEEA/EE
@TGACCA_NB501293:231:HV3CTBGX2:1:11101:16754:1048 2:N:0:CAGATC
TGACCAGCCATTGGCTGGTGGGAGTAGTGATGTCACCCATATGACACCCTGATAACGAGTTGAGAGAGAGCCCTCA
+
AA/AAEEEEEAEEEEEEEEEAEEEEEEEEAEEEEEA/EEEEE</</EAEEEEE<EEEAEEEEEEEEAA/EE<6/EE

Do you know if the BB suite can achieve this? I tried fastx toolkit but it prints the full sequence instead of a partial sequence.

santiagorevale 10-23-2017 05:43 AM

Hi Brian,

I'm using "filterbyname.sh" script from bbmap v37.60 (using Java 1.8.0_102) to extract some reads from a FastQ file given a list of IDs.

The current FastQ file has 196 Mi reads and I want to keep 85 Mi. Uncompressed FastQ file size is 14G while compressed is only 1.4G. IDs file is 3.1G.

When running the script using 24G of RAM it dies with OutOfMemoryError. Isn't it an excessive use of memory for just filtering a FastQ file? Also, among the script arguments the is no "threads" option, however the script is using all available cores. Any way of limiting both memory as well as threads usage?

Here is the error:

java -ea -Xmx24G -cp /software/bbmap-37.60/current/ driver.FilterReadsByName -Xmx24G include=t in=Sample1.I1.fastq.gz out=filtered.Sample1.I1.fastq.gz names=reads.ids
Executing driver.FilterReadsByName [-Xmx24G, include=t, in=Sample1.I1.fastq.gz, out=filtered.Sample1.I1.fastq.gz, names=reads.ids]

Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
at java.util.Arrays.copyOfRange(Arrays.java:3664)
at java.lang.String.<init>(String.java:207)
at java.lang.String.toLowerCase(String.java:2647)
at java.lang.String.toLowerCase(String.java:2670)
at driver.FilterReadsByName.<init>(FilterReadsByName.java:145)
at driver.FilterReadsByName.main(FilterReadsByName.java:40)

Thank you very much in advance.

Best regards,
Santiago

bio_d 10-23-2017 11:36 AM

problem using bbmap
 
Hi Brian,

I am trying a Denovo assembly of a reptilian genome (size comparable to human genome) and have the following Illumina sequence libraries (paired-end, mate-pair and long mate-pair with estimated insert sizes 200bp, 5.2kb and 10kb respectively.).

After quality and adapter trimming (using FASTQc toolkit), I have used bbmap.sh to align the reads to a reference (assembly of contigs from CLC workbench). However, I am unable to generate bam file (using samtools) from the sam file generated using bbmap. I have used "rcomp=t" and "rcs=f" flag for both the mate pair and long mate pair libraries. I had earlier tried with "rcomp=t" flag only but that didn't help either.

Please help.

Best,
D

SNIPPET submit script:

srun ./bbmap/bbmap.sh rcomp=t rcs=f in1=mate_pair_1.fq.gz in2=mate_pair_2.fq.gz outu=mate_pair_u.sam outm=mate_pair_m.sam
srun ./samtools-1.6/samtools view -b -o mate_pair_m.bam mate_pair_m.sam
srun ./samtools-1.6/samtools view -b -o mate_pair_u.bam mate_pair_u.sam

snippet Error:

[E::sam_parse1] SEQ and QUAL are of different length
[W::sam_read1] Parse error at line 2051364
[main_samview] truncated file.
[E::sam_parse1] SEQ and QUAL are of different length
[W::sam_read1] Parse error at line 2051364
[main_samview] truncated file.
srun: error: cluster01: task 0: Exited with exit code 1
srun: error: cluster01: task 1: Exited with exit code 1

GenoMax 10-23-2017 01:03 PM

Can you check what the lines in question look like? Use "zcat mate_pair_1.fq.gz | sed -n '2051364,2051366p' filename" to extract those lines.

bio_d 10-23-2017 01:17 PM

Hi,

I got this with the command you suggested.

BBBCCGC>DE1>FC1FCCFFGEEEBB1FGGDDCD111CF@EG1FBFFFDC>C><F>CC1:FF:11119:1111B1111E@>GC>G1=:11:=<CFD<1F:0=:F@0BFCG>>0FFB>F000;00F
@HWI-D00294:282:CAB4VANXX:7:1101:13889:60146 1:N:0:GCCAAT
CCAATGGGGAATGGCGAAAGCACTGCTCAGCATTTCTGGCTCTGCCTGAGGCTGGAATGCAGAAAACCCTGCAGTAGAGGGGGATCTTCTCTTTGGGGTGCTCCTCGTGCCTCCCCCTTACTGC

Best,
D

GenoMax 10-23-2017 03:19 PM

Compare the sequence and quality lengths of record that the first line belongs to, so let us try "zcat mate_pair_1.fq.gz | sed -n '2051361,2051364p" instead.

If the lengths of sequence and quality values lines are different then you have a malformed fastq record. You could delete that record from both R1/R2 files as a work around. Was this data trimmed? Do you have the original files?

bio_d 10-23-2017 04:08 PM

Hi,

zcat mate_pair_1.fq.gz | sed -n '2051361,2051364p'


TACACATCTAGATGTGTATAAGAGACAGGTAATGGGATTGCCAGGTTCCCCCTCACTTGTAGTTTTGGATTTGGATTTATTATTCTTAATGTATGTATGTAGCACCATAGCTATGTGTGCTCAGG
+
BBBCCGC>DE1>FC1FCCFFGEEEBB1FGGDDCD111CF@EG1FBFFFDC>C><F>CC1:FF:11119:1111B1111E@>GC>G1=:11:=<CFD<1F:0=:F@0BFCG>>0FFB>F000;00F

I checked the lengths of the sequence and quality values it is the same 125. Yes, the fastq files used were trimmed using trim galore toolkit and quality checked using fastqc toolkit. I do have the raw data (illumina) as well.

Is it because trimming was done by the above mentioned tools and not using bbduk.sh ? Can't really understand what is going wrong.

Best,
D

GenoMax 10-23-2017 06:15 PM

Can you check the R2 file by the same method?

bio_d 10-23-2017 08:29 PM

Hi,

There seems to be nothing wrong with the second read too

zcat mate_pair_2.fq.gz | sed -n '2051361,2051364p'

@HWI-D00294:282:CAB4VANXX:7:1101:13812:601352:N:0:GCCAAT
TTGAAGCAGCAGTTCAAAAACATTGTCTCAGTCTGTCTTAATTTGGTATAATCCCCTGAATCTATTAAACCAAGACCAGCTGTCTGACATTTTTCACTATTTTCTTTTCTCCGCTTGTTCTTTTC
+
@BBB@FG1EBGG@D1FFGFGGGGGEEGGGFCDGGE=@>DF@F@1@@>FG>FG>DEG1E>1@FDGGGCGECBGEGBE>1@>GGFC=D>FGE@FFGGGG00E>F>DCGGGGGGGGGDF=@>C0B@FG

santiagorevale 11-02-2017 08:32 AM

Quote:

Originally Posted by santiagorevale (Post 212066)
Hi Brian,

I'm using "filterbyname.sh" script from bbmap v37.60 (using Java 1.8.0_102) to extract some reads from a FastQ file given a list of IDs.

The current FastQ file has 196 Mi reads and I want to keep 85 Mi. Uncompressed FastQ file size is 14G while compressed is only 1.4G. IDs file is 3.1G.

When running the script using 24G of RAM it dies with OutOfMemoryError. Isn't it an excessive use of memory for just filtering a FastQ file? Also, among the script arguments the is no "threads" option, however the script is using all available cores. Any way of limiting both memory as well as threads usage?

Here is the error:

java -ea -Xmx24G -cp /software/bbmap-37.60/current/ driver.FilterReadsByName -Xmx24G include=t in=Sample1.I1.fastq.gz out=filtered.Sample1.I1.fastq.gz names=reads.ids
Executing driver.FilterReadsByName [-Xmx24G, include=t, in=Sample1.I1.fastq.gz, out=filtered.Sample1.I1.fastq.gz, names=reads.ids]

Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
at java.util.Arrays.copyOfRange(Arrays.java:3664)
at java.lang.String.<init>(String.java:207)
at java.lang.String.toLowerCase(String.java:2647)
at java.lang.String.toLowerCase(String.java:2670)
at driver.FilterReadsByName.<init>(FilterReadsByName.java:145)
at driver.FilterReadsByName.main(FilterReadsByName.java:40)

Thank you very much in advance.

Best regards,
Santiago

Hi there,

Any hint on what I've previously asked?

Thanks!

TomHarrop 11-06-2017 07:08 PM

Hi Brian & others,

I tried to run bbnorm with a kmer size of 99, but it crashed with the following error:

Code:

Exception in thread "Thread-371" Exception in thread "Thread-357" Exception in thread "Thread-368" Exception in thread "Thread-377" Exception in thread "Thread-367" Exception in thread "Thread-362" Exception in thread "Thread-380" Exception in thread "Thread-363" Exception in thread "Thread-365" Exception in thread "Thread-364" Exception in thread "Thread-366" Exception in thread "Thread-358" Exception in thread "Thread-361" Exception in thread "Thread-360" Exception in thread "Thread-381" Exception in thread "Thread-387" Exception in thread "Thread-372" Exception in thread "Thread-399" java.lang.AssertionError: this function not tested with k>31
    at jgi.KmerNormalize.correctErrors(KmerNormalize.java:2124)
    at jgi.KmerNormalize.access$19(KmerNormalize.java:2121)
    at jgi.KmerNormalize$ProcessThread.normalizeInThread(KmerNormalize.java:3043)
    at jgi.KmerNormalize$ProcessThread.run(KmerNormalize.java:2806)

I'm wondering if I used a bad combination of parameters. Here's the call:

Code:

java -ea -Xmx132160m -Xms132160m -cp PATH/TO/bbmap/current/ jgi.KmerNormalize bits=32 in=output/trim_decon/reads.fastq.gz threads=50 out=output/k_99/norm/normalised.fastq.gz zl=9 hist=output/k_99/norm/hist_before.txt histout=output/k_99/norm/hist_after.txt target=50 min=5 prefilter ecc k=99 peaks=output/k_99/norm/peaks.txt
Otherwise, is it supported to use bbnorm with larger kmer sizes, or would you recommend estimating the target coverage for k = 99 based on the coverage at k = 31?

I've posted the log on pastebin: https://pastebin.com/jPkKagFs

Thanks again for the bbmap suite!

Tom

boulund 11-07-2017 01:54 AM

Hi, just want to make sure I'm not missing anything here, but randomreads.sh cannot produce metagenomes according to a specific profile, right? I only find information about it drawing random numbers from an exponential distribution for each reference sequence and thus produces a simulated metagenome from a set of reference sequences, which of course is awesome, but right now I would like to produce a simulated metagenome with a very specific composition.

GenoMax 11-07-2017 04:01 AM

Quote:

Originally Posted by santiagorevale (Post 212345)
Hi there,

Any hint on what I've previously asked?

Thanks!

Perhaps. But if you have more memory why not allocate more and see if that helps. Unless you are being charged by every megabyte you use ;)

santiagorevale 11-07-2017 05:00 AM

Quote:

Originally Posted by GenoMax (Post 212495)
Perhaps. But if you have more memory why not allocate more and see if that helps. Unless you are being charged by every megabyte you use ;)

Hi GenoMax,

Because I'm running this in a cluster, to get more memory means to get more cores (slots), and processes requiring more cores take longer to be executed. Also, I was running this command along other commands requiring same amount of cores.

However, isn't it weird for the script to require much more memory than the size of both uncompressed FastQ plus IDs files together?

Thanks!

GenoMax 11-07-2017 06:27 AM

Quote:

Originally Posted by santiagorevale (Post 212499)
Hi GenoMax,

Because I'm running this in a cluster, to get more memory means to get more cores (slots), and processes requiring more cores take longer to be executed. Also, I was running this command along other commands requiring same amount of cores.

However, isn't it weird for the script to require much more memory than the size of both uncompressed FastQ plus IDs files together?

Thanks!

While that is an odd restriction it is what it is when one is using shared compute resources.

Just for kicks have you tried to run this on a local desktop that has a decent amount of RAM (16G)? Just keeping fastq headers in memory should not take a large amount of RAM as you speculate.

santiagorevale 11-07-2017 08:59 AM

Quote:

Originally Posted by GenoMax (Post 212502)
While that is an odd restriction it is what it is when one is using shared compute resources.

Just for kicks have you tried to run this on a local desktop that has a decent amount of RAM (16G)? Just keeping fastq headers in memory should not take a large amount of RAM as you speculate.

I tried running it in a computer with 8Gb of RAM and in cluster nodes using a -Xmx limit of 18Gb and 24Gb (the max memory of the nodes is between 96 and 128 Gb).

Before I wasn't saying that keeping headers in memory take lots of RAM. I just tried to say that I couldn't understand why it ran out of memory when using 24Gb, because if the program were to load both files (FastQ and IDs files) into memory (I currently don't know how the program works), that would add up to 17.1Gb. So even in this scenario it should have not ran out of memory.

I ran the command on 232 sets of files with -Xmx18G, with the following results:
- Exception in thread "main" java.lang.OutOfMemoryError: GC overhead limit exceeded, 39 times

Code:

Exception in thread "main" java.lang.OutOfMemoryError: GC overhead limit exceeded
        at java.util.LinkedHashMap.newNode(LinkedHashMap.java:256)
        at java.util.HashMap.putVal(HashMap.java:641)
        at java.util.HashMap.put(HashMap.java:611)
        at java.util.HashSet.add(HashSet.java:219)
        at shared.Tools.addNames(Tools.java:456)
        at driver.FilterReadsByName.<init>(FilterReadsByName.java:138)
        at driver.FilterReadsByName.main(FilterReadsByName.java:40)

- Exception in thread "main" java.lang.OutOfMemoryError: Java heap space, 8 times

Code:

Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
        at java.lang.StringCoding.decode(StringCoding.java:187)
        at java.lang.StringCoding.decode(StringCoding.java:254)
        at java.lang.String.<init>(String.java:546)
        at java.lang.String.<init>(String.java:566)
        at shared.Tools.addNames(Tools.java:456)
        at driver.FilterReadsByName.<init>(FilterReadsByName.java:138)
        at driver.FilterReadsByName.main(FilterReadsByName.java:40)

- java.lang.OutOfMemoryError: GC overhead limit exceeded, 5 times

Code:

java.lang.OutOfMemoryError: GC overhead limit exceeded
        at java.util.Arrays.copyOfRange(Arrays.java:3520)
        at stream.KillSwitch.copyOfRange(KillSwitch.java:300)
        at fileIO.ByteFile1.nextLine(ByteFile1.java:164)
        at shared.Tools.addNames(Tools.java:454)
        at driver.FilterReadsByName.<init>(FilterReadsByName.java:138)
        at driver.FilterReadsByName.main(FilterReadsByName.java:40)

This program ran out of memory.
Try increasing the -Xmx flag and using tool-specific memory-related parameters.

I couldn't identify a particular reason for each of the three different errors. But what I do can tell is that the driver for failing is related to the amount of reads kept: all of the processes that failed were trying to retain at least 56,881,244 pair-end reads. The first one not failing was retaining 50,519,102 pair-end reads.

One thing that I realise it could be causing it to crash is that it doesn't have a way of limiting the threads it's using. So it's always using all the available cores in the machine. Even if you launch it using the option "threads=1" (which is currently not defined as an option for "filterbyname"), you get the message "Set threads to 1" but it still uses all of them.

I don't want you to make this a priority because I manage to avoid this solution. But I think it should be something to check. Also, I think limiting the threads should be a must on any command, because in most scenarios they will be run on shared servers/clusters.

Thanks for your help!

vzinche 12-04-2017 02:49 AM

randomreads.sh for huge data
 
Hello!

I am trying to simulate the reads from many genomes using metagenomics mode of randomreads.
The problem is the more genomes I use, the worse is the quality of the reads. Let's say, when I use 100 of genomes, around 99% of the reads can be mapped back (using bbmap) to the original genomes. Though, while using 1000 genomes, I can map only around 30-40% of generated reads. Is there some reasonable explanation for this?

GenoMax 12-04-2017 03:32 AM

Quote:

Originally Posted by vzinche (Post 213094)
Hello!

I am trying to simulate the reads from many genomes using metagenomics mode of randomreads.
The problem is the more genomes I use, the worse is the quality of the reads. Let's say, when I use 100 of genomes, around 99% of the reads can be mapped back (using bbmap) to the original genomes. Though, while using 1000 genomes, I can map only around 30-40% of generated reads. Is there some reasonable explanation for this?

How similar are those 1000 genomes? What parameters are you using related to multi-mapping with BBMap? As the number of similar genomes increases the numbers of reads that multi-map will go up as well. You could use "ambig=all" to allow reads to map to every location/genome and that will likely take the % of aligned reads up. But you are losing specificity at that point. Other thing you could do is to generate longer reads that will increase mapping specificity.

Can you say what is the reason behind this exercise and what exact parameters you used for the randomreads.sh and bbmap.sh runs?

vzinche 12-04-2017 04:37 AM

Sorry, I didn't describe the problem well enough in the previous message.

The mapping isn't the main goal and the main problem.
I need to simulate a huge metagenomics dataset (1000 genomes) for further usage, but I need to carefully keep track of the positions of the reads on genomes.
The dataset was simulated with following parameters: len=250 paired coverage=5 metagenome=t simplenames snprate=0.02
When I tried to manually compare the sequence located on the genome between the positions stated in read header with the actual read sequence, for most of the reads they were too different (blast alignment of these sequences showed no similarity). Though, for some they matched perfectly. I checked only +stand reads for simplicity.
That's why I head an idea to ran BBmap to estimate the number of reads that can't be even mapped to original genomes. I ran it with all the default parameters and it could map only around 35% of reads.

But when I have redone all the same with 100 genomes (randomly samples from these 1000), I couldn't find these 'messed up' reads and could map more than 99%.
Increasing the number of genomes, the percentage of mapped reads decreased.

Genomes are not very closely related, and changing the number of genomes being used didn't really affect their similarity.

Thus, my main concern is not the mapping itself, but the source of these 'messed up' reads.

GenoMax 12-04-2017 05:20 AM

@Brian will likely have to weigh in on this (especially "positions stated in read header with the actual read sequence, for most of the reads they were too different ") but be aware that he has been behind on support of late.

A few things to check that I can think of:

1. If you are only going to check the + strands then perhaps you should have used the samestrand=t option when generating the reads.
2. Default value for BBMap is ambig=best. Can you try mapping with ambig=all to see if that improves alignments?
3. Do you know why the remaining reads are not mapping (are they chimeras)?

vzinche 12-04-2017 05:29 AM

I will try that, thank you.

And regarding the third question, that is actually a problem. I have no idea where these reads come from. I tried to search them or parts of them in the original genomes, but apparently with no success. Could be chimeras made up of short sequences, but I can't say for sure.

The first thought was that it could be some memory problem, since it gets worse when increasing the size of the initial file, but it's just a random idea.

GenoMax 12-04-2017 06:01 AM

Have you looked through the logs and such to see if there is any indication of any issues? There is always the possibility that @Brian may not have checked extreme usage case like this for randomreads.sh and this may be a genuine bug that is clearly a road-block.

Since you have said that 100 genomes seem to work fine you could do 10 runs of 100 genomes each and then perhaps merge the data. A thought.

mcmc 12-11-2017 08:21 PM

BBSplit ambig=toss
 
Hi Brian et al.,
When I run BBsplit with ambig=toss, the ambiguous reads are not written to unmapped.fq; but when I run BBmap, they are. Is this the expected behavior? I'd like to be able to retrieve the ambiguous reads from BBsplit (both within/between two references).
Thanks,
MC

mcmc 12-13-2017 08:29 AM

summarizing mapped reads by orf
 
Is there a way to use BBtools to summarize reads mapped to a genome (using BBmap/BBsplit, in a sam file) by orf? I see that pileup.sh will take a prodigal-output fasta file with orf info, but I've got a genome downloaded from refseq with all the ncbi files (gff, cds fasta, gb). Can BBtools parse one of these to summarize my sam file by orf?

While I could map to the orfs.fna instead, I'm interested in intergenics too, e.g. for orf/RNA discovery.

Thanks,
MCMC

GenoMax 12-13-2017 09:27 AM

Quote:

Originally Posted by mcmc (Post 213342)
Is there a way to use BBtools to summarize reads mapped to a genome (using BBmap/BBsplit, in a sam file) by orf? I see that pileup.sh will take a prodigal-output fasta file with orf info, but I've got a genome downloaded from refseq with all the ncbi files (gff, cds fasta, gb). Can BBtools parse one of these to summarize my sam file by orf?

While I could map to the orfs.fna instead, I'm interested in intergenics too, e.g. for orf/RNA discovery.

Thanks,
MCMC

BBTools currently has no count utilities. They may be on the wish list since many have asked Brian. For now, your best bet is to use featureCounts.

mcmc 12-13-2017 12:36 PM

Quote:

Originally Posted by GenoMax (Post 213344)
BBTools currently has no count utilities. They may be on the wish list since many have asked Brian. For now, your best bet is to use featureCounts.

Thanks! I'm surprised there's something BBTools doesn't do :)

phylloxera 01-10-2018 12:31 PM

Ultimately, i'd like to do variant calling on a combined pac bio / illumina whole viral genome dataset. I am working with BBMap right now as it has the intuitive minid flag, which seems desirable. As a first step, I'm trying to optimize my mapping as much as possible on one of the samples that is most divergent to the reference.

Here is my working command:
bbmap/mapPacBio.sh in=200335185_usedreads.fastq.gz ref=200303013.fa maxindel=40 minid=0.4 vslow k=8 out=200335185.sam overwrite=t bamscript=bs.sh; sh bs.sh

It optimizes the number of reads mapped (4148/4164) and minimizes the number of ambiguous mapping reads (1).

Given, k=8 and minid=0.4, geneious mapper maps all 4164 reads for maxindel ranging from 20-500. If it is in the cards, I'd like to be able to map the remaining stragglers but don't know what other BBMap flags I should try in this endeavor. Also, I'm curious why bbmap is so much more sensitive to the valueof maxindel... here are select bbmap results:
maxindel num reads num ambiguous
20 4145 2
40 4148 1
60 4137 1
100 4130 3
200 4125 4


All times are GMT -8. The time now is 09:28 PM.

Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2019, vBulletin Solutions, Inc.