SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics



Similar Threads
Thread Thread Starter Forum Replies Last Post
Bowtie, an ultrafast, memory-efficient, open source short read aligner Ben Langmead Bioinformatics 513 05-14-2015 03:29 PM
Introducing BBMap, a new short-read aligner for DNA and RNA Brian Bushnell Bioinformatics 24 07-07-2014 10:37 AM
Miso's open source joyce kang Bioinformatics 1 01-25-2012 07:25 AM
Targeted resequencing - open source stanford_genome_tech Genomic Resequencing 3 09-27-2011 04:27 PM
EKOPath 4 going open source dnusol Bioinformatics 0 06-15-2011 02:10 AM

Reply
 
Thread Tools
Old 06-04-2014, 09:57 AM   #61
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,707
Default

Salvatore,

That error is because there are unequal numbers of reads in the read1 and read2 files (I'll add a better error message that explains it). This happens when a read-trimming tools throws away one read in a pair because it's too short, but not the other. This also ruins the pairing order, so mapping paired reads will no longer work correctly. BBDuk and Trimmomatic both keep paired reads together correctly.

This is a known problem with fastx; you basically should not use it with paired reads. I think cutadapt is able to handle pairs correctly when you use the "-p" flag, but I'm not sure.

You can re-pair the broken files with my repair.sh script, in the latest release of BBTools (32.27), like this:
repair.sh in1=r1.fq in2=r2.fq out1=fixed1.fq out2=fixed2.fq outsingle=singletons.fq
Brian Bushnell is offline   Reply With Quote
Old 06-04-2014, 10:23 AM   #62
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,707
Default

Quote:
Originally Posted by WhatsOEver View Post
Sorry, but I don't get it completely.

What is your definition of an "unambiguously mapped read" if it has secondary alignments? If secondary alignments are printed as a result of a more liberal threshold, the read is no longer unambiguous under the used threshold, is it? This should be independent of how bad your sec alignments are, the chosen threshold shouldn't change the terminology, should it?
So in my case I have ~16K reads which are "unambiguous" according to your definition but have seconday alignments?
I wrote a function that determines whether or not I classify a read as ambiguous. It takes into account various factors:
1) how high the best score is
2) the difference there is between the best score and the second best score
3) how many additional sites are within a certain range of the best score
It's not a closed-form equation, and it's completely subjective (that's what I meant by subjective, not the SAM spec); I optimized it empirically by iteratively tweaking it to minimize false-positives and maximize true-positives. And that alone decides whether or not I classify a read as "ambiguous".

When "ambig=all" is chosen, all sites with a score of at least 95% of the best score are printed, whether or not the best site is considered ambiguous. This is a bit different. I think I'll change it to make this number a parameter, and perhaps only print secondary sites for reads that are considered ambiguous, which would make more sense.

Quote:
But being aware of that also helps, thanks again Brian for a fast and helpful response. I will now stop my work on mapper-evaluation for my projects and focus on the next steps - so I won't bother you with additional things in the next time
Feel free to post whatever questions you have; they may be shared by other people, and may indicate that I could be doing something better or more clearly.
Brian Bushnell is offline   Reply With Quote
Old 06-04-2014, 07:01 PM   #63
punto_c
Member
 
Location: Japan, Tokyo

Join Date: Oct 2013
Posts: 11
Default thank you

Quote:
Originally Posted by Brian Bushnell View Post
Salvatore,

That error is because there are unequal numbers of reads in the read1 and read2 files (I'll add a better error message that explains it). This happens when a read-trimming tools throws away one read in a pair because it's too short, but not the other. This also ruins the pairing order, so mapping paired reads will no longer work correctly. BBDuk and Trimmomatic both keep paired reads together correctly.

This is a known problem with fastx; you basically should not use it with paired reads. I think cutadapt is able to handle pairs correctly when you use the "-p" flag, but I'm not sure.

You can re-pair the broken files with my repair.sh script, in the latest release of BBTools (32.27), like this:
repair.sh in1=r1.fq in2=r2.fq out1=fixed1.fq out2=fixed2.fq outsingle=singletons.fq
Thank you a lot Brian
Your explanation was as always very simple and clear
I could trimm in PE mode using Cutadapt and then was able to map the reads
Nothing to do for Fastx

thanks again

Salvatore
punto_c is offline   Reply With Quote
Old 06-25-2014, 06:17 PM   #64
coryfunk
Junior Member
 
Location: Seattle, WA

Join Date: Mar 2014
Posts: 8
Default

Brian,

I'm wondering if you can point or provide me with some recommended parameters for running shRNAseq datasets on BBMap.

My initial efforts using BBMap using the default parameters did yield a sam file, but when I converted it to a sorted bam files and attempted to use it on IGV, it said "does not contain any sequence names which match the current genome."

I'm pretty sure I'm overlooking something. I did create the indices with the miRBase mature.fa file. That seemed to work fine, but noticed in the output that said:

No index available; generating from reference genome:

Any help would be appreciated.
coryfunk is offline   Reply With Quote
Old 06-25-2014, 11:09 PM   #65
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,707
Default

coryfunk,

IGV is very sensitive to the names of scaffolds; they must be identical in the fasta used for mapping and the fasta used by IGV. When I encounter that error, "does not contain any sequence names which match the current genome", it's usually because I'm using the wrong reference. so if, for example, you are mapping to a fasta of RNAs and trying to display the results in IGV against the full genome, that won't work.

The methodology would be something like this:

index:
bbmap.sh ref=mature.fa

map:
bbmap.sh in=reads.fq out=mapped.sam bs=script.sh

translate to bam:
sh script.sh
(that creates and runs a shellscript that creates a sorted indexed bam file from the sam output, if samtools is installed, though of course you can do it manually too)

Now, run IGV and choose "load genome" and point it to the same fasta file that you used for indexing, then load the bam file. If it still does not work, look at the header of the sam file, and verify that the names of the scaffolds in IGV are the same as the names in the sam header. Or, actually, that's what you should do first.

As for specifics about good settings to use for shRNAseq, I have no experience with it. Normally for RNAseq I suggest setting the "maxindel" flag to an appropriate value for the organism's expected intron length (say, 200000 for vertebrates or 8000 for fungi and plants with short introns). But if shRNAseq involves only short, unspliced RNAs, then you can just go with the defaults.

-Brian
Brian Bushnell is offline   Reply With Quote
Old 07-09-2014, 04:41 AM   #66
HGV
Junior Member
 
Location: Bremen, Germany

Join Date: Nov 2011
Posts: 4
Default rel. 33.04 fails

Hi Brian
I constantly update bbmap and now ran into problems when using the newest release 33.04 - it performs the mapping, but then fails during the Results statistics reporting, when 32.23 reports perfectly fine for exactly the same command:

Quote:
------------------ Results ------------------

Genome: 1
Key Length: 13
Max Indel: 10
Minimum Score Ratio: 0.65
Mapping Mode: normal
Reads Used: 7525278 (1888844778 bases)

Mapping: 96,483 seconds.
Reads/sec: 38997,80
kBases/sec: 19576,90


Pairing data: pct reads num reads pct bases num bases

mated pairs: 0,0231% 869 0,0231% 436238
bad pairs: 0,0037% 139 0,0037% 69778
insert size avg: 675,02
Exception in thread "main" java.lang.NullPointerException
at align2.ReadStats.mergeAll(ReadStats.java:140)
at align2.AbstractMapper.printOutput(AbstractMapper.java:1473)
at align2.BBMap.testSpeed(BBMap.java:432)
at align2.BBMap.main(BBMap.java:31)
What is going on here?
Cheers
Harald

Last edited by HGV; 07-09-2014 at 04:47 AM. Reason: forgot to mention last working release
HGV is offline   Reply With Quote
Old 07-09-2014, 09:26 AM   #67
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,707
Default

Harald,

Thanks for finding that! I added a new histogram, "qahist" (quality accuracy histogram), which correlates the claimed quality of bases with their actual quality as measured by error rate. But if you use the "qhist" flag and not the "qahist" flag, that crash occurred because an array was not getting initialized. It's fixed now; I'll upload the fixed version later today.

-Brian
Brian Bushnell is offline   Reply With Quote
Old 07-09-2014, 04:37 PM   #68
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,707
Default

I've uploaded the fixed version (33.08) to sourceforge.
Brian Bushnell is offline   Reply With Quote
Old 07-30-2014, 07:48 AM   #69
easoncm
Junior Member
 
Location: South Carolina

Join Date: Jul 2014
Posts: 1
Default Repair.sh

Hello,

I got the same error saying my files had a different number of reads and tried the suggestion of running repair.sh per this thread's reply and the documentation within the program. However, when I run repair.sh, it tells me to there are different number of reads and to run repair.sh.

I don't understand why it is telling me to run the program I am trying to run.

Any ideas?
Attached Images
File Type: jpg repair.sh problem.JPG (94.4 KB, 4 views)
easoncm is offline   Reply With Quote
Old 07-30-2014, 09:46 AM   #70
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,707
Default

Quote:
Originally Posted by easoncm View Post
Hello,

I got the same error saying my files had a different number of reads and tried the suggestion of running repair.sh per this thread's reply and the documentation within the program. However, when I run repair.sh, it tells me to there are different number of reads and to run repair.sh.

I don't understand why it is telling me to run the program I am trying to run.

Any ideas?
Heh, that's kind of a funny error message. I wrote repair.sh with the assumption people would use it on interleaved files... I will fix that bug. But for now, you should be able to run it like this:

cat file1.fq file2.fq | repair.sh in=stdin.fq out1=fixed1.fq out2=fixed2.fq outsingle=singletons.fq overwrite=t
Brian Bushnell is offline   Reply With Quote
Old 08-12-2014, 06:11 AM   #71
andrej-gnip
Junior Member
 
Location: Copenhagen

Join Date: Aug 2014
Posts: 4
Default

Hi Brian,

I am trying to create a tool for identification of viral genetic material in a sample. It's supposed to work for RNA viruses, so RNA is reversed transcribed and the resulting DNA is sequenced. I end up with fastq files, where majority of reads comes from the host, but a significant minority comes from virus. I want to map the reads to a database of known RNA viral sequences, and I'm considering using BBMap for this purpose. I need a mapper which would be able to do local as well as end-to-end alignment... I tried bowtie2 at first, but the local version didn't work properly and I didn't manage to find anyone who'd know what's wrong, so I want to try another mapper now. The files with reads can be quite large. Do you think BBMap is a good tool for this type of purpose? Also, I looked through mapping parameters in the README file, but couldn't find seed length. Is there any reason for that?

Thank you very much.
andrej-gnip is offline   Reply With Quote
Old 08-12-2014, 10:45 AM   #72
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,707
Default

Andrej,

BBMap should work well for this purpose, as it can handle very low identity alignments which are typical of cross-species mapping. You can adjust the seed length with the "k" parameter. The default is 13, and I don't recommend going below about 7 as it becomes exponentially slower with a shorter seed length (speed is inversely proportional to 4^k). For high sensitivity, you might try something like this:

bbmap.sh in=reads.fq out=mapped.sam ref=virus.fa k=11 minratio=0.1 maxindel=200 slow

...or drop k even lower. But you still may not get any hits to your viral database unless it contains a relative of the virus in question.

If you have an assembly of the host genome, you can also remove reads that map to it like this, to reduce the volume of data you're dealing with:

bbmap.sh in=reads.fq outu=unmapped.fq ref=host.fa

-Brian
Brian Bushnell is offline   Reply With Quote
Old 08-21-2014, 08:07 AM   #73
WhatsOEver
Senior Member
 
Location: Germany

Join Date: Apr 2012
Posts: 215
Default

Hi Brian,

I have some questions about the optional histogram tables:

1) mhist: It displays per base matches/sequencing errors, doesn't it?
What is included in the "Other" column?
How are multiple insertions counted? Assume the following:
Code:
12345--6789
ACTAG--CATC
ACTAGGGCATC
Will it count for this read 2 insertions at position 5?
I have different read length from 30bp - 1.2kbp. Is the error rate somehow adjusted to read length?

2) qhist: Assuming that Read_linear is the average base quality, what is Read_log?

3) ihist: This one is always empty?!

4) scafStats: Ambiguous reads are counted multiple times, aren't they? This means the overall ambiguousMB is in fact much smaller, isn't it?

Thanks!
WhatsOEver is offline   Reply With Quote
Old 08-21-2014, 10:07 AM   #74
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,707
Default

Quote:
Originally Posted by WhatsOEver View Post
Hi Brian,

I have some questions about the optional histogram tables:

1) mhist: It displays per base matches/sequencing errors, doesn't it?
Yes.
Quote:
What is included in the "Other" column?
That's for any other cigar string symbols, which in practice means soft-clipping. If you don't use the 'local' flag, only reads that go off the end of scaffolds will be clipped.
Quote:
How are multiple insertions counted? Assume the following:
Code:
12345--6789
ACTAG--CATC
ACTAGGGCATC
Will it count for this read 2 insertions at position 5?
All counts are relative to the read position. But what I call an insertion is extra bases in the query that are not present in the reference, so I would classify that as a deletion, assuming that the top line (with '--') is the read and the bottom line is the reference. That deletion would be counted exactly once, regardless of the length, at position 5. An insertion of length L will be counted L times, once at each read base it covers. So, for example:

Code:
pos   123456789
read  ACTAGGGCATC
ref   ACTAG--CATC
That would be classified as an insertion and counted once at position 6 and once at position 7.
Quote:
I have different read length from 30bp - 1.2kbp. Is the error rate somehow adjusted to read length?
Yes, the sum of all columns should always be 1 for any line; it represents the fraction of cigar symbols at that position, so if there is 1 read of length 100 and 2 reads of length 200, then an error at position 50 would be represented as a 0.33 substitution rate, while an error at position 150 would be represented as a 0.5 substitution rate. As a result the graph gets more noisy going toward the right, with variable-length reads. Be sure you are using PacBio mode (mapPacBio8k.sh) for long reads; bbmap.sh has a read-length limit of 600bp, and reads longer than that will be broken into pieces.
Quote:
2) qhist: Assuming that Read_linear is the average base quality, what is Read_log?
The linear average sums the quality values, and divides by the number of quality values. The log average converts the phred quality score into a probability of error, sums the probabilities of error, divides that by the number of quality values, then transforms the average probability of error back into a phred score. As a result, the linear average (though commonly used) is kind of meaningless, while the logarithmic average can actually be used to calculate the expected error rate at that position. For example, if you had 200 reads and the log average at position 5 was 20, then you would expect there to be 2 total errors at position 5. However, if the linear average is 20, that doesn't actually tell you anything useful, but I include it anyway since that is what most programs plot.
Quote:
3) ihist: This one is always empty?!
"ihist" will contain the insert size distribution of properly-paired reads, if the mapping is done paired. It will be empty if no reads were proper pairs, or if reads were mapped single-ended. By default, BBMap expects reads in the Illumina fragment library orientation where reads are on opposite strands, pointing inward. You can override this with the "rcs=f" flag (requirecorrectstrand=false) which will allow any orientation, or with the "rcompmate" flag if both reads are on the same strand. Also, only reads that map to the same scaffold can be considered proper pairs.

You can, alternately, use the 'lhist' flag to just plot the read lengths, regardless of mapping.
Quote:
4) scafStats: Ambiguous reads are counted multiple times, aren't they? This means the overall ambiguousMB is in fact much smaller, isn't it?
Yes, unambiguously-mapped reads (and megabases) are counted once and ambiguously-mapped reads are counted at least twice.
Quote:
Thanks!
You're welcome!
Brian Bushnell is offline   Reply With Quote
Old 09-17-2014, 02:16 PM   #75
HGV
Junior Member
 
Location: Bremen, Germany

Join Date: Nov 2011
Posts: 4
Default pairlen= broken in 33.41

Hi Brian

I was using pairlen=1200 to limit insert sizes for paired end mapping and with the newest version 33.41 bbmap reports unkown option, while in 33.40b it used to work. What has happened?

Cheers
Harald
HGV is offline   Reply With Quote
Old 09-17-2014, 02:53 PM   #76
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,707
Default

Harald,

Thanks for noting that! I accidentally deleted the line that parsed that command. I've uploaded the fixed version, 33.42.


On another topic, I'd like to announce that a second developer, Jon Rood, has begun porting certain aspects of BBTools over to C using JNI calls. The currently ported classes are BBMerge, BBMap, and Dedupe. This is optional (you can enable it at runtime with the 'usejni' flag), and the output is identical, but there is a substantial speedup:
BBMap: +30%
BBMerge: +60%
Dedupe: up to +200% (when allowing an edit distance)

If you are interested in a free speed increase, instructions for compiling the C code for OS X or Linux are in /bbmap/jni/README.txt
Brian Bushnell is offline   Reply With Quote
Old 09-22-2014, 09:00 AM   #77
kbseah
Junior Member
 
Location: Germany

Join Date: Aug 2013
Posts: 6
Default

Hello,

I hope this is the right place to post questions related to BBMap, but since the last reply wasn't too long ago....

I've been using BBMap to map paired-end Fastq reads where the headers have been renamed for downstream analysis ("_1" and "_2" have been added to header names for forward and reverse reads respectively). When I look at the SAM file from the mapping, the forward and reverse reads that map have nonzero POS field, but the PNEXT fields are always zero. Is this caused by my editing the read names? Bowtie2 doesn't have the same problem, and assembling with SPAdes and IDBA-UD worked normally with the edited read names.

Example of the SAM entries for a read pair:
Quote:
HWI-ST863:279:H03F7ADXX:1:1101:7656:2184_1 16 NODE_207_length_5463_cov_10917.5_ID_413 125 44 13=1X137= * 0 0 [...] [...] NM:i:1 AM:i:44
HWI-ST863:279:H03F7ADXX:1:1101:7656:2184_2 0 NODE_207_length_5463_cov_10917.5_ID_413 5275 45 151= * 0 0 [...] [...] NM:i:0 AM:i:45
Thanks a lot in advance for your help.
Brandon
kbseah is offline   Reply With Quote
Old 09-22-2014, 10:38 AM   #78
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,707
Default

Brandon,

Those were not recognized as paired. BBMap recognizes only the normal Illumina naming schemes:

"* /1"
and
"* 1:"

If those reads are interleaved in a single file, use the "int=t" flag which will force BBMap to recognized them as being interleaved.
Brian Bushnell is offline   Reply With Quote
Old 09-22-2014, 11:47 AM   #79
HGV
Junior Member
 
Location: Bremen, Germany

Join Date: Nov 2011
Posts: 4
Default bbmap hitstats - unambiguous Hits

Hi Brian
I was looking at the hitstats files and I realized that the %unambiguousReads
can add up to more than 100%, and similarly the unambiguousReads count can be higher than the total input reads... How can that be?

Cheers
Harald
HGV is offline   Reply With Quote
Old 09-22-2014, 12:26 PM   #80
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,707
Default

Harald,

Thanks for noticing that. It works correctly for single-ended reads, but it appears that improper pairs (where one read maps to one scaffold, and the other maps to a different scaffold) are double-incrementing the counts on both scaffolds. I'll fix that in the next release.

-Brian
Brian Bushnell is offline   Reply With Quote
Reply

Tags
bbmap, metagenomics, rna-seq aligners, short read alignment

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 06:08 PM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2018, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO