SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics



Similar Threads
Thread Thread Starter Forum Replies Last Post
BBMap (aligner for DNA/RNAseq) is now open-source and available for download. Brian Bushnell Bioinformatics 556 09-07-2017 03:42 PM
BBMap for BitSeq dietmar13 Bioinformatics 1 04-30-2015 08:40 AM
Please help my BBMap cannot remove Illumina adapter TofuKaj Bioinformatics 4 04-28-2015 08:53 AM
BBMap Error Phage Hunter Bioinformatics 5 01-14-2015 04:34 AM
Introducing BBMap, a new short-read aligner for DNA and RNA Brian Bushnell Bioinformatics 24 07-07-2014 09:37 AM

Reply
 
Thread Tools
Old 09-07-2016, 11:31 AM   #121
JVGen
Member
 
Location: East Coast

Join Date: Jul 2016
Posts: 37
Default

Quote:
Originally Posted by Brian Bushnell View Post
Hi Jake,

By default, Dedupe will remove all but one copy of reads that are fully contained within other reads. This may not be a good idea prior to assembly of merged reads (since many will be contained by others due to the variable insert size); just using normalization alone with a target of 100x may work better. You can set Dedupe to only remove reads that are identical to others, though, by adding the flag "ac=f".

If you suspect your data is contaminated, I suggest you try to decontaminate it prior to normalization/assembly. If you are working on non-HIV and it is contaminated with HIV, that's fairly straightforward; you can use BBSplit or Seal (if you have both references) or just BBMap (if you only have the HIV reference but nothing for your amplicon) to separate the reads by mapping. Alternately, if you are confident that all of the low-coverage stuff is contamination, you can assemble the low-coverage portion of your data using Tadpole, which allows coverage thresholds (e.g. "mincr=1 maxcr=15" will make it assemble only those contigs with coverage between 1x and 15x). Then use the resulting assembly to filter all of the reads via mapping, keeping only the unmapped reads.

Incidentally, Tadpole tends to do a good job assembling very-high-coverage short sequences.
Thanks Brian,

I took the de novo approach to try to eliminate contaminating sequences. We're all working on HIV, and it's highly mutated: it's hard to remove contaminants by DNA sequence alone. However, since I know the approximate size of my PCR amplicons (it's different in each case because of DNA deletions), I know what size my contig should be after de novo assembly. By using stringent assembly parameters, that require ~30 bp overlap with maybe 1 mismatch, I can force contaminants to be assembled into a separate contig. I can then extract consensus sequences from each of these contigs, and filter them so that only contigs greater than ~150 bp are used in a subsequent map to reference, which should remove any minor contaminants that were present. I can then extract the consensus from this alignment, which should represent the original PCR amplicon - any contaminants that might have made it into my contig should be lost as they are outnumber by the 100s.

Does this workflow make sense for my application? I'm working on a desktop Mac, so I have limited options. I've been told that Spades might be a better assembler for me, but I think I'd need to purchase a server. With the little coding experience I have, I'm a bit nervous to invest the money, lest we never get the thing to work.

Are you available for hire as a consultant?

Thanks,
Jake

Last edited by JVGen; 09-07-2016 at 11:34 AM.
JVGen is offline   Reply With Quote
Old 09-07-2016, 01:03 PM   #122
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,669
Default

Quote:
Originally Posted by JVGen View Post
Thanks Brian,

I took the de novo approach to try to eliminate contaminating sequences. We're all working on HIV, and it's highly mutated: it's hard to remove contaminants by DNA sequence alone. However, since I know the approximate size of my PCR amplicons (it's different in each case because of DNA deletions), I know what size my contig should be after de novo assembly. By using stringent assembly parameters, that require ~30 bp overlap with maybe 1 mismatch, I can force contaminants to be assembled into a separate contig. I can then extract consensus sequences from each of these contigs, and filter them so that only contigs greater than ~150 bp are used in a subsequent map to reference, which should remove any minor contaminants that were present. I can then extract the consensus from this alignment, which should represent the original PCR amplicon - any contaminants that might have made it into my contig should be lost as they are outnumber by the 100s.

Does this workflow make sense for my application? I'm working on a desktop Mac, so I have limited options. I've been told that Spades might be a better assembler for me, but I think I'd need to purchase a server. With the little coding experience I have, I'm a bit nervous to invest the money, lest we never get the thing to work.

Are you available for hire as a consultant?

Thanks,
Jake
Hi Jake,

It might be helpful in this case if you could generate a kmer frequency histogram to see whether the contaminant and non-contaminant sequence is easily separable by depth alone. If so, there are a couple of easy ways to remove it. You can generate a kmer-frequency histogram with kmercountexact or BBNorm; just attach the text file to this thread. Normally I look at it in a log-log plot.

What assembler are you currently using, by the way? I've had poor results with Spades on viruses, and better results with Tadpole. But this was raw viral sequence and amplicon sequence may give different results.

As for consultancy, I've sent you a pm.

-Brian
Brian Bushnell is online now   Reply With Quote
Old 11-07-2016, 05:47 AM   #123
moistplus
Member
 
Location: Germany

Join Date: Feb 2016
Posts: 40
Default

I've ran pileup for contig coverage estimation after assembly.

The output is :

ID Avg_fold Length Ref_GC Base_Coverage Read_GC

1) The coverage is the Avg_fold right ?

2) If yes, I have some negative or 0 coverage ... How can it be possible ?

Last edited by moistplus; 11-07-2016 at 06:05 AM.
moistplus is offline   Reply With Quote
Old 11-07-2016, 08:35 AM   #124
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,669
Default

Quote:
Originally Posted by moistplus View Post
I've ran pileup for contig coverage estimation after assembly.

The output is :

ID Avg_fold Length Ref_GC Base_Coverage Read_GC

1) The coverage is the Avg_fold right ?
Yes, that's correct.

Quote:
2) If yes, I have some negative or 0 coverage ... How can it be possible ?
Zero coverage is certainly possible, since mapping and assembly results can differ. Negative coverage should be impossible. Can you send me the output file, and post your exact command line?

Thanks,
Brian
Brian Bushnell is online now   Reply With Quote
Old 11-08-2016, 01:34 AM   #125
moistplus
Member
 
Location: Germany

Join Date: Feb 2016
Posts: 40
Default

Quote:
Originally Posted by Brian Bushnell View Post
Yes, that's correct.



Zero coverage is certainly possible, since mapping and assembly results can differ. Negative coverage should be impossible. Can you send me the output file, and post your exact command line?

Thanks,
Brian
1) So I have PE from 2 differents library:
PE 350 (insert size)
PE 550

I merged them with a simple
HTML Code:
cat
so I have at the end F reads (pair1) and R reads (pair2).

2) Mapping with bwa:

HTML Code:
bwa mem contig.fasta pair1 pair2 | samtools view -bS - > pair12.bam
3) Sort the bam file

HTML Code:
samtools sort pair12.bam -o pair12.sorted.bam
4) Pileup

HTML Code:
pileup.sh in=pair12.sorted.bam  out=stats.txt hist=histogram.txt

One more precision, I use the BBmap v32.15

The stat file:

http://www.filedropper.com/stats

Last edited by moistplus; 11-08-2016 at 01:37 AM.
moistplus is offline   Reply With Quote
Old 11-08-2016, 08:58 AM   #126
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,669
Default

Indeed, I see 3 entries with negative coverage, which should not happen. However, v32.15 was released about 2.5 years ago and there have been thousands of changes since then, including hundreds of bug fixes. Could you try the latest version and see if that fixes the problem?
Brian Bushnell is online now   Reply With Quote
Old 11-09-2016, 01:46 AM   #127
moistplus
Member
 
Location: Germany

Join Date: Feb 2016
Posts: 40
Default

Quote:
Originally Posted by Brian Bushnell View Post
Indeed, I see 3 entries with negative coverage, which should not happen. However, v32.15 was released about 2.5 years ago and there have been thousands of changes since then, including hundreds of bug fixes. Could you try the latest version and see if that fixes the problem?
With the new version, I don't have anymore negative coverage!

Another question:
Using this command :

Quote:
reformat.sh in1=reads1.fq in2=reads2.fq out1=sampled1.fq out2=sampled2.fq samplerate=0.1
in1 and in2 are Forward reads and Reverse reads right ? So it keeps the mated pair at the end ?

Last edited by moistplus; 11-09-2016 at 02:02 AM.
moistplus is offline   Reply With Quote
Old 11-09-2016, 01:56 PM   #128
GenoMax
Senior Member
 
Location: East Coast USA

Join Date: Feb 2008
Posts: 6,439
Default

Quote:
Originally Posted by moistplus View Post
With the new version, I don't have anymore negative coverage!

Another question:
Using this command :



in1 and in2 are Forward reads and Reverse reads right ? So it keeps the mated pair at the end ?
It should do that.
GenoMax is offline   Reply With Quote
Old 11-16-2016, 01:35 AM   #129
moistplus
Member
 
Location: Germany

Join Date: Feb 2016
Posts: 40
Default

Can you confirm that sampling reads is random ?

Code:
reformat.sh in1=reads1.fq in2=reads2.fq out1=sampled1.fq out2=sampled2.fq samplerate=0.1
moistplus is offline   Reply With Quote
Old 11-16-2016, 09:01 AM   #130
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,669
Default

Yes, it's random.
Brian Bushnell is online now   Reply With Quote
Old 11-29-2016, 09:00 PM   #131
Dario1984
Senior Member
 
Location: Sydney, Australia

Join Date: Jun 2011
Posts: 159
Default

But can it output chimerically mapped reads to a separate output file, like STAR can? There's no mention of chimeric reads in the user guide, so I'm not sure if it's even suitable for that case.
Dario1984 is offline   Reply With Quote
Old 11-30-2016, 08:14 AM   #132
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,669
Default

No, BBMap does not output chimerically-mapped reads to a separate file, though it can be used to separate properly-paired reads from improper (possibly chimeric) pairs.
Brian Bushnell is online now   Reply With Quote
Old 11-30-2016, 08:25 AM   #133
GenoMax
Senior Member
 
Location: East Coast USA

Join Date: Feb 2008
Posts: 6,439
Default

Quote:
Originally Posted by Brian Bushnell View Post
No, BBMap does not output chimerically-mapped reads to a separate file, though it can be used to separate properly-paired reads from improper (possibly chimeric) pairs.
Is that functionality on "feature request" list?
GenoMax is offline   Reply With Quote
Old 11-30-2016, 09:14 AM   #134
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,669
Default

Quote:
Originally Posted by GenoMax View Post
Is that functionality on "feature request" list?
Oh, very well, I'll add it to the list
Brian Bushnell is online now   Reply With Quote
Old 12-05-2016, 02:59 PM   #135
JVGen
Member
 
Location: East Coast

Join Date: Jul 2016
Posts: 37
Default

Quote:
Originally Posted by Brian Bushnell View Post
Oh, very well, I'll add it to the list
Hi Brian,

I am trying to use BBMap to align 150 bp paired end reads to a 10 kb reference. The reference is an HIV genome, and my sequencing input is PCR amplified HIV proviruses (means I get lots of coverage).

I use BBDuk to adapter and quality trim my reads. I then used BBNorm to normalize coverage to ~150. Then I used BBMap to map the reads to the reference.

Deletions in HIV proviruses are common, and I noticed that BBMap seems to get hung up near the deletions. For instance, if a read spans the deletion, BBMap doesn't seem to insert a gap into the read so that it aligns on the other size of the deletion. I've attached some pictures. First pic is the full alignment, second 5' of the deletion, third 3' of the deletion. Note that the "CTGAGGGGACAGAT" sequence is present on the reads on the 5' side of the deletion, and should extend to the 3' side. In this case, most of these reads were trimmed so that the consensus would reflect the correct deletion, but I do worry that this might not always be the case.

Are there any settings to adjust this to allow the reads to span the deletion?

Thanks,
Jake





JVGen is offline   Reply With Quote
Old 12-05-2016, 03:57 PM   #136
GenoMax
Senior Member
 
Location: East Coast USA

Join Date: Feb 2008
Posts: 6,439
Default

@JVGen: Are you using default alignment settings for bbmap? You may need to adjust maxindel (which defaults to 16000) and intronlen settings.
GenoMax is offline   Reply With Quote
Old 12-05-2016, 05:14 PM   #137
JVGen
Member
 
Location: East Coast

Join Date: Jul 2016
Posts: 37
Default

Quote:
Originally Posted by GenoMax View Post
@JVGen: Are you using default alignment settings for bbmap? You may need to adjust maxindel (which defaults to 16000) and intronlen settings.
Hi GenoMax,

The entirety of my reference is only 9000 bp, so I think the default maxindel size is appropriate. What does intronlen do?

Thanks!
JV
JVGen is offline   Reply With Quote
Old 12-06-2016, 03:06 AM   #138
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,669
Default

The default maxindel should be fine in this case. If there are reads spanning the deletion, they will be mapped spanning the deletion. I'm not familiar with the viewer you are using... perhaps you could try IGV?

"intronlen=10" will, for example, replace "D" (deletion) symbols in cigar strings with "N" (skipped) symbols, for deletions of at least 10bp in length. Some programs and viewers prefer N over D for whatever reason. I consider them equivalent. But, it's possible the viewer you are using does not properly display reads with "D" symbols in the cigar string, so using IGV or remapping with the "intronlen=10" flag might be helpful. Or, if you send me the sam file and reference I can look at it.

Quote:
most of these reads were trimmed so that the consensus would reflect the correct deletion
I'm not sure what you mean by that - can you clarify? Also, can you give the exact command you used for BBMap? If you use the "local" flag, long deletions might get erased.

I've honestly never heard a concern before that BBMap was unwilling to map reads spanning long deletions - only the opposite. In your last picture, it looks to me like all of the reads are mapped with a long deletion extending off the screen to the left; or am I misinterpreting it?
Brian Bushnell is online now   Reply With Quote
Old 12-06-2016, 04:13 AM   #139
JVGen
Member
 
Location: East Coast

Join Date: Jul 2016
Posts: 37
Default

Quote:
Originally Posted by Brian Bushnell View Post
The default maxindel should be fine in this case. If there are reads spanning the deletion, they will be mapped spanning the deletion. I'm not familiar with the viewer you are using... perhaps you could try IGV?
Hi Brian, that's Geneious I'm viewing in. I download and tried using IGV, but I get an error when trying to load up the SAM file. I shared the SAM file with you on google drive. I included the trimmed, normalized, unassembled reads and the reference as a separate FASTA as well, just in case.

Quote:
Originally Posted by Brian Bushnell View Post
"intronlen=10" will, for example, replace "D" (deletion) symbols in cigar strings with "N" (skipped) symbols, for deletions of at least 10bp in length. Some programs and viewers prefer N over D for whatever reason. I consider them equivalent. But, it's possible the viewer you are using does not properly display reads with "D" symbols in the cigar string, so using IGV or remapping with the "intronlen=10" flag might be helpful. Or, if you send me the sam file and reference I can look at it.
This could be, but they replace no coverage with gaps, so I don't know what the program is doing behind the scenes (if it's logging D's or N's). I will try repeating with this flag and see if it changes the outcome.

Quote:
Originally Posted by Brian Bushnell View Post
I'm not sure what you mean by that - can you clarify? Also, can you give the exact command you used for BBMap? If you use the "local" flag, long deletions might get erased.
I'm running it in Geneious with the following parameters (in picture). I'm going to try ticking the "discard trimmed regions", though I think it is unnecessary, because I don't think BBDuk keeps trimmed information (which is how I trimmed the reads). Dissolve contigs is redundant - no contigs have yet been assembled. Quirk of the program.



Quote:
Originally Posted by Brian Bushnell View Post
I've honestly never heard a concern before that BBMap was unwilling to map reads spanning long deletions - only the opposite. In your last picture, it looks to me like all of the reads are mapped with a long deletion extending off the screen to the left; or am I misinterpreting it?
There is a ~3.5kb deletion on the 3' end of the HIV genome. The HIV reference sequence is depicted in the faded yellow box. Reads assembled to the reference are depicted below as black rectangles (which, when I zoom in, show their sequence). A coverage map is shown above the reference in red. A consensus sequence for the assembled reads is provided above the coverage map. Within the consensus, black represents a mismatch to the reference (this many mismatches is not uncommon for HIV, as it's a retrovirus and reverse transcription introduces many mutations).

Thanks for any help!

JV
JVGen is offline   Reply With Quote
Old 12-06-2016, 12:21 PM   #140
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,669
Default

Quote:
Originally Posted by JVGen View Post
Hi Brian, that's Geneious I'm viewing in. I download and tried using IGV, but I get an error when trying to load up the SAM file.
IGV needs a sorted, indexed bam file. It won't accept sam.

Quote:
I shared the SAM file with you on google drive. I included the trimmed, normalized, unassembled reads and the reference as a separate FASTA as well, just in case.
Please send me the links, and I'll look at them.

Quote:
I'm running it in Geneious with the following parameters (in picture). I'm going to try ticking the "discard trimmed regions", though I think it is unnecessary, because I don't think BBDuk keeps trimmed information (which is how I trimmed the reads). Dissolve contigs is redundant - no contigs have yet been assembled. Quirk of the program.
Hmmm, I'm not really sure what Geneous is doing behind the scenes here with regards to trimming, but it doesn't look like it would have any kind of effect that would suppress long deletions.
Brian Bushnell is online now   Reply With Quote
Reply

Tags
bbmap

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 07:11 PM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2017, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO