SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics



Similar Threads
Thread Thread Starter Forum Replies Last Post
BBMap (aligner for DNA/RNAseq) is now open-source and available for download. Brian Bushnell Bioinformatics 573 10-10-2017 02:33 AM
BBMap for BitSeq dietmar13 Bioinformatics 1 04-30-2015 08:40 AM
Please help my BBMap cannot remove Illumina adapter TofuKaj Bioinformatics 4 04-28-2015 08:53 AM
BBMap Error Phage Hunter Bioinformatics 5 01-14-2015 04:34 AM
Introducing BBMap, a new short-read aligner for DNA and RNA Brian Bushnell Bioinformatics 24 07-07-2014 09:37 AM

Reply
 
Thread Tools
Old 08-14-2017, 09:58 AM   #161
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,695
Default

Quote:
Originally Posted by SNPsaurus View Post
I take a subset from each file and then cat the subsets together. That would be faster than combining the inputs with cat, at least.
True, this works well. Alternatively, for interleaved files, you can do this:

Code:
cat a.fq b.fq c.fq | reformat.sh in=stdin.fq out=sampled.fq interleaved samplerate=0.1
...which avoids writing temp files. Won't work for twin paired files, though, unless you do something tricky with named pipes.

To avoid wasting disk space and bandwidth, I normally keep all fastq files gzipped at all times. Using pigz for parallel compression, or compression level 2 (zl=2 flag), will eliminate much of the speed penalty from dealing with compressed files; and if you are I/O limited, compressed files tend to speed things up.

I don't have any plans at present to add multiple input file support (or wildcard support) to Reformat, but I'll put it on my list and consider it. It's something I've occasionally wanted also.
Brian Bushnell is offline   Reply With Quote
Old 09-11-2017, 02:21 PM   #162
TomHarrop
Member
 
Location: New Zealand

Join Date: Jul 2014
Posts: 19
Default Can BBmap remove reads containing homopolymers?

Hi BBmap-ers,

Say I want to remove reads with more than 5 consecutive identical nucleotides. Is there a homopolymer/ polyX filtering option with BBDuk or reformat.sh?

Thanks!

Tom
TomHarrop is offline   Reply With Quote
Old 09-11-2017, 02:24 PM   #163
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,695
Default

There is no homopolymer filter, per se, but you can accomplish that like this:

Quote:
bbduk.sh in=reads.fq out=clean.fq literal=AAAAAA,CCCCCC,GGGGGG,TTTTTT k=6 mm=f
Brian Bushnell is offline   Reply With Quote
Old 09-11-2017, 02:35 PM   #164
TomHarrop
Member
 
Location: New Zealand

Join Date: Jul 2014
Posts: 19
Default

Great, thanks!
TomHarrop is offline   Reply With Quote
Old 09-23-2017, 01:56 PM   #165
gokhulkrishnakilaru
Member
 
Location: Bethesda, Maryland

Join Date: Jul 2011
Posts: 39
Default

Hi Brian-

1. We are currently using bedtools BAM to BED and then extending reads for chip visualization. Does BBMAP suite has an option to extend reads in BAM based on the fragment length estimate from MACS or automatically from BAM?

2. Is there an option/tool to remove chip duplicates based on the read ID in BAM instead of co-ordinates?
gokhulkrishnakilaru is offline   Reply With Quote
Old 09-23-2017, 07:02 PM   #166
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,695
Default

1) No, it does not. Tadpole can extend reads based on kmer-counts but it does not make use of mapping information.

2) Clumpify can remove optical duplicate based on a combination of sequence (which indicates whether they are duplicate) and position from the read ID (which indicates if they are very close). However, it cannot handle bam files, only fasta and fastq.

-Brian
Brian Bushnell is offline   Reply With Quote
Old 09-26-2017, 05:26 PM   #167
dho
Junior Member
 
Location: USA

Join Date: Sep 2017
Posts: 4
Default Comprehensive reporting of mapped reads

Hi Brian,

I am trying to map 100bp Illumina sequences to a collection of very similar, 80bp reference sequences (multiple alleles of a single exon) using bbmap.

When mapping to a single reference sequence from the collection in isolation using settings:

'ambiguous=all',
'maxsites=1000000',
'vslow',
'subfilter=0'

An expected number of sequences (in this case, ~270) map and fully cover the reference sequence.

When using the same parameters and mapping to a collection of sequences containing the same reference sequence only ~120 reads map.

Is there another parameter(s) I need to be setting to map my reads exhaustively against all reference sequences?

Thanks,

dave
dho is offline   Reply With Quote
Old 09-27-2017, 01:30 PM   #168
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,695
Default

Hi Dave,

BBMap has some heuristics that may make it non-ideal for the situation with a large number of near-identical sequences, particularly when the reads don't map glocally well to any of them (because the reads are longer than the reference sequences) and the reference sequences are tiny. You might also try the flag "excludefraction=0" to see if this changes anything. To clarify, there are perfect alignments (with zero mismatches) that are getting missed, correct?

Many of the heuristics related to ignoring extremely common, uninformative reference kmers are disabled in bbmapskimmer.sh, which is designed specifically for a high degree of multimapping. The syntax is the same as BBMap, so please give that a try and let me know if it works better. You'll need to additionally add the flag "minscaf=1" or really short scaffolds get discarded, so something like:

Code:
bbmapskimmer.sh in=reads.fq ref=ref.fa maxsites=1000000 vslow ambig=all maxindel=0 subfilter=0 excludefraction=0 out=mapped.sam minscaf=1
Please let me know whether that changes the situation.

-Brian
Brian Bushnell is offline   Reply With Quote
Old 09-27-2017, 05:37 PM   #169
dho
Junior Member
 
Location: USA

Join Date: Sep 2017
Posts: 4
Default

Hi Brian,

No, unfortunately it didn't help. I think you understand the question correctly:

When I map 100bp reads against a single allele of a gene with four exons using:

Code:
maxsites=1000000 vslow ambig=all maxindel=0 subfilter=0 excludefraction=0 out=mapped.sam minscaf=1 covstats=stdout | grep 'IPD0001613'
I get 100% read support for all four exons.

Code:
Mafa-DPA1*07:02|IPD0001613_2_MHC-II-DPA	288.0902	244	0.5000	100.0000	244	421	388	0.4788	297	35.43
Mafa-DPA1*07:02|IPD0001613_3_MHC-II-DPA	235.8719	281	0.5658	100.0000	281	408	343	0.5554	247	40.92
Mafa-DPA1*07:02|IPD0001613_4_MHC-II-DPA	218.1935	155	0.6323	100.0000	155	227	220	0.5913	228	27.77
Mafa-DPA1*07:02|IPD0001613_1_MHC-II-DPA	199.3457	81	0.5679	100.0000	81	147	111	0.5713	200	35.48
When I use the same parameters but include the allele in a larger reference sequence that contains many other alleles, the number of mapped reads decreases for all four exons and no longer fully covers exon 1:

Code:
Mafa-DPA1*07:02|IPD0001613_2_MHC-II-DPA	242.6475	244	0.5000	100.0000	244	343	327	0.4746	233	43.23
Mafa-DPA1*07:02|IPD0001613_3_MHC-II-DPA	58.2847	281	0.5658	100.0000	281	81	93	0.5509	63	19.81
Mafa-DPA1*07:02|IPD0001613_4_MHC-II-DPA	194.1226	155	0.6323	100.0000	155	201	181	0.5935	204	33.17
Mafa-DPA1*07:02|IPD0001613_1_MHC-II-DPA	38.3333	81	0.5679	97.5309	79	34	17	0.6038	51	17.89
Any other thoughts?

Thanks,

dave
dho is offline   Reply With Quote
Old 09-28-2017, 07:42 AM   #170
gokhulkrishnakilaru
Member
 
Location: Bethesda, Maryland

Join Date: Jul 2011
Posts: 39
Default

Quote:
Originally Posted by Brian Bushnell View Post
1) No, it does not. Tadpole can extend reads based on kmer-counts but it does not make use of mapping information.

2) Clumpify can remove optical duplicate based on a combination of sequence (which indicates whether they are duplicate) and position from the read ID (which indicates if they are very close). However, it cannot handle bam files, only fasta and fastq.

-Brian
Thanks Brian.

How about BED?
gokhulkrishnakilaru is offline   Reply With Quote
Old 10-01-2017, 10:36 AM   #171
dho
Junior Member
 
Location: USA

Join Date: Sep 2017
Posts: 4
Default

Hi Brian,

I figured out the problem and a workaround.The problem is that even with your recommended settings reads that extended beyond the ends of both short and long reference sequences would only be mapped to the longer reference sequences.

The workaround is to avoid having longer and shorter reference sequences as mapping targets at the same time. I subdivided by reference sequences into multiple reference sequences that each contain sequences of the same size and then map against each of these individually. I can then merge the output from all of these mappings.

Thanks for your help this week and I hope my solution helps others who encounter the same issue!

dave
dho is offline   Reply With Quote
Old 10-03-2017, 02:41 PM   #172
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,695
Default

Quote:
Originally Posted by dho View Post
Hi Brian,

I figured out the problem and a workaround.The problem is that even with your recommended settings reads that extended beyond the ends of both short and long reference sequences would only be mapped to the longer reference sequences.

The workaround is to avoid having longer and shorter reference sequences as mapping targets at the same time. I subdivided by reference sequences into multiple reference sequences that each contain sequences of the same size and then map against each of these individually. I can then merge the output from all of these mappings.

Thanks for your help this week and I hope my solution helps others who encounter the same issue!

dave
Hi Dave,

Thanks for the followup. I apologize for not getting back to you in a timely fashion, I'm pretty swamped currently!
Brian Bushnell is offline   Reply With Quote
Old 10-04-2017, 04:33 PM   #173
dho
Junior Member
 
Location: USA

Join Date: Sep 2017
Posts: 4
Default

Hi Brian,

No problem! Thanks for all that you do.

For posterity, the solution I proposed above worked for the most part, but I encountered a few cases where it didn't. Since then, I've decided to be fully exhaustive and map against each reference sequence individually. This is computationally intensive but gives the most robust results. Even with these settings, though, reads mapping to very small reference sequences (<50bp) doesn't seem to work as consistently as I'd like, though I haven't figured out the source of this inconsistency.

--dave
dho is offline   Reply With Quote
Old 10-10-2017, 08:21 AM   #174
gokhulkrishnakilaru
Member
 
Location: Bethesda, Maryland

Join Date: Jul 2011
Posts: 39
Default Adding Unique Identifier To Paired End Reads. Editing FASTQ Read ID based on randomer

Hi Brian-

We have a paired end rnaseq data.

For every sequence in read2, we want to extract the first 6nucleotides and append them to the read id in read2 and also to the respective read id in read1.

Please see example below.

Code:
cat read1
@NB501293:231:HV3CTBGX2:1:11101:2280:1047 1:N:0:CAGATC
CCGCANGTTGCAGAGCGGGTGGGAGCCNCTNCGGGCGCGGCACTGNAGCCCTGANACTGAACCCCGAACCCGAGCC
+
AAAAA#EEEEEEE/EEEEEEEEE6E//#<E#/EEE/E/EAEAEEE#/EA/AEEE#/EAAAEE/EEEAEEE/EE//6
@NB501293:231:HV3CTBGX2:1:11101:9866:1047 1:N:0:CAGATC
CTCAANGGGAGAGACCTTAGATGATACNCANGATGACAGTAGGTANAGGGAACTTATAGAGCCACCTCCATCAGGA
+
AAAAA#EEEEEEEEEAEEEEEEEEEEE#EE#EEEEEEEEEEEEEE#EEEEEEEEEEEEEEEEEAEEEEEEEEEEEE
@NB501293:231:HV3CTBGX2:1:11101:24301:1048 1:N:0:CAGATC
ACGGANCTCTGGCTGTTGTATGGAAAGNTANGCTGTAACACGCACNGACAGAAGAGAGCCATTTTCTCCCTGAACT
+
AA/AA#EEEEEAEAEAEEEEE/EEEEE#E<#6EEAEEAAAE///E#EEAEE//A/</EE</EE//E6E6EEEEEE6
@NB501293:231:HV3CTBGX2:1:11101:16754:1048 1:N:0:CAGATC
CCTGGNAGCCGCCGCAAGCGCCGGACCNCANGCACTCCCAGGCGCGCGCGCTTCTTCTGCAAAAAGTTGAGGGCTC
+
AAAAA#EEAEEEEEE6EEEEEEEEAE/#EE#EE/EE//E/EEAEEEEEEEEAEEEEEEEE/EEEEEEEEEEEEE/E


cat read2
@NB501293:231:HV3CTBGX2:1:11101:2280:1047 2:N:0:CAGATC
GCTGGGCGAGTAGCTTCTGGATCCTGGCCTCCTGAGCCTGTGGCCCGGGCTAGGCTCGGGGCTCGGGTTCGGGGTT
+
AAAAAEEEEEEE/AEEAEEEEEEEE/AEEEAAAE/EEEEEAEAEEEE6EEEAEEEEEEEEEEE/EAEAAE6EEE<A
@NB501293:231:HV3CTBGX2:1:11101:9866:1047 2:N:0:CAGATC
TGACTGTGGGGTGGCAACCCCATTCCTCACTTGATGTCCTGTCTTCCTGATGGAGGTGGCTCTATAAGTTCCCTCT
+
AAAAAEEEEEEEEEE/AEEEEEEEEEEEEEEEEEEEEEAEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEAEE
@NB501293:231:HV3CTBGX2:1:11101:24301:1048 2:N:0:CAGATC
TCGCATCATCCTACACAGCACGGACGCTAGATGACAGGACGTGCCATGACAGTCTAAGTTCAGGGAGAAAATGGCT
+
/AAAAEE//EEEEEEEEEEAEAE/EEEEEE/EA/EEA/AEEEEEEE/E/EEA/EEEEE/EEEEEEEEAEEEEA/EE
@NB501293:231:HV3CTBGX2:1:11101:16754:1048 2:N:0:CAGATC
TGACCAGCCATTGGCTGGTGGGAGTAGTGATGTCACCCATATGACACCCTGATAACGAGTTGAGAGAGAGCCCTCA
+
AA/AAEEEEEAEEEEEEEEEAEEEEEEEEAEEEEEA/EEEEE</</EAEEEEE<EEEAEEEEEEEEAA/EE<6/EE




Desired Output


Code:
cat read1
@GCTGGG_NB501293:231:HV3CTBGX2:1:11101:2280:1047 1:N:0:CAGATC
CCGCANGTTGCAGAGCGGGTGGGAGCCNCTNCGGGCGCGGCACTGNAGCCCTGANACTGAACCCCGAACCCGAGCC
+
AAAAA#EEEEEEE/EEEEEEEEE6E//#<E#/EEE/E/EAEAEEE#/EA/AEEE#/EAAAEE/EEEAEEE/EE//6
@TGACTG_NB501293:231:HV3CTBGX2:1:11101:9866:1047 1:N:0:CAGATC
CTCAANGGGAGAGACCTTAGATGATACNCANGATGACAGTAGGTANAGGGAACTTATAGAGCCACCTCCATCAGGA
+
AAAAA#EEEEEEEEEAEEEEEEEEEEE#EE#EEEEEEEEEEEEEE#EEEEEEEEEEEEEEEEEAEEEEEEEEEEEE
@TCGCAT_NB501293:231:HV3CTBGX2:1:11101:24301:1048 1:N:0:CAGATC
ACGGANCTCTGGCTGTTGTATGGAAAGNTANGCTGTAACACGCACNGACAGAAGAGAGCCATTTTCTCCCTGAACT
+
AA/AA#EEEEEAEAEAEEEEE/EEEEE#E<#6EEAEEAAAE///E#EEAEE//A/</EE</EE//E6E6EEEEEE6
@TGACCA_NB501293:231:HV3CTBGX2:1:11101:16754:1048 1:N:0:CAGATC
CCTGGNAGCCGCCGCAAGCGCCGGACCNCANGCACTCCCAGGCGCGCGCGCTTCTTCTGCAAAAAGTTGAGGGCTC
+
AAAAA#EEAEEEEEE6EEEEEEEEAE/#EE#EE/EE//E/EEAEEEEEEEEAEEEEEEEE/EEEEEEEEEEEEE/E


cat read2
@GCTGGG_NB501293:231:HV3CTBGX2:1:11101:2280:1047 2:N:0:CAGATC
GCTGGGCGAGTAGCTTCTGGATCCTGGCCTCCTGAGCCTGTGGCCCGGGCTAGGCTCGGGGCTCGGGTTCGGGGTT
+
AAAAAEEEEEEE/AEEAEEEEEEEE/AEEEAAAE/EEEEEAEAEEEE6EEEAEEEEEEEEEEE/EAEAAE6EEE<A
@TGACTG_NB501293:231:HV3CTBGX2:1:11101:9866:1047 2:N:0:CAGATC
TGACTGTGGGGTGGCAACCCCATTCCTCACTTGATGTCCTGTCTTCCTGATGGAGGTGGCTCTATAAGTTCCCTCT
+
AAAAAEEEEEEEEEE/AEEEEEEEEEEEEEEEEEEEEEAEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEAEE
@TCGCAT_NB501293:231:HV3CTBGX2:1:11101:24301:1048 2:N:0:CAGATC
TCGCATCATCCTACACAGCACGGACGCTAGATGACAGGACGTGCCATGACAGTCTAAGTTCAGGGAGAAAATGGCT
+
/AAAAEE//EEEEEEEEEEAEAE/EEEEEE/EA/EEA/AEEEEEEE/E/EEA/EEEEE/EEEEEEEEAEEEEA/EE
@TGACCA_NB501293:231:HV3CTBGX2:1:11101:16754:1048 2:N:0:CAGATC
TGACCAGCCATTGGCTGGTGGGAGTAGTGATGTCACCCATATGACACCCTGATAACGAGTTGAGAGAGAGCCCTCA
+
AA/AAEEEEEAEEEEEEEEEAEEEEEEEEAEEEEEA/EEEEE</</EAEEEEE<EEEAEEEEEEEEAA/EE<6/EE
Do you know if the BB suite can achieve this? I tried fastx toolkit but it prints the full sequence instead of a partial sequence.
gokhulkrishnakilaru is offline   Reply With Quote
Reply

Tags
bbmap

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 04:12 PM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2017, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO