SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics



Similar Threads
Thread Thread Starter Forum Replies Last Post
Bowtie, an ultrafast, memory-efficient, open source short read aligner Ben Langmead Bioinformatics 513 05-14-2015 03:29 PM
Introducing BBMap, a new short-read aligner for DNA and RNA Brian Bushnell Bioinformatics 24 07-07-2014 10:37 AM
Miso's open source joyce kang Bioinformatics 1 01-25-2012 07:25 AM
Targeted resequencing - open source stanford_genome_tech Genomic Resequencing 3 09-27-2011 04:27 PM
EKOPath 4 going open source dnusol Bioinformatics 0 06-15-2011 02:10 AM

Reply
 
Thread Tools
Old 02-03-2015, 09:39 AM   #181
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,707
Default

I just verified that "outb" and "blacklist" still work, but their usage may be a bit unclear. I don't use them anymore as they have been superseded by BBSplit which allows any number of reference-specific output streams. I think in your case you may not have built the BAC fasta into the index.

How to use the blacklist to remove e.coli contaminants from human reads:

Concatenate the references:
cat ecoli.fa human.fa > both.fa

index the concatenated file:
bbmap.sh ref=both.fa

map to the combined index and specify a blacklist:
bbmap.sh in=reads.fq outm=good.fq outb=bad.fq blacklist=ecoli.fa

The "blacklist" flag does not index anything; it only parses the headers in the fasta file, whose contents must have already been indexed. If a read can map to two places equally well, by default, BBMap will assign it to the first genomic location (unless you alter the "ambig" flag). That is why the blacklisted files must be first when concatenating files together.

The way you would do this with BBSplit is simpler:

bbsplit.sh ref=ecoli.fa,human.fa
bbsplit.sh in=reads.fq pattern=out_%.fq


That would create two output files, out_ecoli.fq and out_human.fq (or .sam if you want). You can determine the behavior of reads that map to both references equally well in bbsplit with the "ambiguous2" flag; default behavior is to go to the output stream for the first reference. The other options are "ambig2=toss" (discard anything that maps to multiple references equally well), "ambig2=all" (output it to the stream for each reference it maps to equally well), and "ambig2=split" (write the multi-mapping reads to separate files).
Brian Bushnell is offline   Reply With Quote
Old 02-03-2015, 11:44 AM   #182
Thias
Member
 
Location: Münster, Germany

Join Date: Mar 2013
Posts: 42
Default

Quote:
Originally Posted by Brian Bushnell View Post
I think in your case you may not have built the BAC fasta into the index.
*Sigh*. You know your cardboard homies ("Pappenheimer" ) too well ...

I was indeed not rebuilding the index of the reference genome for that purpose, but just added the blacklist parameter.

Thanks a lot for the fast response, the detailed description of the procedure and the inherent logic.

Quote:
Originally Posted by Brian Bushnell View Post
The way you would do this with BBSplit is simpler:

bbsplit.sh ref=ecoli.fa,human.fa
bbsplit.sh in=reads.fq pattern=out_%.fq
Great! This possibility will also aid to solve another question in a different context. Thanks for pointing this out ...
Thias is offline   Reply With Quote
Old 02-23-2015, 08:21 PM   #183
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,707
Default

Quote:
Originally Posted by Thias View Post
*Sigh*. You know your cardboard homies ("Pappenheimer" ) too well ...
That was an interesting reference; I enjoyed reading it! In fact, it spurred me to go see if Mount&Blade II was near completion, but alas, it still seems to be in permanent development.

Since I develop BBMap at a facility where people use it in unexpected ways, I'm used to getting error reports caused by both bugs and misunderstandings. Both are valuable! I don't like difficult-to-use tools, but I can only make BBMap more user-friendly with feedback, so please don't feel shy to give feedback whether or not there's a bug.

In other news - BBMap and Pileup now have an "rpkm" flag. This will directly report RPKM and FPKM counts (as well as coverage depth and raw count) for each reference sequence, in the same format as Seal. Essentially, it's for RNA-seq quantification, and I primarily made it to make it easier to compare BBMap with Seal (since the RPKM/FPKM output only makes sense with a transcriptome reference, not a genome reference) - but it's there if anyone wants it.
Brian Bushnell is offline   Reply With Quote
Old 02-26-2015, 08:14 AM   #184
rkanwar
Junior Member
 
Location: minnesota

Join Date: Feb 2011
Posts: 8
Default Strange behavior for some of the aligned PE reads

Hello Brian,

I am using BBMap to align DNASeq PE data [100x2 bp reads]. The alignment works fine. But I am seeing strange alignments for a few pairs. For examples consider the SAM records for the following PE read:

Code:
R0266248:292:C5C0TACXX:4:1101:6482:13924 1:N:0:CTGAAGCTTAATCTTA	153	1	16877330	44	101M	=	16877330	0	GAAAGCCCCAGTGCCATCAGACAGCCACACCTCATCCTCATCAGTGACACTATGAGGTGAAGACCCCTCCAGGGTGTCAAGAGCTCTCAGCTTCCAGGGTC	4DDDBDDDDDDDDDDEDDDDBDDDDCAACFFFFHHHHHHIGIJIFJIHGIIIJJJJJJJJIHIIIJJIJJJJJJJIIJJJIJJJJJJJHHHHHFFFFFCCC	RG:Z:AS_N+AS_N_1+FCC5C0TACXX+L4+ICTGAAGCT-TAATCTTA	AM:i:0	NM:i:0
Code:
R0266248:292:C5C0TACXX:4:1101:6482:13924 1:N:0:CTGAAGCTTAATCTTA	101	1	16877330	0	*	=	16877330	0	TCCTCAGCCAGCTTCCCTTCTAGTCAATTTTTTGGGAGCTGGGCCTCCAGCTGGGATACTCTCTGGATGAGACTCTTCAGGTCCTTTTTGGCCTGAAGTCC	CCCFFFFFHHHHHJJJJJJJJJIJJJGJJJJJJIJJJJJJJJJJJJJJJJJJJJJGHJJHHHHHHFFFFFEEEEECDDDDDCDDDDDDC?@CCDDDDDDDD	RG:Z:AS_N+AS_N_1+FCC5C0TACXX+L4+ICTGAAGCT-TAATCTTA
The first end aligns properly. The other end does not align and has an alignment score of 0, which I think means that there were multiple alignments for this end. When I BLAT these two sequences then this is what I get for the reads:

Code:
END1
browser details YourSeq          101     1   101   101 100.0%     1   +   16877330  16877430    101
browser details YourSeq           99     1   101   101  99.1%     1   -   17183501  17183601    101
browser details YourSeq           97     1   101   101  98.1%     1   +   17008345  17008445    101
browser details YourSeq           95     1   101   101  97.1%     1   +   21738088  21738188    101
browser details YourSeq           91     1   101   101  95.1%     1   +  144879169 144879269    101
Code:
END2
browser details YourSeq          101     1   101   101 100.0%     1   -   17183611  17183711    101
browser details YourSeq          101     1   101   101 100.0%     1   +   16877220  16877320    101
browser details YourSeq           93     1   101   101  96.1%     1   +  144879059 144879159    101
browser details YourSeq           92     1   101   101  97.0%     1   +   21737977  21738078    102
browser details YourSeq           74    26   101   101  98.7%     1   +   17008260  17008335     76
browser details YourSeq           25     9    34   101 100.0%    10   +   69768146  69768175     30
BBMap aligns the first end properly with a start position of 16877330. The other end has 2 perfect hits, one of those hits [16877220] is in the neighborhood of of the first end. This neighboring hit should have been reported as the alignment for this end as it best explained the whole fragment [my insert sizes are around 250bp]. Here is the command I am using. I want to get only those alignment where the PE reads aligns uniquely on the reference genome. Can you please help me to set the right flags to get these reads align properly ? Thanks!

Code:
bbmap_cmd="${bbmap} -Xms10G -Xmx31G \
           path=${bbmap_ref} \
           in=${bbduk_fq1} \
           in2=${bbduk_fq2} \
           out=stdout \
           ambiguous=toss \
           rgid=${ID} \
           rgsm=${SM} \
           rglb=${LIB} \
           rgpl=${PL} \
           pairlen=1000 \
           threads=12 \
           slow=t \
           local=t \
           reads=10000 \
           samversion=1.3"
regards,
Rahul
rkanwar is offline   Reply With Quote
Old 02-26-2015, 10:13 AM   #185
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,707
Default

Hi Rahul,

In this case, the problem is "ambiguous=toss". If you omit that flag, the second read will be mapped. You can always filter ambiguous mappings later by excluding anything with a mapq of under 4.

However, this situation is a bit strange. your BLAT alignments show both reads hitting the same strand. Did you reverse-complement it prior to BLAT, or something? By default, BBMap requires read 1 and read 2 to be on opposite strands, so if they really are both mapping to the same strand, they won't be considered a proper pair. You can override that with "rcs=f" (requirecorrectstrand=false), but in this case, it's not clear to me that they should be paired; rather, it seems like that might be a chimeric molecule or some kind of structural rearrangement in the genome. Or a misassembly.
Brian Bushnell is offline   Reply With Quote
Old 02-26-2015, 10:34 AM   #186
rkanwar
Junior Member
 
Location: minnesota

Join Date: Feb 2011
Posts: 8
Default

Hello Brian,

Thank you for your reply! I was using the read sequence from the SAM record for BLAT, which might explain the same strand.
Is there some easy way to filter out properly mapped PE reads [which have the right orientation and are within the fragment length distribution] which are uniquely aligned ?

Rahul
rkanwar is offline   Reply With Quote
Old 02-26-2015, 11:34 AM   #187
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,707
Default

You can filter proper pairs only with the "po" flag; unpaired reads will then be classified as unmapped. You can use the 'outm' stream to just get mapped reads. You can also filter with samtools after the fact.

From these, you could filter only the pairs where at least one read is uniquely aligned (mapq>3 or some higher threshold for greater confidence), but this can't be done in BBMap currently; you could, however, do this with a custom script or with samtools.
Brian Bushnell is offline   Reply With Quote
Old 02-26-2015, 02:29 PM   #188
rkanwar
Junior Member
 
Location: minnesota

Join Date: Feb 2011
Posts: 8
Default

Hello Brian,

I removed the flag "ambiguous=toss" as you had suggested. I was able to get the correct alignments for the previous PE read and many other. But there are still a few PE reads which are not aligning properly, even though I can find the correct alignment for them using BLAT. I used the following command for aligning the data:

Code:
bbmap_cmd="${bbmap} -Xms10G -Xmx31G \
           path=${bbmap_ref} \
           in=${bbduk_fq1} \
           in2=${bbduk_fq2} \
           out=stdout \
           rgid=${ID} \
           rgsm=${SM} \
           rglb=${LIB} \
           rgpl=${PL} \
           pairlen=1000 \
           threads=12 \
           slow=t \
           local=t \
           maxsites2=2000 \
           reads=10000 \
           samversion=1.3"
One example of PE read that still behaves odd is the following. I can BLAT both the sequences and can see that there is a unique alignment which explains this read fragment. The % of such reads is very small [0.06%]. I am using grhc37 fasta file that comes with the GATK bundle as the reference. I am using v34.48 of BBMap.

Code:
R0266248:292:C5C0TACXX:4:1101:11723:6183 1:N:0:CTGAAGCTTAATCTTA	101	1	40799223	0	*	=	40799223	0	GAATTATAATCCCCATAATCTCCACCTGTCAAGGGTGGAACCAGATGTAGGTTATTTCCCTTGATGTCTACAATATCACCTTTCTTATAGATTCGCATATA	<<==<=<=>>==>=<===>>>===>=>=<=;>=>=<=>==>======<======>>=====>>=<==>>==<>====<====<>===<====<========	MC:Z:101M	BD:Z:MMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMLMMMMMMMMMMMMMMMMMMMMMMMMMLMMMMMMMMMMMMMMMMMM	PG:Z:MarkDuplicates	RG:Z:AS_N+AS_N_1+FCC5C0TACXX+L4+ICTGAAGCT-TAATCTTA	BI:Z:OOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOO
R0266248:292:C5C0TACXX:4:1101:11723:6183 1:N:0:CTGAAGCTTAATCTTA	153	1	40799223	44	101M	=	40799223	0	TTTGTCATTTTGGCGAATTACTGGAAGGTGGCGTTTCTGGCCGTAAGGCTGATATTCTTCATTTTTAAACCCAATATAACCATCCTTTCCTGTAGACTCTT	========<<3===3===<333===<=====<====><==<========<======>>=>=>==><<=<=>>>==<=>=>=========><====<>====	BD:Z:LMMMMMMLLMMMMMMMMMMMMMMMMMMMMMMMMLMMMMMMMMMMMMMMMMMMMMMMMMMMMLLLMMLMMMMMMMMMMMMMMMMMMLMMMMMMMMMMMMMMM	PG:Z:MarkDuplicates	RG:Z:AS_N+AS_N_1+FCC5C0TACXX+L4+ICTGAAGCT-TAATCTTA	BI:Z:OOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOO	AM:i:0	NM:i:0
Rahul
rkanwar is offline   Reply With Quote
Old 02-26-2015, 03:07 PM   #189
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,707
Default

BBMap is a global aligner. If you use the local flag, it will trim global alignments to become local alignments, but it will not find terrible global alignments in the first place. When run at high sensitivity, this is the output:

Code:
@PG	ID:BBMap	PN:BBMap	VN:34.59	CL:java -Djava.library.path=/usr/common/jgi/utilities/bbtools/prod-v34.59/lib -ea -Xmx31g align2.BBMapPacBio build=1 overwrite=true minratio=0.40 fastareadlen=6000 ambiguous=best minscaf=100 startpad=10000 stoppad=10000 midpad=6000 -Xmx31g t=1 k=11 in=pair3.fq out=pb.sam path=/global/projectb/sandbox/gaag/bbtools/hg191 reads=4 int=t ignorequality minratio=0.1 maxindel=40 vslow ambig=all customtag idtag ow reads=1
1 	99	chr1	40799086	4	3=3X2=3I4=40I46=	=	40799223	238	GAATTATAATCCCCATAATCTCCACCTGTCAAGGGTGGAACCAGATGTAGGTTATTTCCCTTGATGTCTACAATATCACCTTTCTTATAGATTCGCATATA	<<==<=<=>>==>=<===>>>===>=>=<=;>=>=<=>==>======<======>>=====>>=<==>>==<>====<====<>===<====<========	NM:i:46	AM:i:4	NH:i:2	YI:f:54.46	X1:Z:$1,0,40807085,40807142,0,00,32,3075,4058,6517,6517,,mmmSSSmmIIImmmmIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIImmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmm$6,1,154470997,154471095,0,00,33,3754,4697,0,4697,,mSSSmmIIIImmImmSmmmIIIIIIIIIIIIIIImmmSmmSmmmmDDDDDDDDDDDDDDDDDDmmmmmmmmSmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmm	X2:Z:m3S3mmI3m4I40m46	X3:i:4058	X4:i:101	X5:Z:0	X6:i:2054
1 	369	chr5	154462998	5	1=3X2=4I2=1I2=1X3=15I3=1X2=1X4=18D8=1X47=	chr1	40799223	0	*	*	NM:i:45	AM:i:5	NH:i:2	YI:f:62.18	X1:Z:$1,0,40807085,40807142,0,00,32,3075,4058,6517,6517,,mmmSSSmmIIImmmmIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIImmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmm$6,1,154470997,154471095,0,00,33,3754,4697,0,4697,,mSSSmmIIIImmImmSmmmIIIIIIIIIIIIIIImmmSmmSmmmmDDDDDDDDDDDDDDDDDDmmmmmmmmSmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmm	X2:Z:mS3mmI4mmImmSm3I15m3SmmSm4D18m8Sm47	X3:i:4697	X4:i:101	X5:Z:0	X6:i:34823
1 	147	chr1	40799223	44	101=	=	40799086	-238	TTTGTCATTTTGGCGAATTACTGGAAGGTGGCGTTTCTGGCCGTAAGGCTGATATTCTTCATTTTTAAACCCAATATAACCATCCTTTCCTGTAGACTCTT	========<<3===3===<333===<=====<====><==<========<======>>=>=>==><<=<=>>>==<=>=>=========><====<>====	NM:i:0	AM:i:40	NH:i:1	YI:f:100	X1:Z:$1,1,40807222,40807322,0,11,81,10090,10090,11079,11079,,mmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmm	X2:Z:m101	X3:i:10090	X4:i:101	X5:Z:0	X6:i:6159
The two alignments for the first read are at 62% and 54% identity, both of which are below what BBMap looks for at default settings.
Brian Bushnell is offline   Reply With Quote
Old 02-26-2015, 03:22 PM   #190
rkanwar
Junior Member
 
Location: minnesota

Join Date: Feb 2011
Posts: 8
Default

Thanks, that helps a lot !
rkanwar is offline   Reply With Quote
Old 04-16-2015, 08:04 AM   #191
holmrenser
Junior Member
 
Location: Netherlands

Join Date: Jul 2013
Posts: 9
Default

Hi Brian,

I am trying to extract reads from a bbmap sam-file using picard SamToFastq and get an error about unpaired reads. Whether I tell BBmap to output paired only or not, picard SamToFastq claims there are reads flagged as paired that are missing a pair.

BBmap command:
bbmap.sh ref=chloroplast.reference.fasta in=filter.temp.filtered.R1.fastq in2=filter.temp.filtered.R2.fastq threads=10 outm=folder/prefix.sam covstats=folder/prefix.covstats.txt basecov=folder/prefix.basecov.txt -Xmx10G fast=t keepnames=t machineout=t pairedonly=t

picard SamToFastq command:
picard.jar SamToFastq INPUT=folder/prefix.sam F=folder/prefix.R1.fastq F2=folder/prefix.R2.fastq FU=folder/prefix.unpaired.fastq

Error:
Exception in thread "main" htsjdk.samtools.SAMFormatException: SAM validation error: ERROR: Found 29230 unpaired mates

When I run it without the pairedonly=t option I get the following error:
Exception in thread "main" htsjdk.samtools.SAMFormatException: SAM validation error: ERROR: Found 61616 unpaired mates

Any suggestion?

Thanks in advance
holmrenser is offline   Reply With Quote
Old 04-16-2015, 09:14 AM   #192
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,707
Default

That's odd. Your command looks fine. Unfortunately I will be gone until Monday and won't be able to look into this until then, but would you be able to email me the fastqs and reference? Or, ideally, a small subset of the reads that triggers the same problem.

Thanks,
Brian
Brian Bushnell is offline   Reply With Quote
Old 04-19-2015, 04:59 PM   #193
susanklein
Senior Member
 
Location: oceania

Join Date: Feb 2014
Posts: 115
Default

Hi Brian,

is there any way to use bbmap to map very long reads from e.g. nanopore (minION).

Also I read on another thread that dedup.sh can be used to merge assemblies. Can it produce a consensus sequence?

Thanks very much.

S
susanklein is offline   Reply With Quote
Old 04-19-2015, 05:31 PM   #194
GenoMax
Senior Member
 
Location: East Coast USA

Join Date: Feb 2008
Posts: 6,989
Default

BBMap can map reads of up to 6kb: http://seqanswers.com/forums/showpos...43&postcount=2
GenoMax is offline   Reply With Quote
Old 04-20-2015, 02:10 PM   #195
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,707
Default

Dedupe will merge assemblies, but it will not produce consensus sequences or join overlapping reads; it only removes sequences that are fully contained within other sequences (allowing the specified number of mismatches or edits).

BBMap works pretty well for mapping Nanopore reads, but as GenoMax stated, it has a length cap of 6kbp. Reads longer than this will be broken into 6kbp pieces and mapped independently. The command I have been using for mapping Nanopore reads:

mapPacBio.sh -Xmx20g k=7 in=reads.fastq ref=reference.fa maxlen=1000 minlen=200 idtag ow int=f qin=33 out=mapped1.sam minratio=0.15 ignorequality slow ordered maxindel1=40 maxindel2=400

The "maxlen" flag shreds them to a max length of 1000; you can set that up to 6000. But I found 1000 gave a higher mapping rate.
Brian Bushnell is offline   Reply With Quote
Old 05-05-2015, 05:11 AM   #196
hubery_Bio
Junior Member
 
Location: china

Join Date: Oct 2013
Posts: 5
Default

Hi, Brian,
I used BBmap to evaluate the Insert Size of a PE dataset download from SRA database, and found that the peak of the histogram is 89 bp. Actually, the PE reads were 96-96 bp. It seems weird, since I saw it elsewhere that you defined the Insert Size as "PE reads + unsequenced part". In addition, the minimum Size was begin from 1 bp!
Although the dataset has some problems in the quality, it is amazing to got this result.
Could you help me? Sorry for my poor English...
hubery_Bio is offline   Reply With Quote
Old 05-05-2015, 10:58 AM   #197
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,707
Default

It sounds like you don't have any "unsequenced part" in this library. If the insert size is shorter than read length, that means that the the library was fragmented into pieces that were too small, and the sequencer reads off the end of the genomic sequence into the adapter sequence. Before using those reads, you should do adapter trimming with (for example) BBDuk.

As for the minimum size being 1bp, that is probably a read pair being mis-mapped. You can get a second opinion on the insert size distribution with BBMerge, which calculates it by looking for overlaps instead of by mapping, but I expect the result to be the same in this case. BBMap is more accurate for insert size calculations with long inserts (which don't overlap), while BBMerge is more accurate for insert sizes shorter than read length because the adapter sequence interferes with mapping.
Brian Bushnell is offline   Reply With Quote
Old 05-10-2015, 04:19 PM   #198
blsfoxfox
Member
 
Location: auburn

Join Date: Jan 2013
Posts: 12
Default Trimming of Nextera mate pair reads

Hi Brian,

I am using BBmap to trim the Nextera mate pair reads, could you please provide a workflow with parameters to show us how to do this with splitnextera.sh and bbduk.sh?

Thanks in advance.
blsfoxfox is offline   Reply With Quote
Old 05-10-2015, 05:38 PM   #199
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,707
Default

Sure. For simplicity, I will assume the reads are interleaved in a single file initially (you can transform between interleaved and dual-file reads with reformat.sh if you want). First thing is to trim the adapters with BBDuk:

Code:
bbduk.sh in=reads.fq out=trimmed.fq ktrim=r k=23 mink=12 hdist=1 tbo tpe ref=nextera_LMP_adapter.fa.gz
nextera_LMP_adapter.fa.gz is in /bbmap/resources/.

Next, if you want to remove contaminants/spikeins such as PhiX, you can do so now with a filtering pass of BBDuk, but that's optional.

Finally, split the reads into LMP pairs, short pairs, and singletons:

Code:
splitnextera.sh in=trimmed.fq out=longmatepairs.fq outf=fragmentpairs.fq outu=unknownpairs.fq outs=singletons.fq mask=t
At this point you could do quality-trimming if you want to on the individual sub-libraries; it's important not to do so sooner as it interferes with the ability to detect junctions. If all you care about is the LMP data you can just keep longmatepairs.fq and throw away the others, but theoretically, you might be able to get enough fragment pairs and singletons to assemble the contigs from those, then use the LMP for scaffolding, and thus get a complete single-library assembly. That's what Illumina is encouraging, at any rate.

The majority of the reads in the "unknown" bin were LMP, not short-fragment reads, in the one library I tested (and also according to Illumina's data), but I would not really consider it safe to use for scaffolding OR contig-building, and would probably throw it away unless you don't have enough data.
Brian Bushnell is offline   Reply With Quote
Old 05-11-2015, 09:37 AM   #200
blsfoxfox
Member
 
Location: auburn

Join Date: Jan 2013
Posts: 12
Default

Quote:
Originally Posted by Brian Bushnell View Post
Sure. For simplicity, I will assume the reads are interleaved in a single file initially (you can transform between interleaved and dual-file reads with reformat.sh if you want). First thing is to trim the adapters with BBDuk:

Code:
bbduk.sh in=reads.fq out=trimmed.fq ktrim=r k=23 mink=12 hdist=1 tbo tpe ref=nextera_LMP_adapter.fa.gz
nextera_LMP_adapter.fa.gz is in /bbmap/resources/.

Next, if you want to remove contaminants/spikeins such as PhiX, you can do so now with a filtering pass of BBDuk, but that's optional.

Finally, split the reads into LMP pairs, short pairs, and singletons:

Code:
splitnextera.sh in=trimmed.fq out=longmatepairs.fq outf=fragmentpairs.fq outu=unknownpairs.fq outs=singletons.fq mask=t
At this point you could do quality-trimming if you want to on the individual sub-libraries; it's important not to do so sooner as it interferes with the ability to detect junctions. If all you care about is the LMP data you can just keep longmatepairs.fq and throw away the others, but theoretically, you might be able to get enough fragment pairs and singletons to assemble the contigs from those, then use the LMP for scaffolding, and thus get a complete single-library assembly. That's what Illumina is encouraging, at any rate.

The majority of the reads in the "unknown" bin were LMP, not short-fragment reads, in the one library I tested (and also according to Illumina's data), but I would not really consider it safe to use for scaffolding OR contig-building, and would probably throw it away unless you don't have enough data.
Hi Brian,

Thank you for your patient instruction!
I have three more question about the split step:
1)After the split, what are the orientation of longmatepairs.fq and fragmentpairs.fq? I guess RF and FR respectively?
2)What will be the insert size of those fragmentpairs.fq, 300bp?Does the original library size matter?
3)I have tried another software called "Nxtrim" with similar function to seperate those Nextera Mate pairs into different groups, which gives me about 40% fragment pairs, do you think the number is reasonable? If you have used it, could you please tell me the difference between splitnextera.sh and Nxtrim for doing the same job?

Thanks for your time in advance, I know that is too much to answer.
blsfoxfox is offline   Reply With Quote
Reply

Tags
bbmap, metagenomics, rna-seq aligners, short read alignment

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 12:58 AM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2019, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO