Seqanswers Leaderboard Ad

**GenoMax** · 10-30-2015, 10:50 AM

How many reads were removed by bbduk contamination filtering?

Yes for the other question.

**cmccabe** · 10-30-2015, 10:57 AM

Thank you

.

The new index is in a ref folder was created in the bbmap directory. Though the info doesn't look right

Code:

 bbmap.sh ref=/home/cmccabe/Desktop/bed/hg19.fa

Code:

#Chromosome sizes
#Generated on	Fri Oct 30 13:47:30 CDT 2015
#Version	5
#chrom	scaffolds	contigs	length	defined	undefined	startPad	stopPad
1	3	58	492475166	463501710	28973456	8000	8000
2	2	15	389185007	382458811	6726196	8000	8000
3	3	31	511177591	500443989	10733602	8000	8000
4	3	63	423120801	394347091	28773710	8000	8000
5	4	24	491386730	445490327	45896403	8000	8000
6	7	54	522452802	463453103	58999699	8000	8000
7	71	213	307452973	247615431	59837542	8000	8000

Code:

cmccabe@DTV-A5211QLM:~/Desktop/NGS$ bbmap.sh ref=/home/cmccabe/Desktop/bed/hg19.fa
java -Djava.library.path=/home/cmccabe/Desktop/NGS/bbmap-35.59/jni/ -ea -Xmx44294m -cp /home/cmccabe/Desktop/NGS/bbmap-35.59/current/ align2.BBMap build=1 overwrite=true fastareadlen=500 ref=/home/cmccabe/Desktop/bed/hg19.fa
Executing align2.BBMap [build=1, overwrite=true, fastareadlen=500, ref=/home/cmccabe/Desktop/bed/hg19.fa]

BBMap version 35.59
Retaining first best site only for ambiguous mappings.
No output file.
Writing reference.
Executing dna.FastaToChromArrays2 [/home/cmccabe/Desktop/bed/hg19.fa, 1, writeinthread=false, genscaffoldinfo=true, retain, waitforwriting=false, gz=true, maxlen=536670912, writechroms=true, minscaf=1, midpad=300, startpad=8000, stoppad=8000, nodisk=false]

Set genScaffoldInfo=true
Writing chunk 1
Writing chunk 2
Writing chunk 3
Writing chunk 4
Writing chunk 5
Writing chunk 6
Writing chunk 7
Set genome to 1

Loaded Reference:	0.024 seconds.
Loading index for chunk 1-7, build 1
No index available; generating from reference genome: /home/cmccabe/Desktop/NGS/ref/index/1/chr1-3_index_k13_c2_b1.block
No index available; generating from reference genome: /home/cmccabe/Desktop/NGS/ref/index/1/chr4-7_index_k13_c2_b1.block
Indexing threads started for block 0-3
Indexing threads started for block 4-7
Indexing threads finished for block 0-3
Indexing threads finished for block 4-7
Generated Index:	171.787 seconds.
Finished Writing:	31.349 seconds.
No reads to process; quitting.

Contamination filtering results:

Code:

#File	/home/cmccabe/IDP/adapter_removed_quality_trimmed_IDP-N2-IDT_S5_R1.fastq	/home/cmccabe/IDP/adapter_removed_quality_trimmed_IDP-N2-IDT_S5_R2.fastq
#Total	35677024
#Matched	59	0.00017%
#Name	Reads	ReadsPct
PhiX_read2_adapter	59	0.00017%

**Brian Bushnell** · 10-30-2015, 11:40 AM

That's odd - if you do adapter-trimming first, there should be no reads in the contamination-filtering phase that match "PhiX_read2_adapter", since it would have already been trimmed. Can you give the command lines you used for each of those phases?

**cmccabe** · 10-30-2015, 12:11 PM

Here are the commands bbbmap-35.59

Adapter Removal:

Code:

bbduk.sh -Xmx1g in1=read1.fq in2=read2.fq out1=clean1.fq out2=clean2.fq ref=adapters.fa ktrim=r k=23 mink=11 hdist=1 tpe tbo

Quality Trimming:

Code:

bbduk.sh -Xmx1g in=read1.fq in=read2.fq out1=clean1.fq out2=clean2.fq qtrim=rl trimq=10

Thank you

.

**Brian Bushnell** · 10-30-2015, 12:12 PM

That looks fine. And did you do a contamination removal step after that?

**cmccabe** · 10-30-2015, 12:16 PM

Yes I did:

Code:

 
bbduk.sh -Xmx1g in1=r1.fq in2=r2.fq out1=unmatched1.fq out2=unmatched2.fq outm1=matched1.fq outm2=matched2.fq ref=phix_adapters.fa k=31 hdist=1 stats=stats.txt

Thank you

.

**Brian Bushnell** · 10-30-2015, 12:23 PM

Ah, I see. You should do contaminant filtering against the entire phiX reference, rather than just the adapters:

Code:

bbduk.sh -Xmx1g in1=r1.fq in2=r2.fq out1=unmatched1.fq out2=unmatched2.fq outm1=matched1.fq outm2=matched2.fq ref=[B]phix174_ill.ref.fa.gz[/B] k=31 hdist=1 stats=stats.txt

I still don't understand how there are PhiX adapters in the data after trimming, though. They should have all been trimmed. Your trimming command was correct, and adapters.fa already contains the PhiX adapters (last two entries in the file). Can you post the console output from adapter-trimming?

Also, don't worry about the "#Chromosome sizes" in BBMap's index; a long time ago that made sense, but now multiple chromosomes are packed together in blocks but I never changed the terminology.

**cmccabe** · 10-30-2015, 12:27 PM

Thank you.

Code:

 bbmap.sh in1=reads1.fq in2=reads2.fq out=mapped.sam ref=ecoli.fa nodisk

So for the ref= is that the path to the ref folder that was created? Thank you

.

**Brian Bushnell** · 10-30-2015, 12:37 PM

If you type "bbmap.sh ref=x.fa", it will create the "ref" index in the current directory.

If you then run "bbmap.sh in=reads.fq", it will try to do mapping by looking for "ref" in the current directory. If you want it to look elsewhere, you can set path, for example "path=/foo/" would look for "/foo/ref".

The "nodisk" flag does not need "path", as the index does not get written anywhere. So "bbmap.sh in=reads.fq ref=ecoli.fa nodisk" does not need a path flag - it will build the index in memory and then do mapping. The only reason to build an index on disk is because it is faster to load a built index than to build one from the reference fasta each time.

**cmccabe** · 10-30-2015, 01:28 PM

I will give bbmap a try tomorroe.

Here are the results using the phiX reference:

Code:

BBDuk version 35.59
Initial:
Memory: max=1029m, free=1005m, used=24m

Added 487396 kmers; time: 	0.258 seconds.
Memory: max=1029m, free=956m, used=73m

Input is being processed as paired
Started output streams:	0.023 seconds.
Processing time:   		73.602 seconds.

Input:                  	35677024 reads 		4964594710 bases.
Contaminants:           	32 reads (0.00%) 	4814 bases (0.00%)
Result:                 	35676992 reads (100.00%) 	4964589896 bases (100.00%)

Time:   			73.894 seconds.
Reads Processed:      35677k 	482.81k reads/sec
Bases Processed:       4964m 	67.19m bases/sec

stats.txt

Code:

#File	/home/cmccabe/IDP/adapter_removed_quality_trimmed_IDP-N2-IDT_S5_R1.fastq	/home/cmccabe/IDP/adapter_removed_quality_trimmed_IDP-N2-IDT_S5_R2.fastq
#Total	35677024
#Matched	18	0.00005%
#Name	Reads	ReadsPct
gi|9626372|ref|NC_001422.1| Coliphage phiX174, complete genome	18	0.00005%

Thank you

.

**GenoMax** · 10-30-2015, 01:35 PM

32 reads that are similar to phiX must be by chance. Start mapping against the real genome.

**cmccabe** · 10-31-2015, 06:59 AM

Not being familiar with Illumina dat I have no idea if this is a good mapping, I will convert to bam and the use FASTQC as well. Thank you

.

I created an index reference using the command below:

Code:

 ndex: bbmap.sh ref=/home/cmccabe/Desktop/bed/hg19.fa

and that created a ref folder in the bbmap directory. I gave the path to the .fa as that was needed. Is this correct? Thank you

.

Code:

[B]cmccabe@DTV-A5211QLM:~/Desktop/NGS$ bbmap.sh in1=/home/cmccabe/IDP/unmatched_adapter_removed_quality_trimmed_IDP-N2-IDT_S5_R1.fastq in2=/home/cmccabe/IDP/unmatched_adapter_removed_quality_trimmed_IDP-N2-IDT_S5_R2.fastq out=/home/cmccabe/IDP/mapped_unmatched_adapter_removed_quality_trimmed_IDP-N2-IDT_S5_R1_and_R2.sam ref=/home/cmccabe/Desktop/bed/hg19.fa nodisk[/B]
java -Djava.library.path=/home/cmccabe/Desktop/NGS/bbmap-35.59/jni/ -ea -Xmx44276m -cp /home/cmccabe/Desktop/NGS/bbmap-35.59/current/ align2.BBMap build=1 overwrite=true fastareadlen=500 in1=/home/cmccabe/IDP/unmatched_adapter_removed_quality_trimmed_IDP-N2-IDT_S5_R1.fastq in2=/home/cmccabe/IDP/unmatched_adapter_removed_quality_trimmed_IDP-N2-IDT_S5_R2.fastq out=/home/cmccabe/IDP/mapped_unmatched_adapter_removed_quality_trimmed_IDP-N2-IDT_S5_R1_and_R2.sam ref=/home/cmccabe/Desktop/bed/hg19.fa nodisk
Executing align2.BBMap [build=1, overwrite=true, fastareadlen=500, in1=/home/cmccabe/IDP/unmatched_adapter_removed_quality_trimmed_IDP-N2-IDT_S5_R1.fastq, in2=/home/cmccabe/IDP/unmatched_adapter_removed_quality_trimmed_IDP-N2-IDT_S5_R2.fastq, out=/home/cmccabe/IDP/mapped_unmatched_adapter_removed_quality_trimmed_IDP-N2-IDT_S5_R1_and_R2.sam, ref=/home/cmccabe/Desktop/bed/hg19.fa, nodisk]

BBMap version 35.59
Retaining first best site only for ambiguous mappings.
Executing dna.FastaToChromArrays2 [/home/cmccabe/Desktop/bed/hg19.fa, 1, writeinthread=false, genscaffoldinfo=true, retain, waitforwriting=false, gz=true, maxlen=536670912, writechroms=false, minscaf=1, midpad=300, startpad=8000, stoppad=8000, nodisk=true]

Set genScaffoldInfo=true
Set genome to 1

Loaded Reference:	0.005 seconds.
Loading index for chunk 1-7, build 1
Indexing threads started for block 0-3
Indexing threads started for block 4-7
Indexing threads finished for block 0-3
Indexing threads finished for block 4-7
Generated Index:	170.887 seconds.
Analyzed Index:   	6.891 seconds.
Started output stream:	0.160 seconds.
Cleared Memory:    	2.506 seconds.
Processing reads in paired-ended mode.
Started read stream.
Started 16 mapping threads.
Detecting finished threads: 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15

   ------------------   Results   ------------------   

Genome:                	1
Key Length:            	13
Max Indel:             	16000
Minimum Score Ratio:  	0.56
Mapping Mode:         	normal
Reads Used:           	35676992	(4964589896 bases)

Mapping:          	6441.547 seconds.
Reads/sec:       	5538.57
kBases/sec:      	770.71


Pairing data:   	pct reads	num reads 	pct bases	   num bases

mated pairs:     	 98.2186% 	 17520728 	 98.1389% 	  4872191776
bad pairs:       	  0.5597% 	    99849 	  0.5917% 	    29375968
insert size avg: 	  196.85


Read 1 data:      	pct reads	num reads 	pct bases	   num bases

mapped:          	 99.4100% 	 17733242 	 99.3670% 	  2466930784
unambiguous:     	 94.4463% 	 16847807 	 94.4747% 	  2345474517
ambiguous:       	  4.9636% 	   885435 	  4.8922% 	   121456267
low-Q discards:  	  0.0000% 	        0 	  0.0000% 	           0

perfect best site:	 79.8468% 	 14243472 	 79.0980% 	  1963723803
semiperfect site:	 79.8570% 	 14245292 	 79.1077% 	  1963964765
rescued:         	  0.5136% 	    91626

Match Rate:      	      NA 	       NA 	 96.4489% 	  2454439608
Error Rate:      	 19.6675% 	  3487693 	  3.5507% 	    90358451
Sub Rate:        	 19.2300% 	  3410108 	  0.4755% 	    12100683
Del Rate:        	  0.8707% 	   154402 	  3.0603% 	    77878570
Ins Rate:        	  0.4772% 	    84616 	  0.0149% 	      379198
N Rate:          	  0.0155% 	     2748 	  0.0004% 	       11295


Read 2 data:      	pct reads	num reads 	pct bases	   num bases

mapped:          	 99.1143% 	 17680507 	 99.0502% 	  2458368636
unambiguous:     	 94.1704% 	 16798575 	 94.1845% 	  2337606274
ambiguous:       	  4.9440% 	   881932 	  4.8656% 	   120762362
low-Q discards:  	  0.0000% 	        1 	  0.0000% 	          19

perfect best site:	 77.7319% 	 13866207 	 76.8600% 	  1907622427
semiperfect site:	 77.7424% 	 13868077 	 76.8699% 	  1907865846
rescued:         	  0.5123% 	    91390

Match Rate:      	      NA 	       NA 	 96.4399% 	  2443690977
Error Rate:      	 21.5427% 	  3808851 	  3.5591% 	    90184753
Sub Rate:        	 21.1207% 	  3734247 	  0.5632% 	    14271472
Del Rate:        	  0.8812% 	   155792 	  2.9809% 	    75532820
Ins Rate:        	  0.4759% 	    84144 	  0.0150% 	      380461
N Rate:          	  0.0418% 	     7390 	  0.0010% 	       25726

**Brian Bushnell** · 10-31-2015, 09:43 AM

Those are very good results. Over 99% of your reads mapped, and of those, almost 80% matched the reference perfectly. Also, over 98% of the reads mapped in properly paired configurations. That's about as good a result as you can hope for on a diploid organism.

I will note, however, that the average insert size of "196.85" is lower than ideal for 2x150bp reads. When making the library, you should target more than twice the read length (so, over 300bp in this case) so that you don't waste sequence double-covering the same location with the same molecule. Although, for exon-capture the constraints may be a little different since exons are short. I assume this is exon-capture data? At any rate, when your reads mostly overlap (as in this case), you can generally get better mapping results by correcting them first like this:

bbmerge.sh in1=read1.fq in2=read2.fq out1=corrected1.fq out2=corrected2.fq ecco vstrict mix=t merge=f

Also note that as long as samtools is installed and in your path, BBMap will output bam files if you name the output with a .bam extension, such as "out=mapped.bam".

**cmccabe** · 11-02-2015, 06:48 AM

Yes, this is exon capture data and I do have samtools installed and added to path. Thanks for the tips, I appreciate them. In terms of variant calling what do you recommend for using the aligned data with? Thank you

.

Topics	Statistics	Last Post
Genetic Variants and Diabetes Risk in Childhood Cancer Survivors by seqadmin Started by seqadmin, Yesterday, 08:47 AM	0 responses 12 views 0 likes	Last Post by seqadmin Yesterday, 08:47 AM
Cancer Metastasis: A Deep Dive into Cellular Plasticity by seqadmin Started by seqadmin, 04-11-2024, 12:08 PM	0 responses 60 views 0 likes	Last Post by seqadmin 04-11-2024, 12:08 PM
Proteogenomic Profiles Offer New Clues in Prostate Cancer by seqadmin Started by seqadmin, 04-10-2024, 10:19 PM	0 responses 59 views 0 likes	Last Post by seqadmin 04-10-2024, 10:19 PM
Novel Diagnostic Assay Enhances Ovarian Cancer Detection by seqadmin Started by seqadmin, 04-10-2024, 09:21 AM	0 responses 54 views 0 likes	Last Post by seqadmin 04-10-2024, 09:21 AM

Seqanswers Leaderboard Ad

Announcement

bbmap aligner

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News