View Single Post
Old 04-30-2020, 02:46 AM   #1
DPalachan
Junior Member
 
Location: Netherlands

Join Date: Apr 2020
Posts: 4
Default BBmap generation and mapping of artificial paired-end fastq

Hello everyone.
I have discovered BBmap and have just started using it. I want to run a benchmark on some variant callers and I was thinking of making a set of paired-end Illumina fastq files where I know exactly the sample composition per position (e.g. position 100, 80% A (REF), 20% T (MUT)) and then I can run the file through the pipelines and assess how they perform.

So, I generate my set of files with:
Code:
./randomreads.sh ref=reference.fa out1=Sample1_R1.fastq out2=Sample1_R2.fastq coverage=10000 paired=t mininsert=1 maxinsert=12 snprate=0.1 insrate=0.1 delrate=0.1 minq=12 -Xmx10g
The command works just fine and I get a set of fastq files. I can see (and also read somewhere in the documentation) that the read headers actually contain the info of the changes compared to the reference. So, I can write a small python script to parse them and gather that info, but my first question was if there is any tool in the software suite that can do this for me. I thought that maybe reformat.sh could do this, but the bhist option of that seems to output statistics based on read position for the R1 and R2 files.

Then my other question is about BBmap.sh. I tried to map the generated fastq files to my reference, using the command:

Code:
/bbmap.sh ref=reference.fa local=t in1=Sample1_R1.fastq in2=Sample1_R2.fastq covstats=Covstats.cov
but got a very low average coverage per position (225, but expected ~10K based on input) and the mapping percentage is very low at 2.25%.
The variant rates are very high,
HTML Code:
Read 1 data:      	pct reads	num reads 	pct bases	   num bases

mapped:          	  2.2502% 	    22429 	  2.2502% 	     3364350
unambiguous:     	  2.2502% 	    22429 	  2.2502% 	     3364350
ambiguous:       	  0.0000% 	        0 	  0.0000% 	           0
low-Q discards:  	  0.0000% 	        0 	  0.0000% 	           0

perfect best site:	  0.2408% 	     2400 	  0.2408% 	      360000
semiperfect site:	  0.2408% 	     2400 	  0.2408% 	      360000
rescued:         	  0.4520% 	     4505

Match Rate:      	      NA 	       NA 	 95.1746% 	     3202707
Error Rate:      	 86.5130% 	    19404 	  4.1666% 	      140208
Sub Rate:        	 60.4441% 	    13557 	  0.8741% 	       29415
Del Rate:        	  0.1338% 	       30 	  0.0219% 	         736
Ins Rate:        	 65.3707% 	    14662 	  3.2706% 	      110057
N Rate:          	 14.2583% 	     3198 	  0.6589% 	       22171


Read 2 data:      	pct reads	num reads 	pct bases	   num bases

mapped:          	  2.2421% 	    22348 	  2.2421% 	     3352200
unambiguous:     	  2.2421% 	    22348 	  2.2421% 	     3352200
ambiguous:       	  0.0000% 	        0 	  0.0000% 	           0
low-Q discards:  	  0.2039% 	     2032 	  0.2039% 	      304800

perfect best site:	  0.4693% 	     4678 	  0.4693% 	      701700
semiperfect site:	  0.4693% 	     4678 	  0.4693% 	      701700
rescued:         	  0.3604% 	     3592

Match Rate:      	      NA 	       NA 	 96.8958% 	     3248505
Error Rate:      	 76.8301% 	    17170 	  2.7130% 	       90955
Sub Rate:        	 60.3454% 	    13486 	  0.8803% 	       29513
Del Rate:        	  0.1253% 	       28 	  0.0112% 	         376
Ins Rate:        	 39.6680% 	     8865 	  1.8215% 	       61066
N Rate:          	 10.8198% 	     2418 	  0.3912% 	       13116
but that seems contradicting based on my input. So, my second question is what am I missing here? I was aiming for a file that has a few events (substitutions, indels), but the end result seems quite extreme.

I also ran a second test where I lowered the rates drastically
Code:
snprate=0.1 insrate=0.01 delrate=0.01
, but that yielded an even poorer mapping percentage of 0.2%.

I appreciate any pointers, thanks for reading!
DPalachan is offline   Reply With Quote