Hello everyone.
I have discovered BBmap and have just started using it. I want to run a benchmark on some variant callers and I was thinking of making a set of paired-end Illumina fastq files where I know exactly the sample composition per position (e.g. position 100, 80% A (REF), 20% T (MUT)) and then I can run the file through the pipelines and assess how they perform.
So, I generate my set of files with:
The command works just fine and I get a set of fastq files. I can see (and also read somewhere in the documentation) that the read headers actually contain the info of the changes compared to the reference. So, I can write a small python script to parse them and gather that info, but my first question was if there is any tool in the software suite that can do this for me. I thought that maybe reformat.sh could do this, but the bhist option of that seems to output statistics based on read position for the R1 and R2 files.
Then my other question is about BBmap.sh. I tried to map the generated fastq files to my reference, using the command:
but got a very low average coverage per position (225, but expected ~10K based on input) and the mapping percentage is very low at 2.25%.
The variant rates are very high,
but that seems contradicting based on my input. So, my second question is what am I missing here? I was aiming for a file that has a few events (substitutions, indels), but the end result seems quite extreme.
I also ran a second test where I lowered the rates drastically
, but that yielded an even poorer mapping percentage of 0.2%.
I appreciate any pointers, thanks for reading!
I have discovered BBmap and have just started using it. I want to run a benchmark on some variant callers and I was thinking of making a set of paired-end Illumina fastq files where I know exactly the sample composition per position (e.g. position 100, 80% A (REF), 20% T (MUT)) and then I can run the file through the pipelines and assess how they perform.
So, I generate my set of files with:
Code:
./randomreads.sh ref=reference.fa out1=Sample1_R1.fastq out2=Sample1_R2.fastq coverage=10000 paired=t mininsert=1 maxinsert=12 snprate=0.1 insrate=0.1 delrate=0.1 minq=12 -Xmx10g
Then my other question is about BBmap.sh. I tried to map the generated fastq files to my reference, using the command:
Code:
/bbmap.sh ref=reference.fa local=t in1=Sample1_R1.fastq in2=Sample1_R2.fastq covstats=Covstats.cov
The variant rates are very high,
HTML Code:
Read 1 data: pct reads num reads pct bases num bases mapped: 2.2502% 22429 2.2502% 3364350 unambiguous: 2.2502% 22429 2.2502% 3364350 ambiguous: 0.0000% 0 0.0000% 0 low-Q discards: 0.0000% 0 0.0000% 0 perfect best site: 0.2408% 2400 0.2408% 360000 semiperfect site: 0.2408% 2400 0.2408% 360000 rescued: 0.4520% 4505 Match Rate: NA NA 95.1746% 3202707 Error Rate: 86.5130% 19404 4.1666% 140208 Sub Rate: 60.4441% 13557 0.8741% 29415 Del Rate: 0.1338% 30 0.0219% 736 Ins Rate: 65.3707% 14662 3.2706% 110057 N Rate: 14.2583% 3198 0.6589% 22171 Read 2 data: pct reads num reads pct bases num bases mapped: 2.2421% 22348 2.2421% 3352200 unambiguous: 2.2421% 22348 2.2421% 3352200 ambiguous: 0.0000% 0 0.0000% 0 low-Q discards: 0.2039% 2032 0.2039% 304800 perfect best site: 0.4693% 4678 0.4693% 701700 semiperfect site: 0.4693% 4678 0.4693% 701700 rescued: 0.3604% 3592 Match Rate: NA NA 96.8958% 3248505 Error Rate: 76.8301% 17170 2.7130% 90955 Sub Rate: 60.3454% 13486 0.8803% 29513 Del Rate: 0.1253% 28 0.0112% 376 Ins Rate: 39.6680% 8865 1.8215% 61066 N Rate: 10.8198% 2418 0.3912% 13116
I also ran a second test where I lowered the rates drastically
Code:
snprate=0.1 insrate=0.01 delrate=0.01
I appreciate any pointers, thanks for reading!
Comment