SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics



Similar Threads
Thread Thread Starter Forum Replies Last Post
BBMAP paired-end alignment problem Mark Bioinformatics 1 03-02-2018 08:30 AM
converting paired-end (PE) bam file to single-end (SE) fastq adrian Bioinformatics 3 05-05-2015 10:00 AM
Can paired-end mapping produce more reads than single-end ? warrenemmett Bioinformatics 13 03-20-2012 11:10 PM
Fastq: Paired end reads and mapping cedance Bioinformatics 7 06-18-2011 12:33 PM

Reply
 
Thread Tools
Old 04-30-2020, 02:46 AM   #1
DPalachan
Junior Member
 
Location: Netherlands

Join Date: Apr 2020
Posts: 4
Default BBmap generation and mapping of artificial paired-end fastq

Hello everyone.
I have discovered BBmap and have just started using it. I want to run a benchmark on some variant callers and I was thinking of making a set of paired-end Illumina fastq files where I know exactly the sample composition per position (e.g. position 100, 80% A (REF), 20% T (MUT)) and then I can run the file through the pipelines and assess how they perform.

So, I generate my set of files with:
Code:
./randomreads.sh ref=reference.fa out1=Sample1_R1.fastq out2=Sample1_R2.fastq coverage=10000 paired=t mininsert=1 maxinsert=12 snprate=0.1 insrate=0.1 delrate=0.1 minq=12 -Xmx10g
The command works just fine and I get a set of fastq files. I can see (and also read somewhere in the documentation) that the read headers actually contain the info of the changes compared to the reference. So, I can write a small python script to parse them and gather that info, but my first question was if there is any tool in the software suite that can do this for me. I thought that maybe reformat.sh could do this, but the bhist option of that seems to output statistics based on read position for the R1 and R2 files.

Then my other question is about BBmap.sh. I tried to map the generated fastq files to my reference, using the command:

Code:
/bbmap.sh ref=reference.fa local=t in1=Sample1_R1.fastq in2=Sample1_R2.fastq covstats=Covstats.cov
but got a very low average coverage per position (225, but expected ~10K based on input) and the mapping percentage is very low at 2.25%.
The variant rates are very high,
HTML Code:
Read 1 data:      	pct reads	num reads 	pct bases	   num bases

mapped:          	  2.2502% 	    22429 	  2.2502% 	     3364350
unambiguous:     	  2.2502% 	    22429 	  2.2502% 	     3364350
ambiguous:       	  0.0000% 	        0 	  0.0000% 	           0
low-Q discards:  	  0.0000% 	        0 	  0.0000% 	           0

perfect best site:	  0.2408% 	     2400 	  0.2408% 	      360000
semiperfect site:	  0.2408% 	     2400 	  0.2408% 	      360000
rescued:         	  0.4520% 	     4505

Match Rate:      	      NA 	       NA 	 95.1746% 	     3202707
Error Rate:      	 86.5130% 	    19404 	  4.1666% 	      140208
Sub Rate:        	 60.4441% 	    13557 	  0.8741% 	       29415
Del Rate:        	  0.1338% 	       30 	  0.0219% 	         736
Ins Rate:        	 65.3707% 	    14662 	  3.2706% 	      110057
N Rate:          	 14.2583% 	     3198 	  0.6589% 	       22171


Read 2 data:      	pct reads	num reads 	pct bases	   num bases

mapped:          	  2.2421% 	    22348 	  2.2421% 	     3352200
unambiguous:     	  2.2421% 	    22348 	  2.2421% 	     3352200
ambiguous:       	  0.0000% 	        0 	  0.0000% 	           0
low-Q discards:  	  0.2039% 	     2032 	  0.2039% 	      304800

perfect best site:	  0.4693% 	     4678 	  0.4693% 	      701700
semiperfect site:	  0.4693% 	     4678 	  0.4693% 	      701700
rescued:         	  0.3604% 	     3592

Match Rate:      	      NA 	       NA 	 96.8958% 	     3248505
Error Rate:      	 76.8301% 	    17170 	  2.7130% 	       90955
Sub Rate:        	 60.3454% 	    13486 	  0.8803% 	       29513
Del Rate:        	  0.1253% 	       28 	  0.0112% 	         376
Ins Rate:        	 39.6680% 	     8865 	  1.8215% 	       61066
N Rate:          	 10.8198% 	     2418 	  0.3912% 	       13116
but that seems contradicting based on my input. So, my second question is what am I missing here? I was aiming for a file that has a few events (substitutions, indels), but the end result seems quite extreme.

I also ran a second test where I lowered the rates drastically
Code:
snprate=0.1 insrate=0.01 delrate=0.01
, but that yielded an even poorer mapping percentage of 0.2%.

I appreciate any pointers, thanks for reading!
DPalachan is offline   Reply With Quote
Old 04-30-2020, 03:02 AM   #2
GenoMax
Senior Member
 
Location: East Coast USA

Join Date: Feb 2008
Posts: 7,075
Default

It may be best to use "mutate.sh" (to look at in-line help) to introduce the mutations after you generate the reads with randomreads.sh. You will get a VCF files of changes.

I think using "local=t" is causing the alignment issues. It is meant for local alignments when you expect errors at end of reads.

Last edited by GenoMax; 04-30-2020 at 05:49 AM.
GenoMax is offline   Reply With Quote
Old 04-30-2020, 04:30 AM   #3
DPalachan
Junior Member
 
Location: Netherlands

Join Date: Apr 2020
Posts: 4
Default

Quote:
Originally Posted by GenoMax View Post
It may be best to use "mutate.sh" (to look at in-line help) to introduce the mutations after you generate the reads with randomreads.sh. You will get a VCF files of changes.

I think using "local=t" is causing the alignment issues. It is meant for local alignments when you expect errors at end of reads.
Thank you for your ultra-fast reply GenoMax! I have a question on your comment. Did you mean to say to use "mutate.sh" before generating the reads? I actually tried that part:
1) First mutate my original sequence:
Code:
./mutate.sh in=reference.fa out=reference_Mut.fa vcf=Variants.vcf subrate=0.01 indelrate=0.005 maxindel=3 overwrite=t
The vcf file indeed looks super!
2) Then generate reads based on the new sequence, without any additional flags (only min quality):
Code:
./randomreads.sh ref=reference_Mut.fa out1=Sample1_R1.fastq out2=Sample1_R2.fastq coverage=10000 paired=t minq=12 -Xmx10g
3) Map the reads to the mutated reference (exclude "local" option):
Code:
./bbmap.sh ref=reference_Mut.fa in1=Sample1_R1.fastq in2=Sample1_R2.fastq covstats=Covstats.cov
This indeed yields a 99.905% mapping, which makes sense based on some quality filter I assume.
4) Map the reads to the original reference.
Code:
./bbmap.sh ref=reference.fa in1=Sample1_R1.fastq in2=Sample1_R2.fastq covstats=Covstats.cov
This yielded a 99.875% mapping which also makes sense.

My follow-up question then is can I safely assume that the fastq files I'm generating, using the method above, contain the variants at a rate of 100%?

If that assumption is correct, what would be the best way to regulate the variation rate? I'm thinking along the lines of:
1) Generate perfect reads from un-mutated reference.
2) Generate perfect reads from mutated sequence.
3) Mix the file sets in various percentages to get the desired effect and make a new set of files.
Is that correct logic or is there a way to already do that?

Thank you in advance, I really appreciate it!
DPalachan is offline   Reply With Quote
Old 04-30-2020, 05:54 AM   #4
GenoMax
Senior Member
 
Location: East Coast USA

Join Date: Feb 2008
Posts: 7,075
Default

Perhaps I am missing a subtle point but since you can control mutation type/rates with mutate.sh would it not be better to go with just that data (#2 in list above)? If you mix reads from mutated and un-mutated reference you can't be sure that you will maintain the fraction of mutations at the same level in new pool?
GenoMax is offline   Reply With Quote
Old 04-30-2020, 06:26 AM   #5
DPalachan
Junior Member
 
Location: Netherlands

Join Date: Apr 2020
Posts: 4
Default

Hi GenoMax! Yes, you are right, the fraction of mutations may change in the new pool, but - based on the mixing percentages (in my scenario) - I'll be able to control the level of mutations per position. So, say on position 100, go from 100% MUT to 80% MUT if I mix my original reference fastq and mutated reference fastq using 80/20 percent of reads respectively.
DPalachan is offline   Reply With Quote
Old 04-30-2020, 06:31 AM   #6
GenoMax
Senior Member
 
Location: East Coast USA

Join Date: Feb 2008
Posts: 7,075
Default

You could try it out and see if it works (I have a hunch it won't be perfect).

If that does not work acceptably then I suggest you run mutate.sh multiple times with new values and create new datasets.
GenoMax is offline   Reply With Quote
Old 04-30-2020, 06:50 AM   #7
DPalachan
Junior Member
 
Location: Netherlands

Join Date: Apr 2020
Posts: 4
Default

Thank you, I'll try it and report back!
DPalachan is offline   Reply With Quote
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 08:55 AM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2020, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO