Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • BBmap generation and mapping of artificial paired-end fastq

    Hello everyone.
    I have discovered BBmap and have just started using it. I want to run a benchmark on some variant callers and I was thinking of making a set of paired-end Illumina fastq files where I know exactly the sample composition per position (e.g. position 100, 80% A (REF), 20% T (MUT)) and then I can run the file through the pipelines and assess how they perform.

    So, I generate my set of files with:
    Code:
    ./randomreads.sh ref=reference.fa out1=Sample1_R1.fastq out2=Sample1_R2.fastq coverage=10000 paired=t mininsert=1 maxinsert=12 snprate=0.1 insrate=0.1 delrate=0.1 minq=12 -Xmx10g
    The command works just fine and I get a set of fastq files. I can see (and also read somewhere in the documentation) that the read headers actually contain the info of the changes compared to the reference. So, I can write a small python script to parse them and gather that info, but my first question was if there is any tool in the software suite that can do this for me. I thought that maybe reformat.sh could do this, but the bhist option of that seems to output statistics based on read position for the R1 and R2 files.

    Then my other question is about BBmap.sh. I tried to map the generated fastq files to my reference, using the command:

    Code:
    /bbmap.sh ref=reference.fa local=t in1=Sample1_R1.fastq in2=Sample1_R2.fastq covstats=Covstats.cov
    but got a very low average coverage per position (225, but expected ~10K based on input) and the mapping percentage is very low at 2.25%.
    The variant rates are very high,
    HTML Code:
    Read 1 data:      	pct reads	num reads 	pct bases	   num bases
    
    mapped:          	  2.2502% 	    22429 	  2.2502% 	     3364350
    unambiguous:     	  2.2502% 	    22429 	  2.2502% 	     3364350
    ambiguous:       	  0.0000% 	        0 	  0.0000% 	           0
    low-Q discards:  	  0.0000% 	        0 	  0.0000% 	           0
    
    perfect best site:	  0.2408% 	     2400 	  0.2408% 	      360000
    semiperfect site:	  0.2408% 	     2400 	  0.2408% 	      360000
    rescued:         	  0.4520% 	     4505
    
    Match Rate:      	      NA 	       NA 	 95.1746% 	     3202707
    Error Rate:      	 86.5130% 	    19404 	  4.1666% 	      140208
    Sub Rate:        	 60.4441% 	    13557 	  0.8741% 	       29415
    Del Rate:        	  0.1338% 	       30 	  0.0219% 	         736
    Ins Rate:        	 65.3707% 	    14662 	  3.2706% 	      110057
    N Rate:          	 14.2583% 	     3198 	  0.6589% 	       22171
    
    
    Read 2 data:      	pct reads	num reads 	pct bases	   num bases
    
    mapped:          	  2.2421% 	    22348 	  2.2421% 	     3352200
    unambiguous:     	  2.2421% 	    22348 	  2.2421% 	     3352200
    ambiguous:       	  0.0000% 	        0 	  0.0000% 	           0
    low-Q discards:  	  0.2039% 	     2032 	  0.2039% 	      304800
    
    perfect best site:	  0.4693% 	     4678 	  0.4693% 	      701700
    semiperfect site:	  0.4693% 	     4678 	  0.4693% 	      701700
    rescued:         	  0.3604% 	     3592
    
    Match Rate:      	      NA 	       NA 	 96.8958% 	     3248505
    Error Rate:      	 76.8301% 	    17170 	  2.7130% 	       90955
    Sub Rate:        	 60.3454% 	    13486 	  0.8803% 	       29513
    Del Rate:        	  0.1253% 	       28 	  0.0112% 	         376
    Ins Rate:        	 39.6680% 	     8865 	  1.8215% 	       61066
    N Rate:          	 10.8198% 	     2418 	  0.3912% 	       13116
    but that seems contradicting based on my input. So, my second question is what am I missing here? I was aiming for a file that has a few events (substitutions, indels), but the end result seems quite extreme.

    I also ran a second test where I lowered the rates drastically
    Code:
    snprate=0.1 insrate=0.01 delrate=0.01
    , but that yielded an even poorer mapping percentage of 0.2%.

    I appreciate any pointers, thanks for reading!

  • #2
    It may be best to use "mutate.sh" (to look at in-line help) to introduce the mutations after you generate the reads with randomreads.sh. You will get a VCF files of changes.

    I think using "local=t" is causing the alignment issues. It is meant for local alignments when you expect errors at end of reads.
    Last edited by GenoMax; 04-30-2020, 05:49 AM.

    Comment


    • #3
      Originally posted by GenoMax View Post
      It may be best to use "mutate.sh" (to look at in-line help) to introduce the mutations after you generate the reads with randomreads.sh. You will get a VCF files of changes.

      I think using "local=t" is causing the alignment issues. It is meant for local alignments when you expect errors at end of reads.
      Thank you for your ultra-fast reply GenoMax! I have a question on your comment. Did you mean to say to use "mutate.sh" before generating the reads? I actually tried that part:
      1) First mutate my original sequence:
      Code:
      ./mutate.sh in=reference.fa out=reference_Mut.fa vcf=Variants.vcf subrate=0.01 indelrate=0.005 maxindel=3 overwrite=t
      The vcf file indeed looks super!
      2) Then generate reads based on the new sequence, without any additional flags (only min quality):
      Code:
      ./randomreads.sh ref=reference_Mut.fa out1=Sample1_R1.fastq out2=Sample1_R2.fastq coverage=10000 paired=t minq=12 -Xmx10g
      3) Map the reads to the mutated reference (exclude "local" option):
      Code:
      ./bbmap.sh ref=reference_Mut.fa in1=Sample1_R1.fastq in2=Sample1_R2.fastq covstats=Covstats.cov
      This indeed yields a 99.905% mapping, which makes sense based on some quality filter I assume.
      4) Map the reads to the original reference.
      Code:
      ./bbmap.sh ref=reference.fa in1=Sample1_R1.fastq in2=Sample1_R2.fastq covstats=Covstats.cov
      This yielded a 99.875% mapping which also makes sense.

      My follow-up question then is can I safely assume that the fastq files I'm generating, using the method above, contain the variants at a rate of 100%?

      If that assumption is correct, what would be the best way to regulate the variation rate? I'm thinking along the lines of:
      1) Generate perfect reads from un-mutated reference.
      2) Generate perfect reads from mutated sequence.
      3) Mix the file sets in various percentages to get the desired effect and make a new set of files.
      Is that correct logic or is there a way to already do that?

      Thank you in advance, I really appreciate it!

      Comment


      • #4
        Perhaps I am missing a subtle point but since you can control mutation type/rates with mutate.sh would it not be better to go with just that data (#2 in list above)? If you mix reads from mutated and un-mutated reference you can't be sure that you will maintain the fraction of mutations at the same level in new pool?

        Comment


        • #5
          Hi GenoMax! Yes, you are right, the fraction of mutations may change in the new pool, but - based on the mixing percentages (in my scenario) - I'll be able to control the level of mutations per position. So, say on position 100, go from 100% MUT to 80% MUT if I mix my original reference fastq and mutated reference fastq using 80/20 percent of reads respectively.

          Comment


          • #6
            You could try it out and see if it works (I have a hunch it won't be perfect).

            If that does not work acceptably then I suggest you run mutate.sh multiple times with new values and create new datasets.

            Comment


            • #7
              Thank you, I'll try it and report back!

              Comment

              Latest Articles

              Collapse

              • seqadmin
                Advancing Precision Medicine for Rare Diseases in Children
                by seqadmin




                Many organizations study rare diseases, but few have a mission as impactful as Rady Children’s Institute for Genomic Medicine (RCIGM). “We are all about changing outcomes for children,” explained Dr. Stephen Kingsmore, President and CEO of the group. The institute’s initial goal was to provide rapid diagnoses for critically ill children and shorten their diagnostic odyssey, a term used to describe the long and arduous process it takes patients to obtain an accurate...
                12-16-2024, 07:57 AM
              • seqadmin
                Recent Advances in Sequencing Technologies
                by seqadmin



                Innovations in next-generation sequencing technologies and techniques are driving more precise and comprehensive exploration of complex biological systems. Current advancements include improved accessibility for long-read sequencing and significant progress in single-cell and 3D genomics. This article explores some of the most impactful developments in the field over the past year.

                Long-Read Sequencing
                Long-read sequencing has seen remarkable advancements,...
                12-02-2024, 01:49 PM

              ad_right_rmr

              Collapse

              News

              Collapse

              Topics Statistics Last Post
              Started by seqadmin, 12-17-2024, 10:28 AM
              0 responses
              27 views
              0 likes
              Last Post seqadmin  
              Started by seqadmin, 12-13-2024, 08:24 AM
              0 responses
              43 views
              0 likes
              Last Post seqadmin  
              Started by seqadmin, 12-12-2024, 07:41 AM
              0 responses
              29 views
              0 likes
              Last Post seqadmin  
              Started by seqadmin, 12-11-2024, 07:45 AM
              0 responses
              42 views
              0 likes
              Last Post seqadmin  
              Working...
              X