Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • bbmap aligner

    So after using bbduk (adapter removal, quality trimming, and contamination filtering) I am going to use bbmap on the unmatched fastq1 and unmatched fastq2. I am assuming the matched file are those reads that were identical to the phix_adapters and therefore I do not want? Is this correct? Sorry, I am new to Illumina data. Also, should I use

    Code:
    bbmap.sh ref=hg19.fa
    to re-index before using bbmap?

    Thank you .
    Last edited by cmccabe; 10-30-2015, 10:45 AM. Reason: added question

  • #2
    How many reads were removed by bbduk contamination filtering?

    Yes for the other question.

    Comment


    • #3
      Thank you .

      The new index is in a ref folder was created in the bbmap directory. Though the info doesn't look right

      Code:
       bbmap.sh ref=/home/cmccabe/Desktop/bed/hg19.fa
      Code:
      #Chromosome sizes
      #Generated on	Fri Oct 30 13:47:30 CDT 2015
      #Version	5
      #chrom	scaffolds	contigs	length	defined	undefined	startPad	stopPad
      1	3	58	492475166	463501710	28973456	8000	8000
      2	2	15	389185007	382458811	6726196	8000	8000
      3	3	31	511177591	500443989	10733602	8000	8000
      4	3	63	423120801	394347091	28773710	8000	8000
      5	4	24	491386730	445490327	45896403	8000	8000
      6	7	54	522452802	463453103	58999699	8000	8000
      7	71	213	307452973	247615431	59837542	8000	8000
      Code:
      cmccabe@DTV-A5211QLM:~/Desktop/NGS$ bbmap.sh ref=/home/cmccabe/Desktop/bed/hg19.fa
      java -Djava.library.path=/home/cmccabe/Desktop/NGS/bbmap-35.59/jni/ -ea -Xmx44294m -cp /home/cmccabe/Desktop/NGS/bbmap-35.59/current/ align2.BBMap build=1 overwrite=true fastareadlen=500 ref=/home/cmccabe/Desktop/bed/hg19.fa
      Executing align2.BBMap [build=1, overwrite=true, fastareadlen=500, ref=/home/cmccabe/Desktop/bed/hg19.fa]
      
      BBMap version 35.59
      Retaining first best site only for ambiguous mappings.
      No output file.
      Writing reference.
      Executing dna.FastaToChromArrays2 [/home/cmccabe/Desktop/bed/hg19.fa, 1, writeinthread=false, genscaffoldinfo=true, retain, waitforwriting=false, gz=true, maxlen=536670912, writechroms=true, minscaf=1, midpad=300, startpad=8000, stoppad=8000, nodisk=false]
      
      Set genScaffoldInfo=true
      Writing chunk 1
      Writing chunk 2
      Writing chunk 3
      Writing chunk 4
      Writing chunk 5
      Writing chunk 6
      Writing chunk 7
      Set genome to 1
      
      Loaded Reference:	0.024 seconds.
      Loading index for chunk 1-7, build 1
      No index available; generating from reference genome: /home/cmccabe/Desktop/NGS/ref/index/1/chr1-3_index_k13_c2_b1.block
      No index available; generating from reference genome: /home/cmccabe/Desktop/NGS/ref/index/1/chr4-7_index_k13_c2_b1.block
      Indexing threads started for block 0-3
      Indexing threads started for block 4-7
      Indexing threads finished for block 0-3
      Indexing threads finished for block 4-7
      Generated Index:	171.787 seconds.
      Finished Writing:	31.349 seconds.
      No reads to process; quitting.
      Contamination filtering results:
      Code:
      #File	/home/cmccabe/IDP/adapter_removed_quality_trimmed_IDP-N2-IDT_S5_R1.fastq	/home/cmccabe/IDP/adapter_removed_quality_trimmed_IDP-N2-IDT_S5_R2.fastq
      #Total	35677024
      #Matched	59	0.00017%
      #Name	Reads	ReadsPct
      PhiX_read2_adapter	59	0.00017%
      Last edited by cmccabe; 10-30-2015, 11:40 AM. Reason: fixed format

      Comment


      • #4
        That's odd - if you do adapter-trimming first, there should be no reads in the contamination-filtering phase that match "PhiX_read2_adapter", since it would have already been trimmed. Can you give the command lines you used for each of those phases?

        Comment


        • #5
          Here are the commands bbbmap-35.59

          Adapter Removal:
          Code:
          bbduk.sh -Xmx1g in1=read1.fq in2=read2.fq out1=clean1.fq out2=clean2.fq ref=adapters.fa ktrim=r k=23 mink=11 hdist=1 tpe tbo
          Quality Trimming:
          Code:
          bbduk.sh -Xmx1g in=read1.fq in=read2.fq out1=clean1.fq out2=clean2.fq qtrim=rl trimq=10
          Thank you .

          Comment


          • #6
            That looks fine. And did you do a contamination removal step after that?

            Comment


            • #7
              Yes I did:

              Code:
               
              bbduk.sh -Xmx1g in1=r1.fq in2=r2.fq out1=unmatched1.fq out2=unmatched2.fq outm1=matched1.fq outm2=matched2.fq ref=phix_adapters.fa k=31 hdist=1 stats=stats.txt
              Thank you .

              Comment


              • #8
                Ah, I see. You should do contaminant filtering against the entire phiX reference, rather than just the adapters:

                Code:
                bbduk.sh -Xmx1g in1=r1.fq in2=r2.fq out1=unmatched1.fq out2=unmatched2.fq outm1=matched1.fq outm2=matched2.fq ref=[B]phix174_ill.ref.fa.gz[/B] k=31 hdist=1 stats=stats.txt
                I still don't understand how there are PhiX adapters in the data after trimming, though. They should have all been trimmed. Your trimming command was correct, and adapters.fa already contains the PhiX adapters (last two entries in the file). Can you post the console output from adapter-trimming?

                Also, don't worry about the "#Chromosome sizes" in BBMap's index; a long time ago that made sense, but now multiple chromosomes are packed together in blocks but I never changed the terminology.

                Comment


                • #9
                  Thank you.

                  Code:
                   bbmap.sh in1=reads1.fq in2=reads2.fq out=mapped.sam ref=ecoli.fa nodisk
                  So for the ref= is that the path to the ref folder that was created? Thank you .

                  Comment


                  • #10
                    If you type "bbmap.sh ref=x.fa", it will create the "ref" index in the current directory.

                    If you then run "bbmap.sh in=reads.fq", it will try to do mapping by looking for "ref" in the current directory. If you want it to look elsewhere, you can set path, for example "path=/foo/" would look for "/foo/ref".

                    The "nodisk" flag does not need "path", as the index does not get written anywhere. So "bbmap.sh in=reads.fq ref=ecoli.fa nodisk" does not need a path flag - it will build the index in memory and then do mapping. The only reason to build an index on disk is because it is faster to load a built index than to build one from the reference fasta each time.

                    Comment


                    • #11
                      I will give bbmap a try tomorroe.

                      Here are the results using the phiX reference:

                      Code:
                      BBDuk version 35.59
                      Initial:
                      Memory: max=1029m, free=1005m, used=24m
                      
                      Added 487396 kmers; time: 	0.258 seconds.
                      Memory: max=1029m, free=956m, used=73m
                      
                      Input is being processed as paired
                      Started output streams:	0.023 seconds.
                      Processing time:   		73.602 seconds.
                      
                      Input:                  	35677024 reads 		4964594710 bases.
                      Contaminants:           	32 reads (0.00%) 	4814 bases (0.00%)
                      Result:                 	35676992 reads (100.00%) 	4964589896 bases (100.00%)
                      
                      Time:   			73.894 seconds.
                      Reads Processed:      35677k 	482.81k reads/sec
                      Bases Processed:       4964m 	67.19m bases/sec
                      stats.txt
                      Code:
                      #File	/home/cmccabe/IDP/adapter_removed_quality_trimmed_IDP-N2-IDT_S5_R1.fastq	/home/cmccabe/IDP/adapter_removed_quality_trimmed_IDP-N2-IDT_S5_R2.fastq
                      #Total	35677024
                      #Matched	18	0.00005%
                      #Name	Reads	ReadsPct
                      gi|9626372|ref|NC_001422.1| Coliphage phiX174, complete genome	18	0.00005%
                      Thank you .

                      Comment


                      • #12
                        32 reads that are similar to phiX must be by chance. Start mapping against the real genome.

                        Comment


                        • #13
                          Not being familiar with Illumina dat I have no idea if this is a good mapping, I will convert to bam and the use FASTQC as well. Thank you .

                          I created an index reference using the command below:
                          Code:
                           ndex: bbmap.sh ref=/home/cmccabe/Desktop/bed/hg19.fa
                          and that created a ref folder in the bbmap directory. I gave the path to the .fa as that was needed. Is this correct? Thank you .


                          Code:
                          [B]cmccabe@DTV-A5211QLM:~/Desktop/NGS$ bbmap.sh in1=/home/cmccabe/IDP/unmatched_adapter_removed_quality_trimmed_IDP-N2-IDT_S5_R1.fastq in2=/home/cmccabe/IDP/unmatched_adapter_removed_quality_trimmed_IDP-N2-IDT_S5_R2.fastq out=/home/cmccabe/IDP/mapped_unmatched_adapter_removed_quality_trimmed_IDP-N2-IDT_S5_R1_and_R2.sam ref=/home/cmccabe/Desktop/bed/hg19.fa nodisk[/B]
                          java -Djava.library.path=/home/cmccabe/Desktop/NGS/bbmap-35.59/jni/ -ea -Xmx44276m -cp /home/cmccabe/Desktop/NGS/bbmap-35.59/current/ align2.BBMap build=1 overwrite=true fastareadlen=500 in1=/home/cmccabe/IDP/unmatched_adapter_removed_quality_trimmed_IDP-N2-IDT_S5_R1.fastq in2=/home/cmccabe/IDP/unmatched_adapter_removed_quality_trimmed_IDP-N2-IDT_S5_R2.fastq out=/home/cmccabe/IDP/mapped_unmatched_adapter_removed_quality_trimmed_IDP-N2-IDT_S5_R1_and_R2.sam ref=/home/cmccabe/Desktop/bed/hg19.fa nodisk
                          Executing align2.BBMap [build=1, overwrite=true, fastareadlen=500, in1=/home/cmccabe/IDP/unmatched_adapter_removed_quality_trimmed_IDP-N2-IDT_S5_R1.fastq, in2=/home/cmccabe/IDP/unmatched_adapter_removed_quality_trimmed_IDP-N2-IDT_S5_R2.fastq, out=/home/cmccabe/IDP/mapped_unmatched_adapter_removed_quality_trimmed_IDP-N2-IDT_S5_R1_and_R2.sam, ref=/home/cmccabe/Desktop/bed/hg19.fa, nodisk]
                          
                          BBMap version 35.59
                          Retaining first best site only for ambiguous mappings.
                          Executing dna.FastaToChromArrays2 [/home/cmccabe/Desktop/bed/hg19.fa, 1, writeinthread=false, genscaffoldinfo=true, retain, waitforwriting=false, gz=true, maxlen=536670912, writechroms=false, minscaf=1, midpad=300, startpad=8000, stoppad=8000, nodisk=true]
                          
                          Set genScaffoldInfo=true
                          Set genome to 1
                          
                          Loaded Reference:	0.005 seconds.
                          Loading index for chunk 1-7, build 1
                          Indexing threads started for block 0-3
                          Indexing threads started for block 4-7
                          Indexing threads finished for block 0-3
                          Indexing threads finished for block 4-7
                          Generated Index:	170.887 seconds.
                          Analyzed Index:   	6.891 seconds.
                          Started output stream:	0.160 seconds.
                          Cleared Memory:    	2.506 seconds.
                          Processing reads in paired-ended mode.
                          Started read stream.
                          Started 16 mapping threads.
                          Detecting finished threads: 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15
                          
                             ------------------   Results   ------------------   
                          
                          Genome:                	1
                          Key Length:            	13
                          Max Indel:             	16000
                          Minimum Score Ratio:  	0.56
                          Mapping Mode:         	normal
                          Reads Used:           	35676992	(4964589896 bases)
                          
                          Mapping:          	6441.547 seconds.
                          Reads/sec:       	5538.57
                          kBases/sec:      	770.71
                          
                          
                          Pairing data:   	pct reads	num reads 	pct bases	   num bases
                          
                          mated pairs:     	 98.2186% 	 17520728 	 98.1389% 	  4872191776
                          bad pairs:       	  0.5597% 	    99849 	  0.5917% 	    29375968
                          insert size avg: 	  196.85
                          
                          
                          Read 1 data:      	pct reads	num reads 	pct bases	   num bases
                          
                          mapped:          	 99.4100% 	 17733242 	 99.3670% 	  2466930784
                          unambiguous:     	 94.4463% 	 16847807 	 94.4747% 	  2345474517
                          ambiguous:       	  4.9636% 	   885435 	  4.8922% 	   121456267
                          low-Q discards:  	  0.0000% 	        0 	  0.0000% 	           0
                          
                          perfect best site:	 79.8468% 	 14243472 	 79.0980% 	  1963723803
                          semiperfect site:	 79.8570% 	 14245292 	 79.1077% 	  1963964765
                          rescued:         	  0.5136% 	    91626
                          
                          Match Rate:      	      NA 	       NA 	 96.4489% 	  2454439608
                          Error Rate:      	 19.6675% 	  3487693 	  3.5507% 	    90358451
                          Sub Rate:        	 19.2300% 	  3410108 	  0.4755% 	    12100683
                          Del Rate:        	  0.8707% 	   154402 	  3.0603% 	    77878570
                          Ins Rate:        	  0.4772% 	    84616 	  0.0149% 	      379198
                          N Rate:          	  0.0155% 	     2748 	  0.0004% 	       11295
                          
                          
                          Read 2 data:      	pct reads	num reads 	pct bases	   num bases
                          
                          mapped:          	 99.1143% 	 17680507 	 99.0502% 	  2458368636
                          unambiguous:     	 94.1704% 	 16798575 	 94.1845% 	  2337606274
                          ambiguous:       	  4.9440% 	   881932 	  4.8656% 	   120762362
                          low-Q discards:  	  0.0000% 	        1 	  0.0000% 	          19
                          
                          perfect best site:	 77.7319% 	 13866207 	 76.8600% 	  1907622427
                          semiperfect site:	 77.7424% 	 13868077 	 76.8699% 	  1907865846
                          rescued:         	  0.5123% 	    91390
                          
                          Match Rate:      	      NA 	       NA 	 96.4399% 	  2443690977
                          Error Rate:      	 21.5427% 	  3808851 	  3.5591% 	    90184753
                          Sub Rate:        	 21.1207% 	  3734247 	  0.5632% 	    14271472
                          Del Rate:        	  0.8812% 	   155792 	  2.9809% 	    75532820
                          Ins Rate:        	  0.4759% 	    84144 	  0.0150% 	      380461
                          N Rate:          	  0.0418% 	     7390 	  0.0010% 	       25726

                          Comment


                          • #14
                            Those are very good results. Over 99% of your reads mapped, and of those, almost 80% matched the reference perfectly. Also, over 98% of the reads mapped in properly paired configurations. That's about as good a result as you can hope for on a diploid organism.

                            I will note, however, that the average insert size of "196.85" is lower than ideal for 2x150bp reads. When making the library, you should target more than twice the read length (so, over 300bp in this case) so that you don't waste sequence double-covering the same location with the same molecule. Although, for exon-capture the constraints may be a little different since exons are short. I assume this is exon-capture data? At any rate, when your reads mostly overlap (as in this case), you can generally get better mapping results by correcting them first like this:

                            bbmerge.sh in1=read1.fq in2=read2.fq out1=corrected1.fq out2=corrected2.fq ecco vstrict mix=t merge=f

                            Also note that as long as samtools is installed and in your path, BBMap will output bam files if you name the output with a .bam extension, such as "out=mapped.bam".
                            Last edited by Brian Bushnell; 10-31-2015, 09:55 AM.

                            Comment


                            • #15
                              Yes, this is exon capture data and I do have samtools installed and added to path. Thanks for the tips, I appreciate them. In terms of variant calling what do you recommend for using the aligned data with? Thank you .

                              Comment

                              Latest Articles

                              Collapse

                              • seqadmin
                                Essential Discoveries and Tools in Epitranscriptomics
                                by seqadmin




                                The field of epigenetics has traditionally concentrated more on DNA and how changes like methylation and phosphorylation of histones impact gene expression and regulation. However, our increased understanding of RNA modifications and their importance in cellular processes has led to a rise in epitranscriptomics research. “Epitranscriptomics brings together the concepts of epigenetics and gene expression,” explained Adrien Leger, PhD, Principal Research Scientist...
                                04-22-2024, 07:01 AM
                              • seqadmin
                                Current Approaches to Protein Sequencing
                                by seqadmin


                                Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
                                04-04-2024, 04:25 PM

                              ad_right_rmr

                              Collapse

                              News

                              Collapse

                              Topics Statistics Last Post
                              Started by seqadmin, Yesterday, 08:47 AM
                              0 responses
                              12 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 04-11-2024, 12:08 PM
                              0 responses
                              60 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 04-10-2024, 10:19 PM
                              0 responses
                              59 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 04-10-2024, 09:21 AM
                              0 responses
                              54 views
                              0 likes
                              Last Post seqadmin  
                              Working...
                              X