Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Merging paired ends fastq files with BBMerge

    This thread comes from http://seqanswers.com/forums/showthread.php?t=6820 but I prefer to start a new one becase there it was a little off topic.

    I have two samples, from 2 different patients with the same diagnosed illness, procesed with the same protocol. Each sample is composed by a pair of files that I must merge. With one sample I get some merging but with the other I get almost nothing.

    Let me show you results:

    HIGH MERGING case

    java -Djava.library.path=/home/carlos/BBMap/bbmap/jni/ -ea -Xmx1000m -cp /home/carlos/BBMap/bbmap/current/ jgi.BBMerge in1=high_merge_R1.fq in2=high_merge_R2.fq out=merged.fastq outu1=unmerged1.fastq outu2=unmerged2.fastq
    Executing jgi.BBMerge [in1=high_merge_R1.fq, in2=high_merge_R2.fq, out=merged.fastq, outu1=unmerged1.fastq, outu2=unmerged2.fastq]

    BBMerge version 8.8
    Writing mergable reads merged.
    Started output threads.
    Total time: 14,914 seconds.

    Pairs: 141970
    Joined: 83109 58,540%
    Ambiguous: 50033 35,242%
    No Solution: 8828 6,218%
    Too Short: 0 0,000%

    Avg Insert: 156,0
    Standard Deviation: 18,5
    Mode: 149

    Insert range: 63 - 291
    90th percentile: 197
    75th percentile: 152
    50th percentile: 150
    25th percentile: 148
    10th percentile: 145

    LOW MERGING case

    java -Djava.library.path=/home/carlos/BBMap/bbmap/jni/ -ea -Xmx1000m -cp /home/carlos/BBMap/bbmap/current/ jgi.BBMerge in1=low_merge_R1.fq in2=low_merge_R2.fq out=merged.fastq outu1=unmerged1.fastq outu2=unmerged2.fastq
    Executing jgi.BBMerge [in1=low_merge_R1.fq, in2=low_merge_R2.fq, out=merged.fastq, outu1=unmerged1.fastq, outu2=unmerged2.fastq]

    BBMerge version 8.8
    Writing mergable reads merged.
    Started output threads.
    Total time: 49,182 seconds.

    Pairs: 548979
    Joined: 140 0,026%
    Ambiguous: 548807 99,969%
    No Solution: 32 0,006%
    Too Short: 0 0,000%

    Avg Insert: 122,8
    Standard Deviation: 64,3
    Mode: 126

    Insert range: 36 - 592
    90th percentile: 193
    75th percentile: 156
    50th percentile: 126
    25th percentile: 63
    10th percentile: 49
    Last edited by bi_maniac; 11-02-2015, 11:32 AM.

  • #2
    Was the fragment selection/experiment done in such as way that you expect the two reads to overlap?

    Comment


    • #3
      histogram files
      Attached Files

      Comment


      • #4
        Originally posted by GenoMax View Post
        Was the fragment selection/experiment done in such as way that you expect the two reads to overlap?
        YES

        By the way, it was done with an amplicon library and expected insert size is 300-370

        Comment


        • #5
          I don't know the full story here but is the result telling you something useful? Purely speculating: e.g. there is an insertion in "low merge" case such that the reads no longer overlap?

          If you map the reads to the reference in both cases what do you see?

          Comment


          • #6
            Relevant post from previous thread:

            "
            Currently I have only been able to perform this task with BBmerge, mergePairs.py and a custom software developed by myself.

            I consistently get low percentages of merging: about 20% when limiting quality of overlap above 90%.

            Is this normal? I assumed that merging would be much above 50%.
            "

            Comment


            • #7
              Originally posted by GenoMax View Post
              I don't know the full story here but is the result telling you something useful? Purely speculating: e.g. there is an insertion in "low merge" case such that the reads no longer overlap?

              If you map the reads to the reference in both cases what do you see?
              Excuse my ignorance. Which command you advice me to use in order to map to a reference? I assume there is a possibility to do it with BBMap suite.

              In this case I know it is chromosome 14 and I downloaded full hg19.

              Comment


              • #8
                You clearly have a weird bimodal distribution with sharp peaks at ~150 and ~200. But if you expect an insert size of 300-370bp, you should not plan on merging because they won't overlap! But please note that "insert size" means different things to different people. I use it to mean "The length of the genomic portion that is sequenced, including the unsequenced middle, if any". But some people also include adapters and primers so they get bigger numbers that may be more relevant to size selection, but are less relevant to downstream analysis.

                I suspect your true distribution is at least trimodal (since even in the "high" case a lot don't overlap), and the highest mode is at least 300bp and therefore won't show up. Perhaps there are homologous locations in the organism that the primers are binding to - one is ~150bp long, one is ~200bp long, and the last one is the one that's actually of interest, if you expect it to be over 300bp?

                You might want to clarify with the people that made the libraries or designed the primers exactly what they mean by 300-370bp, and why they expect them to overlap. For example, a diagram showing the position of the genomic sequence, primers, and so forth, clearly labelled with how long each one is.

                Comment


                • #9
                  Originally posted by bi_maniac View Post
                  Excuse my ignorance. Which command you advice me to use in order to map to a reference? I assume there is a possibility to do it with BBMap suite.

                  In this case I know it is chromosome 14 and I downloaded full hg19.
                  Yes, that's the best option at this point.

                  bbmap.sh in1=r1.fq in2=r2.fq ref=hg19.fa ihist=ihist_mapping.txt reads=1m

                  Doing with only chr14 is a possibility but I'd recommend the whole thing.

                  Comment


                  • #10
                    @bi_maniac: Take the sequence of the primers and do a quick search over at UCSC: http://genome.ucsc.edu/cgi-bin/hgPcr?db=hg38 to see if you pull up more than one unique product. That may account for the multiple products @Brian thinks you have in your sample.

                    Comment


                    • #11
                      OK give me a while, please!

                      Comment


                      • #12
                        Change the genome build accordingly if the primers were designed using hg19. Default genome build at UCSC is now GRCh38 for the tool above.

                        Comment


                        • #13
                          Execution of BBMap with high_merge case and with chr14.fa as reference (my current computer has only 4GB of memory and was unable to process whole genome).

                          Code:
                          java -Djava.library.path=/home/carlos/BBMap/bbmap/jni/ -ea -Xmx3g -cp /home/carlos/BBMap/bbmap/current/ align2.BBMap build=1 overwrite=true fastareadlen=500 -Xms2g -Xmx3g in1=high_merge_R1.fq in2=high_merge_R2.fq ref=/home/carlos/HG19/chr14.fa ihist=ihist_mapping.txt
                          Executing align2.BBMap [build=1, overwrite=true, fastareadlen=500, -Xms2g, -Xmx3g, in1=high_merge_R1.fq, in2=high_merge_R2.fq, ref=/home/carlos/HG19/chr14.fa, ihist=ihist_mapping.txt]
                          
                          BBMap version 35.43
                          Set insert size histogram output to ihist_mapping.txt
                          Retaining first best site only for ambiguous mappings.
                          No output file.
                          NOTE:	Deleting contents of ref/genome/1 because reference is specified and overwrite=true
                          Writing reference.
                          Executing dna.FastaToChromArrays2 [/home/carlos/HG19/chr14.fa, 1, writeinthread=false, genscaffoldinfo=true, retain, waitforwriting=false, gz=true, maxlen=536670912, writechroms=true, minscaf=1, midpad=300, startpad=8000, stoppad=8000, nodisk=false]
                          
                          Set genScaffoldInfo=true
                          Writing chunk 1
                          Set genome to 1
                          
                          Loaded Reference:	0,332 seconds.
                          Loading index for chunk 1-1, build 1
                          No index available; generating from reference genome: /home/carlos/DATA/IG-EMR_080915/prba/ref/index/1/chr1_index_k13_c4_b1.block
                          Indexing threads started for block 0-1
                          Indexing threads finished for block 0-1
                          Generated Index:	68,469 seconds.
                          Analyzed Index:   	11,051 seconds.
                          Cleared Memory:    	0,703 seconds.
                          Processing reads in paired-ended mode.
                          Started read stream.
                          Started 2 mapping threads.
                          Detecting finished threads: 0, 1
                          
                             ------------------   Results   ------------------   
                          
                          Genome:                	1
                          Key Length:            	13
                          Max Indel:             	16000
                          Minimum Score Ratio:  	0.56
                          Mapping Mode:         	normal
                          Reads Used:           	283940	(60393910 bases)
                          
                          Mapping:          	920,090 seconds.
                          Reads/sec:       	308,60
                          kBases/sec:      	65,64
                          
                          
                          Pairing data:   	pct reads	num reads 	pct bases	   num bases
                          
                          mated pairs:     	  0,0035% 	        5 	  0,0035% 	        2084
                          bad pairs:       	  0,0000% 	        0 	  0,0000% 	           0
                          insert size avg: 	  292,20
                          insert 25th %:   	  147,00
                          insert median:   	  167,00
                          insert 75th %:   	  229,00
                          insert std dev:  	  122,63
                          insert mode:     	  167
                          
                          
                          Read 1 data:      	pct reads	num reads 	pct bases	   num bases
                          
                          mapped:          	  0,0035% 	        5 	  0,0035% 	        1042
                          unambiguous:     	  0,0028% 	        4 	  0,0030% 	         895
                          ambiguous:       	  0,0007% 	        1 	  0,0005% 	         147
                          low-Q discards:  	  0,0183% 	       26 	  0,0031% 	         910
                          
                          perfect best site:	  0,0000% 	        0 	  0,0000% 	           0
                          semiperfect site:	  0,0000% 	        0 	  0,0000% 	           0
                          rescued:         	  0,0007% 	        1
                          
                          Match Rate:      	      NA 	       NA 	  3,8966% 	         927
                          Error Rate:      	100,0000% 	        5 	 96,1034% 	       22863
                          Sub Rate:        	100,0000% 	        5 	  0,4834% 	         115
                          Del Rate:        	 20,0000% 	        1 	 95,6200% 	       22748
                          Ins Rate:        	  0,0000% 	        0 	  0,0000% 	           0
                          N Rate:          	  0,0000% 	        0 	  0,0000% 	           0
                          
                          
                          Read 2 data:      	pct reads	num reads 	pct bases	   num bases
                          
                          mapped:          	  0,0035% 	        5 	  0,0037% 	        1156
                          unambiguous:     	  0,0035% 	        5 	  0,0037% 	        1156
                          ambiguous:       	  0,0000% 	        0 	  0,0000% 	           0
                          low-Q discards:  	  0,0190% 	       27 	  0,0039% 	        1211
                          
                          perfect best site:	  0,0000% 	        0 	  0,0000% 	           0
                          semiperfect site:	  0,0000% 	        0 	  0,0000% 	           0
                          rescued:         	  0,0000% 	        0
                          
                          Match Rate:      	      NA 	       NA 	  3,6493% 	         988
                          Error Rate:      	100,0000% 	        5 	 96,3507% 	       26086
                          Sub Rate:        	100,0000% 	        5 	  0,4026% 	         109
                          Del Rate:        	 60,0000% 	        3 	 95,7302% 	       25918
                          Ins Rate:        	 40,0000% 	        2 	  0,2179% 	          59
                          N Rate:          	  0,0000% 	        0 	  0,0000% 	           0
                          
                          Total time:     	1006,450 seconds.
                          
                          [B]ihist_high_mapping.txt[/B]
                          
                          Mean	253,600
                          #Median	167
                          #Mode	167
                          #STDev	122,631
                          #PercentOfPairs	0,004
                          #InsertSize	Count
                          147	1
                          167	1
                          229	1
                          236	1
                          489	1
                          Last edited by GenoMax; 11-02-2015, 02:03 PM. Reason: Added [CODE] tags to improve display

                          Comment


                          • #14
                            This isn't looking good as you have probably figured out by now. The reads you have do not seem to map to the intended target on chr 14.

                            Comment


                            • #15
                              It looks like you may not have what you think you have. I recommend BLASTing some of those reads against nt to see if you get hits. They may not be human, or at least not chromosome 14.

                              Comment

                              Latest Articles

                              Collapse

                              • seqadmin
                                Current Approaches to Protein Sequencing
                                by seqadmin


                                Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
                                04-04-2024, 04:25 PM
                              • seqadmin
                                Strategies for Sequencing Challenging Samples
                                by seqadmin


                                Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                                03-22-2024, 06:39 AM

                              ad_right_rmr

                              Collapse

                              News

                              Collapse

                              Topics Statistics Last Post
                              Started by seqadmin, 04-11-2024, 12:08 PM
                              0 responses
                              22 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 04-10-2024, 10:19 PM
                              0 responses
                              24 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 04-10-2024, 09:21 AM
                              0 responses
                              19 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 04-04-2024, 09:00 AM
                              0 responses
                              50 views
                              0 likes
                              Last Post seqadmin  
                              Working...
                              X