Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • BBmap on Ion data

    Hello,
    I tried to run BBmap to call variants on four BAM files (2 parents, 2 children) generated from exome sequencing on an Ion Proton machine (TMAP software).

    The bbmap command was:
    Code:
    /opt/bbmap/callvariants.sh in=IonXpress_009_rawlib.bam,IonXpress_010_rawlib.bam,IonXpress_011_rawlib.bam,IonXpress_012_rawlib.bam ref=./hg19/hg19.fasta out=vars.vcf multisample ploidy=2 prefilter 1> err.txt 2> out.txt &
    I got an error as following (I'm including also the last lines of output):

    Code:
    Finished second pass.
    Writing output.
    Merging [(IonXpress_009_rawlib, individual_IonXpress_009_rawlib.vcf.gz), (IonXpress_010_rawlib, individual_IonXpress_010_rawlib.vcf.gz), (IonXpress_011_rawlib, individual_IonXpress_011_rawlib.vcf.gz), (IonXpress_012_rawlib, individual_IonXpress_012_rawlib.vcf.gz)]
    Exception in thread "main" java.lang.NumberFormatException: For input string: "162,590"
    	at sun.misc.FloatingDecimal.readJavaFormatString(FloatingDecimal.java:2043)
    	at sun.misc.FloatingDecimal.parseFloat(FloatingDecimal.java:122)
    	at java.lang.Float.parseFloat(Float.java:451)
    	at var2.MergeSamples.processHeader(MergeSamples.java:233)
    	at var2.MergeSamples.processRow(MergeSamples.java:177)
    	at var2.MergeSamples.mergeFiles(MergeSamples.java:151)
    	at var2.MergeSamples.mergeSamples(MergeSamples.java:117)
    	at var2.CallVariants2.process(CallVariants2.java:380)
    	at var2.CallVariants2.main(CallVariants2.java:52)
    	at var2.CallVariants.main(CallVariants.java:48)
    I unzipped the vcf files and used grep to find "162,590". I got a match at line 9 of the vcf for one individual:
    Code:
    ##readLengthAvg=162,590
    The other individuals also had similar formatting:
    (IonXpress_010)
    Code:
    ##readLengthAvg=160,055
    (IonXpress_011)
    Code:
    ##readLengthAvg=167,107
    (IonXpress_012)
    Code:
    ##readLengthAvg=169,228

    This is the full header for barcode 009:
    Code:
    ##fileformat=VCFv4.2
    ##BBMapVersion=36.92
    ##ploidy=2
    ##rarity=1,00000
    ##minallelefraction=0,10000
    ##reads=43874924
    ##pairedReads=0
    ##properlyPairedReads=0
    ##readLengthAvg=162,590
    ##properPairRate=0,00000
    ##totalQualityAvg=23,525
    ##mapqAvg=77,583
    ##reference=./hg19/hg19.fasta
    ##contig=<ID=chr1,length=249250621>
    ##contig=<ID=chr2,length=243199373>
    ##contig=<ID=chr3,length=198022430>
    ##contig=<ID=chr4,length=191154276>
    ##contig=<ID=chr5,length=180915260>
    ##contig=<ID=chr6,length=171115067>
    ##contig=<ID=chr7,length=159138663>
    ##contig=<ID=chr8,length=146364022>
    ##contig=<ID=chr9,length=141213431>
    ##contig=<ID=chr10,length=135534747>
    ##contig=<ID=chr11,length=135006516>
    ##contig=<ID=chr12,length=133851895>
    ##contig=<ID=chr13,length=115169878>
    ##contig=<ID=chr14,length=107349540>
    ##contig=<ID=chr15,length=102531392>
    ##contig=<ID=chr16,length=90354753>
    ##contig=<ID=chr17,length=81195210>
    ##contig=<ID=chr18,length=78077248>
    ##contig=<ID=chr19,length=59128983>
    ##contig=<ID=chr20,length=63025520>
    ##contig=<ID=chr21,length=48129895>
    ##contig=<ID=chr22,length=51304566>
    ##contig=<ID=chrX,length=155270560>
    ##contig=<ID=chrY,length=59373566>
    ##contig=<ID=chrM,length=16569>
    ##FORMAT=<ID=FAIL,Description="Fail">
    ##FORMAT=<ID=PASS,Description="Pass">
    ##INFO=<ID=SN,Number=1,Type=Integer,Description="Scaffold Number">
    ##INFO=<ID=STA,Number=1,Type=Integer,Description="Start">
    ##INFO=<ID=STO,Number=1,Type=Integer,Description="Stop">
    ##INFO=<ID=TYP,Number=1,Type=String,Description="Type">
    ##INFO=<ID=R1P,Number=1,Type=Integer,Description="Read1 Plus Count">
    ##INFO=<ID=R1M,Number=1,Type=Integer,Description="Read1 Minus Count">
    ##INFO=<ID=R2P,Number=1,Type=Integer,Description="Read2 Plus Count">
    ##INFO=<ID=R2M,Number=1,Type=Integer,Description="Read2 Minus Count">
    ##INFO=<ID=PPC,Number=1,Type=Integer,Description="Paired Count">
    ##INFO=<ID=LS,Number=1,Type=Integer,Description="Length Sum">
    ##INFO=<ID=MQS,Number=1,Type=Integer,Description="MAPQ Sum">
    ##INFO=<ID=MQM,Number=1,Type=Integer,Description="MAPQ Max">
    ##INFO=<ID=BQS,Number=1,Type=Integer,Description="Base Quality Sum">
    ##INFO=<ID=BQM,Number=1,Type=Integer,Description="Base Quality Max">
    ##INFO=<ID=EDS,Number=1,Type=Integer,Description="End Distance Sum">
    ##INFO=<ID=EDM,Number=1,Type=Integer,Description="End Distance Max">
    ##INFO=<ID=IDS,Number=1,Type=Integer,Description="Identity Sum">
    ##INFO=<ID=IDM,Number=1,Type=Integer,Description="Identity Max">
    ##INFO=<ID=COV,Number=1,Type=Integer,Description="Coverage">
    ##INFO=<ID=MCOV,Number=1,Type=Integer,Description="Minus Coverage">
    ##INFO=<ID=CED,Number=1,Type=Integer,Description="Contig End Distance">
    ##INFO=<ID=HMP,Number=1,Type=Integer,Description="Homopolymer Count">
    ##INFO=<ID=DP,Number=1,Type=Integer,Description="Total Depth">
    ##INFO=<ID=AF,Number=1,Type=Float,Description="Allele Fraction">
    ##INFO=<ID=RAF,Number=1,Type=Float,Description="Revised Allele Fraction">
    ##INFO=<ID=DP4,Number=4,Type=Integer,Description="Ref+, Ref-, Alt+, Alt-">
    ##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
    ##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Read Depth">
    ##FORMAT=<ID=AD,Number=1,Type=Integer,Description="Allele Depth">
    ##FORMAT=<ID=AF,Number=1,Type=Float,Description="Allele Fraction">
    ##FORMAT=<ID=SC,Number=1,Type=Float,Description="Score">
    ##FORMAT=<ID=PF,Number=1,Type=String,Description="Pass Filter">
    #CHROM	POS	ID	REF	ALT	QUAL	FILTER	INFO	FORMAT	IonXpress_009_rawlib
    Anything I should have done differently to not cause the error?
    Thank you!

    Comment


    • Oh, that's really strange. It's due to your locale settings, I guess. In the US, decimals are printed like "##readLengthAvg=162.590" rather than "##readLengthAvg=162,590". For some reason Java is using commas as the delimiter when printing the floating point numbers, but not when trying to read floating point numbers, which is extremely irritating.

      This will take a little research to fix (other than you changing your computer's locale to US or something like that); I'll get back to you.

      Comment


      • @r.rosati: See if this helps. https://docs.oracle.com/cd/E23824_01...033/glmha.html

        Comment


        • Originally posted by GenoMax View Post
          Thank you... Indeed I ran it on a ssh session from a computer that had pt_BR as LC_NUMERIC.

          I've read Zhou's comment on indels and I thought I'd first run the callvariants script on a test exome from CEPH 1347-02 DNA.
          Should I realign the reads (I'm reading it's advised for non-BBmap aligned reads?) i.e.
          Code:
          callvariants.sh in=IonXpress_014_rawlib.bam ref=./hg19/hg19.fasta out=CEPH_BBmap_variants_TMAP_BAM_realigned.vcf ploidy=2 realign threads=16 1> err.txt 2> out.txt &

          Comment


          • Yes, I do recommend "realign" for non-BBMap-aligned reads.

            Comment


            • BBSplit / BBMap error

              Hello,
              I'm using BBSplit to remove mouse reads in human samples. My command is the following :
              bbsplit.sh ambiguous=best ambiguous2=toss path=$8 maxindel=900000 refstats=stat_alignment_$samplename.txt threads=4 qtrim=f untrim=f in1=$8/trim_R1_$samplename.fq in2=$8/trim_R2_$samplename.fq ref=$2/$1.fa,$2/Grcm38.fa basename=decontamin_$samplename.%_#.fq outu1=unmapped1_$samplename.fq outu2=unmapped2_$samplename.fq

              Which give me the following message where the parameters used are not what I specified :

              java -Djava.library.path=/home/audrey/bbmap/jni/ -ea -Xmx48113m -cp /home/audrey/bbmap/current/ align2.BBSplitter ow=t fastareadlen=500 minhits=1 minratio=0.56 maxindel=20 qtrim=rl untrim=t trimq=6 ambiguous=best ambiguous2=toss path=. maxindel=900000 refstats=stat_alignment_L6.txt threads=4 qtrim=f untrim=f in1=./trim_R1_L6.fq in2=./trim_R2_L6.fq ref=/home/audrey/reference_genomes/GATK/hg38.fa,/home/audrey/reference_genomes/GATK/Grcm38.fa basename=decontamin_L6.%_#.fq outu1=unmapped1_L6.fq outu2=unmapped2_L6.fq
              Executing align2.BBSplitter [ow=t, fastareadlen=500, minhits=1, minratio=0.56, maxindel=20, qtrim=rl, untrim=t, trimq=6, ambiguous=best, ambiguous2=toss, path=., maxindel=900000, refstats=stat_alignment_L6.txt, threads=4, qtrim=f, untrim=f, in1=./trim_R1_L6.fq, in2=./trim_R2_L6.fq, ref=/home/audrey/reference_genomes/GATK/hg38.fa,/home/audrey/reference_genomes/GATK/Grcm38.fa, basename=decontamin_L6.%_#.fq, outu1=unmapped1_L6.fq, outu2=unmapped2_L6.fq]

              Converted arguments to [ow=t, fastareadlen=500, minhits=1, minratio=0.56, maxindel=20, qtrim=rl, untrim=t, trimq=6, ambiguous=best, ambiguous2=toss, path=., maxindel=900000, refstats=stat_alignment_L6.txt, threads=4, qtrim=f, untrim=f, in1=./trim_R1_L6.fq, in2=./trim_R2_L6.fq, basename=decontamin_L6.%_#.fq, outu1=unmapped1_L6.fq, outu2=unmapped2_L6.fq, ref_hg38=/home/audrey/reference_genomes/GATK/hg38.fa, ref_Grcm38=/home/audrey/reference_genomes/GATK/Grcm38.fa]
              Creating merged reference file ref/genome/1/merged_ref_6706777893850.fa.gz
              Ref merge time: 112.084 seconds.
              Executing align2.BBMap [ow=t, fastareadlen=500, minhits=1, minratio=0.56, maxindel=20, qtrim=rl, untrim=t, trimq=6, ambiguous=best, ambiguous2=toss, maxindel=900000, refstats=stat_alignment_L6.txt, threads=4, qtrim=f, untrim=f, in1=./trim_R1_L6.fq, in2=./trim_R2_L6.fq, outu1=unmapped1_L6.fq, outu2=unmapped2_L6.fq, ref=ref/genome/1/merged_ref_6706777893850.fa.gz, out_hg38=decontamin_L6.hg38_#.fq, out_Grcm38=decontamin_L6.Grcm38_#.fq]

              BBMap version 36.92
              Set MINIMUM_ALIGNMENT_SCORE_RATIO to 0.560
              Reference set statistics will be written to stat_alignment_L6.txt
              Set threads to 4
              Retaining first best site only for ambiguous mappings.
              NOTE: Deleting contents of ref/genome/1 because reference is specified and overwrite=true
              Writing reference.
              Executing dna.FastaToChromArrays2 [ref/genome/1/merged_ref_6706777893850.fa.gz, 1, writeinthread=false, genscaffoldinfo=true, retain, waitforwriting=false, gz=true, maxlen=536670912, writechroms=true, minscaf=1, midpad=300, startpad=8000, stoppad=8000, nodisk=false]

              Set genScaffoldInfo=true
              Writing chunk 1
              Writing chunk 2
              Writing chunk 3
              Writing chunk 4
              Writing chunk 5
              Writing chunk 6
              Writing chunk 7
              Writing chunk 8
              Writing chunk 9
              Writing chunk 10
              Writing chunk 11
              Writing chunk 12
              Writing chunk 13
              Set genome to 1

              Loaded Reference: 0.005 seconds.
              Loading index for chunk 1-13, build 1
              No index available; generating from reference genome: /media/audrey/26a8cc89-53d7-4f9b-984c-f8d2c0d519fd/Ievgenia/PDX/Exome/test/ref/index/1/chr1-3_index_k13_c2_b1.block
              Indexing threads started for block 0-3
              Indexing threads finished for block 0-3
              No index available; generating from reference genome: /media/audrey/26a8cc89-53d7-4f9b-984c-f8d2c0d519fd/Ievgenia/PDX/Exome/test/ref/index/1/chr4-7_index_k13_c2_b1.block
              Indexing threads started for block 4-7
              Indexing threads finished for block 4-7
              No index available; generating from reference genome: /media/audrey/26a8cc89-53d7-4f9b-984c-f8d2c0d519fd/Ievgenia/PDX/Exome/test/ref/index/1/chr8-11_index_k13_c2_b1.block
              Indexing threads started for block 8-11
              Indexing threads finished for block 8-11
              No index available; generating from reference genome: /media/audrey/26a8cc89-53d7-4f9b-984c-f8d2c0d519fd/Ievgenia/PDX/Exome/test/ref/index/1/chr12-13_index_k13_c2_b1.block
              Indexing threads started for block 12-13
              Indexing threads finished for block 12-13
              Generated Index: 571.883 seconds.
              Analyzed Index: 10.807 seconds.
              Reads that map to multiple references will be considered unmapped.
              Started output stream: 6.057 seconds.
              Creating ref-set statistics table: 0.012 seconds.
              Cleared Memory: 4.121 seconds.
              Processing reads in paired-ended mode.
              Started read stream.
              Started 4 mapping threads.


              Could you please explain me how to set and force the parameters?
              Many thanks in advance.

              Comment


              • Which exact parameters are you referring to?

                Comment


                • Thank you GenoMax for your answer.
                  I let minratio at the default value. So it should be 0.65 as said in bbsplit help. here it's 0.56.
                  And for the parameters : maxindel, qtrim and untrim, they are both set as the default then as I asked in the same command line. Will it be problematic?
                  Many thanks.

                  Comment


                  • I see. @Brian hopefully will be along later today with an official answer but my guess is that it probably has to do with ambigous2=toss option that you have selected.

                    Comment


                    • And for the parameters : maxindel, qtrim and untrim, they are both set as the default then as I asked in the same command line. Will it be problematic?
                      Later parameters override earlier parameters, so that's not a problem.

                      As for 0.65 versus 0.56, that looks like either a typo, or else I changed the default at some point but forgot to update the help info; I'll fix that. Thanks! In the meantime you can always manually add "minratio=0.65" to override it.

                      Edit: To clarify, 0.56 is the intended behavior, and the description is in error.
                      Last edited by Brian Bushnell; 02-13-2017, 10:18 AM.

                      Comment


                      • Thank you very much GenoMax and Brian for your helpful answers.

                        Comment


                        • Converting a coverage file into a csv format for mmgenome

                          Hello there,

                          Is there a way I could convert the mapping files that bbmap produced into csv coverage file that can be used for the mmgenome software (link: http://madsalbertsen.github.io/mmgen...ad_data.html)?

                          Thanks!

                          Comment


                          • BBMap can directly output coverage using the "covstats" or "basecov" flags, or it can be generated from sam files with pileup.sh. Can you post a few lines from one of these mmgenome csv files so I can see what the format looks like?

                            Comment


                            • Dear Brian,

                              Thanks for your reply.

                              This is from one of the mapping files (.csv) for mmgenome.

                              tanshiming@S620100019205:~/Downloads$ head C13.11.14.csv
                              "Name","Consensus length","Total read count","Single reads","Reads in pairs","Average coverage","Reference sequence","Reference length","Reference common name","Reference Latin name"
                              "1 mapping","5621","103","19","84","1.443","1","8264","",""
                              "2 mapping","370","6","2","4","0.625","2","1027","",""
                              "3 mapping","1665","198","52","146","13.536","3","1665","",""
                              "4 mapping","94","1","1","0","0.01","4","9056","",""
                              "5 mapping","3042","95","19","76","3.205","5","3343","",""
                              "6 mapping","901","8","2","6","9.663E-3","6","98207","",""
                              "7 mapping","5788","147","11","136","2.612","7","6480","",""
                              "8 mapping","14602","381","47","334","2.782","8","15790","",""
                              "9 mapping","1403","1056","282","774","85.135","9","1403","",""
                              tanshiming@S620100019205:~/Downloads$

                              Comment


                              • Hello,

                                I am new to this forum so I apologize if I posted in the wrong place.

                                Currently I am trying to assemble a single cell genome from some Illumina Hiseq 2x250 bp reads.
                                One of my first steps is to do normalization of the data. Previously I was using bbnorm.sh because it's fast and very good. However, with this data I cannot use bbnorm.sh on our server due to memory issues. (I have around 100 million PE reads).
                                So I tried to use bbnorm on a cluster where I allocated 16 cores and 700GB of RAM.

                                I run the software using this command line:
                                bbnorm.sh in1=st1c_R1.fastq.gz in2=st1c_R2.fastq.gz out1=st1c_R1_norm.fastq.gz out2=st1c_R2_norm.fastq.gz threads=12 target=60 mindepth=2

                                The software runs fine, but the process will be killed by the scheduler because of high CPU usage.

                                =>> PBS: job killed: ncpus 59.52 exceeded limit 16 (sum)
                                I tried to increase the number of threads requested to 30, lower the number of threads in the command line to 10, but still I have the same problem.

                                The cluster has installed BBMap version 36.92. It looks like for some reason the number of threads which I define is not taken into account, even if during the run it tells me that the number of threads was set to 12.

                                Settings:
                                threads: 12
                                k: 31
                                deterministic: true
                                toss error reads: false
                                passes: 1
                                bits per cell: 16
                                cells: 145.37B
                                hashes: 3
                                base min quality: 5
                                kmer min prob: 0.5

                                target depth: 240
                                min depth: 2
                                max depth: 300
                                min good kmers: 15
                                depth percentile: 64.8
                                ignore dupe kmers: true
                                fix spikes: false
                                Any suggestions? I really cannot ask for the entire node just for me.
                                I understand that due to other components like pigz, it can use more than 12 cores, but here it looks like uses 60 cores, or all available cores.

                                Thank you,
                                Sebastian

                                Comment

                                Latest Articles

                                Collapse

                                • seqadmin
                                  Strategies for Sequencing Challenging Samples
                                  by seqadmin


                                  Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                                  03-22-2024, 06:39 AM
                                • seqadmin
                                  Techniques and Challenges in Conservation Genomics
                                  by seqadmin



                                  The field of conservation genomics centers on applying genomics technologies in support of conservation efforts and the preservation of biodiversity. This article features interviews with two researchers who showcase their innovative work and highlight the current state and future of conservation genomics.

                                  Avian Conservation
                                  Matthew DeSaix, a recent doctoral graduate from Kristen Ruegg’s lab at The University of Colorado, shared that most of his research...
                                  03-08-2024, 10:41 AM

                                ad_right_rmr

                                Collapse

                                News

                                Collapse

                                Topics Statistics Last Post
                                Started by seqadmin, Yesterday, 06:37 PM
                                0 responses
                                10 views
                                0 likes
                                Last Post seqadmin  
                                Started by seqadmin, Yesterday, 06:07 PM
                                0 responses
                                9 views
                                0 likes
                                Last Post seqadmin  
                                Started by seqadmin, 03-22-2024, 10:03 AM
                                0 responses
                                49 views
                                0 likes
                                Last Post seqadmin  
                                Started by seqadmin, 03-21-2024, 07:32 AM
                                0 responses
                                67 views
                                0 likes
                                Last Post seqadmin  
                                Working...
                                X