Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Originally posted by skatrinli View Post
    Hey Brian,

    Thanks for this great tool but I could not properly download it. I installed the latest version 37.25 but when i try to test the installation with the command
    $ (installation directory)/stats.sh in=(installation directory)/resources/phix174_ill.ref.fa.gz
    it comes this error:
    Exception in thread "main" java.lang.RuntimeException: Unknown parameter Downloads/bbmap/resources/phix174_ill.ref.fa.gz
    at jgi.AssemblyStats2.<init>(AssemblyStats2.java:166)
    at jgi.AssemblyStats2.main(AssemblyStats2.java:39)

    What did I do wrong?
    You can't have spaces in the filenames without specific countermeasures like quotes. For example:

    Code:
    stats.sh in=foo bar.fa
    Exception in thread "main" java.lang.RuntimeException: Unknown parameter bar.fa
            at jgi.AssemblyStats2.<init>(AssemblyStats2.java:166)
            at jgi.AssemblyStats2.main(AssemblyStats2.java:39)
    That doesn't work...
    Code:
    stats.sh in="foo bar.fa"
    Exception in thread "main" java.lang.RuntimeException: Unknown parameter bar.fa
            at jgi.AssemblyStats2.<init>(AssemblyStats2.java:166)
            at jgi.AssemblyStats2.main(AssemblyStats2.java:39)
    That doesn't work either.

    Code:
    stats.sh in="foo\ bar.fa"
    A       C       G       T       N       IUPAC   Other   GC      GC_stdev
    NaN     NaN     NaN     NaN     NaN     NaN     NaN     NaN     0.0000
    
    Main genome scaffold total:             0
    Main genome contig total:               0
    Main genome scaffold sequence total:    0.000 MB
    Main genome contig sequence total:      0.000 MB        NaN% gap
    Main genome scaffold N/L50:             0/0
    Main genome contig N/L50:               0/0
    Main genome scaffold N/L90:             0/0
    Main genome contig N/L90:               0/0
    Max scaffold length:                    0
    Max contig length:                      0
    Number of scaffolds > 50 KB:            0
    % main genome in scaffolds > 50 KB:     0.00%
    
    
    Minimum         Number          Number          Total           Total           Scaffold
    Scaffold        of              of              Scaffold        Contig          Contig
    Length          Scaffolds       Contigs         Length          Length          Coverage
    --------        --------------  --------------  --------------  --------------  --------
    That does work (though I ran it on an empty file).

    The exact way to deal with spaces is system-specific. In the normal Windows shell you can just use quotes; in Linux bash it looks like you need quotes and an escape character (backslash); in Windows under a Linux emulator I'm not entirely sure. The easiest thing to do is to put files in a path that does not have any spaces (so, not in My Documents, but in C:\data\ or something like that.)

    Comment


    • Hi Brian,

      I'm using reformat.sh to play around with some Nanopore reads. Is there any way to get the histograms (e.g. mhist, qhist, bhist) to track longer reads, like the `max` parameter in readlength.sh?

      Thanks,

      Tom

      Comment


      • Originally posted by TomHarrop View Post
        Hi Brian,

        I'm using reformat.sh to play around with some Nanopore reads. Is there any way to get the histograms (e.g. mhist, qhist, bhist) to track longer reads, like the `max` parameter in readlength.sh?

        Thanks,

        Tom
        Sorry, the max lengths are currently constants. I'll add support for changing them. Generally I didn't find them all that useful for variable-length reads since I kind of designed them to find position-related anomalies, but it's fairly easy to change.

        Comment


        • Re: reformat.sh

          Is there now, or could there be in the future, be a way to specify multiple sets of paired read inputs (ie. different libraries) which could be randomly sampled and output to a single FASTQ file?

          Ie) in=FASTQ_A_R1,FASTQ_B_R1,FASTQ_C_R1 in2=FASTQ_A_R2,FASTQ_B_R2,FASTQ_C_R2 out=Subset_R1.fastq out2=Subset_R2.fastq

          Can do it via a previous cat command (A+B+C -> reformat) but with large files cat can be an I/O issue.

          Best and Thanks,
          Bob

          Comment


          • Originally posted by jazz710 View Post
            Re: reformat.sh

            Is there now, or could there be in the future, be a way to specify multiple sets of paired read inputs (ie. different libraries) which could be randomly sampled and output to a single FASTQ file?

            Ie) in=FASTQ_A_R1,FASTQ_B_R1,FASTQ_C_R1 in2=FASTQ_A_R2,FASTQ_B_R2,FASTQ_C_R2 out=Subset_R1.fastq out2=Subset_R2.fastq

            Can do it via a previous cat command (A+B+C -> reformat) but with large files cat can be an I/O issue.

            Best and Thanks,
            Bob
            I take a subset from each file and then cat the subsets together. That would be faster than combining the inputs with cat, at least. If the subset size is a significant fraction of most of the files it would still be slow, but if you are just collecting a small fraction of reads it is fast.
            Providing nextRAD genotyping and PacBio sequencing services. http://snpsaurus.com

            Comment


            • Originally posted by SNPsaurus View Post
              I take a subset from each file and then cat the subsets together. That would be faster than combining the inputs with cat, at least.
              True, this works well. Alternatively, for interleaved files, you can do this:

              Code:
              cat a.fq b.fq c.fq | reformat.sh in=stdin.fq out=sampled.fq interleaved samplerate=0.1
              ...which avoids writing temp files. Won't work for twin paired files, though, unless you do something tricky with named pipes.

              To avoid wasting disk space and bandwidth, I normally keep all fastq files gzipped at all times. Using pigz for parallel compression, or compression level 2 (zl=2 flag), will eliminate much of the speed penalty from dealing with compressed files; and if you are I/O limited, compressed files tend to speed things up.

              I don't have any plans at present to add multiple input file support (or wildcard support) to Reformat, but I'll put it on my list and consider it. It's something I've occasionally wanted also.

              Comment


              • Can BBmap remove reads containing homopolymers?

                Hi BBmap-ers,

                Say I want to remove reads with more than 5 consecutive identical nucleotides. Is there a homopolymer/ polyX filtering option with BBDuk or reformat.sh?

                Thanks!

                Tom

                Comment


                • There is no homopolymer filter, per se, but you can accomplish that like this:

                  bbduk.sh in=reads.fq out=clean.fq literal=AAAAAA,CCCCCC,GGGGGG,TTTTTT k=6 mm=f

                  Comment


                  • Great, thanks!

                    Comment


                    • Hi Brian-

                      1. We are currently using bedtools BAM to BED and then extending reads for chip visualization. Does BBMAP suite has an option to extend reads in BAM based on the fragment length estimate from MACS or automatically from BAM?

                      2. Is there an option/tool to remove chip duplicates based on the read ID in BAM instead of co-ordinates?

                      Comment


                      • 1) No, it does not. Tadpole can extend reads based on kmer-counts but it does not make use of mapping information.

                        2) Clumpify can remove optical duplicate based on a combination of sequence (which indicates whether they are duplicate) and position from the read ID (which indicates if they are very close). However, it cannot handle bam files, only fasta and fastq.

                        -Brian

                        Comment


                        • Comprehensive reporting of mapped reads

                          Hi Brian,

                          I am trying to map 100bp Illumina sequences to a collection of very similar, 80bp reference sequences (multiple alleles of a single exon) using bbmap.

                          When mapping to a single reference sequence from the collection in isolation using settings:

                          'ambiguous=all',
                          'maxsites=1000000',
                          'vslow',
                          'subfilter=0'

                          An expected number of sequences (in this case, ~270) map and fully cover the reference sequence.

                          When using the same parameters and mapping to a collection of sequences containing the same reference sequence only ~120 reads map.

                          Is there another parameter(s) I need to be setting to map my reads exhaustively against all reference sequences?

                          Thanks,

                          dave

                          Comment


                          • Hi Dave,

                            BBMap has some heuristics that may make it non-ideal for the situation with a large number of near-identical sequences, particularly when the reads don't map glocally well to any of them (because the reads are longer than the reference sequences) and the reference sequences are tiny. You might also try the flag "excludefraction=0" to see if this changes anything. To clarify, there are perfect alignments (with zero mismatches) that are getting missed, correct?

                            Many of the heuristics related to ignoring extremely common, uninformative reference kmers are disabled in bbmapskimmer.sh, which is designed specifically for a high degree of multimapping. The syntax is the same as BBMap, so please give that a try and let me know if it works better. You'll need to additionally add the flag "minscaf=1" or really short scaffolds get discarded, so something like:

                            Code:
                            bbmapskimmer.sh in=reads.fq ref=ref.fa maxsites=1000000 vslow ambig=all maxindel=0 subfilter=0 excludefraction=0 out=mapped.sam minscaf=1
                            Please let me know whether that changes the situation.

                            -Brian

                            Comment


                            • Hi Brian,

                              No, unfortunately it didn't help. I think you understand the question correctly:

                              When I map 100bp reads against a single allele of a gene with four exons using:

                              Code:
                              maxsites=1000000 vslow ambig=all maxindel=0 subfilter=0 excludefraction=0 out=mapped.sam minscaf=1 covstats=stdout | grep 'IPD0001613'
                              I get 100% read support for all four exons.

                              Code:
                              Mafa-DPA1*07:02|IPD0001613_2_MHC-II-DPA	288.0902	244	0.5000	100.0000	244	421	388	0.4788	297	35.43
                              Mafa-DPA1*07:02|IPD0001613_3_MHC-II-DPA	235.8719	281	0.5658	100.0000	281	408	343	0.5554	247	40.92
                              Mafa-DPA1*07:02|IPD0001613_4_MHC-II-DPA	218.1935	155	0.6323	100.0000	155	227	220	0.5913	228	27.77
                              Mafa-DPA1*07:02|IPD0001613_1_MHC-II-DPA	199.3457	81	0.5679	100.0000	81	147	111	0.5713	200	35.48
                              When I use the same parameters but include the allele in a larger reference sequence that contains many other alleles, the number of mapped reads decreases for all four exons and no longer fully covers exon 1:

                              Code:
                              Mafa-DPA1*07:02|IPD0001613_2_MHC-II-DPA	242.6475	244	0.5000	100.0000	244	343	327	0.4746	233	43.23
                              Mafa-DPA1*07:02|IPD0001613_3_MHC-II-DPA	58.2847	281	0.5658	100.0000	281	81	93	0.5509	63	19.81
                              Mafa-DPA1*07:02|IPD0001613_4_MHC-II-DPA	194.1226	155	0.6323	100.0000	155	201	181	0.5935	204	33.17
                              Mafa-DPA1*07:02|IPD0001613_1_MHC-II-DPA	38.3333	81	0.5679	97.5309	79	34	17	0.6038	51	17.89
                              Any other thoughts?

                              Thanks,

                              dave

                              Comment


                              • Originally posted by Brian Bushnell View Post
                                1) No, it does not. Tadpole can extend reads based on kmer-counts but it does not make use of mapping information.

                                2) Clumpify can remove optical duplicate based on a combination of sequence (which indicates whether they are duplicate) and position from the read ID (which indicates if they are very close). However, it cannot handle bam files, only fasta and fastq.

                                -Brian
                                Thanks Brian.

                                How about BED?

                                Comment

                                Latest Articles

                                Collapse

                                • seqadmin
                                  Advancing Precision Medicine for Rare Diseases in Children
                                  by seqadmin




                                  Many organizations study rare diseases, but few have a mission as impactful as Rady Children’s Institute for Genomic Medicine (RCIGM). “We are all about changing outcomes for children,” explained Dr. Stephen Kingsmore, President and CEO of the group. The institute’s initial goal was to provide rapid diagnoses for critically ill children and shorten their diagnostic odyssey, a term used to describe the long and arduous process it takes patients to obtain an accurate...
                                  12-16-2024, 07:57 AM
                                • seqadmin
                                  Recent Advances in Sequencing Technologies
                                  by seqadmin



                                  Innovations in next-generation sequencing technologies and techniques are driving more precise and comprehensive exploration of complex biological systems. Current advancements include improved accessibility for long-read sequencing and significant progress in single-cell and 3D genomics. This article explores some of the most impactful developments in the field over the past year.

                                  Long-Read Sequencing
                                  Long-read sequencing has seen remarkable advancements,...
                                  12-02-2024, 01:49 PM

                                ad_right_rmr

                                Collapse

                                News

                                Collapse

                                Topics Statistics Last Post
                                Started by seqadmin, 12-17-2024, 10:28 AM
                                0 responses
                                26 views
                                0 likes
                                Last Post seqadmin  
                                Started by seqadmin, 12-13-2024, 08:24 AM
                                0 responses
                                42 views
                                0 likes
                                Last Post seqadmin  
                                Started by seqadmin, 12-12-2024, 07:41 AM
                                0 responses
                                28 views
                                0 likes
                                Last Post seqadmin  
                                Started by seqadmin, 12-11-2024, 07:45 AM
                                0 responses
                                42 views
                                0 likes
                                Last Post seqadmin  
                                Working...
                                X