Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Originally posted by skatrinli View Post
    Hey Brian,

    Thanks for this great tool but I could not properly download it. I installed the latest version 37.25 but when i try to test the installation with the command
    $ (installation directory)/stats.sh in=(installation directory)/resources/phix174_ill.ref.fa.gz
    it comes this error:
    Exception in thread "main" java.lang.RuntimeException: Unknown parameter Downloads/bbmap/resources/phix174_ill.ref.fa.gz
    at jgi.AssemblyStats2.<init>(AssemblyStats2.java:166)
    at jgi.AssemblyStats2.main(AssemblyStats2.java:39)

    What did I do wrong?
    You can't have spaces in the filenames without specific countermeasures like quotes. For example:

    Code:
    stats.sh in=foo bar.fa
    Exception in thread "main" java.lang.RuntimeException: Unknown parameter bar.fa
            at jgi.AssemblyStats2.<init>(AssemblyStats2.java:166)
            at jgi.AssemblyStats2.main(AssemblyStats2.java:39)
    That doesn't work...
    Code:
    stats.sh in="foo bar.fa"
    Exception in thread "main" java.lang.RuntimeException: Unknown parameter bar.fa
            at jgi.AssemblyStats2.<init>(AssemblyStats2.java:166)
            at jgi.AssemblyStats2.main(AssemblyStats2.java:39)
    That doesn't work either.

    Code:
    stats.sh in="foo\ bar.fa"
    A       C       G       T       N       IUPAC   Other   GC      GC_stdev
    NaN     NaN     NaN     NaN     NaN     NaN     NaN     NaN     0.0000
    
    Main genome scaffold total:             0
    Main genome contig total:               0
    Main genome scaffold sequence total:    0.000 MB
    Main genome contig sequence total:      0.000 MB        NaN% gap
    Main genome scaffold N/L50:             0/0
    Main genome contig N/L50:               0/0
    Main genome scaffold N/L90:             0/0
    Main genome contig N/L90:               0/0
    Max scaffold length:                    0
    Max contig length:                      0
    Number of scaffolds > 50 KB:            0
    % main genome in scaffolds > 50 KB:     0.00%
    
    
    Minimum         Number          Number          Total           Total           Scaffold
    Scaffold        of              of              Scaffold        Contig          Contig
    Length          Scaffolds       Contigs         Length          Length          Coverage
    --------        --------------  --------------  --------------  --------------  --------
    That does work (though I ran it on an empty file).

    The exact way to deal with spaces is system-specific. In the normal Windows shell you can just use quotes; in Linux bash it looks like you need quotes and an escape character (backslash); in Windows under a Linux emulator I'm not entirely sure. The easiest thing to do is to put files in a path that does not have any spaces (so, not in My Documents, but in C:\data\ or something like that.)

    Comment


    • Hi Brian,

      I'm using reformat.sh to play around with some Nanopore reads. Is there any way to get the histograms (e.g. mhist, qhist, bhist) to track longer reads, like the `max` parameter in readlength.sh?

      Thanks,

      Tom

      Comment


      • Originally posted by TomHarrop View Post
        Hi Brian,

        I'm using reformat.sh to play around with some Nanopore reads. Is there any way to get the histograms (e.g. mhist, qhist, bhist) to track longer reads, like the `max` parameter in readlength.sh?

        Thanks,

        Tom
        Sorry, the max lengths are currently constants. I'll add support for changing them. Generally I didn't find them all that useful for variable-length reads since I kind of designed them to find position-related anomalies, but it's fairly easy to change.

        Comment


        • Re: reformat.sh

          Is there now, or could there be in the future, be a way to specify multiple sets of paired read inputs (ie. different libraries) which could be randomly sampled and output to a single FASTQ file?

          Ie) in=FASTQ_A_R1,FASTQ_B_R1,FASTQ_C_R1 in2=FASTQ_A_R2,FASTQ_B_R2,FASTQ_C_R2 out=Subset_R1.fastq out2=Subset_R2.fastq

          Can do it via a previous cat command (A+B+C -> reformat) but with large files cat can be an I/O issue.

          Best and Thanks,
          Bob

          Comment


          • Originally posted by jazz710 View Post
            Re: reformat.sh

            Is there now, or could there be in the future, be a way to specify multiple sets of paired read inputs (ie. different libraries) which could be randomly sampled and output to a single FASTQ file?

            Ie) in=FASTQ_A_R1,FASTQ_B_R1,FASTQ_C_R1 in2=FASTQ_A_R2,FASTQ_B_R2,FASTQ_C_R2 out=Subset_R1.fastq out2=Subset_R2.fastq

            Can do it via a previous cat command (A+B+C -> reformat) but with large files cat can be an I/O issue.

            Best and Thanks,
            Bob
            I take a subset from each file and then cat the subsets together. That would be faster than combining the inputs with cat, at least. If the subset size is a significant fraction of most of the files it would still be slow, but if you are just collecting a small fraction of reads it is fast.
            Providing nextRAD genotyping and PacBio sequencing services. http://snpsaurus.com

            Comment


            • Originally posted by SNPsaurus View Post
              I take a subset from each file and then cat the subsets together. That would be faster than combining the inputs with cat, at least.
              True, this works well. Alternatively, for interleaved files, you can do this:

              Code:
              cat a.fq b.fq c.fq | reformat.sh in=stdin.fq out=sampled.fq interleaved samplerate=0.1
              ...which avoids writing temp files. Won't work for twin paired files, though, unless you do something tricky with named pipes.

              To avoid wasting disk space and bandwidth, I normally keep all fastq files gzipped at all times. Using pigz for parallel compression, or compression level 2 (zl=2 flag), will eliminate much of the speed penalty from dealing with compressed files; and if you are I/O limited, compressed files tend to speed things up.

              I don't have any plans at present to add multiple input file support (or wildcard support) to Reformat, but I'll put it on my list and consider it. It's something I've occasionally wanted also.

              Comment


              • Can BBmap remove reads containing homopolymers?

                Hi BBmap-ers,

                Say I want to remove reads with more than 5 consecutive identical nucleotides. Is there a homopolymer/ polyX filtering option with BBDuk or reformat.sh?

                Thanks!

                Tom

                Comment


                • There is no homopolymer filter, per se, but you can accomplish that like this:

                  bbduk.sh in=reads.fq out=clean.fq literal=AAAAAA,CCCCCC,GGGGGG,TTTTTT k=6 mm=f

                  Comment


                  • Great, thanks!

                    Comment


                    • Hi Brian-

                      1. We are currently using bedtools BAM to BED and then extending reads for chip visualization. Does BBMAP suite has an option to extend reads in BAM based on the fragment length estimate from MACS or automatically from BAM?

                      2. Is there an option/tool to remove chip duplicates based on the read ID in BAM instead of co-ordinates?

                      Comment


                      • 1) No, it does not. Tadpole can extend reads based on kmer-counts but it does not make use of mapping information.

                        2) Clumpify can remove optical duplicate based on a combination of sequence (which indicates whether they are duplicate) and position from the read ID (which indicates if they are very close). However, it cannot handle bam files, only fasta and fastq.

                        -Brian

                        Comment


                        • Comprehensive reporting of mapped reads

                          Hi Brian,

                          I am trying to map 100bp Illumina sequences to a collection of very similar, 80bp reference sequences (multiple alleles of a single exon) using bbmap.

                          When mapping to a single reference sequence from the collection in isolation using settings:

                          'ambiguous=all',
                          'maxsites=1000000',
                          'vslow',
                          'subfilter=0'

                          An expected number of sequences (in this case, ~270) map and fully cover the reference sequence.

                          When using the same parameters and mapping to a collection of sequences containing the same reference sequence only ~120 reads map.

                          Is there another parameter(s) I need to be setting to map my reads exhaustively against all reference sequences?

                          Thanks,

                          dave

                          Comment


                          • Hi Dave,

                            BBMap has some heuristics that may make it non-ideal for the situation with a large number of near-identical sequences, particularly when the reads don't map glocally well to any of them (because the reads are longer than the reference sequences) and the reference sequences are tiny. You might also try the flag "excludefraction=0" to see if this changes anything. To clarify, there are perfect alignments (with zero mismatches) that are getting missed, correct?

                            Many of the heuristics related to ignoring extremely common, uninformative reference kmers are disabled in bbmapskimmer.sh, which is designed specifically for a high degree of multimapping. The syntax is the same as BBMap, so please give that a try and let me know if it works better. You'll need to additionally add the flag "minscaf=1" or really short scaffolds get discarded, so something like:

                            Code:
                            bbmapskimmer.sh in=reads.fq ref=ref.fa maxsites=1000000 vslow ambig=all maxindel=0 subfilter=0 excludefraction=0 out=mapped.sam minscaf=1
                            Please let me know whether that changes the situation.

                            -Brian

                            Comment


                            • Hi Brian,

                              No, unfortunately it didn't help. I think you understand the question correctly:

                              When I map 100bp reads against a single allele of a gene with four exons using:

                              Code:
                              maxsites=1000000 vslow ambig=all maxindel=0 subfilter=0 excludefraction=0 out=mapped.sam minscaf=1 covstats=stdout | grep 'IPD0001613'
                              I get 100% read support for all four exons.

                              Code:
                              Mafa-DPA1*07:02|IPD0001613_2_MHC-II-DPA	288.0902	244	0.5000	100.0000	244	421	388	0.4788	297	35.43
                              Mafa-DPA1*07:02|IPD0001613_3_MHC-II-DPA	235.8719	281	0.5658	100.0000	281	408	343	0.5554	247	40.92
                              Mafa-DPA1*07:02|IPD0001613_4_MHC-II-DPA	218.1935	155	0.6323	100.0000	155	227	220	0.5913	228	27.77
                              Mafa-DPA1*07:02|IPD0001613_1_MHC-II-DPA	199.3457	81	0.5679	100.0000	81	147	111	0.5713	200	35.48
                              When I use the same parameters but include the allele in a larger reference sequence that contains many other alleles, the number of mapped reads decreases for all four exons and no longer fully covers exon 1:

                              Code:
                              Mafa-DPA1*07:02|IPD0001613_2_MHC-II-DPA	242.6475	244	0.5000	100.0000	244	343	327	0.4746	233	43.23
                              Mafa-DPA1*07:02|IPD0001613_3_MHC-II-DPA	58.2847	281	0.5658	100.0000	281	81	93	0.5509	63	19.81
                              Mafa-DPA1*07:02|IPD0001613_4_MHC-II-DPA	194.1226	155	0.6323	100.0000	155	201	181	0.5935	204	33.17
                              Mafa-DPA1*07:02|IPD0001613_1_MHC-II-DPA	38.3333	81	0.5679	97.5309	79	34	17	0.6038	51	17.89
                              Any other thoughts?

                              Thanks,

                              dave

                              Comment


                              • Originally posted by Brian Bushnell View Post
                                1) No, it does not. Tadpole can extend reads based on kmer-counts but it does not make use of mapping information.

                                2) Clumpify can remove optical duplicate based on a combination of sequence (which indicates whether they are duplicate) and position from the read ID (which indicates if they are very close). However, it cannot handle bam files, only fasta and fastq.

                                -Brian
                                Thanks Brian.

                                How about BED?

                                Comment

                                Latest Articles

                                Collapse

                                • seqadmin
                                  Techniques and Challenges in Conservation Genomics
                                  by seqadmin



                                  The field of conservation genomics centers on applying genomics technologies in support of conservation efforts and the preservation of biodiversity. This article features interviews with two researchers who showcase their innovative work and highlight the current state and future of conservation genomics.

                                  Avian Conservation
                                  Matthew DeSaix, a recent doctoral graduate from Kristen Ruegg’s lab at The University of Colorado, shared that most of his research...
                                  03-08-2024, 10:41 AM
                                • seqadmin
                                  The Impact of AI in Genomic Medicine
                                  by seqadmin



                                  Artificial intelligence (AI) has evolved from a futuristic vision to a mainstream technology, highlighted by the introduction of tools like OpenAI's ChatGPT and Google's Gemini. In recent years, AI has become increasingly integrated into the field of genomics. This integration has enabled new scientific discoveries while simultaneously raising important ethical questions1. Interviews with two researchers at the center of this intersection provide insightful perspectives into...
                                  02-26-2024, 02:07 PM

                                ad_right_rmr

                                Collapse

                                News

                                Collapse

                                Topics Statistics Last Post
                                Started by seqadmin, 03-14-2024, 06:13 AM
                                0 responses
                                33 views
                                0 likes
                                Last Post seqadmin  
                                Started by seqadmin, 03-08-2024, 08:03 AM
                                0 responses
                                72 views
                                0 likes
                                Last Post seqadmin  
                                Started by seqadmin, 03-07-2024, 08:13 AM
                                0 responses
                                81 views
                                0 likes
                                Last Post seqadmin  
                                Started by seqadmin, 03-06-2024, 09:51 AM
                                0 responses
                                68 views
                                0 likes
                                Last Post seqadmin  
                                Working...
                                X