Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • It is a strange analysis and with all those caveats going to be of limited real world use for eukaryotic genomes.

    Comment


    • Thank you all for your replies. I will be trying some of the things you have all suggested.
      As for running samtools depth with BBMap's outputs, I get 0 coverage and Standard Deviation values. I first convert BBMap's sam output to bam, and then proceed to sort it using samtools. It is then that I try using samtools depth.

      Edit:
      I tried using one read to verify if there was a problem with the random selection of sites for ambiguously-mapped paired reads. However, the results were very similar, leading me to think that I should be focusing on extracting the longest isoform per gene and seeing how that looks like. However, I am unable to produce coverage and STDV directly from BBMap, as it tells me that "hist=" is an unknown command. Thank you again Brian.
      Last edited by JamesJoyce; 10-06-2015, 06:39 AM.

      Comment


      • Update: Results still have high standard deviations, even after only keeping the longest isoforms of the sequences found in the reference transcript.

        I really like all the features BBMap holds, and as such I would like to keep using BBMap for a foreseeable future. Is there any way I could use BBMap to do this sort of analysis (determining copy number of certain genes within an individual organism), but instead by using RNA-Seq read sequences? Thanks again for your time guys.

        Comment


        • You can't determine copy number of genes from RNA-seq data, as the expression is not really determined by the copy-number (and is highly variable anyway). The best way is probably to map WGS DNA reads to the genome and then use bedtools to extract coverage over coding regions, and look for regions with different-than-expected coverage. This will remove a lot of the bias caused by mapping DNA reads to a transcriptome. Also, note that exome-capture will not yield useful quantitative results, either, due to extreme bias, without extensive calibration for a given kit.

          Alternately, it would probably be a lot cheaper to run a SNP chip, or qPCR, or a probe array, or something like that, rather than doing sequencing. I can't remember which one is best for this but at least one is designed to do copy-count quantification without sequencing, across a genome.

          Comment


          • out of memory error

            even with a very small ref and input file, I got the out of memory error. Would you check what is happen?


            bbmap.sh in=~/Documents/align/test.fa out=~/Documents/align/outputu.sam

            java -Djava.library.path=/home/local/bbmap/jni/ -ea -Xmx700m -cp /home/local/bbmap/current/ align2.BBMap build=1 overwrite=true fastareadlen=500 in=/home/Documents/align/HLAtest.fa out=/home/Documents/align/outputu.sam
            Executing align2.BBMap [build=1, overwrite=true, fastareadlen=500, in=/home/Documents/align/test.fa, out=/home/Documents/align/outputu.sam]


            BBMap version 35.51
            Retaining first best site only for ambiguous mappings.
            Set genome to 1

            Loaded Reference: 0.023 seconds.
            Loading index for chunk 1-1, build 1
            Generated Index: 0.817 seconds.
            Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
            at align2.BBIndex.analyzeIndex(BBIndex.java:106)
            at align2.BBMap.loadIndex(BBMap.java:410)
            at align2.BBMap.main(BBMap.java:32)

            Comment


            • BBMap needs a fairly large amount of memory dependent on kmer size, regardless of how small the reference is. You can estimate it like this:

              Code:
              C:\temp\ecoli>java jgi.AssemblyStats2 in=ecoli_K12.fa k=13
              A       C       G       T       N       IUPAC   Other   GC      GC_stdev
              0.2462  0.2542  0.2537  0.2459  0.0000  0.0000  0.0000  0.5079  0.0000
              
              Main genome scaffold total:             1
              Main genome contig total:               1
              Main genome scaffold sequence total:    4.640 MB
              [I](stuff)...[/I]
              
              [B]BBMap minimum memory estimate at k=13:     [COLOR="DarkRed"]-Xmx900m[/COLOR]     (at least 1000 MB physical RAM)[/B]
              So, it looks like K=13 with a tiny genome like E.coli still needs 900MB! Let's try K=12:

              Code:
              C:\temp\ecoli>java jgi.AssemblyStats2 in=ecoli_K12.fa k=12
              A       C       G       T       N       IUPAC   Other   GC      GC_stdev
              0.2462  0.2542  0.2537  0.2459  0.0000  0.0000  0.0000  0.5079  0.0000
              
              Main genome scaffold total:             1
              Main genome contig total:               1
              Main genome scaffold sequence total:    4.640 MB
              [I](stuff)...[/I]
              
              [B]BBMap minimum memory estimate at k=12:     [COLOR="DarkRed"]-Xmx480m[/COLOR]     (at least 540 MB physical RAM)[/B]
              That's more reasonable; K=12 will work with -Xmx480m. Note that you must specify K both when building the index AND mapping.

              Comment


              • Thank you very much. After specify K in both command, it works.

                Comment


                • mapPacBio.sh - weighting soft clipped mappings higher than mappings with substitution

                  Hi Brian,
                  I was wondering if there was a way to adjust the mapping score weights such as to favor a mapping that perfectly maps to a reference with soft clipped ends over a mapping with a number of substitutions. Our problem is that we have allele databases of variable length sequences that closely resemble one another (in this case Major Histocompatibility complex class II - macaques) where our pacbio reads end up extending well past the reference sequence, but map perfectly to the reference. We want to include gapped mappings but not mappings with substitutions for the purpose of counting reads to individual ref sequences taking into account pacbio's tendency to introduce random indels.

                  We have been using bbmap - with the pacbio option and the subfilter=0 option as packaged within Geneious 9.2. This has the result of discarding perfect mappings to allele references where there is a higher scoring hit to a longer reference sequence with substitutions that subsequently gets filtered out post-mapping with the subfilter=0 option.

                  This looks to give the correct result if we use a full length allele reference file, however many of the alleles we need to count for genotyping don't have a full-length ref sequence identified as of yet.

                  Comment


                  • Unfortunately, I'm not sure how this could be accomplished without changing the scoring matrix, which is currently hard-coded, or else doing custom post-processing the output.

                    It's an odd request, though, as PacBio reads have substitution errors as well as indels. If not for the indels, the "semiperfectmode" might accomplish what you want - it requires perfect mappings, aside from the parts that go off the ends of contigs.

                    Perhaps the best bet would be to align with "bbmapskimmer.sh", which tries to report all alignments above a certain score cutoff, rather than looking for the single best alignment like normal BBMap (and it is designed for PacBio reads). Then, subfilter=0 would get rid of sites with substitutions, but shorter sub-free alignments would be retained.

                    Comment


                    • Hi Brian,

                      I am using bbmap.sh to evaluate coverage on a collection of viral genomes (viral all.fna downloaded from NCBI). The command I am using is as follow:

                      bbmap.sh in=XXX.fastq ref=XXX/all.fna nodisk covstats=xxx.txt

                      I got results in the XXX.txt file but I am not sure if I understand it correctly. For instance, for one of the gi| entry, I got the following information:

                      #ID Avg_fold Length Ref-GC Covered_percent Covered_bases Plus_reads Minus_reads Read_GC

                      gi|XXXXXX 0.0926 195859 0.6661 9.2561 18129 1 1 0.9444

                      How come with one Plus_reads and one minus_reads you can get covered_base=18129? How does the Covered_percent calculated?

                      Your kind help is highly appreciated.

                      James

                      Comment


                      • Hi James,

                        BBMap and Pileup consider bases covered by a deletion event as covered. So presumably one or both of your reads mapped with a long deletion ("D" symbol in the cigar string). Please let me know if that's not the case.

                        Covered_percent is simply 100*Covered_bases/Length.

                        Comment


                        • Hi Brian,

                          Thanks a lot for your quick response!

                          Perhaps you can give me some advice: I have a metagenomic sequencing fast file. I want to know how many reads are mapped to each of the viral genomes and what % of each viral genome is covered at what fold. What is the best way to do that?

                          Thanks in advance!

                          James

                          Comment


                          • It looks like you are doing it correctly. Bases covered by deletion events are treated as covered for "percent covered", but "fold coverage" is calculated by the total number of mapped bases divided by the length, so that is not affected by long deletions. So, only the percent covered column is affected by long deletions. You simply ban long deletions with a flag like "strictmaxindel=10" if you don't want those reads.

                            Alternately, produce a sam file, and then run it through a different program that treats bases covered by deletion events as not covered. I'm not sure which tool would do that, but probably one exists. Maybe samtools or bedtools?

                            Comment


                            • Thanks Brian!

                              What if I use BBSplit? How many genomes can BBsplit handle? What I ultimately want to do is to figure out how many reads mapped to each genome and separate those reads by genome so that we can evaluate them later if necessary. For instance, we generate a fastq file with MiSeq, we bbsplit on say 1000 viral genomes, and see which virus has more hits. We then somehow report what percentage of the genome is covered with reads and on average at what fold coverage.

                              Your kind input is highly appreciated.

                              James

                              Comment


                              • BBSplit is ideal for purpose of getting one file per organism, and there's no upper limit to the number of genomes it can handle. Note that BBSplit needs to have each reference in its own file (e.g. ref=virus1.fa,virus2.fa, etc), and also it should not be used to produce sam files, just fasta/fastq. There's also Seal, which can function similarly to BBSplit, and produces either one output file per reference file or one per reference sequence, so all your organisms can be in the same reference file if you want. It also has no upper limit on the number of reference sequences.

                                I'll add a flag to tell BBMap/Pileup to ignore deletions when calculating coverage. It does seem fairly useful, so I'll probably do that soon.

                                ...OK, I added it. I'll upload the new version next week, probably on Monday.
                                Last edited by Brian Bushnell; 10-30-2015, 05:11 PM.

                                Comment

                                Latest Articles

                                Collapse

                                • seqadmin
                                  Current Approaches to Protein Sequencing
                                  by seqadmin


                                  Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
                                  04-04-2024, 04:25 PM
                                • seqadmin
                                  Strategies for Sequencing Challenging Samples
                                  by seqadmin


                                  Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                                  03-22-2024, 06:39 AM

                                ad_right_rmr

                                Collapse

                                News

                                Collapse

                                Topics Statistics Last Post
                                Started by seqadmin, 04-11-2024, 12:08 PM
                                0 responses
                                24 views
                                0 likes
                                Last Post seqadmin  
                                Started by seqadmin, 04-10-2024, 10:19 PM
                                0 responses
                                25 views
                                0 likes
                                Last Post seqadmin  
                                Started by seqadmin, 04-10-2024, 09:21 AM
                                0 responses
                                22 views
                                0 likes
                                Last Post seqadmin  
                                Started by seqadmin, 04-04-2024, 09:00 AM
                                0 responses
                                52 views
                                0 likes
                                Last Post seqadmin  
                                Working...
                                X