Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • #16
    Originally posted by sk8bro View Post
    You wrote: "But it should be "unmapped+mapped1+mapped2=used"

    I just wanted to confirm that it is okay that the numbers also don't fit this formula.

    611706+374+844 = 612924 != 613590
    Yes, I know It only works for single-ended reads.

    Then, if I wanted to report the mapping rate, is this the correct formula?

    mapped = used-unmapped/used
    Depends on what you want to report, but yes, the number of reads that get sent to outm are "used-unmapped". The number of reads that actually mapped (meaning they match the reference) is mapped1+mapped2, which is slightly different (for paired reads).

    Is there a way to export to outm only if both pairs map? Default I think is to export if either pair maps?
    Yes, you can use the "pairedonly" flag, which will a pair mapped only if both reads map, and in the correct orientation (same contig, opposite strands, within the maximum insert size).

    Comment


    • #17
      That all sounds great

      Have you ever compared to SNAP in terms of human read removal?

      Comment


      • #18
        SNAP is faster than BBMap, but much less sensitive. I would never use it for most mapping tasks, but for contamination removal, it might be acceptable because you want fairly low sensitivity anyway, to avoid false positives. A more fair comparison in that case might be BBDuk vs SNAP, as they both use long kmers, and would probably have similar speed and sensitivity.

        But no, I have not specifically looked at SNAP for contamination removal.

        Comment


        • #19
          Thank you for the explanation and your extensive suite of tools.

          Comment


          • #20
            Hi Brian, I was wondering about this... If bam/bai is what I want at the end of the day, do you still recommend fastq output and then conversion?

            "BBSplit is recommended for fastq and fasta output, not for sam/bam output"

            It seems that sam output with this option would give me what I'm after without any extra work:

            bs=<file> Write a shell script to 'file' that will turn the sam output into a sorted, indexed bam file.

            Comment


            • #21
              BBSplit can output sam files, and my original goal was to have it do so, for convenience. And it works fine for reads that only map to one of the references. But unfortunately, with "ambig2=all", reads that map to multiple references will be printed to each sam file with the same mapping information. In other words, if a read mapped to both e.coli and human, it would be printed in both the e.coli and human outputs (which is correct), but in both cases it would be reported as mapping to the same organism (e.g. in both files it would be listed as mapping to e.coli). Which is probably not what you want.

              With "ambiguous2=toss" or "best" this is not a problem, but with "all" or "split" it is. And it turns out this is rather tricky to fix, so I haven't allocated time to do it. Since my use-cases (and those of JGI in general) use BBSplit to produce fastq output anyway, it seems low-priority. Obviously, with fastq, this is not an issue because fastq output does not retain the mapping information.

              So, the easiest way to ensure that people get correct results is to only use BBSplit for fasta/fastq output. Then if you want sam/bam, map the individual output files to their individual genome.

              Sorry for the inconvenience!

              Comment


              • #22
                Hi Brian, a few more questions.

                Can BBsplit output the unmapped reads? It didn't look in the description like it could, so I have been doing a parallel BBmap to the concatenation of each of BBsplit's reference fastas to get the unmapped output. However, even with what I think are the same parameters specified to each program, I get slightly different results when I count up how many reads map in each case. Here is a comparison, and you can see that a few samples are a few read pairs off.

                BBmap BBsplit
                72 72
                46 46
                408 408
                37800 37800
                281614 281614
                6854 6854
                114028 114034
                40788 40788
                6508 6508
                193476 193480

                Here are the 2 commands with options

                bbsplit.sh
                k=8
                in=BBmap/Sample_1P.fastq
                in2=BBmap/Sample_2P.fastq
                ref="$amplicon_dir"
                ambiguous2=split
                basename=BBsplit/Sample_%_#P.fastq
                refstats=Amplicon_counts_Sample.txt
                minid=0.95
                idfilter=0.95
                vslow
                maxindel=100
                nzo=f
                pairedonly

                bbmap.sh
                k=8
                minid=0.95
                idfilter=0.95
                maxindel=100
                vslow
                minhits=1
                ambiguous=best
                path="$agg_file"
                pairedonly
                build=1
                printunmappedcount
                -Xmx23g
                in=BBmap/Sample_#P.fastq
                outm=BBmap/Sample.mapped.sam
                outu=BBmap/Sample.unmapped.sam

                Also, in investigating the unmapped reads, it is not clear to me why some read pairs are unmapped. While I realize I have required 95% identity for both reads, I find many pairs that satisfy this in the unmapped output.

                Thanks!
                Last edited by sk8bro; 12-18-2015, 12:38 PM.

                Comment


                • #23
                  BBSplit supports the "outu" flag, so yes, it can capture unmapped reads. The number of reads mapping can vary slightly when mapping paired reads multithreaded, particularly with the "pairedonly" flag or a high minid/idfilter, from run to run; this is because the pair distance is calculated dynamically on a per-thread basis, and how close a pair distance is to the average influences the mapping score, which determines whether or not the reads are mapped (or paired). When multithreaded, which reads are processed by which thread is nondeterministic, so the exact value of the average pair distance will be slightly different. I assume this is causing the discrepancy.

                  You have set both a 95% identity cutoff and "pairedonly", meaning both reads must be at least 95% identity and they must be in a properly-paired orientation. Perhaps you can map the unmapped reads with BBMap using default settings, and add the flag "idtag" which will print the calculated percent identity in the sam output? I would be interested in seeing if there are indeed pairs, in a proper orientation, with both at least 95% identity as calculated by BBMap.

                  Comment


                  • #24
                    I am trying that now.. but I changed BBsplit command by reducing indel size to 20, removing vslow (what does that do anyway?), and minid is default (idfilter is strong than minid right?).

                    Still running, but I notice slightly higher mapping rates with these changes... can you explain?

                    Thank you for your help.

                    Comment


                    • #25
                      vslow increases sensitivity, by not discarding very common kmers. This is not very important with default K=13, but becomes more important with short kmers, since there are fewer of them.

                      minid is loose, and mainly affects speed. "minid=90", for example, does not ensure that there will be no hits with under 90% identity; rather, it ensures that any hit as low as 90% identity, and possibly lower, will be reported. Idfilter is strict; "idfilter=X" ensures that any alignment with ID under X will be terminated.

                      Comment


                      • #26
                        Okay, so I should probably keep vslow in there, since I use 8mers. Here are some examples of default BBmap mapping of my unmapped by BBsplit reads:

                        HWI-D00294:202:C7FK3ANXX:6:2213:12108:58098 1:N:0:AAGAGGCACTCTCTAT 97 XXX, complete genome 743 1X117= XXX, complete genome 274 0 XXX [email protected]>F0C=C.FG=FGEEEB.@@BBBGBGG NM:i:1 AM:i:43 YI:f:99.15

                        HWI-D00294:202:C7FK3ANXX:6:2213:12108:58098 1:N:0:AAGAGGCACTCTCTAT 145 XXX, complete genome 274 3 107=1X7= XXX, complete genome 7 0 XXX 8CDGGG<BBGGGEGEGGBGD>=GGBDGGGGGCGDEBDEGDFGD>GFEGGGGGGGDDGGDGC1GDGGFFGGEGGGGGAGGGGDG<BGEEGG@FGEGFDF1GGFC>?;GGBA>@A XT:A:R NM:i:1 AM:i:3 YI:f:99.13

                        HWI-D00294:202:C7FK3ANXX:6:2212:12674:82633 1:N:0:AAGAGGCACTCTCTAT 99 XXX, complete genome 140 6=1X35=1X59=1X21= = 264 388 XXX BBB@@<FGGGGGEGGDGGGGGGGGGGGFFFGGGGGGEG1FGGGGGGGGGGGGGGGGGGGGGGBG;:EFGGFGGGGGF>EFGGGGDFFCCCCFGGGGGG8A@FCGGCGGGBGGGG=@GGGGGGGG NM:i:3 AM:i:40 YI:f:97.58

                        HWI-D00294:202:C7FK3ANXX:6:2212:12674:82633 1:N:0:AAGAGGCACTCTCTAT 147 XXX, complete genome 264 3 53=1X33=1X22=1X14= = 1 -388 XXX DGG:GE:.<GCDGGGGGDDDBGGGGDEGGGGGCGGFBGGGGGGGGGFGGGDGGGGCGGGGGGGGGGGGGGEGGGGFGGGFF:EGGGB/GGGGGGGGGGGFEFF1FFDGGGGGGGGGGGGGBBBBB XT:A:R NM:i:3AM:i:3 YI:f:97.60


                        The second pair, seems proper so I don't understand how it is unmapped. The first pair, I think the problem may be that my reference is actually a concatenation of sub-references (fasta file with multiple > entries). If Read1 aligns best to sub-ref1, while Read2 aligns best to sub-ref2, is the insert size too big and therefore not proper? Is there a way to get around this behavior to keep the best read alignment within acceptable insert size?

                        Thanks,

                        Also, those changes to BBsplit gave me ~1.5M more reads mapping. Still seems that I didn't change much that should have mattered... 127,208,572 vs 125,675,984

                        Comment


                        • #27
                          sooo. Ok, I tried increasing both pairlen and rescuedist to 200,000. That gave another small increase in mapped read count. I realllly hope I can keep with BBsplit, but it seems I either need to change my reference or find a way to get the following:

                          I do want both pairs to map, and to different strand,, and each with 95% identity.. but the following are all fine given my reference of similar concatenated sub-reference sequences. At worst case, I have 603 similar sub-references for a reference, so I really would rather not split them out and generate outputs for each.

                          ---> <---
                          <--- --->
                          ---> //huge insert// <---
                          <--- //huge insert// --->

                          Should I get rid of the 'pairedonly' and output as SAM and filter on SAM flags? I know I would need to lose the ambiguous2=split for SAM output.

                          Thanks for letting me know what you think to try!

                          Comment

                          Latest Articles

                          Collapse

                          • seqadmin
                            Current Approaches to Protein Sequencing
                            by seqadmin


                            Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
                            04-04-2024, 04:25 PM
                          • seqadmin
                            Strategies for Sequencing Challenging Samples
                            by seqadmin


                            Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                            03-22-2024, 06:39 AM

                          ad_right_rmr

                          Collapse

                          News

                          Collapse

                          Topics Statistics Last Post
                          Started by seqadmin, 04-11-2024, 12:08 PM
                          0 responses
                          22 views
                          0 likes
                          Last Post seqadmin  
                          Started by seqadmin, 04-10-2024, 10:19 PM
                          0 responses
                          24 views
                          0 likes
                          Last Post seqadmin  
                          Started by seqadmin, 04-10-2024, 09:21 AM
                          0 responses
                          19 views
                          0 likes
                          Last Post seqadmin  
                          Started by seqadmin, 04-04-2024, 09:00 AM
                          0 responses
                          50 views
                          0 likes
                          Last Post seqadmin  
                          Working...
                          X