SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics



Similar Threads
Thread Thread Starter Forum Replies Last Post
Bowtie, an ultrafast, memory-efficient, open source short read aligner Ben Langmead Bioinformatics 513 05-14-2015 02:29 PM
Introducing BBMap, a new short-read aligner for DNA and RNA Brian Bushnell Bioinformatics 24 07-07-2014 09:37 AM
Miso's open source joyce kang Bioinformatics 1 01-25-2012 06:25 AM
Targeted resequencing - open source stanford_genome_tech Genomic Resequencing 3 09-27-2011 03:27 PM
EKOPath 4 going open source dnusol Bioinformatics 0 06-15-2011 01:10 AM

Reply
 
Thread Tools
Old 06-04-2015, 05:24 PM   #221
luc
Senior Member
 
Location: US

Join Date: Dec 2010
Posts: 324
Default

Hi Brian,

what are the applications/assembly operations for which tadpole.sh is designed? In part it replaces BBmerge?

Thanks!
luc is offline   Reply With Quote
Old 06-05-2015, 09:47 AM   #222
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,707
Default

Hi Luc,

Tadpole does not replace BBMerge, though I plan to further integrate them in the future - mainly, so that BBMerge can first attempt to merge, then if unsuccessful extend the reads using Tadpole, then attempt to merge again, and if still unsuccessful, undo all the changes.

The main reason I made "Tadpole" was because I need to quantify the insert size of libraries, to determine whether they are acceptable; for example, if a project needs a 2x150bp library with a 500bp insert size, for an unknown organism... how do you determine whether it passes? BBMerge only works when the reads are largely overlapping. So, I wrote Tadpole to extend the right end of non-overlapping reads so that they will overlap and can be merged. So far, it works really well on 2x150bp single-cell data with a 350bp insert size, but I have not tested it further than that.

Tadpole is a complete (and very fast) assembler; you can run "tadpole.sh in=reads.fq out=contigs.fa" and it will give you a conservative assembly with a low error rate. The main drawback is that currently the max kmer length is 31, and as such the continuity is poor. I'm evaluating it for use in making a quick assembly for mapping reads to recalibrate their quality, prior to feeding them to a more sophisticated assembler; for recalibration, a low error rate and low misassembly rate is more important than continuity.

For extending reads, the command would be:
tadpole.sh in=reads.fq extend=reads.fq oute=extended.fq mode=extend extendleft=100 extendright=100

That will extend the reads by up to 100bp in each direction, stopping early if a branch is hit. For extending paired reads so that they overlap, only “extendright” is needed, so “extendleft” should be set to zero.
Brian Bushnell is offline   Reply With Quote
Old 06-08-2015, 12:02 PM   #223
lankage
Member
 
Location: Madison, WI

Join Date: Oct 2014
Posts: 20
Default

Hi Brian, got another one for you. This time im working with Illumina miseq reads and mapping them to a HIV reference sequence for the purpose of determining the presence or absence of a variety of HIV drug resistance associated mutations. One step in my workflow requires the use of the mapping information to determine which mutation sites are spanned by a mapped read. I have a list of base coordinates based on the HIV reference sequence. I noticed that in some instances with a large insert relative to the reference, that the mapping start coordinate does not begin with the first base of the read sequence. In the example provided, the start coordinate of 3804 is actually 29 bases in from the start of the read. This is following a 27 base insert per the cigar string. I would like to be able to reliably determine the complete interval with which a read overlaps the reference sequence so I can know which mutation sites are covered. Can i tweak the parameters to get the start coordinate to reflect the start of the read sequence in a non-soft clipped mapping?

Example SAM record:
Code:
M00196:47:000000000-A1CEL:1:1103:11833:21618 1:N:0:1	16	NC_001802DRannotations_(modified)	3804	19	2=27I2=1X5=1X23=1X27=1X82=	*	0	0	ATCACTAGCCATTGCTCTCCAATTACTGTGAGCATGAAAAATATCACAGTAATTGGAGAGCTATGGCTAGTGATTTTAACCTGCCACCTATAGTAGCAAAAGAAATAGTAGCCAGCTGTGATAAATGTCAGCTAAAAGGAGAAGCCATGCATGGACAAGTAGACTGTAGTCC	????ABBBDDDDDEEDGGGGGGIIIIIIHHIHHHHIHIIIIIIIHIIIIHHHHHIIIIIHHHHHHIHIIIIIHHIIHHGHIIIHFFHIIIHGGGIHHHIIHHGFIHIIIIHIHHHFHGHHIHHHHHHFFHHHHHHIHHHHHHHHIIHHFIHIHHFFFFFFFDDDDDDDDBBB	NM:i:31	AM:i:19

Read with expanded cigar string:
ATCACTAGCCATTGCTCTCCAATTACTGTGAGCATGAAAAATATCACAGTAATTGGAGAGCTATGGCTAGTGATTTTAACCTGCCACCTATAGTAGCAAAAGAAATAGTAGCCAGCTGTGATAAATGTCAGCTAAAAGGAGAAGCCATGCATGGACAAGTAGACTGTAGTCC
MMIIIIIIIIIIIIIIIIIIIIIIIIIIIMMXMMMMMXMMMMMMMMMMMMMMMMMMMMMMMXMMMMMMMMMMMMMMMMMMMMMMMMMMMXMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMM
HTML Code:
[PRE]
>>NC_001802DRannotations_(modified)                       (9181 nt)
 initn: 652 init1: 652 opt: 684  Z-score: 836.6  bits: 167.0 E(1): 8.7e-45
banded Smith-Waterman score: 684; 97.2% identity (97.2% similar) in 144 nt overlap (29-172:3805-3948)

                 10        20        30        40        50        
offend   ATCACTAGCCATTGCTCTCCAATTACTGTGAGCATGAAAAATATCACAGTAATTGGAG
                                     ::: ::::: ::::::::::::::::::::
NC_001 AUUUUUAGAUGGAAUAGAUAAGGCCCAAGAUGAACAUGAGAAAUAUCACAGUAAUUGGAG
         3780      3790      3800      3810      3820      3830    

       60        70        80        90       100       110        
offend AGCTATGGCTAGTGATTTTAACCTGCCACCTATAGTAGCAAAAGAAATAGTAGCCAGCTG
       ::: ::::::::::::::::::::::::::: ::::::::::::::::::::::::::::
NC_001 AGCAAUGGCUAGUGAUUUUAACCUGCCACCUGUAGUAGCAAAAGAAAUAGUAGCCAGCUG
         3840      3850      3860      3870      3880      3890    

      120       130       140       150       160       170        
offend TGATAAATGTCAGCTAAAAGGAGAAGCCATGCATGGACAAGTAGACTGTAGTCC      
       ::::::::::::::::::::::::::::::::::::::::::::::::::::::      
NC_001 UGAUAAAUGUCAGCUAAAAGGAGAAGCCAUGCAUGGACAAGUAGACUGUAGUCCAGGAAU
         3900      3910      3920      3930      3940      3950    

NC_001 AUGGCAACUAGAUUGUACACAUUUAGAAGGAAAAGUUAUCCUGGUAGCAGUUCAUGUAGC
         3960      3970      3980      3990      4000      4010[/PRE]

Last edited by Brian Bushnell; 06-08-2015 at 02:21 PM.
lankage is offline   Reply With Quote
Old 06-08-2015, 02:29 PM   #224
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,707
Default

You can change gap penalties and get a different alignment by adding the flag "msa=MultiStateAligner9Flat". That will generally reduce the rate of long indels by flattening the affine transforms. The resulting cigar string is "2=1D1X1=1X3=2X1=2X2=3I4=1I1X1=3I3=1X5=1X23=1X27=1X82=", which may not be exactly what you are looking for, but is much closer.

Alternatively, depending on what you are doing with the data, you can enable soft-clipping with BBMap's "local" flag, then calculate coverage using pileup.sh with the "includesoftclip" flag. This will calculate coverage including soft-clipped sequences.

Please let me know if neither of those is adequate...
Brian Bushnell is offline   Reply With Quote
Old 06-30-2015, 08:42 AM   #225
lankage
Member
 
Location: Madison, WI

Join Date: Oct 2014
Posts: 20
Default

Hi Brian,
Came across an instance today where I was attempting to parse bbmap SAM output for cigar string information to observe the mapping of some HIV read sequences against a reference. I was expecting for a cigar string to be present for each record when using the outm=output.sam parameter. In one of my mapping records, I observed an asterisk instead. Am I wrong to assume outm= is intended to include only those mapped reads? I can filter out these reads, but I'd like to make sure my mapping parameters make sense.

From the SAM output:
@HD VN:1.4 SO:unsorted
@SQ SN:NC_001802DRannotations_(modified) LN:9181
@PG ID:BBMap PN:BBMap VN:34.92 CL:java -Djava.library.path=/home/dnanexus/bbmap/jni/ -ea -Xmx10g align2.BBMap build=1 overwrite=true fastareadlen=500 in=reads_file ref=ref_file outm=sam_output minid=.8 strictmaxindel=10 k=8 subfilter=15 -Xmx10g

The offending mapping record:
M01472:214:000000000-AG0YC:1:2108:10755:20410 1:N:0:78 0 NC_001802DRannotations_(modified) 2154 4 * * 0 0 AAGTTAAACAATGGCCATTGACAGAAGAAAAAATAAAAGCATTAGTCATAGTAATATGGGGAAAGACTCCTAAATTTAAATTACCCATACAAAAGGAAACATGGGAAGCATGGTGGACAGAGTATTGGC CCCCCGGGGGGGGGGGGGGGGGGGGGGFDCFGECGGGF<AFEGGFEFGGGGFGGGGGGGGGGGGGGGGGFGFFGGFGEFGGAGFGEAFF<,FGGGGGGGGGGFGGFDGGGFECFGGGGGGGGGGCCCCC AM:i:4

This only occurred after parsing ~8 million mapping records out of a total 50 million.
lankage is offline   Reply With Quote
Old 06-30-2015, 09:33 AM   #226
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,707
Default

That should not happen... outm should only print mapped reads, except when processing paired data, in which case it will print unmapped reads if the mate is mapped. But according to the flag you are processing single-ended data and that read was mapped, so it should have a cigar string.

It's possible that this is a bug that I've fixed since v34.92, but I have no record of that in my changelog. So I'd like to try to replicate it to see if it's a current bug. I already have the read, but can you send me or direct me to the specific reference you are using?

Thanks,
Brian
Brian Bushnell is offline   Reply With Quote
Old 06-30-2015, 09:38 AM   #227
lankage
Member
 
Location: Madison, WI

Join Date: Oct 2014
Posts: 20
Default

Code:
>NC_001802DRannotations_(modified)
GGUCUCUCUGGUUAGACCAGAUCUGAGCCUGGGAGCUCUCUGGCUAACUAGGGAACCCACUGCUUAAGCCUCAAUAAAGCUUGCCUUGAGUGCUUCAAGUAGUGUGUGCCCGUCUGUUGUGUGACUCUGGUAACUAGAGAUCCCUCAGACCCUUUUAGUCAGUGUGGAAAAUCUCUAGCAGUGGCGCCCGAACAGGGACCUGAAAGCGAAAGGGAAACCAGAGGAGCUCUCUCGACGCAGGACUCGGCUUGCUGAAGCGCGCACGGCAAGAGGCGAGGGGCGGCGACUGGUGAGUACGCCAAAAAUUUUGACUAGCGGAGGCUAGAAGGAGAGAGAUGGGUGCGAGAGCGUCAGUAUUAAGCGGGGGAGAAUUAGAUCGAUGGGAAAAAAUUCGGUUAAGGCCAGGGGGAAAGAAAAAAUAUAAAUUAAAACAUAUAGUAUGGGCAAGCAGGGAGCUAGAACGAUUCGCAGUUAAUCCUGGCCUGUUAGAAACAUCAGAAGGCUGUAGACAAAUACUGGGACAGCUACAACCAUCCCUUCAGACAGGAUCAGAAGAACUUAGAUCAUUAUAUAAUACAGUAGCAACCCUCUAUUGUGUGCAUCAAAGGAUAGAGAUAAAAGACACCAAGGAAGCUUUAGACAAGAUAGAGGAAGAGCAAAACAAAAGUAAGAAAAAAGCACAGCAAGCAGCAGCUGACACAGGACACAGCAAUCAGGUCAGCCAAAAUUACCCUAUAGUGCAGAACAUCCAGGGGCAAAUGGUACAUCAGGCCAUAUCACCUAGAACUUUAAAUGCAUGGGUAAAAGUAGUAGAAGAGAAGGCUUUCAGCCCAGAAGUGAUACCCAUGUUUUCAGCAUUAUCAGAAGGAGCCACCCCACAAGAUUUAAACACCAUGCUAAACACAGUGGGGGGACAUCAAGCAGCCAUGCAAAUGUUAAAAGAGACCAUCAAUGAGGAAGCUGCAGAAUGGGAUAGAGUGCAUCCAGUGCAUGCAGGGCCUAUUGCACCAGGCCAGAUGAGAGAACCAAGGGGAAGUGACAUAGCAGGAACUACUAGUACCCUUCAGGAACAAAUAGGAUGGAUGACAAAUAAUCCACCUAUCCCAGUAGGAGAAAUUUAUAAAAGAUGGAUAAUCCUGGGAUUAAAUAAAAUAGUAAGAAUGUAUAGCCCUACCAGCAUUCUGGACAUAAGACAAGGACCAAAGGAACCCUUUAGAGACUAUGUAGACCGGUUCUAUAAAACUCUAAGAGCCGAGCAAGCUUCACAGGAGGUAAAAAAUUGGAUGACAGAAACCUUGUUGGUCCAAAAUGCGAACCCAGAUUGUAAGACUAUUUUAAAAGCAUUGGGACCAGCGGCUACACUAGAAGAAAUGAUGACAGCAUGUCAGGGAGUAGGAGGACCCGGCCAUAAGGCAAGAGUUUUGGCUGAAGCAAUGAGCCAAGUAACAAAUUCAGCUACCAUAAUGAUGCAGAGAGGCAAUUUUAGGAACCAAAGAAAGAUUGUUAAGUGUUUCAAUUGUGGCAAAGAAGGGCACACAGCCAGAAAUUGCAGGGCCCCUAGGAAAAAGGGCUGUUGGAAAUGUGGAAAGGAAGGACACCAAAUGAAAGAUUGUACUGAGAGACAGGCUAAUUUUUUAGGGAAGAUCUGGCCUUCCUACAAGGGAAGGCCAGGGAAUUUUCUUCAGAGCAGACCAGAGCCAACAGCCCCACCAGAAGAGAGCUUCAGGUCUGGGGUAGAGACAACAACUCCCCCUCAGAAGCAGGAGCCGAUAGACAAGGAACUGUAUCCUUUAACUUCCCUCAGGUCACUCUUUGGCAACGACCCCUCGUCACAAUAAAGAUAGGGGGGCAACUAAAGGAAGCUCUAUUAGAUACAGGAGCAGAUGAUACAGUAUUAGAAGAAAUGAGUUUGCCAGGAAGAUGGAAACCAAAAAUGAUAGGGGGAAUUGGAGGUUUUAUCAAAGUAAGACAGUAUGAUCAGAUACUCAUAGAAAUCUGUGGACAUAAAGCUAUAGGUACAGUAUUAGUAGGACCUACACCUGUCAACAUAAUUGGAAGAAAUCUGUUGACUCAGAUUGGUUGCACUUUAAAUUUUCCCAUUAGCCCUAUUGAGACUGUACCAGUAAAAUUAAAGCCAGGAAUGGAUGGCCCAAAAGUUAAACAAUGGCCAUUGACAGAAGAAAAAAUAAAAGCAUUAGUAGAAAUUUGUACAGAGAUGGAAAAGGAAGGGAAAAUUUCAAAAAUUGGGCCUGAAAAUCCAUACAAUACUCCAGUAUUUGCCAUAAAGAAAAAAGACAGUACUAAAUGGAGAAAAUUAGUAGAUUUCAGAGAACUUAAUAAGAGAACUCAAGACUUCUGGGAAGUUCAAUUAGGAAUACCACAUCCCGCAGGGUUAAAAAAGAAAAAAUCAGUAACAGUACUGGAUGUGGGUGAUGCAUAUUUUUCAGUUCCCUUAGAUGAAGACUUCAGGAAGUAUACUGCAUUUACCAUACCUAGUAUAAACAAUGAGACACCAGGGAUUAGAUAUCAGUACAAUGUGCUUCCACAGGGAUGGAAAGGAUCACCAGCAAUAUUCCAAAGUAGCAUGACAAAAAUCUUAGAGCCUUUUAGAAAACAAAAUCCAGACAUAGUUAUCUAUCAAUACAUGGAUGAUUUGUAUGUAGGAUCUGACUUAGAAAUAGGGCAGCAUAGAACAAAAAUAGAGGAGCUGAGACAACAUCUGUUGAGGUGGGGACUUACCACACCAGACAAAAAACAUCAGAAAGAACCUCCAUUCCUUUGGAUGGGUUAUGAACUCCAUCCUGAUAAAUGGACAGUACAGCCUAUAGUGCUGCCAGAAAAAGACAGCUGGACUGUCAAUGACAUACAGAAGUUAGUGGGGAAAUUGAAUUGGGCAAGUCAGAUUUACCCAGGGAUUAAAGUAAGGCAAUUAUGUAAACUCCUUAGAGGAACCAAAGCACUAACAGAAGUAAUACCACUAACAGAAGAAGCAGAGCUAGAACUGGCAGAAAACAGAGAGAUUCUAAAAGAACCAGUACAUGGAGUGUAUUAUGACCCAUCAAAAGACUUAAUAGCAGAAAUACAGAAGCAGGGGCAAGGCCAAUGGACAUAUCAAAUUUAUCAAGAGCCAUUUAAAAAUCUGAAAACAGGAAAAUAUGCAAGAAUGAGGGGUGCCCACACUAAUGAUGUAAAACAAUUAACAGAGGCAGUGCAAAAAAUAACCACAGAAAGCAUAGUAAUAUGGGGAAAGACUCCUAAAUUUAAACUGCCCAUACAAAAGGAAACAUGGGAAACAUGGUGGACAGAGUAUUGGCAAGCCACCUGGAUUCCUGAGUGGGAGUUUGUUAAUACCCCUCCCUUAGUGAAAUUAUGGUACCAGUUAGAGAAAGAACCCAUAGUAGGAGCAGAAACCUUCUAUGUAGAUGGGGCAGCUAACAGGGAGACUAAAUUAGGAAAAGCAGGAUAUGUUACUAAUAGAGGAAGACAAAAAGUUGUCACCCUAACUGACACAACAAAUCAGAAGACUGAGUUACAAGCAAUUUAUCUAGCUUUGCAGGAUUCGGGAUUAGAAGUAAACAUAGUAACAGACUCACAAUAUGCAUUAGGAAUCAUUCAAGCACAACCAGAUCAAAGUGAAUCAGAGUUAGUCAAUCAAAUAAUAGAGCAGUUAAUAAAAAAGGAAAAGGUCUAUCUGGCAUGGGUACCAGCACACAAAGGAAUUGGAGGAAAUGAACAAGUAGAUAAAUUAGUCAGUGCUGGAAUCAGGAAAGUACUAUUUUUAGAUGGAAUAGAUAAGGCCCAAGAUGAACAUGAGAAAUAUCACAGUAAUUGGAGAGCAAUGGCUAGUGAUUUUAACCUGCCACCUGUAGUAGCAAAAGAAAUAGUAGCCAGCUGUGAUAAAUGUCAGCUAAAAGGAGAAGCCAUGCAUGGACAAGUAGACUGUAGUCCAGGAAUAUGGCAACUAGAUUGUACACAUUUAGAAGGAAAAGUUAUCCUGGUAGCAGUUCAUGUAGCCAGUGGAUAUAUAGAAGCAGAAGUUAUUCCAGCAGAAACAGGGCAGGAAACAGCAUAUUUUCUUUUAAAAUUAGCAGGAAGAUGGCCAGUAAAAACAAUACAUACUGACAAUGGCAGCAAUUUCACCGGUGCUACGGUUAGGGCCGCCUGUUGGUGGGCGGGAAUCAAGCAGGAAUUUGGAAUUCCCUACAAUCCCCAAAGUCAAGGAGUAGUAGAAUCUAUGAAUAAAGAAUUAAAGAAAAUUAUAGGACAGGUAAGAGAUCAGGCUGAACAUCUUAAGACAGCAGUACAAAUGGCAGUAUUCAUCCACAAUUUUAAAAGAAAAGGGGGGAUUGGGGGGUACAGUGCAGGGGAAAGAAUAGUAGACAUAAUAGCAACAGACAUACAAACUAAAGAAUUACAAAAACAAAUUACAAAAAUUCAAAAUUUUCGGGUUUAUUACAGGGACAGCAGAAAUCCACUUUGGAAAGGACCAGCAAAGCUCCUCUGGAAAGGUGAAGGGGCAGUAGUAAUACAAGAUAAUAGUGACAUAAAAGUAGUGCCAAGAAGAAAAGCAAAGAUCAUUAGGGAUUAUGGAAAACAGAUGGCAGGUGAUGAUUGUGUGGCAAGUAGACAGGAUGAGGAUUAGAACAUGGAAAAGUUUAGUAAAACACCAUAUGUAUGUUUCAGGGAAAGCUAGGGGAUGGUUUUAUAGACAUCACUAUGAAAGCCCUCAUCCAAGAAUAAGUUCAGAAGUACACAUCCCACUAGGGGAUGCUAGAUUGGUAAUAACAACAUAUUGGGGUCUGCAUACAGGAGAAAGAGACUGGCAUUUGGGUCAGGGAGUCUCCAUAGAAUGGAGGAAAAAGAGAUAUAGCACACAAGUAGACCCUGAACUAGCAGACCAACUAAUUCAUCUGUAUUACUUUGACUGUUUUUCAGACUCUGCUAUAAGAAAGGCCUUAUUAGGACACAUAGUUAGCCCUAGGUGUGAAUAUCAAGCAGGACAUAACAAGGUAGGAUCUCUACAAUACUUGGCACUAGCAGCAUUAAUAACACCAAAAAAGAUAAAGCCACCUUUGCCUAGUGUUACGAAACUGACAGAGGAUAGAUGGAACAAGCCCCAGAAGACCAAGGGCCACAGAGGGAGCCACACAAUGAAUGGACACUAGAGCUUUUAGAGGAGCUUAAGAAUGAAGCUGUUAGACAUUUUCCUAGGAUUUGGCUCCAUGGCUUAGGGCAACAUAUCUAUGAAACUUAUGGGGAUACUUGGGCAGGAGUGGAAGCCAUAAUAAGAAUUCUGCAACAACUGCUGUUUAUCCAUUUUCAGAAUUGGGUGUCGACAUAGCAGAAUAGGCGUUACUCGACAGAGGAGAGCAAGAAAUGGAGCCAGUAGAUCCUAGACUAGAGCCCUGGAAGCAUCCAGGAAGUCAGCCUAAAACUGCUUGUACCAAUUGCUAUUGUAAAAAGUGUUGCUUUCAUUGCCAAGUUUGUUUCAUAACAAAAGCCUUAGGCAUCUCCUAUGGCAGGAAGAAGCGGAGACAGCGACGAAGAGCUCAUCAGAACAGUCAGACUCAUCAAGCUUCUCUAUCAAAGCAGUAAGUAGUACAUGUAAUGCAACCUAUACCAAUAGUAGCAAUAGUAGCAUUAGUAGUAGCAAUAAUAAUAGCAAUAGUUGUGUGGUCCAUAGUAAUCAUAGAAUAUAGGAAAAUAUUAAGACAAAGAAAAAUAGACAGGUUAAUUGAUAGACUAAUAGAAAGAGCAGAAGACAGUGGCAAUGAGAGUGAAGGAGAAAUAUCAGCACUUGUGGAGAUGGGGGUGGAGAUGGGGCACCAUGCUCCUUGGGAUGUUGAUGAUCUGUAGUGCUACAGAAAAAUUGUGGGUCACAGUCUAUUAUGGGGUACCUGUGUGGAAGGAAGCAACCACCACUCUAUUUUGUGCAUCAGAUGCUAAAGCAUAUGAUACAGAGGUACAUAAUGUUUGGGCCACACAUGCCUGUGUACCCACAGACCCCAACCCACAAGAAGUAGUAUUGGUAAAUGUGACAGAAAAUUUUAACAUGUGGAAAAAUGACAUGGUAGAACAGAUGCAUGAGGAUAUAAUCAGUUUAUGGGAUCAAAGCCUAAAGCCAUGUGUAAAAUUAACCCCACUCUGUGUUAGUUUAAAGUGCACUGAUUUGAAGAAUGAUACUAAUACCAAUAGUAGUAGCGGGAGAAUGAUAAUGGAGAAAGGAGAGAUAAAAAACUGCUCUUUCAAUAUCAGCACAAGCAUAAGAGGUAAGGUGCAGAAAGAAUAUGCAUUUUUUUAUAAACUUGAUAUAAUACCAAUAGAUAAUGAUACUACCAGCUAUAAGUUGACAAGUUGUAACACCUCAGUCAUUACACAGGCCUGUCCAAAGGUAUCCUUUGAGCCAAUUCCCAUACAUUAUUGUGCCCCGGCUGGUUUUGCGAUUCUAAAAUGUAAUAAUAAGACGUUCAAUGGAACAGGACCAUGUACAAAUGUCAGCACAGUACAAUGUACACAUGGAAUUAGGCCAGUAGUAUCAACUCAACUGCUGUUAAAUGGCAGUCUAGCAGAAGAAGAGGUAGUAAUUAGAUCUGUCAAUUUCACGGACAAUGCUAAAACCAUAAUAGUACAGCUGAACACAUCUGUAGAAAUUAAUUGUACAAGACCCAACAACAAUACAAGAAAAAGAAUCCGUAUCCAGAGAGGACCAGGGAGAGCAUUUGUUACAAUAGGAAAAAUAGGAAAUAUGAGACAAGCACAUUGUAACAUUAGUAGAGCAAAAUGGAAUAACACUUUAAAACAGAUAGCUAGCAAAUUAAGAGAACAAUUUGGAAAUAAUAAAACAAUAAUCUUUAAGCAAUCCUCAGGAGGGGACCCAGAAAUUGUAACGCACAGUUUUAAUUGUGGAGGGGAAUUUUUCUACUGUAAUUCAACACAACUGUUUAAUAGUACUUGGUUUAAUAGUACUUGGAGUACUGAAGGGUCAAAUAACACUGAAGGAAGUGACACAAUCACCCUCCCAUGCAGAAUAAAACAAAUUAUAAACAUGUGGCAGAAAGUAGGAAAAGCAAUGUAUGCCCCUCCCAUCAGUGGACAAAUUAGAUGUUCAUCAAAUAUUACAGGGCUGCUAUUAACAAGAGAUGGUGGUAAUAGCAACAAUGAGUCCGAGAUCUUCAGACCUGGAGGAGGAGAUAUGAGGGACAAUUGGAGAAGUGAAUUAUAUAAAUAUAAAGUAGUAAAAAUUGAACCAUUAGGAGUAGCACCCACCAAGGCAAAGAGAAGAGUGGUGCAGAGAGAAAAAAGAGCAGUGGGAAUAGGAGCUUUGUUCCUUGGGUUCUUGGGAGCAGCAGGAAGCACUAUGGGCGCAGCCUCAAUGACGCUGACGGUACAGGCCAGACAAUUAUUGUCUGGUAUAGUGCAGCAGCAGAACAAUUUGCUGAGGGCUAUUGAGGCGCAACAGCAUCUGUUGCAACUCACAGUCUGGGGCAUCAAGCAGCUCCAGGCAAGAAUCCUGGCUGUGGAAAGAUACCUAAAGGAUCAACAGCUCCUGGGGAUUUGGGGUUGCUCUGGAAAACUCAUUUGCACCACUGCUGUGCCUUGGAAUGCUAGUUGGAGUAAUAAAUCUCUGGAACAGAUUUGGAAUCACACGACCUGGAUGGAGUGGGACAGAGAAAUUAACAAUUACACAAGCUUAAUACACUCCUUAAUUGAAGAAUCGCAAAACCAGCAAGAAAAGAAUGAACAAGAAUUAUUGGAAUUAGAUAAAUGGGCAAGUUUGUGGAAUUGGUUUAACAUAACAAAUUGGCUGUGGUAUAUAAAAUUAUUCAUAAUGAUAGUAGGAGGCUUGGUAGGUUUAAGAAUAGUUUUUGCUGUACUUUCUAUAGUGAAUAGAGUUAGGCAGGGAUAUUCACCAUUAUCGUUUCAGACCCACCUCCCAACCCCGAGGGGACCCGACAGGCCCGAAGGAAUAGAAGAAGAAGGUGGAGAGAGAGACAGAGACAGAUCCAUUCGAUUAGUGAACGGAUCCUUGGCACUUAUCUGGGACGAUCUGCGGAGCCUGUGCCUCUUCAGCUACCACCGCUUGAGAGACUUACUCUUGAUUGUAACGAGGAUUGUGGAACUUCUGGGACGCAGGGGGUGGGAAGCCCUCAAAUAUUGGUGGAAUCUCCUACAGUAUUGGAGUCAGGAACUAAAGAAUAGUGCUGUUAGCUUGCUCAAUGCCACAGCCAUAGCAGUAGCUGAGGGGACAGAUAGGGUUAUAGAAGUAGUACAAGGAGCUUGUAGAGCUAUUCGCCACAUACCUAGAAGAAUAAGACAGGGCUUGGAAAGGAUUUUGCUAUAAGAUGGGUGGCAAGUGGUCAAAAAGUAGUGUGAUUGGAUGGCCUACUGUAAGGGAAAGAAUGAGACGAGCUGAGCCAGCAGCAGAUAGGGUGGGAGCAGCAUCUCGAGACCUGGAAAAACAUGGAGCAAUCACAAGUAGCAAUACAGCAGCUACCAAUGCUGCUUGUGCCUGGCUAGAAGCACAAGAGGAGGAGGAGGUGGGUUUUCCAGUCACACCUCAGGUACCUUUAAGACCAAUGACUUACAAGGCAGCUGUAGAUCUUAGCCACUUUUUAAAAGAAAAGGGGGGACUGGAAGGGCUAAUUCACUCCCAAAGAAGACAAGAUAUCCUUGAUCUGUGGAUCUACCACACACAAGGCUACUUCCCUGAUUAGCAGAACUACACACCAGGGCCAGGGGUCAGAUAUCCACUGACCUUUGGAUGGUGCUACAAGCUAGUACCAGUUGAGCCAGAUAAGAUAGAAGAGGCCAAUAAAGGAGAGAACACCAGCUUGUUACACCCUGUGAGCCUGCAUGGGAUGGAUGACCCGGAGAGAGAAGUGUUAGAGUGGAGGUUUGACAGCCGCCUAGCAUUUCAUCACGUGGCCCGAGAGCUGCAUCCGGAGUACUUCAAGAACUGCUGACAUCGAGCUUGCUACAAGGGACUUUCCGCUGGGGACUUUCCAGGGAGGCGUGGCCUGGGCGGGACUGGGGAGUGGCGAGCCCUCAGAUCCUGCAUAUAAGCAGCUGCUUUUUGCCUGUACUGGGUCUCUCUGGUUAGACCAGAUCUGAGCCUGGGAGCUCUCUGGCUAACUAGGGAACCCACUGCUUAAGCCUCAAUAAAGCUUGCCUUGAGUGCUUC
lankage is offline   Reply With Quote
Old 06-30-2015, 09:38 AM   #228
GenoMax
Senior Member
 
Location: East Coast USA

Join Date: Feb 2008
Posts: 6,703
Default

Looks like this (T's changed to U's): http://www.ncbi.nlm.nih.gov/nuccore/9629357/
GenoMax is offline   Reply With Quote
Old 06-30-2015, 09:51 AM   #229
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,707
Default

OK, that looks like a bug that is related to the "subfilter" setting. Without subfilter, it gets mapped here:

Code:
M01472:214:000000000-AG0YC:1:2108:10755:20410 1:N:0:78  0       NC_001802DRannotations_(modified)   3186        15      1=1X2=1X7=3X2=2X1=1X3=3X9=3X2=2X1=2X34=1X1=1X24=1X21=
With subfilter, that site gets killed because it has too many substitutions, so it chooses a different site that presumably has a worse alignment (more indels) but fewer than 15 substitutions, so it's still valid. However, for some reason the cigar string does not get generated for the new site. I'll find out why and fix that very soon. There's nothing wrong with your command.

-Brian
Brian Bushnell is offline   Reply With Quote
Old 07-06-2015, 05:08 PM   #230
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,707
Default

This bug has now been fixed as of 35.14, which I just uploaded. Filtering was essentially not being applied when the best site was filtered and inferior sites passed the minid yet did not have cigar strings (a very rare scenario). But, it works fine now.

-Brian

P.S. Without filtering, just with default parameters, you get this alignment:

Code:
M01472:214:000000000-AG0YC:1:2108:10755:20410 1:N:0:78  0       NC_001802DRannotations_(modified)   2154        35      46=1032D34=1X1=1X24=1X21=
...which in my view is better than any alternative, despite the long deletion.
Brian Bushnell is offline   Reply With Quote
Old 07-20-2015, 04:01 AM   #231
WhatsOEver
Senior Member
 
Location: Germany

Join Date: Apr 2012
Posts: 215
Default

Hi Brian,
I found a small bug in the v1.4 cigar strings: Whenever there is a "N" in the read sequence, one gets a "M" for this position instead of a "=".

Quote:
Originally Posted by Example
Read:
GATATCGGTTTCATCCTCGCCTTAGCATGATTTATCCTACACTCCAACTCATGAGACCCCANAACAAATAGCCCTTCTAAACGCTAATCCAAGCCTCACCCCCACTACTAGGCCTCCTCCTAGCAGCAGCAGGCAAATCAGCCCAATTGGCTCCACCCCTGACTCCCCTCAG
aligns with the following cigar:
56=1I4=1M36=1I49=1X1=2D22=
WhatsOEver is offline   Reply With Quote
Old 07-20-2015, 07:43 AM   #232
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,707
Default

That's actually intentional, and happens when there's an N in the read or the reference. The SAM spec does not specify what to do in these situations. So, the way I see it, A aligning to A is a match (=), a aligning to C is a substitution (X), and A aligning to N - or even N aligning to N - might be a match, and might be a substitution. Since it's not an insertion or deletion, it's length-neutral, so I call it an alignment match.

Are these causing problems for some downstream application? If so, I can add a flag that changes the behavior.
Brian Bushnell is offline   Reply With Quote
Old 07-20-2015, 02:48 PM   #233
sdriscoll
I like code
 
Location: San Diego, CA, USA

Join Date: Sep 2009
Posts: 438
Default

Hello Brian-

I'm curious if there is an output setting that I'm missing for PE alignments that would maintain a strict grouping of the mates in the output. I'd like to try using bbmap.sh as an aligner to feed RSEM (and maybe eXpress) which was designed around bowtie alignments. RSEM is VERY picky and there is almost no way to get alignments to work with their program without running them through some sort of reformatting script. One thing, however, that would be useful is to have the alignment output be able to keep mates in two-row pairs no matter what. So no matter what if line X is the left mate (mapped or unmapped) then line X+1 is the right mate mapped or unmapped. This way every pair of lines go together even through this will result in one mate or the other's alignment being reported more than once. RSEM flips out if it doesn't find the alignments like this.

TL;DR; An option so that every pair of lines in the output SAM represent the first and second mates of a fragment even if this causes redundant reporting of individual mate alignments.

Another question - I wanted to exclude singletons and improper pairs (most specifically pairs with mates mapped to different transcripts) so I enabled pairedonly AND killbadpairs. Each option taken separately works however killbadpairs seems to happen AFTER pairedonly, internally, because killbadpairs creates singletons in the output. Is it possible to have pairedonly apply to the result of killbadpairs as well?

thanks!
__________________
/* Shawn Driscoll, Gene Expression Laboratory, Pfaff
Salk Institute for Biological Studies, La Jolla, CA, USA */
sdriscoll is offline   Reply With Quote
Old 07-20-2015, 05:19 PM   #234
sdriscoll
I like code
 
Location: San Diego, CA, USA

Join Date: Sep 2009
Posts: 438
Default

Sort of continuing on the subject I posted above. Again I'm looking at paired-end alignments to a transcriptome where I'm allowing multiple alignments per pair. I'm using both ambiguous=all and secondary=t to allow some additional sub-optimal alignments to be reported. I need pairs to be reported as neighbors in the output so when there are 2 alignments for the first mate but only 1 for the second I need that to either be two lines or four lines depending on how those alignments were paired. So in parsing the output of bbmap today I found something that's throwing me off that appears to be a bug.

I've been parsing out reads in cases where there are more than one alignment of either the left or right mate so that I can see how bbmap reports those alignments. The following is one such set (first mate separated from second mate alignments):

Code:
#== left ==#
SRR1016945.2666	83	NM_146193	1295	42	100M	=	1199	-196	GAACCTATAGAGATCCTGCCCAATGTCTGTTACACAGCGTGCGCAACACTCAAGGGCCCAGATTCTCACTATGGCACAAAAGGACTGAAGAAAGTAGTGT	DCCDEEDEECCCDDDBDDDDDDEEDEEFFFEFFFHHIIIIIIIJIGGFJIJJJJJIJJJJIJJJJJJIJJJIJJJHJJJJJJJIJJJHHHHHFFFFFCCC	NM:i:1	AM:i:42	NH:i:1
SRR1016945.2666	353	ENSMUST00000121441	1072	40	100M	NM_146193	1295	0	CAGATCATTGAATATGAGAAAAAACAAACCTTGGGACAGAATGATACTGGATCTAGTTGTGATGGGACTGCTAACACATTCAGGGTCATGTTCAAAGAAC	CCCFFFFFHHHGHJJJJJIJJJJJJJJJJJJJJJJJJIIGIJJJJJIJJJJIIGIJIJJHIJJJJJIHHHHHFFFFFDEEDEEDDCCDDDEEEDCCDDDD	NM:i:2	AM:i:40	NH:i:3
SRR1016945.2666	353	ENSMUST00000178360	1054	40	100M	NM_146193	1295	0	CAGATCATTGAATATGAGAAAAAACAAACCTTGGGACAGAATGATACTGGATCTAGTTGTGATGGGACTGCTAACACATTCAGGGTCATGTTCAAAGAAC	CCCFFFFFHHHGHJJJJJIJJJJJJJJJJJJJJJJJJIIGIJJJJJIJJJJIIGIJIJJHIJJJJJIHHHHHFFFFFDEEDEEDDCCDDDEEEDCCDDDD	NM:i:2	AM:i:40	NH:i:3

#== right ==#
SRR1016945.2666	163	NM_146193	1199	42	100M	=	1295	196	CAGATCATTGAATATGAGAAAAAACAAACCTTGGGACAGAATGATACTGGATCTAGTTGTGATGGGACTGCTAACACATTCAGGGTCATGTTCAAAGAAC	CCCFFFFFHHHGHJJJJJIJJJJJJJJJJJJJJJJJJIIGIJJJJJIJJJJIIGIJIJJHIJJJJJIHHHHHFFFFFDEEDEEDDCCDDDEEEDCCDDDD	NM:i:1	AM:i:42	NH:i:3
The first three alignments all have 0x40 set indicating they are the first mate. The last one has 0x80 set indicating it is the second mate. The first+second pair with flags 83 and 163 are properly paired and confirmed with bowtie2. So what are the other two first mate alignments? Those aren't actually first mates...and the so-called RNEXT and PNEXT fields indicate they are mated to the first first-mate alignment. Also the sequences and qualities match up with those of the second-mate read. So those should actually have the 0x80 flag set since they are NOT first-mate reads at all.

This appears to be happening randomly because sometimes I'll see multiple first-mate mappings and a single second-mate mapping but all of the 0x40 and 0x80 flags are set correctly.

Another interesting case:
Code:
#== left ==#
SRR1016945.1906	83	NM_011653	367	44	100M	=	266	-201	AGCAGCTCATCACAGGCAAGGAGGATGCTGCCAATAACTATGCTCGTGGCCACTACACCATTGGCAAGGAGATCATTGACCTTGTCCTGGACAGGATTCG	[email protected]JJJIJJJJIIGHFIGJJJJJJJJHHHHHFFFFFCCC	NM:i:0	AM:i:44	NH:i:3
SRR1016945.1906	337	NM_009448	800	42	100M	NM_011653	266	0	AGCAGCTCATCACAGGCAAGGAGGATGCTGCCAATAACTATGCTCGTGGCCACTACACCATTGGCAAGGAGATCATTGACCTTGTCCTGGACAGGATTCG	[email protected]JJJIJJJJIIGHFIGJJJJJJJJHHHHHFFFFFCCC	NM:i:1	AM:i:42	NH:i:3
SRR1016945.1906	337	ENSMUST00000122404	269	42	100M	NM_011653	266	0	AGCAGCTCATCACAGGCAAGGAGGATGCTGCCAATAACTATGCTCGTGGCCACTACACCATTGGCAAGGAGATCATTGACCTTGTCCTGGACAGGATTCG	[email protected]JJJIJJJJIIGHFIGJJJJJJJJHHHHHFFFFFCCC	NM:i:1	AM:i:42	NH:i:3
SRR1016945.1906	353	NM_009448	699	42	100M	NM_011653	367	0	AGGAGCTGGCAAGCATGTGCCCCGGGCAGTGTTCGTAGACCTGGAACCCACGGTCATCGATGAAGTTCGCACCGGCACCTACCGCCAGCTCTTCCACCCT	[email protected]>AHGHIGIBGIGHDHGIBFHGJJJJJIJFI6D;[email protected]@BBBDBDCCDDDDD<<A	NM:i:1	AM:i:42	NH:i:2
#== right ==#
SRR1016945.1906	163	NM_011653	266	44	100M	=	367	201	AGGAGCTGGCAAGCATGTGCCCCGGGCAGTGTTCGTAGACCTGGAACCCACGGTCATCGATGAAGTTCGCACCGGCACCTACCGCCAGCTCTTCCACCCT	[email protected]>AHGHIGIBGIGHDHGIBFHGJJJJJIJFI6D;[email protected]@BBBDBDCCDDDDD<<A	NM:i:0	AM:i:44	NH:i:2
In the set of 0x40 flagged alignments only the fourth one should have the 0x80 flag. I didn't catch this until I looked at alignments for this pair from bowtie2 but the second first-mate and the fourth first-mate (the two mapped to NM_009448) should actually be mated to one another (i.e. bowtie reports that exact pair as a properly paired alignment with insert size 201 which is the same as the primary here to NM_011653). The alignment at NM_011653 is perfect (no mismatches) however the alignment at NM_009448 has 1 mismatch per mate (from bowtie2). So that should fall under the category of secondary alignments in bbmap. Also, curiously, bowtie2 has reported 6 different pairings for this fragment with between 0 and 2 mismatches. I allowed secondary mappings to be within 0.95 of the primary score so maybe I need to lower that in order to see alignments with > 1 mismatch when the best has 0 mismatches however there are two additional sequences that bowtie2 aligned this fragment to with only a single mismatch and they aren't reported in bbmap's output so it seems it may be missing some valid mappings.
__________________
/* Shawn Driscoll, Gene Expression Laboratory, Pfaff
Salk Institute for Biological Studies, La Jolla, CA, USA */
sdriscoll is offline   Reply With Quote
Old 07-20-2015, 05:41 PM   #235
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,707
Default

Woah, I replied to your first post and it never registered. To summarize:

I was unable to reproduce an interaction of the "pairedonly" and "killbadpairs" flag generating singletons, so if you can send me a read pair and reference for which that occurs, it would be helpful. In my testing, when pairedonly=true, I get identical results with killbadpairs true or false. But, I don't use killbadpairs anymore - when I want to produce only properly paired mapped reads, I set pairedonly and use the outm stream, e.g.

bbmap.sh in=reads.fq outm=mapped_paired.fq pairedonly

For RSEM - the default behaviour of BBMap is to produce R1, R2, R1, R2, etc. in sam files, so it should be fine. It will do this with the flags "ambig=best" (default), "ambig=random", and "ambig=toss". The only time it will not do that is with "ambig=all", in which it will print all R1 alignments, then all R2 alignments. Enforcing strict interleaving when printing all alignments would probably result in output that is either incorrect or violates the sam spec, which makes me wonder where it ever occurs... so, I'm a bit reluctant to add the option, but I already have other flags like "keepnames" that (as noted) violate the sam spec but are still useful, so I'm willing to add it if useful. Does RSEM require all mapping locations, rather than a random location, for multimapping reads?

As for your second post - that seems much more concerning (to me). I'll examine it in detail tomorrow, and thanks as always for your thorough bug reports. I'm still working on the Seal qhdist issue, by the way; I've just been preoccupied with upgrading some programs to support unlimited max kmer lengths, which is now mostly finished.
Brian Bushnell is offline   Reply With Quote
Old 07-20-2015, 10:59 PM   #236
sdriscoll
I like code
 
Location: San Diego, CA, USA

Join Date: Sep 2009
Posts: 438
Default

I'll have to doublecheck that pairedonly/killbadpairs thing I thought I was seeing. I like the pairedonly strategy and I can deal with mates mapped to different references in other ways.

Yes, tools like RSEM and eXpress need to have as many alignments as possible per read within reason. In fact the default alignment for RSEM is with bowtie (1) allowing 2 mismatches in the seed and then up to the entire rest of the read to be mismatches and all alignments that fall under that threshold. Multiple best alignments is also super useful for the program because they it can see which reads are totes ambiguous (i.e. the ones that map exactly the same to mulitple isoforms - those that hit an exon that's shared among multiple isoforms). So that information is necessary but it's also necessary to have what most aligners consider suboptimal alignments because those programs don't necessarily look at each alignment's quality but the overall solution to the abundance problem. The need to see which target references a read can map to within some reason. I don't like to allow as much wiggle room as the default RSEM bowtie setting - I typically allow up to 10% error. In some cases it would be entirely reasonable to only accept the best alignment...but I would still need ambig=all to satisfy the requirements for RSEM/eXpress. Obviously I could just use bowtie2 which works great but it is slooowwwwwww. STAR works pretty good and is a great substitute for bowtie2. I want to see if I can find the right settings for BBMap too and see how it ranks. The variable between programs tends to be the number of alignments found. bowtie2 finds many more PE alignments than bowtie1, for example. STAR comes close to bowtie2 while bwa falls short of them both. Basically whichever aligner combines finding the 'most' mappings with speed is my favorite for this job. I haven't specifically looked to see how well BBMap stacks up but I'll do that. The kind of test I'm talking about is to generate transcriptome based simulated reads and then map them with different aligners. I just check, per read/pe-fragment, if the aligner successfully reported the original correct alignment along with, if any, additional alignments. Again bowtie2 was the winner between bowtie1, STAR, bwa (aln and mem), and I think subread. Unfortunately it is slow in the alignment mode where it reports more than a single alignment (-k or -a mode).

FYI the default behavior for STAR, bowtie and bowtie2 when allowing multiple mappings per pair is to report them in R1, R2, R1, R2 strict formatting. So if bowtie2 finds 12 locations for a single fragment it will print out 24 lines in R1,R2 pairs. All of the secondary locations get the 0x100 flag set. Basically that just makes parsing and interpreting them really easy since, as you know, if you can count on a certain formatting it means you can write less code. Obviously the RSEM program could do this but they chose to put the burden of formatting on the user - probably to keep the number of decisions going on behind the scenes to a minimum. I was parsing the BBMap alignments by watching the read name and piling up the 0x40 mates in one array with 0x80 mates in a second array and then afterwards I ran all possible combinations of them through a function that validates whether or not they are a valid pair. It would make their program a little more powerful, in my opinion, if they could just incorporate something like that but alas...I guess that's what keeps Perl in business.
__________________
/* Shawn Driscoll, Gene Expression Laboratory, Pfaff
Salk Institute for Biological Studies, La Jolla, CA, USA */
sdriscoll is offline   Reply With Quote
Old 07-21-2015, 02:49 PM   #237
sdriscoll
I like code
 
Location: San Diego, CA, USA

Join Date: Sep 2009
Posts: 438
Default

Quote:
Originally Posted by Brian Bushnell View Post
As for your second post - that seems much more concerning (to me). I'll examine it in detail tomorrow, and thanks as always for your thorough bug reports. I'm still working on the Seal qhdist issue, by the way; I've just been preoccupied with upgrading some programs to support unlimited max kmer lengths, which is now mostly finished.
By the way this was when I had ambig=all and aligning PE reads. The odd behavior didn't seem to care whether or not I had 'secondary=t' set.
__________________
/* Shawn Driscoll, Gene Expression Laboratory, Pfaff
Salk Institute for Biological Studies, La Jolla, CA, USA */
sdriscoll is offline   Reply With Quote
Old 09-11-2015, 10:21 AM   #238
rkanwar
Junior Member
 
Location: minnesota

Join Date: Feb 2011
Posts: 8
Default Aligning reads after bbmerge

Hello,

I have PE reads, many of which overlap. I used bbmerge to merge the overlapping reads. In the end I have 3 fastq files:

1. Merged single end reads
2. End 1 of no-merged reads
3. End 2 of no-merged reads

How do I specify these 3 fastq files to bbmap [what should I give for in and in2] ? Thanks.

bye,
Rahul

Last edited by rkanwar; 09-11-2015 at 10:42 AM.
rkanwar is offline   Reply With Quote
Old 09-11-2015, 12:28 PM   #239
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,707
Default

Hi Rahul,

For a paired end dataset and a single-end dataset, you need to run mapping twice. However, you can do it in one command with BBWrap like this:

Code:
bbwrap.sh ref=reference.fa in=read1.fq,merged.fq in2=read2.fq,null out=mapped.sam append=t
This would be equivalent to:
Code:
bbmap.sh ref=reference.fa
bbmap.sh in=read1.fq in2=read2.fq out=mapped_pairs.sam
bbmap.sh in=merged.fq out=mapped_merged.sam
...except that BBWrap would only produce a single sam output file.
Brian Bushnell is offline   Reply With Quote
Old 09-20-2015, 11:28 AM   #240
DarthMapper
Junior Member
 
Location: PR

Join Date: Sep 2015
Posts: 4
Default

Greetings community! I am only a first year undergrad student, so please bare with my ignorance.

I have encountered some difficulties trying to use BBMap to align WGS pair end reads belonging to a Sheep to it's pertinent rna.fa file - I'm only interested in the organism's full mRNA sequences, but I do not know where else to get them.

The problem I'm facing surfaces after verifying Average Coverage and Standard Deviation calculations. While the coverage results are as expected, the standard deviation values are well over 600!

I will post the command line and the path to the reference data I am using below, to ease out the process of troubleshooting:

Reference Data: found under NCBI's ftp site, under the following path: genomes/Ovis_aries/RNA/rna.fa.gz
Here is a link that leads directly to the directory containing the reference file I am using: ftp://ftp.ncbi.nlm.nih.gov/genomes/Ovis_aries/RNA/

*If anyone else would kindly suggest a better way of obtaining only the full mRNA nucleotide sequences of the organism, please be kind enough to enlighten my poor soul*

Reads: I am using paired end fastq WGS reads that were used to build the organism's genome assembly. They have already undergone proper quality testing.

After I create my build using BBMap and my reference file (rna.fa), I proceed to align my reads to the reference build. Here is what my command line looks like:

bbmap.sh in1=read_1.fastq in2=read_2.fastq out=MappedFile.sam build=1 minratio=0.56

(I am trying to compare ambiguous values later on, so I need the min ratio. Making the "minratio=<>" parameter closer to 1 does lower the Standard Deviation, but it is still unusually high)

Cool, up to there everything runs smoothly. I take a look at my alignment results and proceed to calculate coverage and standard deviation values using pileup.sh. Again, here is my command line:

pileup.sh in=MappedFile.sam out=stats.txt hist=histogram.txt

After looking at the results, they look something like this:
Set USE_COVERAGE_ARRAYS to true
Set USE_BITSETS to false
Note: Coverage capped at 65535

Average Coverage: 37.84
Standard Deviation: 707.47
Percent scaffolds with any coverage: 80.76
Percent of reference bases covered: 35.87

I'd be incredibly thankful if anyone would shed some light into this little hindrance. What am I messing up?
DarthMapper is offline   Reply With Quote
Reply

Tags
bbmap, metagenomics, rna-seq aligners, short read alignment

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 07:34 AM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2018, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO