SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics



Similar Threads
Thread Thread Starter Forum Replies Last Post
Bowtie, an ultrafast, memory-efficient, open source short read aligner Ben Langmead Bioinformatics 513 05-14-2015 02:29 PM
Introducing BBMap, a new short-read aligner for DNA and RNA Brian Bushnell Bioinformatics 24 07-07-2014 09:37 AM
Miso's open source joyce kang Bioinformatics 1 01-25-2012 06:25 AM
Targeted resequencing - open source stanford_genome_tech Genomic Resequencing 3 09-27-2011 03:27 PM
EKOPath 4 going open source dnusol Bioinformatics 0 06-15-2011 01:10 AM

Reply
 
Thread Tools
Old 03-15-2017, 12:57 PM   #441
sebastian_1
Junior Member
 
Location: Czech republic

Join Date: Mar 2017
Posts: 3
Default

Hello,

I am new to this forum so I apologize if I posted in the wrong place.

Currently I am trying to assemble a single cell genome from some Illumina Hiseq 2x250 bp reads.
One of my first steps is to do normalization of the data. Previously I was using bbnorm.sh because it's fast and very good. However, with this data I cannot use bbnorm.sh on our server due to memory issues. (I have around 100 million PE reads).
So I tried to use bbnorm on a cluster where I allocated 16 cores and 700GB of RAM.

I run the software using this command line:
bbnorm.sh in1=st1c_R1.fastq.gz in2=st1c_R2.fastq.gz out1=st1c_R1_norm.fastq.gz out2=st1c_R2_norm.fastq.gz threads=12 target=60 mindepth=2

The software runs fine, but the process will be killed by the scheduler because of high CPU usage.

Quote:
=>> PBS: job killed: ncpus 59.52 exceeded limit 16 (sum)
I tried to increase the number of threads requested to 30, lower the number of threads in the command line to 10, but still I have the same problem.

The cluster has installed BBMap version 36.92. It looks like for some reason the number of threads which I define is not taken into account, even if during the run it tells me that the number of threads was set to 12.

Quote:
Settings:
threads: 12
k: 31
deterministic: true
toss error reads: false
passes: 1
bits per cell: 16
cells: 145.37B
hashes: 3
base min quality: 5
kmer min prob: 0.5

target depth: 240
min depth: 2
max depth: 300
min good kmers: 15
depth percentile: 64.8
ignore dupe kmers: true
fix spikes: false
Any suggestions? I really cannot ask for the entire node just for me.
I understand that due to other components like pigz, it can use more than 12 cores, but here it looks like uses 60 cores, or all available cores.

Thank you,
Sebastian
sebastian_1 is offline   Reply With Quote
Old 03-15-2017, 01:38 PM   #442
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,552
Default

Hi Sebastian,

You can reduce the memory needs of BBNorm by adding the flags "prefilter" and "bits=16", which might allow it to run on a lower-memory computer, depending on the complexity of the input data.

But as for reducing the CPU utilization... that's tricky. You can add "pigz=f unpigz=f" to disable pigz, so you will only be left with java threads. Java thread scheduling is nondeterministic due to garbage collection and various other background stuff, and I'm not sure if it's possible to ensure the process will be capped at a certain number of total threads without binding the process to specific cores. But you can try manually specifying the number of GC threads:

-XX:ParallelGCThreads=2

This command may or may not work, depending on your java version. You'd have to skip (or modify) the shellscript so that the actual java command would look something like this:

java -ea -Xmx200g -Xms200g -XX:ParallelGCThreads=2 -cp /path/to/bbmap/current/ jgi.KmerNormalize in1=st1c_R1.fastq.gz in2=st1c_R2.fastq.gz out1=st1c_R1_norm.fastq.gz out2=st1c_R2_norm.fastq.gz threads=12 target=60 mindepth=2
Brian Bushnell is offline   Reply With Quote
Old 03-16-2017, 02:00 AM   #443
sebastian_1
Junior Member
 
Location: Czech republic

Join Date: Mar 2017
Posts: 3
Default

Hi Brian,

Thank you for your quick reply. I've tried to run BBNorm with prefilter=t and bits=16 on our server, but I still run out of memory. I have a server with 128GB of RAM, and I've tried also on a 256 GB but it didn't work. But I saw that the high CPU usage (more than defined in threads=x) is just at the beginning, for a short time and afterwards java limits itself to the required number of cores.
Luckily, I've found that the PBS system divides the CPU hours to the walltime, and thus gets the CPU usage. The workaround was to ask the job as a STDIN, and be idle for around 30 minutes, which will create a "buffer", and thus even if I start with much higher initial CPU usage, the time buffer is enough to compensate, and thus the job doesn't get killed by the scheduler.

However now I get an error with pigz:

Quote:
Table creation time: 1667.979 seconds.
Started output threads.
pigz: write error code 122
pigz: abort: write error on <stdout>
Exception in thread "Thread-53" java.lang.RuntimeException: java.io.IOException: Stream closed
at stream.ReadStreamByteWriter.run(ReadStreamByteWriter.java:31)
Caused by: java.io.IOException: Stream closed
at java.lang.ProcessBuilder$NullOutputStream.write(ProcessBuilder.java:433)
at java.io.OutputStream.write(OutputStream.java:116)
at java.io.BufferedOutputStream.write(BufferedOutputStream.java:122)
at stream.ReadStreamByteWriter.writeFastq(ReadStreamByteWriter.java:451)
at stream.ReadStreamByteWriter.processJobs(ReadStreamByteWriter.java:96)
at stream.ReadStreamByteWriter.run2(ReadStreamByteWriter.java:41)
at stream.ReadStreamByteWriter.run(ReadStreamByteWriter.java:27)
Exception in thread "Thread-64" java.lang.RuntimeException: Writing to a terminated thread.
at stream.ConcurrentGenericReadOutputStream.write(ConcurrentGenericReadOutputStream.java:207)
at stream.ConcurrentGenericReadOutputStream.addOrdered(ConcurrentGenericReadOutputStream.java:193)
at stream.ConcurrentGenericReadOutputStream.add(ConcurrentGenericReadOutputStream.java:98)
at jgi.KmerNormalize$ProcessThread.normalizeInThread(KmerNormalize.java:3129)
at jgi.KmerNormalize$ProcessThread.run(KmerNormalize.java:2801)
Output buffer became full; key 95560 waiting on 95304.
LE: I set the $TMPDIR on the cluster to my current working directory and i got rid of the pigz error:-)

Last edited by sebastian_1; 03-16-2017 at 03:24 AM.
sebastian_1 is offline   Reply With Quote
Old 03-16-2017, 03:26 AM   #444
GenoMax
Senior Member
 
Location: East Coast USA

Join Date: Feb 2008
Posts: 6,283
Default

While you wait for @Brian to respond (he is on California Daylight Time) you could turn off "pigz=f unpigz=f" and see if that allows the job to move forward.
GenoMax is offline   Reply With Quote
Old 03-16-2017, 06:04 AM   #445
minja
Junior Member
 
Location: Novosibirsk

Join Date: Mar 2017
Posts: 1
Default

Hello Brain,

I'm trying to make use of bbmapskimmer.
Here is a simple fasta example (read.fasta):

Code:
>1
GAACATGATCCCCTTGTACTACTACATTATCACTAGCTTTATGTTTTCTA
>2
TAGAAAACATAAAGCTAGTGATAATGTAGTAGTACAAGGGGATCATGTTC
>3
GAACATGATCCCCTTGTACTATTACATTATCACTAGCTTTATGTTTTCTA
Read 1 and read 3 are only different in single "T/C" letter, read 2 is reverse-complement of 1.

I'm aligning to mm10 genome:

~/Distr/bbmap/bbmapskimmer.sh in=~/tmp/HiC_repeats/input/temp/read.fasta out=test_bbmap1.sam ref=~/tmp/HiC_repeats/fasta/mm10.fa ambiguous=all vslow k=8 minid=0.6 maxsites2=8000 maxsites=8000 local

Here is test_bbmap1.sam:
Code:
@HD    VN:1.4    SO:unsorted
@SQ    SN:chr10    LN:130694993
....
@SQ    SN:chrM    LN:16299
@SQ    SN:chrX    LN:171031299
@SQ    SN:chrY    LN:91744698
@PG    ID:BBMap    PN:BBMap    VN:37.02    CL:java -Djava.library.path=/mnt/storage/home/vsfishman/Distr/bbmap/jni/ -ea -Xmx98546m align2.BBMapPacBioSkimmer build=1 overwrite=true minratio=0.40 fastareadlen=6000 ambig=all minscaf=100 startpad=10000 stoppad=10000 midpad=6000 in=/mnt/storage/home/vsfishman/tmp/HiC_repeats/input/temp/read.fasta out=test_bbmap1.sam ref=/mnt/storage/home/vsfishman/tmp/HiC_repeats/fasta/mm10.fa ambiguous=all vslow k=8 minid=0.6 maxsites2=8000 maxsites=8000 local
1    0    chr16    39205299    43    50=    *    0    0    GAACATGATCCCCTTGTACTACTACATTATCACTAGCTTTATGTTTTCTA    *    NM:i:0    AM:i:43    NH:i:2
1    272    chr10    38707112    39    28=1X21=    *    0    0    *    *    NM:i:1    AM:i:39    NH:i:2
2    16    chr16    39205299    43    50=    *    0    0    GAACATGATCCCCTTGTACTACTACATTATCACTAGCTTTATGTTTTCTA    *    NM:i:0    AM:i:43    NH:i:2
2    256    chr10    38707112    39    28=1X21=    *    0    0    *    *    NM:i:1    AM:i:39    NH:i:2
3    16    chr10    38707112    3    50=    *    0    0    TAGAAAACATAAAGCTAGTGATAATGTAATAGTACAAGGGGATCATGTTC    *    XT:A:R    NM:i:0    AM:i:3    NH:i:80
3    272    chr10    72154315    3    50=    *    0    0    *    *    NM:i:0    AM:i:3    NH:i:80
3    256    chr10    77806115    3    50=    *    0    0    *    *    NM:i:0    AM:i:3    NH:i:80
3    272    chr10    127797927    3    50=    *    0    0    *    *    NM:i:0    AM:i:3    NH:i:80
3    256    chr13    4007635    3    50=    *    0    0    *    *    NM:i:0    AM:i:3    NH:i:80
3    256    chr14    17647057    3    50=    *    0    0    *    *    NM:i:0    AM:i:3    NH:i:80
3    256    chr14    40019581    3    50=    *    0    0    *    *    NM:i:0    AM:i:3    NH:i:80
3    256    chr15    50069753    3    50=    *    0    0    *    *    NM:i:0    AM:i:3    NH:i:80
3    272    chr15    72113945    3    50=    *    0    0    *    *    NM:i:0    AM:i:3    NH:i:80
3    256    chr15    75259455    3    50=    *    0    0    *    *    NM:i:0    AM:i:3    NH:i:80
3    256    chr17    69579150    3    50=    *    0    0    *    *    NM:i:0    AM:i:3    NH:i:80
3    256    chr19    60830365    3    50=    *    0    0    *    *    NM:i:0    AM:i:3    NH:i:80
3    256    chr1    95272134    3    50=    *    0    0    *    *    NM:i:0    AM:i:3    NH:i:80
3    256    chr1    102465771    3    50=    *    0    0    *    *    NM:i:0    AM:i:3    NH:i:80
....
(I reduced some header lines and some alinments of read 3)

And, just to note, without "local" option I got same results.

As you can see, there are many perfect alignmens for read3, but only 2 alignments of reads 1 and 2. As I understand, with minid=0.6 all alignments of read3 should be also reported for reads 1 and 2?

Could you, please, help to clarify this?
And one more question, why reads 1 and 2 do not have "XT:A:" tag?

Best,
Minja

Last edited by minja; 03-16-2017 at 06:07 AM.
minja is offline   Reply With Quote
Old 03-27-2017, 01:00 PM   #446
catagui
Junior Member
 
Location: Florida, USA

Join Date: Mar 2017
Posts: 2
Default Bbmap for mapping a transcriptome to a genome

Hi Brian,
I would like to use Bbmap to map a transcriptome to a genome.
My first questions would be in the Results it saids:

Reads Used: 165317 (61187186 bases)

but the transcriptome that I am using has 95,389 transcripts. Shouldn't the number of transcripts be the same as the reads used?

2nd. This transcriptome has a N50 of 363 and N75 of 696, the manual saids to use BBMapPacBio for reads longer than 700bp. However, when I use BBMapPacBio instead of bbmap I get a 28% mapping percentage, that is lower than when using bbmap (38%).This transcriptome has transcripts from other organisms and the mapping percentage should be around 50%.

What else should I take into account when mapping a transcriptome to the genome, since is it different than mapping mRNA reads?

Thanks
catagui is offline   Reply With Quote
Old 03-27-2017, 01:35 PM   #447
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,552
Default

@minja - I replied via email and still have not had a chance to look into that; sorry - I will when I have a chance.

@catagui - BBMap cannot handle query sequences over 600bp. When you feed it a fasta input file, by default, sequences longer than 500bp are broken into 500bp pieces and mapped individually (and not recombined afterward). Some of these pieces could be quite short (say, 15bp) and will usually map if you have a big enough reference. You can discard them with "minlen=40" or so. Or you could simply use MapPacBio for mapping as it can accommodate much longer reads. In that case, though, you should increase "maxindel" which has different defaults between BBMap and MapPacBio, and furthermore, a longer transcript could contain multiple long introns.

Try "mapPacBio.sh k=13 maxindel=200000 maxlen=60000 minlen=40 (other arguments)" and see if that improves its mapping rate. Ultimately, though, longer query sequences will always have lower alignment rates than shorter ones when using a global or glocal aligner, so shredding the sequences (which BBMap does) artificially increases the alignment rate.
Brian Bushnell is offline   Reply With Quote
Old 04-01-2017, 06:30 AM   #448
jweger1988
Junior Member
 
Location: Paris, France

Join Date: Apr 2017
Posts: 9
Default

Hello Brian,

I made some viruses that have a set of degenerate nucleotides in the genome to look at bottlenecks - we're calling these barodes. I'm wondering if any of your tools have a way of extracting the specific region where the barcodes are and then listing the different sequences that are present and possibly how many times they are present. I've been able to do pretty much everything I've needed to with your tools so I figured I would ask. Thanks for all of the work you've done.

James
jweger1988 is offline   Reply With Quote
Old 04-04-2017, 04:13 PM   #449
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,552
Default

Hi James,

Sorry for the slow reply. You could, in your case, do this:

Assume the barcode is from position 10 to 19 in the read (inclusive, zero-based). Then you would do:

reformat.sh in=reads.fq ftl=10 ftr=19 out=trimmed.fq
kmercountexact.sh in=trimmed.fq k=10 rcomp=f out=kmers.fa


"kmers.fa" will contain all the barcodes and their counts.
Brian Bushnell is offline   Reply With Quote
Old 04-06-2017, 01:32 AM   #450
jweger1988
Junior Member
 
Location: Paris, France

Join Date: Apr 2017
Posts: 9
Default

Quote:
Originally Posted by Brian Bushnell View Post
Hi James,

Sorry for the slow reply. You could, in your case, do this:

Assume the barcode is from position 10 to 19 in the read (inclusive, zero-based). Then you would do:

reformat.sh in=reads.fq ftl=10 ftr=19 out=trimmed.fq
kmercountexact.sh in=trimmed.fq k=10 rcomp=f out=kmers.fa


"kmers.fa" will contain all the barcodes and their counts.


Awesome Brian, thanks for the response. It seems like that is going to work. It's a little more complicated since there are actually three amplicons with barcodes in different areas in the same file but if I can pull them out using one of your other tools the kmercountexact.sh should work just fine.

I thought maybe I would ask you a question about another problem I'm currently having. I have been trying to use BBmap to detect large deletions in viral sequencing data (defective genomes) but so far haven't had good luck. Is there a way I can sort out specifically these reads that have "break points" or deletions in the viral genome away from the reads that map normally?

Thanks again for all of your work.

James
jweger1988 is offline   Reply With Quote
Old 04-06-2017, 09:19 AM   #451
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,552
Default

Normally, I would just map the reads with the flag "maxindel=100k" (or whatever is appropriate, given the length of the genome) and then call variants with "callvariants.sh" using the flag "rarity=0.01" to find low-frequency events. But alternatively, you can separate the reads with multiple BBMap passes, like this:

bbmap.sh in=reads.fq outm=mapped.fq maxindel=100k
bbmap.sh in=mapped.fq outm=normal.fq outu=longdeletion.fq maxindel=100k dellenfilter=20
bbmap.sh in=longdeletion.fq out=mapped_longdeletion.sam maxindel=100k
Brian Bushnell is offline   Reply With Quote
Old 04-07-2017, 01:29 AM   #452
jweger1988
Junior Member
 
Location: Paris, France

Join Date: Apr 2017
Posts: 9
Default

Quote:
Originally Posted by Brian Bushnell View Post
Normally, I would just map the reads with the flag "maxindel=100k" (or whatever is appropriate, given the length of the genome) and then call variants with "callvariants.sh" using the flag "rarity=0.01" to find low-frequency events. But alternatively, you can separate the reads with multiple BBMap passes, like this:

bbmap.sh in=reads.fq outm=mapped.fq maxindel=100k
bbmap.sh in=mapped.fq outm=normal.fq outu=longdeletion.fq maxindel=100k dellenfilter=20
bbmap.sh in=longdeletion.fq out=mapped_longdeletion.sam maxindel=100k
Brian, that's awesome, I didn't even know there was a variant caller in your suite. That worked well and I was able to find a bunch of deletions by using the flag "clearfilters" in callvariants.sh. However, I'm sure a lot of it is probably not real. Is there a particular parameter you suggest looking at as a measure of probability of being real? I'm not sure what some of the readouts in the output from callvariants.sh actually mean.

Thanks again.
jweger1988 is offline   Reply With Quote
Old 04-07-2017, 01:37 PM   #453
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,552
Default

Normally, the variant quality is the best indicator of whether a variant is real. I think, in this case, it's likely that most of them are real - viruses mutate rapidly, so viral "isolates" often contain multiple different viral genomes (which is why they are hard to assemble). The possibilities are, essentially...

1) Sequencing error randomly matched somewhere distant in the genome, causing the false appearance of a deletion. This is extremely unlikely with small, non-redundant genomes, and would usually only be represented by a single read.
2) Chimeric reads formed from different genomic fragments fusing together during sequencing may look like long deletion events. These should universally be represented by a single read (or pair) with a PCR-free or deduplicated library.
3) Real deletion events. Hopefully these will be represented by multiple reads.

So, you may want to deduplicate your library (if it was PCR'd, otherwise don't) and then call variants using a threshold of, say, 3 reads (minreads=3 flag, placed after "clearfilters").
Brian Bushnell is offline   Reply With Quote
Old 04-10-2017, 09:43 AM   #454
catagui
Junior Member
 
Location: Florida, USA

Join Date: Mar 2017
Posts: 2
Default

Quote:
Originally Posted by Brian Bushnell View Post
@minja - I replied via email and still have not had a chance to look into that; sorry - I will when I have a chance.

@catagui - BBMap cannot handle query sequences over 600bp. When you feed it a fasta input file, by default, sequences longer than 500bp are broken into 500bp pieces and mapped individually (and not recombined afterward). Some of these pieces could be quite short (say, 15bp) and will usually map if you have a big enough reference. You can discard them with "minlen=40" or so. Or you could simply use MapPacBio for mapping as it can accommodate much longer reads. In that case, though, you should increase "maxindel" which has different defaults between BBMap and MapPacBio, and furthermore, a longer transcript could contain multiple long introns.

Try "mapPacBio.sh k=13 maxindel=200000 maxlen=60000 minlen=40 (other arguments)" and see if that improves its mapping rate. Ultimately, though, longer query sequences will always have lower alignment rates than shorter ones when using a global or glocal aligner, so shredding the sequences (which BBMap does) artificially increases the alignment rate.
Hi Brian thanks for your reply.

I haven't been able to improve the mapping percentage. I tried your suggestion (bbmap/mapPacBio.sh k=13 maxindel=200000 maxlen=60000 minlen=40) and still get 13%.

Is it because the mapping is too stringent? As I am dealing with different species within the same genus. Is there another flag aside from "minid" that I could change to improve the mapping.

Thanks again
catagui is offline   Reply With Quote
Old 04-10-2017, 02:14 PM   #455
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,552
Default

You can reduce kmer length and increase sensitivity - change "k=13" to "k=11" and add "slow". But ultimately, since BBMap is a global aligner, it does not like large structural rearrangements when aligning long queries, since they do not fit into the model of "match/substitution/insertion/deletion" reported in single alignments.
Brian Bushnell is offline   Reply With Quote
Old 04-17-2017, 04:06 AM   #456
Kristian
Junior Member
 
Location: Europe

Join Date: May 2015
Posts: 8
Default High error rate with BBmap?

Hi Brian,

while trying to call variants on a human MT DNA I noticed the unusually high error rate in mapping with bbmap (37.10 and 36.92 tested) compared to bwa (0.7.12-r1039). Both were run with the default parameters, and below is the output of samtools stats (after sorting and indexing bams). IGV shows many more inverted pairs with bbmap mapping, and this came as a surprise.

I can provide details, however I prefer to share them offline due to the sensitivity of some of the data. Any hints on how to track down the issue?

Many thanks!

BBmap
Code:
BBM: raw total sequences:       537764
BBM: filtered sequences:        0
BBM: sequences: 537764
BBM: is sorted: 1
BBM: 1st fragments:     268882
BBM: last fragments:    268882
BBM: reads mapped:      256366
BBM: reads mapped and paired:   230042  # paired-end technology bit set + both mates mapped
BBM: reads unmapped:    281398
BBM: reads properly paired:     228748  # proper-pair bit set
BBM: reads paired:      537764  # paired-end technology bit set
BBM: reads duplicated:  0       # PCR or optical duplicate bit set
BBM: reads MQ0: 0       # mapped and MQ=0
BBM: reads QC failed:   4282
BBM: non-primary alignments:    0
BBM: total length:      67353071        # ignores clipping
BBM: bases mapped:      32229667        # ignores clipping
BBM: bases mapped (cigar):      29307   # more accurate
BBM: bases trimmed:     0
BBM: bases duplicated:  0
BBM: mismatches:        1672477 # from NM fields
BBM: error rate:        5.706749e+01    # mismatches / bases mapped (cigar)
BBM: average length:    125
BBM: maximum length:    151
BBM: average quality:   30.4
BBM: insert size average:       255.7
BBM: insert size standard deviation:    117.0
BBM: inward oriented pairs:     60121
BBM: outward oriented pairs:    2977
BBM: pairs with other orientation:      48
BBM: pairs on different chromosomes:    0
BWA
Code:
BWA: raw total sequences:       537764
BWA: filtered sequences:        0
BWA: sequences: 537764
BWA: is sorted: 1
BWA: 1st fragments:     268882
BWA: last fragments:    268882
BWA: reads mapped:      266324
BWA: reads mapped and paired:   248064  # paired-end technology bit set + both mates mapped
BWA: reads unmapped:    271440
BWA: reads properly paired:     245908  # proper-pair bit set
BWA: reads paired:      537764  # paired-end technology bit set
BWA: reads duplicated:  0       # PCR or optical duplicate bit set
BWA: reads MQ0: 178     # mapped and MQ=0
BWA: reads QC failed:   0
BWA: non-primary alignments:    0
BWA: total length:      67353071        # ignores clipping
BWA: bases mapped:      33736838        # ignores clipping
BWA: bases mapped (cigar):      32160973        # more accurate
BWA: bases trimmed:     0
BWA: bases duplicated:  0
BWA: mismatches:        603462  # from NM fields
BWA: error rate:        1.876380e-02    # mismatches / bases mapped (cigar)
BWA: average length:    125
BWA: maximum length:    151
BWA: average quality:   30.4
BWA: insert size average:       384.2
BWA: insert size standard deviation:    1040.7
BWA: inward oriented pairs:     67383
BWA: outward oriented pairs:    3645
BWA: pairs with other orientation:      112
BWA: pairs on different chromosomes:    0
Kristian is offline   Reply With Quote
Old 04-17-2017, 10:00 AM   #457
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,552
Default

I noticed this line:
Quote:
BBM: bases mapped (cigar): 29307 # more accurate
I'm guessing you are not counting "=" and "X" symbols as mapped, just "M"... generally, I would classify "M", "=", "X", and "I" as mapped bases. What version of samtools are you using?

Also, I find it odd that the insert size average and stdev differ so much. Often the mode or median is more stable as it is less influenced by outliers.

By inverted pairs, do you mean both reads map to the same side of the reference? If not, I wonder if the fact that the reference is very small and circular could play some role in the differences.

Anyway, please feel free to email me both bam files and the reference, and I'll look at them. It's possible that the circularity, or the fact that bwa looks for local rather than global alignments, is the primary factor behind the difference.
Brian Bushnell is offline   Reply With Quote
Old 04-17-2017, 10:27 AM   #458
Kristian
Junior Member
 
Location: Europe

Join Date: May 2015
Posts: 8
Default

Quote:
Originally Posted by Brian Bushnell View Post
I noticed this line:
Anyway, please feel free to email me both bam files and the reference, and I'll look at them.
Thanks Brian, I sent you a PM with the details. I'm using samtools 1.3.1 (and the above is the result of samtools stats
Kristian is offline   Reply With Quote
Old 04-17-2017, 11:37 AM   #459
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,552
Default

Hi Kristian,

Is this a Nextera long-mate pair library? Those need special processing before they can be mapped. Or... can you give me any more information about the library construction, and the trimming methodology? The library has an extremely high error rate (particularly with read 2), less than half of the reads map to the mito, and it appears that both adapters and transposase are still present.... also, I'm measuring the median insert size as 159 (BBMap) or 133 (BBMerge), so there are a lot of pairs with insert size shorter than the sequenced read length; those might be displayed differently in IGV depending on whether the adapter portion was soft-clipped (which bwa would do by default) or not (bbmap does not soft-clip by default).

I adapter-trimmed the reads and error-corrected them, but still under 50% map. I'm not really sure what's wrong with the library. But, I don't see anything unusual about the pairing orientations. I get 45.5670% properly paired with "rcs=f" (require correct strand = false) and 45.5481% with "rcs=t", so only 0.02% map in the wrong orientation.

Last edited by Brian Bushnell; 04-17-2017 at 02:38 PM.
Brian Bushnell is offline   Reply With Quote
Old 04-18-2017, 05:16 AM   #460
Kristian
Junior Member
 
Location: Europe

Join Date: May 2015
Posts: 8
Default

Hi Brian,

Quote:
Originally Posted by Brian Bushnell View Post
Is this a Nextera long-mate pair library?
This is Nextera XT prep, and the adapters/linker were trimmed along with demultplexing directly through Illumina RTA pipeline. On my end the bbduk2 left and right trimming had almost no effect, and the splitnextera also had only a few hits (0.1%) to the adapter, so I'd guess the dataset is (for the most part) trimmed.

Last edited by Kristian; 04-18-2017 at 05:21 AM.
Kristian is offline   Reply With Quote
Reply

Tags
bbmap, metagenomics, rna-seq aligners, short read alignment

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 01:17 PM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2017, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO