SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics



Similar Threads
Thread Thread Starter Forum Replies Last Post
Bowtie, an ultrafast, memory-efficient, open source short read aligner Ben Langmead Bioinformatics 513 05-14-2015 03:29 PM
Introducing BBMap, a new short-read aligner for DNA and RNA Brian Bushnell Bioinformatics 24 07-07-2014 10:37 AM
Miso's open source joyce kang Bioinformatics 1 01-25-2012 07:25 AM
Targeted resequencing - open source stanford_genome_tech Genomic Resequencing 3 09-27-2011 04:27 PM
EKOPath 4 going open source dnusol Bioinformatics 0 06-15-2011 02:10 AM

Reply
 
Thread Tools
Old 10-03-2017, 08:20 AM   #561
dariober
Senior Member
 
Location: Cambridge, UK

Join Date: May 2010
Posts: 310
Default bbduk.sh: "reads" parameter does not exit clean

Hi- Not sure this thread is the best place to post this...

It seems to me that the `reads` parameter makes bbduk exits with an error when the input is not fully processed.

These runs are both ok (input file has less then 1M reads):

Code:
bbduk.sh in=test.fq.gz out=test.out.fq reads=-1 overwrite=true
bbduk.sh in=test.fq.gz out=test.out.fq reads=1000000 overwrite=true
However reads=10 gives:

Code:
bbduk.sh in=test.fq.gz out=test.out.fq reads=10 overwrite=true

java -Djava.library.path=/home/db291g/applications/BBmap/bbmap/jni/ -ea -Xmx15023m -Xms15023m -cp /home/db291g/applications/BBmap/bbmap/current/ jgi.BBDukF in=test.fq.gz out=test.out.fq reads=10 overwrite=true
Executing jgi.BBDukF [in=test.fq.gz, out=test.out.fq, reads=10, overwrite=true]

BBDuk version 37.54
NOTE: No reference files specified, no trimming mode, no min avg quality, no histograms - read sequences will not be changed.
Initial:
Memory: max=15097m, free=14703m, used=394m

Input is being processed as paired
Started output streams:	0.082 seconds.
Processing time:   		0.017 seconds.

Input:                  	20 reads 		3011 bases.
Total Removed:          	0 reads (0.00%) 	0 bases (0.00%)
Result:                 	20 reads (100.00%) 	3011 bases (100.00%)

Time:   			0.114 seconds.
Reads Processed:          20 	0.18k reads/sec
Bases Processed:        3011 	0.03m bases/sec
Exception in thread "main" java.lang.RuntimeException: jgi.BBDukF terminated in an error state; the output may be corrupt.
	at jgi.BBDukF.process(BBDukF.java:913)
	at jgi.BBDukF.main(BBDukF.java:72)

echo $?
1
I guess the output is still ok but the non-zero exit code causes problems inside a pipeline.

Am I missing something here or should this be fixed?

Thanks

Dario
dariober is offline   Reply With Quote
Old 10-03-2017, 09:01 AM   #562
GenoMax
Senior Member
 
Location: East Coast USA

Join Date: Feb 2008
Posts: 6,585
Default

I would like to add that reformat.sh also exhibits this "issue".

Code:
$ reformat.sh in=P1.fastq.gz out=stdout.fa reads=5

Input:                          5 reads                 1094 bases
Output:                         5 reads (100.00%)       1094 bases (100.00%)

Time:                           0.178 seconds.
Reads Processed:           5    0.03k reads/sec
Bases Processed:        1094    0.01m bases/sec
Exception in thread "main" java.lang.RuntimeException: ReformatReads terminated in an error state; the output may be corrupt.
        at jgi.ReformatReads.process(ReformatReads.java:1148)
        at jgi.ReformatReads.main(ReformatReads.java:45)
GenoMax is offline   Reply With Quote
Old 10-03-2017, 03:45 PM   #563
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,695
Default

Hi Dario and Genomax,

I though I had gotten rid of that a while ago, but I guess not - I'll investigate and fix it. It's harmless and due to a race condition for a thread finishing after it was prematurely shut down because enough reads were generated, but it looks scary.
Brian Bushnell is offline   Reply With Quote
Old 10-04-2017, 05:39 AM   #564
jweger1988
Member
 
Location: Paris, France

Join Date: Apr 2017
Posts: 37
Default

Hi Brian,

I have a couple of questions about using callvariants.sh and mapPacbio.sh. I am using mapPacbio.sh to align reads from a Nanopore minion to a viral sequence and looking for deletions with callvariants.sh. My questions are

1. Do you have a suggested method for mapping minion reads. I used a suggestion you provided in 2014 but I don't know if you have since updated this. Here is what I'm running.

mapPacBio.sh k=8 in=reads.fq ref=ref.fa maxreadlen=1000 minlen=200 sam=1.4 idtag ow int=f qin=33 indelhist=indelhist1.txt out=mapped.sam minratio=0.15 ignorequality slow ordered maxindel=12000 outputunmapped=f bs=bs1.sh

2. I am then calling variants with the desire to look for deletions with callvariants.sh.

callvariants.sh in=mapped.sam rarity=0.0001 out=deletions.tsv vcf=deletions.vcf ref=ref.fa extended=t border=20 minscore=10

In the output can you tell me what the meaning of "coverage" or "read Depth" or "allele depth" are? What is the difference between these? Also, what does the allele fraction mean in this case? For a SNP it's self-explanatory but for a indel it seems like it's not as straight forward.

Any insight you can provide would be extremely useful. Thanks again for all you do.
jweger1988 is offline   Reply With Quote
Old 10-04-2017, 11:58 AM   #565
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,695
Default

1) I have not looked at Nanopore reads in several years now so I don't have any further experience in mapping them. But from what I read and hear the quality is gradually creeping upward (and my initial exposure was to very low quality data) so some of the settings can probably be relaxed; e.g., increasing to k=10 and minratio=0.25.

2) "Read Depth" (DP field) is the depth of all reads covering (or spanning) the event, while "Allele Depth" (AD field) is the same but only for those reads exhibiting the variation.

Coverage is added to an array tracking per-reference-base coverage at each position when each read is processed (and deleted bases are considered covered by a read containing that deletion). Similarly, coverage is added to each variation when a read containing that variation is processed.
The reference coverage for a variant is calculated by summing the coverage per-base in the array over the length of the event. For substitutions this is straightforward. For deletions, this is the deleted reference bases. For insertions, this is the average coverage of the bases on either side of the insertion.
It is possible for the allele coverage to be slightly higher than the total coverage using this calculation method (for example, as a result of a sudden drop in coverage on one side of an insertion event) so DP reports max(ref coverage, allele coverage) while COV reports the raw calculated coverage (they will virtually always be the same, though). AF is calculated as DP/AD to prevent AF over 1.0.

3) "Allele Fraction" is "Allele Depth"/"Read Depth". As you say, this is straightforward for SNPs but less so for indels, particularly with short reads and long deletions. It's not entirely clear what AF should really mean for a deletion, because it describes a ratio of two classes of reads - those that cover a variant location and do not indicate the variant, and those that cover a variant location and do indicate the variant. But when deletions are longer than reads there is a third case; those that land in the middle of the deletion but do not touch either end. You could, for example, have 100x coverage on either end of a deletion event, with 50x indicating the deletion and 50x indicating no deletion, and calculate an AF of 0.5. But if you had 1000x coverage somewhere in the middle of the deletion event from reads not spanning the junction on either end, that's a 3rd category and really they should not contribute to the AF (since the deleted allele would never be observed in that case), but currently they do. I'll look into changing the way that long deletion AF and DP are calculated; thanks for making me think about it.
For insertions, there is "Alternate Allele Frequency" which is allele frequency adjusted to compensate for the fact that insertions will not be called on reads that do not span the full event. So if you have a 50bp insertion and 100bp reads, only 1/3rd of the reads with bases in the inserted region would actually span the full insertion and thus possibly yield an insertion call, so a 100% true allele frequency might be reported as 33%. "Alternate Allele Frequency" would adjust the 33% observed back up to 100% based on the relative average read length and insertion length.

Last edited by Brian Bushnell; 10-04-2017 at 02:23 PM.
Brian Bushnell is offline   Reply With Quote
Old 10-05-2017, 04:40 AM   #566
dariober
Senior Member
 
Location: Cambridge, UK

Join Date: May 2010
Posts: 310
Default

Quote:
Originally Posted by Brian Bushnell View Post
Hi Dario and Genomax,

I though I had gotten rid of that a while ago, but I guess not - I'll investigate and fix it. It's harmless and due to a race condition for a thread finishing after it was prematurely shut down because enough reads were generated, but it looks scary.
Hi Brain, thanks for feedback. I thought it was harmless the problem is that I have a pipeline (snakemake) that will "refuse" to proceed on receiving this error.
dariober is offline   Reply With Quote
Old 10-05-2017, 04:48 AM   #567
dariober
Senior Member
 
Location: Cambridge, UK

Join Date: May 2010
Posts: 310
Default

Sorry... another issue.

I have a reference genome where the header lines contain spaces, like this:

Code:
>chr1  AC:CM000663.2  gi:568336023  LN:248956422  rl:Chromosome  M5:6aef897c3d6ff0c78aff06ac189178dd  AS:GRCh38
NNNNN...
After mapping with bbmap I get a sam header where the SN field contains the full fasta header:

Code:
cat -vet test.sam | head

@HD^IVN:1.4^ISO:unsorted$
@SQ^ISN:chr1  AC:CM000663.2  gi:568336023  LN:248956422  rl:Chromosome  M5:6aef897c3d6ff0c78aff06ac189178dd  AS:GRCh38^ILN:248956422$
the sam records contain only the first word, i.e. "chr1", "chr2" etc. This causes problems to downstream tools since "chr1 AC:CM000663.2 ..." is not a valid reference name.

I think the printing of the sequence name in the sam header should strip everything after the first blank space?
dariober is offline   Reply With Quote
Old 10-05-2017, 05:50 AM   #568
GenoMax
Senior Member
 
Location: East Coast USA

Join Date: Feb 2008
Posts: 6,585
Default

You can choose to truncate the names at the first space during mapping using "trd=t".

Since you already have sam/bam files you can use reformat.sh to take care of the spaces using

Code:
trimrname=f             For sam/bam files, trim rname/rnext fields after the first space.
GenoMax is offline   Reply With Quote
Old 10-05-2017, 01:35 PM   #569
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,695
Default

reformat.sh has an option "underscore" which will change whitespace in sequence headers into underscores, if the extra information is important. Alternatively, as Genomax says, you can use "trimrname". Generally I don't like to alter sequence headers by default because that could cause the introduction of non-unique headers, which would break all tools, rather than just some tools
Brian Bushnell is offline   Reply With Quote
Old 10-05-2017, 11:37 PM   #570
dariober
Senior Member
 
Location: Cambridge, UK

Join Date: May 2010
Posts: 310
Default

Quote:
Originally Posted by Brian Bushnell View Post
I don't like to alter sequence headers by default because that could cause the introduction of non-unique headers
Hi Brain- Thanks for the comments (and for the tools of course!). I see your point but I still think the current default behaviour of bbmap is incorrect when sequence names have spaces because it creates an inconsistency between contig names in the sam header versus the names in the sam records.

Since spaces in names are not allowed by the sam format, I would expect bbmap either to silently trim at the first space (which is not nice...!) or fail straightaway with noting that if you really want to proceed you have to enable the tdr=t option. Just a thought...
dariober is offline   Reply With Quote
Old 10-06-2017, 05:04 AM   #571
Gopo
Member
 
Location: Louisiana

Join Date: Nov 2013
Posts: 15
Default

Hi Brian,
I really like callvariants.sh (simulated di-allelic SNPs and at both low (4x) and moderate (60x) sequencing coverage, filtering to keep only variants with quality score of >=Q27, callvariants.sh is more accurate than using GATK 3.7 with their recommended variant filtering). One thing though that I was wondering about, was whether it would be possible to have an option that sets variants that fail for some individuals (in multisample mode) to a missing genotype call "./." rather than whatever genotype is called by callvariants.sh
Best,
Gopo
Gopo is offline   Reply With Quote
Old 10-09-2017, 02:57 PM   #572
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,695
Default

Quote:
Originally Posted by Gopo View Post
Hi Brian,
I really like callvariants.sh (simulated di-allelic SNPs and at both low (4x) and moderate (60x) sequencing coverage, filtering to keep only variants with quality score of >=Q27, callvariants.sh is more accurate than using GATK 3.7 with their recommended variant filtering). One thing though that I was wondering about, was whether it would be possible to have an option that sets variants that fail for some individuals (in multisample mode) to a missing genotype call "./." rather than whatever genotype is called by callvariants.sh
Best,
Gopo
Hi Gopo,

Yes, I will add that (as an option). Is that common practice in other variant-callers? Note that callvariants.sh does currently have a "PF" (pass filter) field per sample, but I want to make things as simple and compatible as possible.
Brian Bushnell is offline   Reply With Quote
Old 10-09-2017, 04:36 PM   #573
GenoMax
Senior Member
 
Location: East Coast USA

Join Date: Feb 2008
Posts: 6,585
Default

@Brian: K-mer module in FastQC only tracks 2% of the data (1 in 50 sequences)
GenoMax is offline   Reply With Quote
Old 10-10-2017, 03:33 AM   #574
Gopo
Member
 
Location: Louisiana

Join Date: Nov 2013
Posts: 15
Default

Quote:
Originally Posted by Brian Bushnell View Post
Hi Gopo,

Yes, I will add that (as an option). Is that common practice in other variant-callers? Note that callvariants.sh does currently have a "PF" (pass filter) field per sample, but I want to make things as simple and compatible as possible.
Hi Brian,
GATK does output "./." for individuals where a genotype can't be reliably called at a particular SNP (can't remember if freebayes and/or ANGSD does this too). What is the flag for "PF" (pass filter) field per sample?

Is there a flag for filtering --min-alleles 2 --max-alleles 2? I ask because I normally do this with VCFtools, but with big VCF files, this takes a while.

Thank you,
Gopo
Gopo is offline   Reply With Quote
Old 10-25-2017, 10:21 AM   #575
bio_d
Junior Member
 
Location: USA

Join Date: Oct 2017
Posts: 5
Default aligning Illumina mate-pair library using BBMAP

Code:

bbmap.sh rcomp=t rcs=f in=read1.fq in2=read2.fq out=mapped.sam

Is this the correct way to align illumina mate pair(5.2Kb)/ long mate pair(10Kb) libraries.

No matter what combination of the flags rcs=t/f rcomp=t/f I use the standard error file shows that "Processing reads in paired-ended mode.". I can't understand how to use the flags correctly because in the UsageGuide it is suggested that one should use requirecorrectstrand=f (rcs=f) and rcomp=t for long mate pair. However, since I am getting insert size of 2611.48 when I am expecting something around 5200bp (predicted 5.2Kb Illumina mate pair library from sequencing center) it is most likely the flags are overridden by the default values which are rcs=t and rcomp=f (which is why I presume bbmap is processing reads in the paired-end mode).

Could you please help.

Best,
D

Last edited by bio_d; 10-25-2017 at 12:13 PM. Reason: Adding more information to the post
bio_d is offline   Reply With Quote
Old 11-03-2017, 08:38 AM   #576
santiagorevale
Member
 
Location: UK

Join Date: Dec 2016
Posts: 17
Default

Hi Brian,

Would it be possible to add to bbduk the capability of filtering also the Illumina Index file? Just to clarify what I mean, Illumina produces the following 3 output files for a typical pair-end single indexed sample:

Read 1 : Sample_S1_L001_R1_001.fastq.gz
Read 2 : Sample_S1_L001_R2_001.fastq.gz
Index 1 : Sample_S1_L001_I1_001.fastq.gz

When I'm filtering using bbduk, I specify in1=read1 in2=read2 out1=filtered.read1 out2=filtered.read2.

What I would like is to be able to add "in3=index1 out3=filtered.index1". I don't think any quality filtering should be applied to this file except only outputting only the sequences that passed the quality filtering for reads 1/2.

Currently, what I'm doing now is extracting the IDs of the filtered file and then filtering the index1 file by these IDs, which is very inefficient.

Many new protocols (mainly for single-cell analysis) are using these 3 Illumina files as input in their pipelines. So I think it could be a good addition.

Thank you very much in advance!

Cheers,
S
santiagorevale is offline   Reply With Quote
Old 11-03-2017, 08:50 AM   #577
GenoMax
Senior Member
 
Location: East Coast USA

Join Date: Feb 2008
Posts: 6,585
Default

Can you give examples of programs you are referring to? I assume the index read here is not the standard Illumina index but some sort of UMI?

BTW: Generation of index read in a separate file is an atypical application which requires a separate option for the bcl2fastq pipeline.

Last edited by GenoMax; 11-03-2017 at 08:52 AM.
GenoMax is offline   Reply With Quote
Old 11-03-2017, 09:04 AM   #578
santiagorevale
Member
 
Location: UK

Join Date: Dec 2016
Posts: 17
Default

Hi Genomax,

I'm using 10Xs cellranger pipeline.

In this particular configuration, they include both "cell barcode" and "UMI" in Read 1 while the current RNA data is encoded in Read 2. The Index 1 file is used as part of the demultiplexing process because 10X uses 4 Illumina indexes per sample instead of just one.

One additional question: in this scenario, Read 1 is only 26 bp while Read 2 is 75 bp. I want to apply a basic filtering (only maxns) on Read 2, is there a way of doing that? I've tried skipr1=t but that only applies to kmer opperations.

Thanks again.
santiagorevale is offline   Reply With Quote
Old 11-03-2017, 09:13 AM   #579
GenoMax
Senior Member
 
Location: East Coast USA

Join Date: Feb 2008
Posts: 6,585
Default

If you are using the cellranger pipeline then you should not need to do anything to the data outside the pipeline. Cellranger pipeline takes in raw Illumina flowcell data folder and does the entire analysis (including demultiplexing/alignments/counting etc).

Last edited by GenoMax; 11-03-2017 at 09:17 AM.
GenoMax is offline   Reply With Quote
Old 11-03-2017, 09:21 AM   #580
santiagorevale
Member
 
Location: UK

Join Date: Dec 2016
Posts: 17
Default

cellranger pipeline is splitted in two steps: "cellranger mkfastq" which produces the three above mentioned files (R1, R2, I1) and "cellranger count" which uses this 3 files as input.

The reason why I need to do the filtering first is that some platforms fail sometimes when calling reads from the TOP surface of the flowcell for Read 2, resulting in reads which will be fully or partly composed by Ns. Then, "cellranger count" will use all of the reads (including those that are 100% Ns) to do their calculations, ending up with aberrant metrics (like "mean reads per cell", which is determined as the raw number of reads divided by the number of called cells).
santiagorevale is offline   Reply With Quote
Reply

Tags
bbmap, metagenomics, rna-seq aligners, short read alignment

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 10:59 PM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2017, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO