SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics



Similar Threads
Thread Thread Starter Forum Replies Last Post
Bowtie, an ultrafast, memory-efficient, open source short read aligner Ben Langmead Bioinformatics 513 05-14-2015 02:29 PM
Introducing BBMap, a new short-read aligner for DNA and RNA Brian Bushnell Bioinformatics 24 07-07-2014 09:37 AM
Miso's open source joyce kang Bioinformatics 1 01-25-2012 06:25 AM
Targeted resequencing - open source stanford_genome_tech Genomic Resequencing 3 09-27-2011 03:27 PM
EKOPath 4 going open source dnusol Bioinformatics 0 06-15-2011 01:10 AM

Reply
 
Thread Tools
Old 08-24-2017, 09:23 AM   #541
jweger1988
Member
 
Location: Paris, France

Join Date: Apr 2017
Posts: 34
Default

Brian,

Thanks for the splitsam, that worked great!

As for the other question by the Ns I meant read name I think? Sorry my grasp of nomenclature with these things is not great.

BBmap
NS500697:37:HH5V7BGXY:2:23303:13708:4313 1:N:0:TAAGGCGA+ATATAGAG

Bowtie
NS500697:37:HH5V7BGXY:4:13502:9896:14279
jweger1988 is offline   Reply With Quote
Old 08-24-2017, 09:31 AM   #542
jweger1988
Member
 
Location: Paris, France

Join Date: Apr 2017
Posts: 34
Default

As for the comparison for lofreq and callvariants, I can never seem to get callvariants to call what look like good snps for my samples. Maybe I am using it wrong in some way.

The sensitivity parameters are a little trick to play with in lofreq. But with callvariants and lofreq with standard settings at rarity=0.01, callvariants calls completely different snps as compared to lofreq. After manual inspection of the bam file it looks like lofreq is more accurate. Could definitely be some paremeter issues though. I'm open to any suggestions at changing these, as I'm a big fan of the bbmap suite.

Here is a link to the vcf files and the script I'm running if you are interested.


https://drive.google.com/drive/folde...1k?usp=sharing

Thanks as always.
jweger1988 is offline   Reply With Quote
Old 08-24-2017, 10:42 AM   #543
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,668
Default

Quote:
Originally Posted by jweger1988 View Post
Brian,

Thanks for the splitsam, that worked great!

As for the other question by the Ns I meant read name I think? Sorry my grasp of nomenclature with these things is not great.

BBmap
NS500697:37:HH5V7BGXY:2:23303:13708:4313 1:N:0:TAAGGCGA+ATATAGAG

Bowtie
NS500697:37:HH5V7BGXY:4:13502:9896:14279
Yep, BBMap uses the full read name whereas Bowtie truncates the read name at the first whitespace. As Genomax noted, you can add the flag "trd" (trimreaddescriptions) to BBMap to produce truncated names, you you can reprocess the already-mapped sam file with Reformat using the "trd" flag to do that also.

Thanks for the feedback on CallVariants! It's certainly odd that there seems to be nothing in common between the two results. Would it be possible for you to email me the bam and reference so I can examine it in more detail?

Last edited by Brian Bushnell; 08-24-2017 at 10:46 AM.
Brian Bushnell is offline   Reply With Quote
Old 08-24-2017, 04:11 PM   #544
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,668
Default

Unfortunately, it looks like your data is very low quality and has inaccurate quality scores, things we used to see on our NextSeqs until it eventually got better after various chemistry and software revisions. I recommend:

1) Adapter-trimming (you have >5% of reads with adapters)
2) Mapping
3) Recalibrating the raw reads
4) Adapter-trimming and quality-trimming the recalibrated reads, with "ftm=5"
5) Mapping again
6) Calling variants

E.g.
Code:
bbduk.sh in=reads.fq.gz out=trimmed0.fq.gz ktrim=r k=23 mink=11 hdist=1 minlen=100 ref=adapters
bbmap.sh in=trimmed0.fq.gz ref=ref.fa out=mapped0.sam nodisk qahist=qahist.txt qhist=qhist.txt mhist=mhist.txt
rm trimmed0.fq.gz
calctruequality.sh in=mapped0.sam
rm mapped0.sam
bbduk.sh in=reads.fq.gz out=recal.fq.gz
bbduk.sh in=recal.fq.gz out=trimmed.fq.gz qtrim=rl trimq=14 ktrim=r k=23 mink=11 hdist=1 minlen=100 ref=adapters
bbmap.sh in=trimmed.fq ref=ref.fa out=mapped.sam nodisk bs=bs.sh  qahist=qahist_rc.txt qhist=qhist_rc.txt mhist=mhist_rc.txt; sh bs.sh

Result of recalibration:
Before:



After:



However, I am not sure you will get anything useful out of this data. We have found that NextSeq runs are just not trustworthy for calling variants at a frequency of below 5%. I looked at lofreq's highest-quality variant, with a Q-score of 1392, and it was seen in 2 reads, with a Q-score of 14 in each case, once within 3 bp of the read end (probably in adapter sequence because there are 2 mismatches and an insertion next to it there) and both reads had a lot of other mismatches. That is the the highest quality variation, and it has zero chance of being correct.

The reason you were not getting any variant calls with CallVariants except at the ends is because for some reason you have stranded data; all the reads only map to the minus strand. In that situation you need to set "usebias=f" and "minstrandratio=0" to turn off strand-bias filters. It still won't call much, though. 95 variants are called using this command (on the trimmed recalibrated reads):

Code:
callvariants.sh in=recal.sam.gz ref=ref.fa rarity=0.01 out=foo3.txt vcf=foo3.vcf minreads=2 minscore=10 usebias=f minstrandratio=0
But only 12 are above Q20 which is my normal cutoff (default is minscore=20, which I recommend), so I'd suggest perhaps looking at those. Maybe they're real. The insertion at 9225 looks plausible.
Attached Images
File Type: png before.png (21.5 KB, 30 views)
File Type: png after.png (24.8 KB, 30 views)
Brian Bushnell is offline   Reply With Quote
Old 08-25-2017, 12:28 AM   #545
jweger1988
Member
 
Location: Paris, France

Join Date: Apr 2017
Posts: 34
Default

Brian,

Wow, that's very surprising to me.

Quote:
things we used to see on our NextSeqs until it eventually got better after various chemistry and software revisions
Are you still currently using a nextseq? Do you recommend a MiSeq or a Hiseq instead? I assume by chemistry and software revisions you mean those put out by Illumina? I wonder if we don't have the latest software?

Any suggestion to improve this in the future?
jweger1988 is offline   Reply With Quote
Old 08-25-2017, 03:07 AM   #546
GenoMax
Senior Member
 
Location: East Coast USA

Join Date: Feb 2008
Posts: 6,435
Default

A MiSeq/HiSeq generally generates better data so if you do have the option of using one of those then do switch. Especially for lowfreq SNP calls.
GenoMax is offline   Reply With Quote
Old 08-25-2017, 03:47 AM   #547
jweger1988
Member
 
Location: Paris, France

Join Date: Apr 2017
Posts: 34
Default

Quote:
Originally Posted by GenoMax View Post
A MiSeq/HiSeq generally generates better data so if you do have the option of using one of those then do switch. Especially for lowfreq SNP calls.
Thanks Genomax.

I'm just curious why Brian said the data was really low quality. We are using a Nextseq 500 and V2 chemistry. The Q30>30 was 80% and it was actually underclustered.

We only have access to Nextseq at the moment, is there anything we can do to improve this or is it just how it is with Nextseq?
jweger1988 is offline   Reply With Quote
Old 08-25-2017, 10:13 AM   #548
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,668
Default

Quote:
Originally Posted by jweger1988 View Post
I'm just curious why Brian said the data was really low quality. We are using a Nextseq 500 and V2 chemistry. The Q30>30 was 80% and it was actually underclustered.
I would not use the "Q30" number for anything; I consider it marketing fluff. The only way to determine data quality is to actually map it and count the number of sequencing errors. You can see from the graph that when you do that, the average starts at Q11 (which is incredibly low and very unusual), goes up to a peak of just under 30, and drops to Q12 at the end, which is also very low. At no point does the average quality even reach Q30.

You can see this visually in IGV - the screen is always littered with random blips indicating indels or mismatches to the reference, which is very unusual for Illumina data (at least, HiSeq 2500 Illumina data). The screen should look very clean except where mismatches to the reference form an obvious column, indicating a real mutation.

Quote:
We only have access to Nextseq at the moment, is there anything we can do to improve this or is it just how it is with Nextseq?
You might want to make sure you are using the latest chemistry and software, and talk to your Illumina rep about your concerns with the quality. We are still using NextSeq routinely, and the quality is usually much better than that, but I don't know the reason.

I don't think the platform is appropriate for low-frequency variant-calling anyway, but on any platform, I'd recommend paired-end reads for that (preferably overlapping so they can be merged), and definitely not strand-specific protocols, which incur bias (particularly, certain genomic features, read in a certain direction, are prone to miscalls). How did you end up with only minus-strand reads, anyway?
Brian Bushnell is offline   Reply With Quote
Old 09-06-2017, 06:49 AM   #549
boulund
Member
 
Location: Sweden

Join Date: Jan 2017
Posts: 10
Default

Hi,

I have a question about statswrapper.sh. It says in the help text description that it supposedly runs stats.sh on multiple assemblies to produce one output line per file, but I only get a concatenation of regular stats.sh outputs, making it far less useful to me than I was hoping for. Is this the intended behavior or am I misunderstanding the help text formulation somehow?

I was hoping to get some kind of overviewable summary table of all files in a single file, with one output line per file.
boulund is offline   Reply With Quote
Old 09-06-2017, 07:54 AM   #550
GenoMax
Senior Member
 
Location: East Coast USA

Join Date: Feb 2008
Posts: 6,435
Default

@boulund: Can you include the exact starswrapper.sh command you are running?
GenoMax is offline   Reply With Quote
Old 09-06-2017, 09:18 AM   #551
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,668
Default

Hi Boulund,

Please add the flag "format=3" to get one line per file.
Brian Bushnell is offline   Reply With Quote
Old 09-06-2017, 09:05 PM   #552
boulund
Member
 
Location: Sweden

Join Date: Jan 2017
Posts: 10
Default

Oh, that was easy. Thanks a lot for your quick response, both of you. Sorry for not spotting that option!

Quote:
Originally Posted by GenoMax View Post
@boulund: Can you include the exact starswrapper.sh command you are running?

The command I used was simple:
Code:
statswrapper.sh in=assembly_1.fasta in=assembly_2.fasta
With the "format=3" option it produces exactly the output I was expecting.
boulund is offline   Reply With Quote
Old 09-07-2017, 07:58 AM   #553
luc
Senior Member
 
Location: US

Join Date: Dec 2010
Posts: 291
Default

I am sorry I can't find the info - I am sure it is somewhere.

How does BBmap define insert size ( e.g of the insert size histograms)?
Distance between start point of the reads (forward and reverse) or between the ends of the reads?

#############
I just checked with a test file; as expected the insert size seems to be the distance between the start point of the reads.

Last edited by luc; 09-07-2017 at 08:52 AM.
luc is offline   Reply With Quote
Old 09-07-2017, 10:02 AM   #554
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,668
Default

Hi Luc,

To make this very clear, it's the estimate of the size of sequenced genomic molecule. So, for example, if you have 2 reads like this, where R1 maps to the plus strand and R2 maps to the minus strand:

Code:
   111111111111111
-------------------------------------------------------------------
                                          22222222222222222
...then it would generally be the distance from the leftmost 1 to the rightmost 2. But if either read contains insertions, those are added to the insert size, and deletions are subtracted from the insert size. Otherwise, the insert sizes would be dramatically overestimated in spliced RNA-seq data.
Brian Bushnell is offline   Reply With Quote
Old 09-07-2017, 10:34 AM   #555
darthsequencer
Member
 
Location: california

Join Date: Feb 2012
Posts: 29
Default

Quote:
Originally Posted by Brian Bushnell View Post
"ambiguous=best" is a bit misleading, but it means the genomically first location with a maxmimum score will be used. "ambiguous=all" will report all locations within the ambiguity threshold of the first. This does not mean they need exactly the same score; it means that they are very close, so much so that none can be confidently determined to be the correct mapping location. Normally they're identical, but if for example one mapping had a single 1bp deletion and another mapping had two 1bp substitutions, the scores would be different, but would be close enough to be both reported. But if there was a third potential mapping with, say, 5 substitutions, that would be excluded. This can be controlled with the "secondarysitescoreratio" flag; if you set it to 1.0, only mappings with identical scores to the best score will be reported.
Hi - So if I set secondarysitescoreratio to 1.0 do I need to set secondary = T if ambiguous=all and increase the maxsites setting?

I guess what I'm really asking is what do I need to set to ensure that I'm getting all mapping locations of read with all the same top score.
darthsequencer is offline   Reply With Quote
Old 09-07-2017, 11:00 AM   #556
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,668
Default

That's correct for single-ended reads, but I do not recommend setting "secondarysitescoreratio = 1.0" for paired reads because one pairing is currently slightly favored over others due to the difficulty of tracking all possible pairings. So, just set "secondary=t ambig=all maxsites=100". In other words, it's not currently possible to ensure you get exactly and only all reads that have the exact same (best) alignment score unless they are processed unpaired.
Brian Bushnell is offline   Reply With Quote
Old 09-07-2017, 03:42 PM   #557
luc
Senior Member
 
Location: US

Join Date: Dec 2010
Posts: 291
Default

Hi Brian,

thanks for the details -- there is always a another twist to it.

Quote:
Originally Posted by Brian Bushnell View Post
Hi Luc,

To make this very clear, it's the estimate of the size of sequenced genomic molecule. So, for example, if you have 2 reads like this, where R1 maps to the plus strand and R2 maps to the minus strand:

Code:
   111111111111111
-------------------------------------------------------------------
                                          22222222222222222
...then it would generally be the distance from the leftmost 1 to the rightmost 2. But if either read contains insertions, those are added to the insert size, and deletions are subtracted from the insert size. Otherwise, the insert sizes would be dramatically overestimated in spliced RNA-seq data.
luc is offline   Reply With Quote
Reply

Tags
bbmap, metagenomics, rna-seq aligners, short read alignment

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 01:59 AM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2017, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO