SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics



Similar Threads
Thread Thread Starter Forum Replies Last Post
mapping contigs to reference genome sadiexiaoyu Bioinformatics 2 07-04-2014 06:19 AM
extend sequence mirex Illumina/Solexa 2 08-28-2012 11:59 AM
how to align the contigs to the reference genome jjjscuedu Bioinformatics 1 06-05-2012 09:39 AM
Cufflinks models from gsnap alignments extend beyond the reference bwalts Bioinformatics 0 05-29-2012 03:01 PM
Aligning contigs w.rt reference sequence kasutubh Bioinformatics 7 03-11-2010 12:17 PM

Reply
 
Thread Tools
Old 07-08-2014, 12:03 PM   #1
arundurvasula
Member
 
Location: California

Join Date: Jun 2014
Posts: 16
Default Extend contigs with a reference sequence

Hi,

I'm trying to extend my contigs to cover entire viral genomes. When I align my reads (1.5M) to a viral genome (2.6KB), I can get full coverage of it and export a concensus sequence. When I assemble these reads, I get 2 contigs that cover about half of the viral reference sequence. I can scaffold (scaffold_builder: http://edwards.sdsu.edu/scaffold_builder/) these contigs, but I am missing two large sequences in the beginning and middle of the sequence (brackets denote missing sequence):
Code:
reference: -----------------------------------
contigs:   [      ]--------------[    ]-------
I know that my reads cover this sequence, but I think that they didn't make it into any contigs so the scaffold program can't use them to fill in the gaps.

My plan is to do the following: combine the reads that map to the reference with the contigs and use the scaffold builder on the contigs and reads together to fill in the gaps. This pipeline is running now, but I wanted to see if anyone has any other ideas about what to do?

Thanks!
arundurvasula is offline   Reply With Quote
Old 07-08-2014, 12:37 PM   #2
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,707
Default

I would guess that the genome has repeats which do not assemble well. Scaffolding and gap-filling with paired reads depend on the gaps being small enough and the read insert sizes being long enough.

If you can't solve the problem computationally, you might need a new library with longer reads or longer insert sizes. But before doing more sequencing, I would suggest trying different assemblers - many perform better on specific scenarios.

Also, you might try a gap-filling algorithm. The only one I'm familiar with is PBJelly, which is designed for long reads (like PacBio), but there are many others.
Brian Bushnell is offline   Reply With Quote
Old 07-08-2014, 12:45 PM   #3
arundurvasula
Member
 
Location: California

Join Date: Jun 2014
Posts: 16
Default

Hi Brian, thanks for the reply. The virus I'm trying to assemble is this one: http://www.ncbi.nlm.nih.gov/nuccore/...report=GenBank and doesn't seem to have too many repeats. Our read size is about 50 bp and we have single end reads. I've used idba_ud and Velvet to assemble this genome and got slightly longer contigs from idba_ud. Most other assemblers I've used have had poorer results.
arundurvasula is offline   Reply With Quote
Old 07-08-2014, 01:03 PM   #4
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,707
Default

50bp SE reads are not very good for assembling; I would suggest going to longer reads, paired-ends, and a greater kmer length if you want a quality assembly. It's probably less expensive to resequence and assemble from longer reads than to spend a lot of time generating an inferior assembly from very short reads.
Brian Bushnell is offline   Reply With Quote
Old 07-09-2014, 04:52 PM   #5
Matt Kearse
Member
 
Location: New Zealand

Join Date: Mar 2014
Posts: 20
Default

For most assembly, unpaired 50bp reads aren't that great. But in your case, the genome is tiny and the maximum size of any repeat in it is only 12bp. So 50bp reads should be fine. I created around 2000 simulated reads of 50bp from your genome with a 10% error rate and Geneious assembles it fine into a single contig.

Maybe the problem is that you have too much coverage. Try using a small subset of your data of just a few thousand reads.

If that doesn't help, then maybe the error rate in your data is too high. In that case you could try Geneious since it works well with high error rates.
Matt Kearse is offline   Reply With Quote
Old 07-10-2014, 12:20 PM   #6
arundurvasula
Member
 
Location: California

Join Date: Jun 2014
Posts: 16
Default

Thank you for the in depth help! I don't have geneious, but if I do get it I will try that. One thing I've been playing around with is just generating a consensus sequence from my reads instead of assembling it. I used samtools and bwa to look at my distrubution of reads and I definitely have the whole genome covered by the reads. Then I did this:
Code:
samtools mpileup -uf $refseq $alignmentmappedsortedbam | bcftools view -cg - | vcfutils.pl vcf2fq > $consensus
and got a consensus fastq file.

My question is, is there any reason not to do this vs de novo assembly? I realize that I need to have a reference genome for this to work, but I'm able to get the full sequence using this but not with de novo assembly (I haven't tried using a subset of the reads yet).

I'm thinking of making a pipeline where I do de novo assembly of the reads -> blast contigs -> get reference sequences -> map reads to reference -> obtain consensus. Does that workflow make sense?

The reason for this vs resequencing with longer reads is that I have a lot of 50 bp reads already. Eventually (hopefully soon) we will be moving to longer paired end reads though.

Last edited by arundurvasula; 07-10-2014 at 12:22 PM.
arundurvasula is offline   Reply With Quote
Old 07-10-2014, 01:52 PM   #7
Matt Kearse
Member
 
Location: New Zealand

Join Date: Mar 2014
Posts: 20
Default

Most likely the problem is the excessive level coverage you've got. Most de novo assemblers (Geneious included) won't work well with that level of coverage. The reason is that the same random sequencing error ends up occurring so many times that the assembler treats it as a real variant. A few thousand reads are more than enough for your genome.

If your reference sequence is close enough to your sequenced organism, then mapping is OK. But with a genome as small as yours which has no significant repeats and an appropriate level of coverage, I'd have more confidence that the de novo assembly results are correct over the mapping results. If you apply both methods and get the same consensus sequence that should give you even more confidence you've got the correct result.
Matt Kearse is offline   Reply With Quote
Old 07-10-2014, 02:53 PM   #8
arundurvasula
Member
 
Location: California

Join Date: Jun 2014
Posts: 16
Default

Cool, I'm working on using less reads now. One problem is that my reads are not evenly distributed:


Where the y axis is position along the genome and x is number of reads at that base position.

I'm examining the results of subsampling my reads here: https://github.com/arundurvasula/cov...depth-assembly

I think I need a way to sample the reads while still maintaining full coverage. If I drop down to 50x coverage, I get something like this:

which is expected, but I don't think I will be able to get 1 contig from my reads.
arundurvasula is offline   Reply With Quote
Old 07-10-2014, 04:24 PM   #9
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,707
Default

With highly irregular coverage distributions, you should probably look to normalization rather than subsampling. There's an example of how to do that here:

http://seqanswers.com/forums/showthread.php?t=44494
Brian Bushnell is offline   Reply With Quote
Old 07-11-2014, 10:22 AM   #10
arundurvasula
Member
 
Location: California

Join Date: Jun 2014
Posts: 16
Default

That looks like exactly what I was looking for, thank you Brian!
arundurvasula is offline   Reply With Quote
Old 07-11-2014, 10:27 AM   #11
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,707
Default

You're welcome; please let me know if it solves the problem.
Brian Bushnell is offline   Reply With Quote
Old 07-11-2014, 11:59 AM   #12
arundurvasula
Member
 
Location: California

Join Date: Jun 2014
Posts: 16
Default

So I ran it on my data and looked at the coverage using the same method (left is subsampling, right is normalization):

10x:


50x:


so normalization is a lot better, but there's still highly variable coverage.

The script I used to run these analyses is here: https://github.com/arundurvasula/cov...rmalization.sh

Notably, I'm using
Code:
bash bbnorm.sh in=$reads out=../temp/$1/normalized_reads.fasta target=$2 -Xmx2g
to run bbnorm. What options can I play with to make the coverage more even?

Also am I calculating coverage correctly? I saw that target is for kmer depth and there's formula to convert kmer depth to read depth so I did Dr = Dk * (50/50-31+1) and calculated Dk for 5x, 10x, 20x, 30x, 40x, and 50x.

Last edited by arundurvasula; 07-11-2014 at 12:02 PM.
arundurvasula is offline   Reply With Quote
Old 07-11-2014, 12:26 PM   #13
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,707
Default

Because your reads are so short, you can try reducing the kmer length to perhaps 21 with "k=21" flag. The coverage profile after normalization is currently way too spiky, and too high - it should be almost perfectly flat and very close to the target level. So, normalization does not seem to be working well, which probably indicates a high level of errors or reads from slightly different genomes.

You can also try error-correction with the "ecc" flag, and throwing away low-quality reads, since you have have excess coverage. In summary:

bash reformat.sh in=reads.fq out=highQuality.fq maq=16

(that will throw away all reads with average quality below 16)

bash bbnorm.sh in=$reads out=../temp/$1/normalized_reads.fasta target=$2 -Xmx2g ecc k=21

And while this may bias the assembly, you could also try throwing away all reads that don't map to the reference within some edit distance, prior to normalization.
Brian Bushnell is offline   Reply With Quote
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 06:45 PM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2021, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO