SEQanswers

Go Back   SEQanswers > Sequencing Technologies/Companies > Illumina/Solexa



Similar Threads
Thread Thread Starter Forum Replies Last Post
Genome Res De novo bacterial genome sequencing: millions of very short reads assembly b_seite Literature Watch 1 10-05-2017 12:26 AM
Targeted Genome Assembly for region poorly represented in reference genome? gumbos Bioinformatics 1 01-09-2012 05:01 PM
PubMed: High-throughput sequencing of complete human mtDNA genomes from the Philippin Newsbot! Literature Watch 0 12-15-2010 11:20 AM
Problems with de novo transcriptome assembly using 454 Titanium series ankitgupta.iitg Bioinformatics 1 10-30-2009 12:15 PM

Reply
 
Thread Tools
Old 06-10-2013, 02:53 PM   #1
leftisthominid
Member
 
Location: Florida

Join Date: Apr 2013
Posts: 13
Default mtDNA genome assembly problems

Hello, we recently paired sequenced a 94 mtDNA libraries on a Illumina HiSeq and I am now working on assembling the sequences, determining coverage, etc.

For the most part I have not done a lot of QC as FASTQC results had some quirks but my sequence coverage is high enough (and the problems weren't that bad) (http://www.bioinformatics.babraham.a...ojects/fastqc/) but I have fastq-mcf to trim some adapters on some sequences and I've used fastq-join to join joinable paired ends (both from http://code.google.com/p/ea-utils/). I cat'ed the joined and unjoined reads into single fastq files for each sample.

I then began using Mapping Iterative Assembler. It is designed for mitochondrial assembly (https://github.com/udo-stenzel/mappi...tive-assembler). I have my samples in maln assembly format, and I am beginning to convert them in to ACE format. I am visualizing them with Tablet (http://bioinf.scri.ac.uk/tablet/download.shtml).

I have noticed that lots of ****ty reads with large fake insertions that created gaps in my sequence (see attach screen grab). The consensus sequence still comes out normal and skips the reads with the fake insertions. I wanted to know, is this normal. The human mt-genome is about 16.5 kb, with all the gapped bases in my assemblies I have well over 40,000 characters. When I take out the -'s I have ~16.5k bases left.

I am trying to figure out if it has anything to do with either not doing QC beyond adapter trimming or if it has anything to do with the program I used to joined paired reads.

I am re-assembling some of my contamination controls that went through adapter trimming but without joining paired ends to see how that goes.

I am thinking about using a filtering software like fastq_quality_filter to cut out any reads with base qualities below 25 or so (http://hannonlab.cshl.edu/fastx_tool...y_filter_usage).

Any suggestions?

Thanks!

Thanks!
Attached Images
File Type: png snapshot.png (122.3 KB, 22 views)
leftisthominid is offline   Reply With Quote
Old 06-10-2013, 03:16 PM   #2
leftisthominid
Member
 
Location: Florida

Join Date: Apr 2013
Posts: 13
Default

Forgot to post above. The fastq's that went into Mapping Iterative Assembler generally had FASTQC results like this
[PASS] Basic Statistics
[PASS] Per base sequence quality
[PASS] Per sequence quality scores
[eith PASS or WARNING] Per base sequence content
[PASS] Per base GC content
[PASS] Per sequence GC content
[PASS] Per base N content
[WARNING] Sequence Length Distribution
[FAIL] Sequence Duplication Levels
[PASS] Overrepresented sequences
[WARNING] Kmer Content
leftisthominid is offline   Reply With Quote
Old 06-11-2013, 04:07 AM   #3
gringer
David Eccles (gringer)
 
Location: Wellington, New Zealand

Join Date: May 2011
Posts: 838
Default

We didn't get anything like this for our mitochondrial sequencing, but the inputs were quite different. We did IonTorrent 316 from 2x PCR fragments, 400~2000x coverage, no clipping/trimming, bowtie2 mapping, samtools variant discovery, vcftools for generating sequence (modified to work properly with INDELs) -- the only insertions we observed in our Tablet output were true insertions.

FWIW, I noticed that SAMtools choked on finding insertions when provided with (by default) a coverage over about 400. Maybe you could try digital normalisation of your data.

Thinking about the sequence length issues, the paired-end joining seems a bit strange to me if you're going to be doing a paired-end mapping to the mitochondrial sequence. I can't imagine mapping tools will do well when provided with mixed single [joined] sequences and pairs.
gringer is offline   Reply With Quote
Old 06-11-2013, 08:31 AM   #4
Number6
Member
 
Location: NY

Join Date: Feb 2009
Posts: 20
Default

I'm no expert, but for what it's worth, our bioinformatics group often says that for assembly, too many reads can be a be a bad thing. I know they kick around the number that ~100 x coverage is around the sweet spot.
Number6 is offline   Reply With Quote
Old 06-11-2013, 10:11 AM   #5
leftisthominid
Member
 
Location: Florida

Join Date: Apr 2013
Posts: 13
Default

Quote:
Originally Posted by gringer View Post
We didn't get anything like this for our mitochondrial sequencing, but the inputs were quite different. We did IonTorrent 316 from 2x PCR fragments, 400~2000x coverage, no clipping/trimming, bowtie2 mapping, samtools variant discovery, vcftools for generating sequence (modified to work properly with INDELs) -- the only insertions we observed in our Tablet output were true insertions.

FWIW, I noticed that SAMtools choked on finding insertions when provided with (by default) a coverage over about 400. Maybe you could try digital normalisation of your data.

Thinking about the sequence length issues, the paired-end joining seems a bit strange to me if you're going to be doing a paired-end mapping to the mitochondrial sequence. I can't imagine mapping tools will do well when provided with mixed single [joined] sequences and pairs.
fastq-join outputs three files: a fastq with joined reads, a fastq from file 1 of the unjoinables, and a fastq from file 2 with the unjoinables. All three are effectively single end reads after the joining. The sequencing lab divided the samples by index and did most of the QC for us. I've seen the problem with our smallest "low" coverage sample (no adapter cleaning here, also smallest is in terms of bytes), with our second smallest "low" coverage sample (yes adapter cleaning here), and with our second largest "high" coverage sample (no adapter cleaning here). So regardless of which flow cell lane it came from or if I had to do adapter trimming the mess still emerges.

Right now the possible culprit I am looking at is fastq-join, but I am waiting on MIA to assemble.
leftisthominid is offline   Reply With Quote
Old 06-11-2013, 01:59 PM   #6
gringer
David Eccles (gringer)
 
Location: Wellington, New Zealand

Join Date: May 2011
Posts: 838
Default

Quote:
Originally Posted by Number6 View Post
I'm no expert, but for what it's worth, our bioinformatics group often says that for assembly, too many reads can be a be a bad thing. I know they kick around the number that ~100 x coverage is around the sweet spot.
Yes, we realised this after discovering SAMtools' dislike for 1000x+ coverage.
gringer is offline   Reply With Quote
Old 06-13-2013, 10:41 AM   #7
leftisthominid
Member
 
Location: Florida

Join Date: Apr 2013
Posts: 13
Default

So, I ran unjoined sequences - problem remains
I used cutadapt to trim out short reads and <30 quality - problem remains
I used slightly stricter filter options in MIA (on non-cutadapt'd sequence) - problem remains

My consensus sequence seems sensible, but I don't know if there are errors in it from the misaligning sequence.

It is now seeming to me a lot of the misaligning sequence aren't horrible quality, they're just off-target reads.

Does anyone have any suggestions on how I can filter them out of my fastq (and then put what I have left into MIA)?
leftisthominid is offline   Reply With Quote
Old 06-14-2013, 10:08 AM   #8
leftisthominid
Member
 
Location: Florida

Join Date: Apr 2013
Posts: 13
Default

I used bowtie2 to filter out off-bait target and then loaded results got a little better, but not much.

Gringer, is that vcftools modification customized or is it something that was already made?
leftisthominid is offline   Reply With Quote
Old 01-08-2014, 09:25 PM   #9
SamH
Member
 
Location: Moscow, ID

Join Date: Sep 2010
Posts: 16
Default

I'm probably late to this party, but if you are still working on this problem, I've had very good luck doing the following:

1) screen duplicates, there are lots of tools, but I'm partial to (https://github.com/ibest/GRC_Scripts...plicates_PE.py) since I wrote part of it.

2) Second I clean the reads, usually pretty strictly using: https://bitbucket.org/izhbannikov/seqyclean

3) Third, I overlap the reads using Flash: https://wiki.gacrc.uga.edu/wiki/FLASH

4) And finally, I do the assembly with my program ARC, which is a hybrid mapping/de novo assembly approach developed for mitochondrial and similar types of assembly problems: https://github.com/ibest/ARC (make sure to "git checkout develop" since there are a lot of improvements which haven't been merged into the stable branch yet).

Using this approach I've been able to assemble chipmunk mitochondria from capture data, resolving most of 55 samples into a single 16k contig. I've also had good luck with ancient DNA (mammoth short, SE 454 reads), salamander, and exon capture.

I concur with the above, too high of coverage is not good. I use Newbler a lot and 20-60x seems to be the sweet spot there.
SamH is offline   Reply With Quote
Old 01-09-2014, 04:13 AM   #10
gringer
David Eccles (gringer)
 
Location: Wellington, New Zealand

Join Date: May 2011
Posts: 838
Default

Quote:
Gringer, is that vcftools modification customized or is it something that was already made?
Sorry, missed that question. I did a custom modification, posting my modification as a patch on a bug page, but it doesn't seem like it's been picked up. I can attach that code if you want -- it also allows for incomplete VCF files (i.e. variant-only, and considerably smaller), assuming that anything that isn't in the VCF file is identical to the specified reference sequence.

Edit: or you could look at my previous post on this:

http://seqanswers.com/forums/showthr...628#post122628

Last edited by gringer; 01-09-2014 at 08:22 PM.
gringer is offline   Reply With Quote
Old 01-09-2014, 05:44 AM   #11
JackieBadger
Senior Member
 
Location: Halifax, Nova Scotia

Join Date: Mar 2009
Posts: 381
Default

Try this: http://nar.oxfordjournals.org/conten.../09/nar.gkt371

Mira also caps the total number of reads i believe
JackieBadger is offline   Reply With Quote
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 04:33 PM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2021, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO