![]() |
|
![]() |
||||
Thread | Thread Starter | Forum | Replies | Last Post |
Genome Res De novo bacterial genome sequencing: millions of very short reads assembly | b_seite | Literature Watch | 1 | 10-05-2017 12:26 AM |
Targeted Genome Assembly for region poorly represented in reference genome? | gumbos | Bioinformatics | 1 | 01-09-2012 05:01 PM |
PubMed: High-throughput sequencing of complete human mtDNA genomes from the Philippin | Newsbot! | Literature Watch | 0 | 12-15-2010 11:20 AM |
Problems with de novo transcriptome assembly using 454 Titanium series | ankitgupta.iitg | Bioinformatics | 1 | 10-30-2009 12:15 PM |
![]() |
|
Thread Tools |
![]() |
#1 |
Member
Location: Florida Join Date: Apr 2013
Posts: 13
|
![]()
Hello, we recently paired sequenced a 94 mtDNA libraries on a Illumina HiSeq and I am now working on assembling the sequences, determining coverage, etc.
For the most part I have not done a lot of QC as FASTQC results had some quirks but my sequence coverage is high enough (and the problems weren't that bad) (http://www.bioinformatics.babraham.a...ojects/fastqc/) but I have fastq-mcf to trim some adapters on some sequences and I've used fastq-join to join joinable paired ends (both from http://code.google.com/p/ea-utils/). I cat'ed the joined and unjoined reads into single fastq files for each sample. I then began using Mapping Iterative Assembler. It is designed for mitochondrial assembly (https://github.com/udo-stenzel/mappi...tive-assembler). I have my samples in maln assembly format, and I am beginning to convert them in to ACE format. I am visualizing them with Tablet (http://bioinf.scri.ac.uk/tablet/download.shtml). I have noticed that lots of ****ty reads with large fake insertions that created gaps in my sequence (see attach screen grab). The consensus sequence still comes out normal and skips the reads with the fake insertions. I wanted to know, is this normal. The human mt-genome is about 16.5 kb, with all the gapped bases in my assemblies I have well over 40,000 characters. When I take out the -'s I have ~16.5k bases left. I am trying to figure out if it has anything to do with either not doing QC beyond adapter trimming or if it has anything to do with the program I used to joined paired reads. I am re-assembling some of my contamination controls that went through adapter trimming but without joining paired ends to see how that goes. I am thinking about using a filtering software like fastq_quality_filter to cut out any reads with base qualities below 25 or so (http://hannonlab.cshl.edu/fastx_tool...y_filter_usage). Any suggestions? Thanks! Thanks! |
![]() |
![]() |
![]() |
#2 |
Member
Location: Florida Join Date: Apr 2013
Posts: 13
|
![]()
Forgot to post above. The fastq's that went into Mapping Iterative Assembler generally had FASTQC results like this
[PASS] Basic Statistics [PASS] Per base sequence quality [PASS] Per sequence quality scores [eith PASS or WARNING] Per base sequence content [PASS] Per base GC content [PASS] Per sequence GC content [PASS] Per base N content [WARNING] Sequence Length Distribution [FAIL] Sequence Duplication Levels [PASS] Overrepresented sequences [WARNING] Kmer Content |
![]() |
![]() |
![]() |
#3 |
David Eccles (gringer)
Location: Wellington, New Zealand Join Date: May 2011
Posts: 838
|
![]()
We didn't get anything like this for our mitochondrial sequencing, but the inputs were quite different. We did IonTorrent 316 from 2x PCR fragments, 400~2000x coverage, no clipping/trimming, bowtie2 mapping, samtools variant discovery, vcftools for generating sequence (modified to work properly with INDELs) -- the only insertions we observed in our Tablet output were true insertions.
FWIW, I noticed that SAMtools choked on finding insertions when provided with (by default) a coverage over about 400. Maybe you could try digital normalisation of your data. Thinking about the sequence length issues, the paired-end joining seems a bit strange to me if you're going to be doing a paired-end mapping to the mitochondrial sequence. I can't imagine mapping tools will do well when provided with mixed single [joined] sequences and pairs. |
![]() |
![]() |
![]() |
#4 |
Member
Location: NY Join Date: Feb 2009
Posts: 20
|
![]()
I'm no expert, but for what it's worth, our bioinformatics group often says that for assembly, too many reads can be a be a bad thing. I know they kick around the number that ~100 x coverage is around the sweet spot.
|
![]() |
![]() |
![]() |
#5 | |
Member
Location: Florida Join Date: Apr 2013
Posts: 13
|
![]() Quote:
Right now the possible culprit I am looking at is fastq-join, but I am waiting on MIA to assemble. |
|
![]() |
![]() |
![]() |
#6 |
David Eccles (gringer)
Location: Wellington, New Zealand Join Date: May 2011
Posts: 838
|
![]()
Yes, we realised this after discovering SAMtools' dislike for 1000x+ coverage.
|
![]() |
![]() |
![]() |
#7 |
Member
Location: Florida Join Date: Apr 2013
Posts: 13
|
![]()
So, I ran unjoined sequences - problem remains
I used cutadapt to trim out short reads and <30 quality - problem remains I used slightly stricter filter options in MIA (on non-cutadapt'd sequence) - problem remains My consensus sequence seems sensible, but I don't know if there are errors in it from the misaligning sequence. It is now seeming to me a lot of the misaligning sequence aren't horrible quality, they're just off-target reads. Does anyone have any suggestions on how I can filter them out of my fastq (and then put what I have left into MIA)? |
![]() |
![]() |
![]() |
#8 |
Member
Location: Florida Join Date: Apr 2013
Posts: 13
|
![]()
I used bowtie2 to filter out off-bait target and then loaded results got a little better, but not much.
Gringer, is that vcftools modification customized or is it something that was already made? |
![]() |
![]() |
![]() |
#9 |
Member
Location: Moscow, ID Join Date: Sep 2010
Posts: 16
|
![]()
I'm probably late to this party, but if you are still working on this problem, I've had very good luck doing the following:
1) screen duplicates, there are lots of tools, but I'm partial to (https://github.com/ibest/GRC_Scripts...plicates_PE.py) since I wrote part of it. 2) Second I clean the reads, usually pretty strictly using: https://bitbucket.org/izhbannikov/seqyclean 3) Third, I overlap the reads using Flash: https://wiki.gacrc.uga.edu/wiki/FLASH 4) And finally, I do the assembly with my program ARC, which is a hybrid mapping/de novo assembly approach developed for mitochondrial and similar types of assembly problems: https://github.com/ibest/ARC (make sure to "git checkout develop" since there are a lot of improvements which haven't been merged into the stable branch yet). Using this approach I've been able to assemble chipmunk mitochondria from capture data, resolving most of 55 samples into a single 16k contig. I've also had good luck with ancient DNA (mammoth short, SE 454 reads), salamander, and exon capture. I concur with the above, too high of coverage is not good. I use Newbler a lot and 20-60x seems to be the sweet spot there. |
![]() |
![]() |
![]() |
#10 | |
David Eccles (gringer)
Location: Wellington, New Zealand Join Date: May 2011
Posts: 838
|
![]() Quote:
Edit: or you could look at my previous post on this: http://seqanswers.com/forums/showthr...628#post122628 Last edited by gringer; 01-09-2014 at 08:22 PM. |
|
![]() |
![]() |
![]() |
#11 |
Senior Member
Location: Halifax, Nova Scotia Join Date: Mar 2009
Posts: 381
|
![]()
Try this: http://nar.oxfordjournals.org/conten.../09/nar.gkt371
Mira also caps the total number of reads i believe |
![]() |
![]() |
![]() |
Thread Tools | |
|
|