SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics



Similar Threads
Thread Thread Starter Forum Replies Last Post
Is a server with 512 GB RAM enough for denovo genome assembly? ngs_per Bioinformatics 15 03-26-2014 11:04 PM
cuffmerge assembly vs denovo assembly of RNAseq data skm Bioinformatics 0 10-16-2013 10:16 PM
Denovo assembly Thenna Bioinformatics 2 05-06-2013 07:09 AM
Denovo assembly problem huma Asif Illumina/Solexa 1 03-27-2013 10:20 PM
denovo assembly nagaraj Bioinformatics 5 07-11-2012 07:13 AM

Reply
 
Thread Tools
Old 05-05-2014, 02:51 AM   #1
bioman1
Member
 
Location: US

Join Date: May 2012
Posts: 80
Thumbs up Mitochondria genome-denovo assembly

Hi all,

I am trying to assemble plant mitochondria genome. The method I follow is to extract mitochondria reads from genomic reads (sequenced WGS approach using hiseq 2000, illumina paired-end reads)

1. I have downloaded all mitochondrial genomes of plants and indexed as reference genome using BWA
2. The raw paried-end reads were filtered (adapter & low quality reads filtered) which passed fastqc tool test. The fastqc passed filtered reads were interleaved using using perl script and used as single-end sequence. These single-end sequence were mapped to mitochondiral reference genome using BWA
3. Then mapped reads are extracted using samtools -F 4 option and got output in bam format
4. Using picard, bam format converted to fastq format
5.Before doing denovo assembly, I checked with fastqc, it failed in following
(i)FAIL-Per sequence GC content
(ii)FAIL-Sequence Duplication Levels
(iii)FAIL-Overrepresented sequences
(iv)FAIL-Kmer Content

My questions
(i) what I can I improve the reads before denovo assembly of mitochondrial reads?
(ii) Which better tool to assembly mitochondrial genome velvet or soapdenovo?. How much kmer size can be used?
bioman1 is offline   Reply With Quote
Old 05-05-2014, 07:51 AM   #2
maubp
Peter (Biopython etc)
 
Location: Dundee, Scotland, UK

Join Date: Jul 2009
Posts: 1,543
Default

I would try MITObim, which uses the MIRA assembler iteratively:
https://github.com/chrishah/MITObim
http://dx.doi.org/10.1093/nar/gkt371
maubp is offline   Reply With Quote
Old 05-05-2014, 10:46 AM   #3
kmcarr
Senior Member
 
Location: USA, Midwest

Join Date: May 2008
Posts: 1,172
Default

Quote:
Originally Posted by bioman1 View Post
5.Before doing denovo assembly, I checked with fastqc, it failed in following
(i)FAIL-Per sequence GC content
(ii)FAIL-Sequence Duplication Levels
(iii)FAIL-Overrepresented sequences
(iv)FAIL-Kmer Content

My questions
(i) what I can I improve the reads before denovo assembly of mitochondrial reads?
Don't get hung up on the FastQC report. You have created a very small subset of the total input reads, and a subset which likely represents an extreme fold coverage of the Mt genome. The FastQC results you are seeing are probably to be expected given the input.
kmcarr is offline   Reply With Quote
Old 05-06-2014, 02:55 PM   #4
Matt Kearse
Member
 
Location: New Zealand

Join Date: Mar 2014
Posts: 20
Default

It's always good to have more than one way to do something to add confidence that your results are correct.

So for a completely different approach to MITObim, why not just de novo assemble your reads? Since mitochondrial coverage levels are typically much higher than other genomic data, you only need to use a small fraction of a full data set to achieve sufficient mitochondrial coverage. This has worked well for me when I tried it.

For example, with no trimming or other quality control, I took an entire Illumina Genome Analyzer II chimpanzee shotgun sequencing dataset and de novo assembled 5% of the data (which works out at around 1.4 million pairs) with default settings in Geneious apart from turning the allow circular contigs setting on. It produced around 260,000 contigs, the largest of which was a 16,500 bp circular contig which agrees with the known chimpanzee mitochondrial genome apart from SNPs. The whole process only took a few minutes of my time and just over an hour of CPU time.

It may or may not work as well for plants and/or other assemblers but it's worth a go since it's such a simple solution.

Note - for anyone not familiar with Geneious, it is commercial software with a 2 week trial, so there's plenty of time to freely use it on your own data sets. Here are the exact steps I did with http://www.ncbi.nlm.nih.gov/sra/ERX012402

1) Drag and drop the 2 fastq files into Geneious
2) Select them both and go to 'Sequence -> Set Paired Reads' and set the insert size to 250
3) Select 'Align/Assemble -> De Novo Assemble'
4) Turned on "Use x% of the data" checkbox, setting x to 5.
5) In the advanced options turned on the circular contigs option
6) Clicked OK and a little over an hour later I had my circular mitochondrial sequence.
Matt Kearse is offline   Reply With Quote
Old 05-08-2014, 12:16 AM   #5
bioman1
Member
 
Location: US

Join Date: May 2012
Posts: 80
Default

@Matt kearse: Did you use 2 fastqfiles of only mitchondrial reads?
---------------------------------------------


@maubp
I find difficult in installing MITObim, can you please help me?

I am trying to use MITObim for mitchondrial genome assembly project. I have tried tutorial provided in MITObim:https://github.com/chrishah/MITObim

I have installed Mira 3.4.1.1 and MITObim 1.6. Mapping assembly with MIRA 3.4.1.1 was sucessful, however wrapper script MITObim 1.6 stops at iteration 1, it should iterate upto 8 according the the tutorial, it says error "mirabait seems to have failed". What will be the reason

mapping assembly with MIRA 3.4.1.1:
mira --project=initial-mapping-
testpool-to-Salpinus-mt --job=mapping,genome,accurate,solexa -MI:somrnl=0 -AS:nop=1 -SB:bft=fasta:bbq=30:bsn=Salpinus-mt-genome SOLEXA_SETTINGS -CO:msr=no -SB:dsn=testpool

for MITObim 1.6:
MITObim.pl -start 1 -end 10 -strain testpool -ref Salpinus_mt_genome -readpool initial-mapping-testpool-to-Salpinus-mt_in.solexa.fastq -maf initial-mapping-testpool-to-Salpinus-mt_assembly/initial-mapping-testpool-to-Salpinus-mt_d_results/initial-mapping-testpool-to-Salpinus-mt_out.maf &> log

below is the log file:

MITObim - mitochondrial baiting and iterative mapping
version 1.6
author: Christoph Hahn, (c) 2012-2013


Full command run: ../MITObim_1.6.pl -start 1 -end 10 -strain testpool -ref Salpinus_mt_genome -readpool initial-mapping-testpool-to-Salpinus-mt_in.solexa.fastq -maf initial-mapping-testpool-to-Salpinus-mt_assembly/initial-mapping-testpool-to-Salpinus-mt_d_results/initial-mapping-testpool-to-Salpinus-mt_out.maf

All paramters seem to make sense:
startiteration: 1
enditeration: 10
strainname: testpool
refname: Salpinus_mt_genome
readpool: /PATH/MITObim-master/testdata1/initial-mapping-testpool-to-Salpinus-mt_in.solexa.fastq
maf: /PATH/Downloads/MITObim-master/testdata1/initial-mapping-testpool-to-Salpinus-mt_assembly/initial-mapping-testpool-to-Salpinus-mt_d_results/initial-mapping-testpool-to-Salpinus-mt_out.maf
quick: 0
paired: 0
denovo: 0 (mapping=0, denovo=1)
noshow: 0
read trimming: 0 (off=0, on=1)
kmer baiting: 31
platform: SOLEXA
clean: 0 (off=0, on=1)
proofread: 0

Starting MITObim

==============
ITERATION 1
==============
May 8 10:20:53


recover backbone by running convert_project on maf file


Parsing special MIRA parameters: SOLEXA_SETTINGS -CO:fnicpst=yes

Ok.
Loading from maf, saving to: fasta fastaqual
First counting reads:
[0%] ....|.... [10%] ....|.... [20%] ....|.... [30%] ....|.... [40%] ....|.... [50%] ....|.... [60%] ....|.... [70%] ....|.... [80%] ....|.... [90%] ....|.... [100%]
Now loading and processing data:
[0%] ....|.... [10%] ....|.... [20%] ....|.... [30%] ....|.... [40%] ....|.... [50%] ....|.... [60%] ....|.... [70%] ....|.... [80%] ....|.... [90%] ....|.Localtime: Thu May 8 10:20:53 2014

Generated 1692 unique template ids for 2548 valid reads.
Localtime: Thu May 8 10:20:53 2014

Seeing strain 1: "testpool"
Seeing strain 2: "Salpinus-mt-genome"
Generated 2 unique strain ids for 2548 reads.
makeIntelligentConsensus() complete calc
makeIntelligentConsensus() complete calc
makeIntelligentConsensus() complete calc
makeIntelligentConsensus() complete calc
makeIntelligentConsensus() complete calc
makeIntelligentConsensus() complete calc
makeIntelligentConsensus() complete calc
makeIntelligentConsensus() complete calc
makeIntelligentConsensus() complete calc
makeIntelligentConsensus() complete calc
makeIntelligentConsensus() complete calc
makeIntelligentConsensus() complete calc
... [100%]

Data conversion process finished, no obvious errors encountered.
SC Read:: read name issorted (1) capacity 4294967295(4) size 1
SC Read:: scf path name issorted (1) capacity 255(1) size 1
SC Read:: exp path name issorted (1) capacity 255(1) size 1
SC Read:: machine type issorted (1) capacity 255(1) size 1
SC Read:: primer issorted (1) capacity 255(1) size 1
SC Read:: strain issorted (1) capacity 255(1) size 3
SC Read:: base caller issorted (1) capacity 255(1) size 1
SC Read:: dye issorted (1) capacity 255(1) size 1
SC Read:: process status issorted (1) capacity 255(1) size 1
SC Read:: clone vector name issorted (1) capacity 65535(2) size 1
SC Read:: sequencing vector name issorted (1) capacity 65535(2) size 1
SC asped issorted (1) capacity 4294967295(4) size 1


fishing readpool using mirabait (k = 31)



mirabait seems to have failed - see detailed output above
bioman1 is offline   Reply With Quote
Old 05-08-2014, 12:20 AM   #6
bioman1
Member
 
Location: US

Join Date: May 2012
Posts: 80
Default re

@Matt kearse: Did you use 2 fastqfiles of only mitchondrial reads?
---------------------------------------------


@maubp
I find difficult in installing MITObim, can you please help me?

I am trying to use MITObim for mitchondrial genome assembly project. I have tried tutorial provided in MITObim:https://github.com/chrishah/MITObim

I have installed Mira 3.4.1.1 and MITObim 1.6. Mapping assembly with MIRA 3.4.1.1 was sucessful, however wrapper script MITObim 1.6 stops at iteration 1, it should iterate upto 8 according the the tutorial, it says error "mirabait seems to have failed". What will be the reason

mapping assembly with MIRA 3.4.1.1:
mira --project=initial-mapping-
testpool-to-Salpinus-mt --job=mapping,genome,accurate,solexa -MI:somrnl=0 -AS:nop=1 -SB:bft=fasta:bbq=30:bsn=Salpinus-mt-genome SOLEXA_SETTINGS -CO:msr=no -SB:dsn=testpool

for MITObim 1.6:
MITObim.pl -start 1 -end 10 -strain testpool -ref Salpinus_mt_genome -readpool initial-mapping-testpool-to-Salpinus-mt_in.solexa.fastq -maf initial-mapping-testpool-to-Salpinus-mt_assembly/initial-mapping-testpool-to-Salpinus-mt_d_results/initial-mapping-testpool-to-Salpinus-mt_out.maf &> log

below is the log file:

MITObim - mitochondrial baiting and iterative mapping
version 1.6
author: Christoph Hahn, (c) 2012-2013


Full command run: ../MITObim_1.6.pl -start 1 -end 10 -strain testpool -ref Salpinus_mt_genome -readpool initial-mapping-testpool-to-Salpinus-mt_in.solexa.fastq -maf initial-mapping-testpool-to-Salpinus-mt_assembly/initial-mapping-testpool-to-Salpinus-mt_d_results/initial-mapping-testpool-to-Salpinus-mt_out.maf

All paramters seem to make sense:
startiteration: 1
enditeration: 10
strainname: testpool
refname: Salpinus_mt_genome
readpool: /PATH/MITObim-master/testdata1/initial-mapping-testpool-to-Salpinus-mt_in.solexa.fastq
maf: /PATH/Downloads/MITObim-master/testdata1/initial-mapping-testpool-to-Salpinus-mt_assembly/initial-mapping-testpool-to-Salpinus-mt_d_results/initial-mapping-testpool-to-Salpinus-mt_out.maf
quick: 0
paired: 0
denovo: 0 (mapping=0, denovo=1)
noshow: 0
read trimming: 0 (off=0, on=1)
kmer baiting: 31
platform: SOLEXA
clean: 0 (off=0, on=1)
proofread: 0

Starting MITObim

==============
ITERATION 1
==============
May 8 10:20:53


recover backbone by running convert_project on maf file


Parsing special MIRA parameters: SOLEXA_SETTINGS -CO:fnicpst=yes

Ok.
Loading from maf, saving to: fasta fastaqual
First counting reads:
[0%] ....|.... [10%] ....|.... [20%] ....|.... [30%] ....|.... [40%] ....|.... [50%] ....|.... [60%] ....|.... [70%] ....|.... [80%] ....|.... [90%] ....|.... [100%]
Now loading and processing data:
[0%] ....|.... [10%] ....|.... [20%] ....|.... [30%] ....|.... [40%] ....|.... [50%] ....|.... [60%] ....|.... [70%] ....|.... [80%] ....|.... [90%] ....|.Localtime: Thu May 8 10:20:53 2014

Generated 1692 unique template ids for 2548 valid reads.
Localtime: Thu May 8 10:20:53 2014

Seeing strain 1: "testpool"
Seeing strain 2: "Salpinus-mt-genome"
Generated 2 unique strain ids for 2548 reads.
makeIntelligentConsensus() complete calc
makeIntelligentConsensus() complete calc
makeIntelligentConsensus() complete calc
makeIntelligentConsensus() complete calc
makeIntelligentConsensus() complete calc
makeIntelligentConsensus() complete calc
makeIntelligentConsensus() complete calc
makeIntelligentConsensus() complete calc
makeIntelligentConsensus() complete calc
makeIntelligentConsensus() complete calc
makeIntelligentConsensus() complete calc
makeIntelligentConsensus() complete calc
... [100%]

Data conversion process finished, no obvious errors encountered.
SC Read:: read name issorted (1) capacity 4294967295(4) size 1
SC Read:: scf path name issorted (1) capacity 255(1) size 1
SC Read:: exp path name issorted (1) capacity 255(1) size 1
SC Read:: machine type issorted (1) capacity 255(1) size 1
SC Read:: primer issorted (1) capacity 255(1) size 1
SC Read:: strain issorted (1) capacity 255(1) size 3
SC Read:: base caller issorted (1) capacity 255(1) size 1
SC Read:: dye issorted (1) capacity 255(1) size 1
SC Read:: process status issorted (1) capacity 255(1) size 1
SC Read:: clone vector name issorted (1) capacity 65535(2) size 1
SC Read:: sequencing vector name issorted (1) capacity 65535(2) size 1
SC asped issorted (1) capacity 4294967295(4) size 1


fishing readpool using mirabait (k = 31)



mirabait seems to have failed - see detailed output above
bioman1 is offline   Reply With Quote
Old 05-08-2014, 12:25 AM   #7
maubp
Peter (Biopython etc)
 
Location: Dundee, Scotland, UK

Join Date: Jul 2009
Posts: 1,543
Default

Quote:
Originally Posted by bioman1 View Post
@maubp
I find difficult in installing MITObim, can you please help me?
Double check the version of MIRA matches what the version of MITObim recommends. In particular MIRA 3 and 4 are very different, but the exact version may matter too. Also you could try the MIRA mailing list (assuming MITObim doesn't have its own forum/list).

Last edited by maubp; 10-03-2014 at 06:20 AM. Reason: typo
maubp is offline   Reply With Quote
Old 05-08-2014, 01:01 AM   #8
Matt Kearse
Member
 
Location: New Zealand

Join Date: Mar 2014
Posts: 20
Default

Quote:
@Matt kearse: Did you use 2 fastqfiles of only mitchondrial reads?
No, I just used a whole genome data set without any sort of filtering for mitochondria.
Matt Kearse is offline   Reply With Quote
Old 05-08-2014, 01:37 AM   #9
bioman1
Member
 
Location: US

Join Date: May 2012
Posts: 80
Default

@maubp
I have used stable version of MIRA 3.4.1.1 and MITObim 1.6 (double checked I am using correct version). The problem it can find mirabait executables while running MITObim 1.6. I tried export MIRA path in .bashrc and also tried with --MIRAPATH option.
---------------------
@Matt Kears: Well it seems easy method in getting mitochondrial as circular contigs. I don't understand how Geneious gets mitchondrial reads without mapping to any reference mitochondria reads or align the assembled contigs to reference mitochondria genome?. I think from your steps (i) Geneious first do denovo assemble the whole genomic reads (ii) then you use only 5% of them to form circular contigs, but how sure if it is from mitochondria? may be it from chloroplast reads (if plants). I am sorry for misinterpretation. Could you please explain a bit?.
bioman1 is offline   Reply With Quote
Old 05-08-2014, 01:55 AM   #10
Jeremy
Senior Member
 
Location: Pathum Thani, Thailand

Join Date: Nov 2009
Posts: 190
Default

With plant mitochondria you aren't necessarily expecting a single cyclic strand of DNA (for example: http://www.ncbi.nlm.nih.gov/pubmed/?...Mutation+Rates).
Jeremy is offline   Reply With Quote
Old 05-08-2014, 07:56 AM   #11
lkral
Member
 
Location: Carrollton, GA

Join Date: May 2011
Posts: 27
Default

I was able to use Geneious to assemble mitochondrial DNA form Illumina reads of fish DNA but in a different way than that suggested by Matt Kearse. I started with a small number of mitochondrial genes from a related species and then assembled to each of the genes with the Map to Reference function with an "Iterate up to X times" option. This iteratively extends the assembly of the contig from the initial gene outward. Eventually, the resulting contigs will overlap giving rise to the entire mitochondria DNA sequence. This may be similar to what MITObim does? It may be necessary to only use a fraction of the reads if the mtDNA sequences are highly enriched.
lkral is offline   Reply With Quote
Old 05-08-2014, 02:20 PM   #12
Matt Kearse
Member
 
Location: New Zealand

Join Date: Mar 2014
Posts: 20
Default

Quote:
Well it seems easy method in getting mitochondrial as circular contigs. I don't understand how Geneious gets mitchondrial reads without mapping to any reference mitochondria reads or align the assembled contigs to reference mitochondria genome?. I think from your steps (i) Geneious first do denovo assemble the whole genomic reads (ii) then you use only 5% of them to form circular contigs, but how sure if it is from mitochondria? may be it from chloroplast reads (if plants). I am sorry for misinterpretation. Could you please explain a bit?.
It varies greatly between different cells, but I think on average there are around 200 mitochondria per cell so you usually get a much higher level of mitochondrial coverage in a WGS data set compared to nuclear DNA. If you de novo assemble at an appropriate level of data, then there will be insufficient coverage to generate contigs of significant length from nuclear DNA so that will mostly end up as short contigs. Mitochondrial DNA (and for example chloroplasts in plants) will have higher levels of coverage and so are able to form larger or complete contigs.

Yes you're right - you may get chloroplast contigs too if you sequenced a region of the plant that has high chloroplast levels. In this case you'll need to do some analysis of the contigs to determine which are chloroplast or mitochondria.
Matt Kearse is offline   Reply With Quote
Old 07-17-2014, 02:24 PM   #13
francicco
Member
 
Location: Innsbruck

Join Date: Jul 2010
Posts: 28
Default

Dears, I found this thread interesting and I'm trying to use this procedure.
I have 140M RNAseq reads in paired-ends (140M+140M) too much. MIRA give me clear warning, too much coverage. So I'm splitting reads in blocks in order to get a better coverage.

Everything works, but as expected each read block gives a differente result and, because I'm using RNAseq data I have mainly sequences from coding genes. That means I have uncomplete contigs.

What I'd like to do, to use as much information as possible, is to assemble all these condings (one for each read block) and obtain a sort of consensus.

Do you have any advise how to do it? I was trying to use CAGOB (wgs-assembler), but since contigs have "x" (gaps) apparently it does not work!

Thank you so much for your help
Francesco
francicco is offline   Reply With Quote
Old 07-17-2014, 07:56 PM   #14
Jeremy
Senior Member
 
Location: Pathum Thani, Thailand

Join Date: Nov 2009
Posts: 190
Default

You need to use a transcriptome assembler (SOAPdenovo-Trans, Trinity, Trans-abyss etc.), of course a genome assembler will fail with RNA-seq data, almost all of the assumptions regarding the data are violated.
Jeremy is offline   Reply With Quote
Old 07-17-2014, 11:49 PM   #15
francicco
Member
 
Location: Innsbruck

Join Date: Jul 2010
Posts: 28
Default

Quote:
Originally Posted by Jeremy View Post
... of course a genome assembler will fail with RNA-seq data, almost all of the assumptions regarding the data are violated.
@Jeremy. I think that the protocol works pretty well, and the way MIRA and MITObim work is not contrary to situation in which you do not have a complete coverage on the molecule. But maybe @maubp can solve the issue, seems to be familiar with the method.

I'm expecting to have a fragmented reconstruction of the genome, with no coverage on polyadenylated transcripts or shorter than my insert and that is what I have. I just would like take advantage of my high amount of reads and determine the most accurate reconstruction using replicas.

F
francicco is offline   Reply With Quote
Old 07-18-2014, 12:38 AM   #16
Jeremy
Senior Member
 
Location: Pathum Thani, Thailand

Join Date: Nov 2009
Posts: 190
Default

How can you reconstruct the genome using transcriptome data?
Jeremy is offline   Reply With Quote
Old 07-18-2014, 12:57 AM   #17
francicco
Member
 
Location: Innsbruck

Join Date: Jul 2010
Posts: 28
Default

Quote:
Originally Posted by Jeremy View Post
How can you reconstruct the genome using transcriptome data?
Well, definitely not. But mtDNA is quite smaller than the nuclear genome; it's a single molecule while nuDNA has chromosomes; mtDNA genes do not splice or have different isoforms. I'd say is a totally different approach.

Still I do not see contraindications, what do you think is the misleading effect?

F
francicco is offline   Reply With Quote
Old 07-18-2014, 01:23 AM   #18
francicco
Member
 
Location: Innsbruck

Join Date: Jul 2010
Posts: 28
Default

Quote:
Originally Posted by francicco View Post
...to assemble all these condings (one for each read block) and obtain a sort of consensus.

Do you have any advise how to do it? I was trying to use CAGOB (wgs-assembler), but since contigs have "x" (gaps) apparently it does not work!
Anyway, any help about this issue?

F
francicco is offline   Reply With Quote
Old 07-18-2014, 01:28 AM   #19
Jeremy
Senior Member
 
Location: Pathum Thani, Thailand

Join Date: Nov 2009
Posts: 190
Default

If you are talking about a mammalian mitochondrial genome maybe. Plant mitochondrial genomes are very very different.
Jeremy is offline   Reply With Quote
Old 07-18-2014, 01:43 AM   #20
francicco
Member
 
Location: Innsbruck

Join Date: Jul 2010
Posts: 28
Default

Quote:
Originally Posted by Jeremy View Post
If you are talking about a mammalian mitochondrial genome maybe. Plant mitochondrial genomes are very very different.
I'm trying to do it on a Drosophila mtDNA.
francicco is offline   Reply With Quote
Reply

Tags
bioinformatics, genome, mitochondria, plastid

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 01:33 AM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2019, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO