Hi everybody!
I'm new in NGS data processing and also in this great community.
I have two sets of reads 454 single-end from Roche GS junior, the source organism is a yeast with an expected genome size of 12 Mb. i need to assembly this reads and finally obtain a draft of the genome. The problem is that i am getting too many contigs
I extracted the reads from the sff files using the script sff_extract and i got 211981 reads of 40-630 bp with an average lenght of ~500 bp. i made the quality control by trimming ends using a threshold quality value of 35 and discarding all reads shorter than 100 bp. Finally i got 180585 reads of 100-469 bp (the sequence lenght distribution is pretty irregular).
Then, i tried to do a denovo assembly with MIRA. like a first view, i ran
mira --job=denovo,genome,accurate,454 --project=yeast 454_SETTINGS -LR:ft=fastq -LR:fqqo=33
and i got:
Avg. total coverage: 35.45
Large contigs:
============
Number of contigs: 17
Total consensus: 58823
Largest contig: 9561
N50 contig size: 6322
N90 contig size: 1575
N95 contig size: 1281
All contigs:
============
Number of contigs: 10941
Total consensus: 6832983
Largest contig: 9561
N50 contig size: 724
N90 contig size: 357
N95 contig size: 301
Like there were too many contigs, i manipulate several parameters of MIRA to obtain better results. In general all looks like: if i sacrifice coverage, i got more "total consensus", but still a lot of contigs.
here is one of the better assemblies i got:
mira --job=denovo,genome,accurate,454 --project=yeast 454_SETTINGS -LR:ft=fastq -SKr=60 -AL:mrs=60 -AS:mrpc=10
Avg. total coverage: 18.45
Large contigs:
=============
Number of contigs: 655
Total consensus: 665610
Largest contig: 8794
N50 contig size: 1043
N90 contig size: 572
N95 contig size: 539
All contigs:
============
Number of contigs: 5699
Total consensus: 4835995
Largest contig: 8794
N50 contig size: 900
N90 contig size: 532
N95 contig size: 468
i still think there is too many contigs. i thought that this two sets of reads (two sequencing supposedly) were enough to do a good draft, but now i doubt about this. i hope you could help me .
Greetings and thanks in advance.
I'm new in NGS data processing and also in this great community.
I have two sets of reads 454 single-end from Roche GS junior, the source organism is a yeast with an expected genome size of 12 Mb. i need to assembly this reads and finally obtain a draft of the genome. The problem is that i am getting too many contigs
I extracted the reads from the sff files using the script sff_extract and i got 211981 reads of 40-630 bp with an average lenght of ~500 bp. i made the quality control by trimming ends using a threshold quality value of 35 and discarding all reads shorter than 100 bp. Finally i got 180585 reads of 100-469 bp (the sequence lenght distribution is pretty irregular).
Then, i tried to do a denovo assembly with MIRA. like a first view, i ran
mira --job=denovo,genome,accurate,454 --project=yeast 454_SETTINGS -LR:ft=fastq -LR:fqqo=33
and i got:
Avg. total coverage: 35.45
Large contigs:
============
Number of contigs: 17
Total consensus: 58823
Largest contig: 9561
N50 contig size: 6322
N90 contig size: 1575
N95 contig size: 1281
All contigs:
============
Number of contigs: 10941
Total consensus: 6832983
Largest contig: 9561
N50 contig size: 724
N90 contig size: 357
N95 contig size: 301
Like there were too many contigs, i manipulate several parameters of MIRA to obtain better results. In general all looks like: if i sacrifice coverage, i got more "total consensus", but still a lot of contigs.
here is one of the better assemblies i got:
mira --job=denovo,genome,accurate,454 --project=yeast 454_SETTINGS -LR:ft=fastq -SKr=60 -AL:mrs=60 -AS:mrpc=10
Avg. total coverage: 18.45
Large contigs:
=============
Number of contigs: 655
Total consensus: 665610
Largest contig: 8794
N50 contig size: 1043
N90 contig size: 572
N95 contig size: 539
All contigs:
============
Number of contigs: 5699
Total consensus: 4835995
Largest contig: 8794
N50 contig size: 900
N90 contig size: 532
N95 contig size: 468
i still think there is too many contigs. i thought that this two sets of reads (two sequencing supposedly) were enough to do a good draft, but now i doubt about this. i hope you could help me .
Greetings and thanks in advance.
Comment