Hi there,
this is my first post in SEQanswers but I looked the community from long time.
I started these last 6 month to work with metagenomics using the sequencing service and analysis of BGI of some samples (formally I'm a biochemist with good computer skills) .
My problem is that to learn about the assembly strategies i tried to reproduce the outpout of BGI on my first sample using SOAPdenovo with the k-mer values that are indicated in the final report to check the best N50 and N90 score, the best length etc.
So I prepared my SOAP_config_file with the illumina_clean_reads:
BGI (testing 21<kmer<63) report that for this sample the best was kmer=33 with
Then I run SOAPdenovo-63mer with 21<kmer<63.
My statistics however for the same samples with kmer=33 showed:
Then I applied a filter to remove the sequences with less than 500bp and the result was:
Somebody know why these results are so different?
I tried to run SOAPdenovo both in step-by-step mode and in single command but the result does not change and the same differences are present with the other kmer values if compared with the BGI comparison of assembly result on different kmer.
this is my first post in SEQanswers but I looked the community from long time.
I started these last 6 month to work with metagenomics using the sequencing service and analysis of BGI of some samples (formally I'm a biochemist with good computer skills) .
My problem is that to learn about the assembly strategies i tried to reproduce the outpout of BGI on my first sample using SOAPdenovo with the k-mer values that are indicated in the final report to check the best N50 and N90 score, the best length etc.
So I prepared my SOAP_config_file with the illumina_clean_reads:
---------
#maximal read length
max_rd_len=90
[LIB]
#average insert size
avg_ins=170
#if sequence needs to be reversed
reverse_seq=0
#in which part(s) the reads are used
asm_flags=3
#use only first 100 bps of each read
rd_len_cutoff=90
#in which order the reads are used while scaffolding
rank=1
# cutoff of pair number for a reliable connection (at least 3 for short insert size)
pair_num_cutoff=3
#minimum aligned length to contigs for a reliable read location (at least 32 for short insert size)
map_len=32
#a pair of fastq file, read 1 file should always be followed by read 2 file
q1=clean_read_1.fq
q2=clean_read_2.fq
---------
#maximal read length
max_rd_len=90
[LIB]
#average insert size
avg_ins=170
#if sequence needs to be reversed
reverse_seq=0
#in which part(s) the reads are used
asm_flags=3
#use only first 100 bps of each read
rd_len_cutoff=90
#in which order the reads are used while scaffolding
rank=1
# cutoff of pair number for a reliable connection (at least 3 for short insert size)
pair_num_cutoff=3
#minimum aligned length to contigs for a reliable read location (at least 32 for short insert size)
map_len=32
#a pair of fastq file, read 1 file should always be followed by read 2 file
q1=clean_read_1.fq
q2=clean_read_2.fq
---------
BGI (testing 21<kmer<63) report that for this sample the best was kmer=33 with
sequence n°: 1543
total length: 6169928
max length: 126110
min length: 500
N50: 15863
N90: 1181
total length: 6169928
max length: 126110
min length: 500
N50: 15863
N90: 1181
My statistics however for the same samples with kmer=33 showed:
sequence n°: 188926
total length: 16591907
max length: 15246
min length: 34
N50: 74
N90: 42
total length: 16591907
max length: 15246
min length: 34
N50: 74
N90: 42
sequence n°: 2529
total length: 3094882
max length: 15246
min length: 500
N50: 1384
N90: 584
total length: 3094882
max length: 15246
min length: 500
N50: 1384
N90: 584
I tried to run SOAPdenovo both in step-by-step mode and in single command but the result does not change and the same differences are present with the other kmer values if compared with the BGI comparison of assembly result on different kmer.
Comment