Originally posted by Gators
View Post
Seqanswers Leaderboard Ad
Collapse
Announcement
Collapse
No announcement yet.
X
-
You could also try using the Columbus module from Velvet.L. Collado Torres, Ph.D. student in Biostatistics.
Comment
-
Originally posted by tonybolger View PostWe've noticed a tendency for CLC denovo to oversimplify complex repeat-filled areas, turning 'frayed ropes' into a single contigs. These don't tend to be in coding regions, so you won't find them with genes or RNA
We have been using CLCbio, though I worry there is some mystery to it's algorithm. CLC seems to return much better n50 and max contig lengths than SOAP, CLC is also faster and able to handle significantly more data.
We have an insect genome ~200 MB for which we use one illumina paired end lane (~200 million reads at 100bp) and one mate pair lane (~50 million reads at 36 bp, with library size of 3kb). Approx 75% of paired reads end up mapping. With 200bp min contigs we get an n50 of ~4kb.
To compare to SOAP using our limited machine (44GB RAM) we can only process ~50 million reads.. that subset with CLC gives an n50 of 1kb while SOAP gives n50 of 300bp.
We then Scaffold the CLC contigs using the mate pair reads with SSPACE.. which seems to do a very decent job (from above example, post SSPACE scaffolding gives n50=88kb). Still, CLC users are rare (due to it being proprietary) and the inability to control kmer size makes me weary. So i'd appreciate any further light on the subject.
Comment
-
Originally posted by themwg View PostCould you elaborate on what you mean by frayed ropes turning into single contigs?
Code:A---> ---->E C--->D B---> ---->F
where the 'correct' paths are A->C->D->E and B->C->D->F, with C->D being a repeat.
It appears the CLC tends to be overly aggressive for my taste, and collapses the A->C and B->C paths into a forced consensus, even in the presence of strong support for the different paths. Likewise, D->E and D->F. Unfortunately, due to lack of tuning options, this isn't easy to prevent. Check for Ns in the assembly - this might be an indicator.
Faced with this situation, other assemblers usually produce 5 contigs, whereas CLC will produce 1. This has already caused us to closely investigate family number differences of related genes (vs a related organism) which turned out to be merely 'merged' in the CLC assembly.
Originally posted by themwg View PostWe have been using CLCbio, though I worry there is some mystery to it's algorithm. CLC seems to return much better n50 and max contig lengths than SOAP, CLC is also faster and able to handle significantly more data.
Originally posted by themwg View PostWe have an insect genome ~200 MB for which we use one illumina paired end lane (~200 million reads at 100bp) and one mate pair lane (~50 million reads at 36 bp, with library size of 3kb). Approx 75% of paired reads end up mapping. With 200bp min contigs we get an n50 of ~4kb.
To compare to SOAP using our limited machine (44GB RAM) we can only process ~50 million reads.. that subset with CLC gives an n50 of 1kb while SOAP gives n50 of 300bp.
For SOAP assemblies, i would strongly recommend pre-filtering the reads by quality - it considerably reduces the memory footprint. Both assemblers may well give better N50 with filtering. Still, i would expect CLC to beat SOAP by a factor of 5-10 in contig N50.
SOAP contig N50 is somewhat hampered by the fact that it doesn't use pairing information at all until the scaffolding stage. It is also broken in other interesting ways, but there doesn't seem to be a perfect beast for the job. You might also want to give the new CLC v4 beta a spin - it doesn't work on very big assemblies, but 200 million reads may be ok.
Originally posted by themwg View PostWe then Scaffold the CLC contigs using the mate pair reads with SSPACE.. which seems to do a very decent job (from above example, post SSPACE scaffolding gives n50=88kb).Originally posted by themwg View PostStill, CLC users are rare (due to it being proprietary) and the inability to control kmer size makes me weary. So i'd appreciate any further light on the subject.
Comment
-
Originally posted by seb567 View PostYes, I think it is very clever to store genome variations as they are encountered.Last edited by jiltysequence; 06-23-2011, 10:16 AM.
Comment
-
Running Abyss
Hi,
Multiple people posted in this thread were able to run abyss succesfully. I am novice and have some doubts about running abyss. Please answer:
Question 1:
I want to use abyss for paired reads assembly. But I have paired reads (Forward and reverse) in single file. This is the file generated after quality trimming.
The file structure is
>001_forward
ATGC.......
>001_reverse
ATGC....
>002_forward
ATGC....
>002_reverse
ATGC....
How do I run Abyss for such file? I need command for this. Any suggestions?
Question2:
I have paired end files for single genome. e.g. Genome X reads are
001_R1.fastq 001_R2.fastq
002_R1.fastq 002_R2.fastq
003_R1.fastq 003_R2.fastq
Do i need to treat each pair as separate library? or if I mention
abyss-pe name=ecoli k=64 in='001_R1.fastq 001_R2.fastq 002_R1.fastq 002_R2.fastq'
should work fine?
Question3:
Does abyss have automated qulaity trimming incorporated or its necessory to use quality trimmed reads? I read somewhere it has -q flag
Thanks
Comment
-
#1) Run ABySS as SE (single end) or split your file into two parts. perl or awk would be my tools of choice for this.
#2) Set them up as individual libraries, e.g., lib="libA libB" libA="001_R1.fastq 001_R2.fastq" libB="002_R1.fastq 002_R2.fastq"
#3) I always do trimming pre-ABySS. What did you read about the '-q' flag? Did it mention trimming?
Comment
-
Originally posted by westerman View Post#1) Run ABySS as SE (single end) or split your file into two parts. perl or awk would be my tools of choice for this.
#2) Set them up as individual libraries, e.g., lib="libA libB" libA="001_R1.fastq 001_R2.fastq" libB="002_R1.fastq 002_R2.fastq"
#3) I always do trimming pre-ABySS. What did you read about the '-q' flag? Did it mention trimming?
Comment
-
Originally posted by eslondon View PostI have been playing around with ABYSS, SOAPdenovo and CLC Bio for a genome project. To cut a very long story short, these are our experiences.
We started from a set of standard 200bp PE reads and a set of 5kb mate pair reads.
-ABYSS: with our limited 5kb reads, we never managed to get ABYSS to use them properly for scaffolding. The Contig N50 was a bit poor, whatever we tried. It took a fair while, we never got it to parallelize
-SOAPdenovo: very fast because using multiple threads is as simple as saying -p number of processors, and VERY good at scaffolding. The Contig N50 was not great, but better than ABYSS (around 600bp)
-CLC Bio: although it does not support scaffolding, it gave us by far the best N50 in terms of contigs (an N50 of 2.2Kb)
In the end we used CLCBio contigs with SOAPdenovo for scaffolding, which got us a nice N50 of 8kb.
Finally we use the SOAPdenovo GapCloser to close GAPS in the scaffolds produced, which removed about 25% of the Ns we had in the assembly!
All the QC on these assemblies (mapping known genes, mapping RNA-Seq reads, etc) pointed to the CLCBio + SOAPdenovo as being the best we had.
Now we are going to throw more data at it, hoping for a much better assembly
best regards
Elia
Comment
-
- Information for assembly Scaffold 'output.scafSeq'.(cut_off_length <
100bp) -->
Size_includeN 14238304
Size_withoutN 14238304
Scaffold_Num 69976
Mean_Size 203
Median_Size 154
Longest_Seq 5423
Shortest_Seq 100
Singleton_Num 69976
Average_length_of_break(N)_in_scaffold 0
Known_genome_size NaN
Total_scaffold_length_as_percentage_of_known_genome_size NaN
scaffolds>100 69864 99.84%
scaffolds>500 2964 4.24%
scaffolds>1K 324 0.46%
scaffolds>10K 0 0.00%
scaffolds>100K 0 0.00%
scaffolds>1M 0 0.00%
Nucleotide_A 3733290 26.22%
Nucleotide_C 3403704 23.91%
Nucleotide_G 3387000 23.79%
Nucleotide_T 3714310 26.09%
GapContent_N 0 0.00%
Non_ACGTN 0 0.00%
GC_Content 47.69% (G+C)/(A+C+G+T)
N10 611 1677
N20 420 4532
N30 315 8483
N40 250 13577
N50 206 19868
N60 174 27405
N70 151 36212
N80 134 46255
N90 120 57488
Can anyone explain what is size include N means and how the size without N numbers is same.?
and N50 value of this result is?
Comment
-
Hi,
first of all: Which program gave you this output?
Originally posted by Aman Mahajan View PostSize_includeN 14238304
Size_withoutN 14238304
Nucleotide_A 3733290 26.22%
Nucleotide_C 3403704 23.91%
Nucleotide_G 3387000 23.79%
Nucleotide_T 3714310 26.09%
GapContent_N 0 0.00%
Originally posted by Aman Mahajan View PostN10 611 1677
N20 420 4532
N30 315 8483
N40 250 13577
N50 206 19868
N60 174 27405
N70 151 36212
N80 134 46255
N90 120 57488
Shouldn't this be in the doku of the program you are using to generate this output?
Hope this is of any help.
Comment
Latest Articles
Collapse
-
by seqadmin
Many organizations study rare diseases, but few have a mission as impactful as Rady Children’s Institute for Genomic Medicine (RCIGM). “We are all about changing outcomes for children,” explained Dr. Stephen Kingsmore, President and CEO of the group. The institute’s initial goal was to provide rapid diagnoses for critically ill children and shorten their diagnostic odyssey, a term used to describe the long and arduous process it takes patients to obtain an accurate...-
Channel: Articles
12-16-2024, 07:57 AM -
-
by seqadmin
Innovations in next-generation sequencing technologies and techniques are driving more precise and comprehensive exploration of complex biological systems. Current advancements include improved accessibility for long-read sequencing and significant progress in single-cell and 3D genomics. This article explores some of the most impactful developments in the field over the past year.
Long-Read Sequencing
Long-read sequencing has seen remarkable advancements,...-
Channel: Articles
12-02-2024, 01:49 PM -
ad_right_rmr
Collapse
News
Collapse
Topics | Statistics | Last Post | ||
---|---|---|---|---|
Started by seqadmin, 12-17-2024, 10:28 AM
|
0 responses
33 views
0 likes
|
Last Post
by seqadmin
12-17-2024, 10:28 AM
|
||
Started by seqadmin, 12-13-2024, 08:24 AM
|
0 responses
49 views
0 likes
|
Last Post
by seqadmin
12-13-2024, 08:24 AM
|
||
Started by seqadmin, 12-12-2024, 07:41 AM
|
0 responses
34 views
0 likes
|
Last Post
by seqadmin
12-12-2024, 07:41 AM
|
||
Started by seqadmin, 12-11-2024, 07:45 AM
|
0 responses
46 views
0 likes
|
Last Post
by seqadmin
12-11-2024, 07:45 AM
|
Comment