Seqanswers Leaderboard Ad

**usad** · 07-05-2010, 03:47 AM

Hi didymos,

I reckon you want to do de-novo? I wouldn't have the values for a full human genome but 500MBases to 1GBase should run with <100GBytes of RAM with CLC . The recommended RAM size is currently somewhere around 70Gbytes, and we never went over 100Gbytes RAM usage when we assemble up to 1Gigabase (non human genomes though).

Best Wishes

**syambmed** · 03-24-2011, 11:54 PM

Hi Roald,

I am new to bioinformatics and have little bioinfo knowledge.

I got treated and control transcriptome data from a cell line from a cat. I want to find differentially express genes between the two. I am using CLC Genomic workbench4.5.1 for analysis. I have several questions I hope you can help. For finding differently gene expression I did de novo assembly on control reads and then map my treated reads back to assembled control reads by RNA seq. this way de novo control act as the reference.

Is this the right way to do it..? or should I de novo treated reads also before mapping back to de novo control reads..?

after I did that I create box plot for quality control.the mean line is on the same level.

so, should I conduct normalization..? if yes, the software provide 3 ways to normalize data which are by scaling, quantile and reads per million.

which one should I choose..? I read the reads per million is the suitable one for RNA high throughput sequencing data.

or should I use reference gene like GAPDH or beta-actin expression value for normalization..? if yes, how do I do it using this software..?

FYI, cat has 2x annotated genome and 3x genome without annotation. I already did rna seq analysis for control and treated reads with these genome to find out the genes.

The problem is I dont know how to compare control and treated bcoz when I want to compare them by rna seq but the software always tick prokaryote instead of eukaryote. I read your CLC bio tutorial on rna seq but still have some confusion about this.

Help me..huhu

Thank you.

**amhein** · 03-28-2011, 03:42 AM

Hi syambmed,

Here are a few comments to your questions:

1: I think it would probably be preferable to include both control and cell line reads in your de novo assembly.

2: If the box plots look ok there is probably not a need for normalizing your data. Also, there are two types of statistical tests available. The Gaussian based tests assume a continuous expression measure (such as the RPKM) and require replicates for each condition. The proportion-based tests compare counts (such as a read count, e.g 'Total gene reads'), and can work with or without replicates. As the proportion based tests compare proportions they implicitly normalize samples, so you should not use them on normalized data.

3: The RNA-seq analysis in the Genomics Workbench allows you to work either (a) with an annotated genome or (b) with a list of reference sequences (e.g. ESTs). Option (b) would be the one you use when you assemble against some reference sequences that you found in your de novo analysis. Option (a) requires that you have a genome sequence which has 'gene', and possible 'mRNA' annotations. If only gene annotations are available the reads will be assembled against the gene regions. If also 'mRNA' annotations are available, you can choose between the 'Eukaryote' and 'Prokaryote' modes. If you choose 'Eukaryote' reads will also be assembled against transcripts. If you do not have mRNA annotations on you reference sequence for the cat the 'prokaryote' option is the only available option. But it will make sense to use this option - it is our name for the option that is a bit misleading - sorry.

Hope this helps. If you need further assistance please contact our Support people.

Anne-Mette (CLC developer)
On behalf of Roald

**Roald** · 03-28-2011, 11:02 AM

Thanks!!

Thanks for your help Anne-Mette!!

**syambmed** · 03-30-2011, 12:19 AM

Confirmation

Originally posted by amhein View Post

Hi syambmed,

Here are a few comments to your questions:

1: I think it would probably be preferable to include both control and cell line reads in your de novo assembly.

2: If the box plots look ok there is probably not a need for normalizing your data. Also, there are two types of statistical tests available. The Gaussian based tests assume a continuous expression measure (such as the RPKM) and require replicates for each condition. The proportion-based tests compare counts (such as a read count, e.g 'Total gene reads'), and can work with or without replicates. As the proportion based tests compare proportions they implicitly normalize samples, so you should not use them on normalized data.

3: The RNA-seq analysis in the Genomics Workbench allows you to work either (a) with an annotated genome or (b) with a list of reference sequences (e.g. ESTs). Option (b) would be the one you use when you assemble against some reference sequences that you found in your de novo analysis. Option (a) requires that you have a genome sequence which has 'gene', and possible 'mRNA' annotations. If only gene annotations are available the reads will be assembled against the gene regions. If also 'mRNA' annotations are available, you can choose between the 'Eukaryote' and 'Prokaryote' modes. If you choose 'Eukaryote' reads will also be assembled against transcripts. If you do not have mRNA annotations on you reference sequence for the cat the 'prokaryote' option is the only available option. But it will make sense to use this option - it is our name for the option that is a bit misleading - sorry.

Hope this helps. If you need further assistance please contact our Support people.

Anne-Mette (CLC developer)
On behalf of Roald

==============================================================

Dear Anne-Mette,

Thank you for your reply. Your reply has shed some light in my tunnel..haha..

Can I confirm these with you.

1. So, it is preferable to do 'de novo-ed control reads' vs 'de novo-ed treated reads' RNA seq rather than control (raw reads) vs treated (raw reads) RNA seq..? I tried doing 'raw control' vs 'raw treated' reads once but my computer freezed after 10 hours with only 1 percent progress. (16 core, 47 gb RAM, control=48 million reads vs 50 million reads)

2. I don't have any replicates..just 1 control and 1 treated sequenced. Thus, based on your suggestion, proportion-based tests are the most appropriate.

3. The 2x cat's genome has mRNA annotations. So, I don't have problem with this.

I found that CLC GWB is very user friendly especially for a newbie like me. Keep up the good work.

**Irsan_Kooi** · 06-07-2011, 04:26 AM

Does anyone have an idea how long it takes to perform a single end assembly with CLC assembly cell 3.2.2. on 24 Gbases of data using quadcore with 16 GB or RAM.

P.S. I know what they claim on the company website, I just like to hear about experiences of an unbiased user...

**sklages** · 06-07-2011, 11:38 AM

Originally posted by Irsan_Kooi View Post

Does anyone have an idea how long it takes to perform a single end assembly with CLC assembly cell 3.2.2. on 24 Gbases of data using quadcore with 16 GB or RAM.

P.S. I know what they claim on the company website, I just like to hear about experiences of an unbiased user...

There is probably no correct answer. It may depend on organism, type of library, type of sequence data, quality of sequence data, size of target (genome,transcriptome), type of processors, speed of IO etc. And, .. 16GB of RAM is not too much ... :-)

Let us know when your assembly has finished and how the quality is ..

Sven

**NextGenSeq** · 06-07-2011, 12:00 PM

We don't have the assembly cell but on a computer with 16GB of RAM and 24 GB of data it would take about 6 hours. I've assembled 250 million reads from a HiSeq in ~16 hours. This if for reference assembly. However, de novo assembly takes about the same time.

**Abishai3911** · 06-30-2011, 02:41 PM

Hi,

I am basically a molecular biologist/biochemist and not a Bioinformatician. However, I have been trying to use CLC Genomics Workbench to analyze my 454 data resulting from PCR amplicons. I was able to import the .fna and .qual file into CLC. Now when I do use the "Map reads to reference" under "Highthroughput sequencing" for my sequencing reads (containing 121000 sequences of 310bases) with a 32bp reference sequence, the matched sequences that it shows is incorrect. For eg I am getting only 97 matches instead of atleast 10000 matches that are expected. Also, sometimes when the reference sequence is shorter for example 15 bp, then it says the match count is zero and that there are zero matches.

Can somebody help me with this? Am I doing the mapping correctly?

Thanks in advance.

JAG

**shaohua.fan** · 07-19-2011, 07:04 AM

hi, CLC people,

I have a question about CLC genomic workbench that when will CLC add the scaffolding option in the genome assembly. Until the latest version (version 4.7.2), CLC genomic workbench still does not support this. But, this is of important for the genome assembly.

Thanx

**usad** · 07-19-2011, 07:08 AM

No idea,
I guess the easiest way to help yourself is using SSPACE, after you got your contigs with CLC.

Cheers,
björn

**shaohua.fan** · 07-19-2011, 07:12 AM

Originally posted by usad View Post

No idea,
I guess the easiest way to help yourself is using SSPACE, after you got your contigs with CLC.

Cheers,
björn

but SSPACE does not support scaffolding using the 454 reads.

**usad** · 07-19-2011, 08:02 AM

I didn't know you have 454 data. So what kind of data do you have?

if it is 99% illumina and a bit for 454 scaffolding:
I reckon also SSPACE can be beaten into submission by giving it fake reads. It works with SOAP at least you could plainmail Tbolger if you wanted to give that a shot. Or better yet switch to an assembler/scaffolfer that takes all data into account. (I guess that was why you asked the question in the first place :-))

Cheers,
björn

**shaohua.fan** · 07-19-2011, 08:13 AM

Originally posted by usad View Post

I didn't know you have 454 data. So what kind of data do you have?

if it is 99% illumina and a bit for 454 scaffolding:
I reckon also SSPACE can be beaten into submission by giving it fake reads. It works with SOAP at least you could plainmail Tbolger if you wanted to give that a shot. Or better yet switch to an assembler/scaffolfer that takes all data into account. (I guess that was why you asked the question in the first place :-))

Cheers,
björn

i have tried to trim the long 454 reads (20K PE) to 36 and 72 bp and fed them as Illumina reads to SSPACE. But the scaffolding quality didn't improve much.

The reason I asked the question to CLC people is that we bought the CLC since it appears an all in one package (de novo genome assembly with hybrid 454 and illumina data). But, the scaffolding function, which is essential for a complicated genome assembly, is not included. I guess CLC is expecting all their customs buy the CLC then de novo assembly the virus or simple bacterial genome?

**usad** · 07-19-2011, 08:33 AM

Did you do random trimming or did you trim them down to the region with the highest information gain (which is what we do).

I think it had large genomes in mind. It is really good in RAM consumption and quite ok in thread usage and thus speed. Maybe CLC4 brings some scaffold capabilities?

Cheers,
björn

Topics	Statistics	Last Post
Cancer Metastasis: A Deep Dive into Cellular Plasticity by seqadmin Started by seqadmin, 04-11-2024, 12:08 PM	0 responses 39 views 0 likes	Last Post by seqadmin 04-11-2024, 12:08 PM
Proteogenomic Profiles Offer New Clues in Prostate Cancer by seqadmin Started by seqadmin, 04-10-2024, 10:19 PM	0 responses 41 views 0 likes	Last Post by seqadmin 04-10-2024, 10:19 PM
Novel Diagnostic Assay Enhances Ovarian Cancer Detection by seqadmin Started by seqadmin, 04-10-2024, 09:21 AM	0 responses 35 views 0 likes	Last Post by seqadmin 04-10-2024, 09:21 AM
Evolutionary Dynamics of Centromeres: A Comparative Genomic Analysis by seqadmin Started by seqadmin, 04-04-2024, 09:00 AM	0 responses 55 views 0 likes	Last Post by seqadmin 04-04-2024, 09:00 AM

Seqanswers Leaderboard Ad

Announcement

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News