SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics



Similar Threads
Thread Thread Starter Forum Replies Last Post
CLC Genomics Workbench - Windows vs. Linux figure002 Bioinformatics 24 12-06-2013 06:10 AM
Getting a full annotation onto a consensus sequence in CLC Genomics Workbench Dapip33 Genomic Resequencing 1 09-19-2013 07:02 AM
CLC Genomics Workbench for de novo RNA-seq JQH Bioinformatics 1 07-12-2011 11:17 PM
CLC Genomics Workbench goes hand in hand with Ion Torrent data CLC bio Vendor Forum 0 05-12-2011 05:34 AM
Mapping RNA seq using CLC Genomics WOrkbench rururara Bioinformatics 1 02-22-2011 11:35 AM

Reply
 
Thread Tools
Old 04-02-2009, 12:48 PM   #21
RudyS
Member
 
Location: new york

Join Date: May 2008
Posts: 20
Default objective criterion for comparing de novo assemblies

The main issue is that there is no true objective criterion for comparing de novo assemblies when no close references are available.[/QUOTE]

Torsten

For your bacterial genomes the majority of the DNA is coding for proteins (presumably) ... long open reading frames for proteins that "make sense" is a decent biological criterion ... assembly errors will produce stop codons at a relatively high rate ... indels mostly lead to out-of-frame shifts more often than expected ... I have seen reports of people working on programs to incorporate this kind of CDS "spell-check" ... I do it with undergrads ...

Rudy
RudyS is offline   Reply With Quote
Old 06-10-2009, 05:06 AM   #22
arne.muller
Junior Member
 
Location: Europe

Join Date: Jun 2009
Posts: 3
Default CLC Workbench 3.5

Hello,

I'm new to NGS and this list. So this is my first posting ... ;-)

I am testing the CLC Genomic Workbench 3.5 for our molecular biologists (our main users). I like the user interface, and the assembly against a reference genome/transcriptome is fast (comparable with bowtie - not arguing about minutes ...) and "only" consumes about 2Gb of memory.

Still the application is memory greedy - the assembler/mapper seems to be a stand alone binary program (C/C++?) that's called by the workbench, whereas the rest is java which consumes lots of memory (~30 Gb when loading 7mio Solexa reads in fastq format and the human RefSeq mRNAs as the reference).

I run the workbench on a 64 Gb Linux machine, but our end users only have small winXP workstations. Even if I did the assemblies and mappings for them, the resulting contig file is too large to load on any winXP machine (limited to <4 Gb of memory) for browsing. Anyway, there's probably a trick to split thing up ... (maybe RTFM helps ;-).

We're doing RNA-Seq (qualitative), and the main reasons why our biologists are interested in the workbench is to query for their favorite gene in the assembly and look how many reads align where - confirm the presence of transcripts and ultimately/hopefully work out tissue specific isoforms. However, for the moment the search capabilities in the workbench is not yet as good as I'd like to have it, e.g. the assembled contigs table does not allow to search for gene names even though the reference is RefSeq mRNA from gene bank with lots of annotation. I guess they're still improving this kind of functionality.

Has anybody experience using their Genomics Server in combination with the workbench? It's supposed to let users run the workbench as a client and let the assembly and mapping to be calculated on the server, but again loading the results into the client for browsing could still be a bottleneck.

Finally, what alternatives are there for browsing assembly/mapping results (when mapping to a reference genome) interactively and with some graphics, I mean for end users? I just read about MapView but haven't tested it yet.

regards,

Arne
arne.muller is offline   Reply With Quote
Old 06-11-2009, 04:50 PM   #23
Lesley
Junior Member
 
Location: New Zealand

Join Date: Jan 2008
Posts: 8
Default

Hi all,
I am just beginning to evaluate CLC Genomic Workbench for use with Illumina output and I am finding it so 454 orientated that it is driving me crazy with irrelevant instructions. Does anyone have any clear instruction on sorting Illumina based indexed sequencing?

The other question is - can we do real mapping with CLC or are we stuck with contig assembly (with or without reference). I do a lot of work with small ncRNAs and cannot find any tools in the trial that are remotely useful. I also find the comparison of their assembler with maq and soap a laugh, this is comparing an assembler with mappers.
I am working under an ubuntu 64 bit environment and the data loading of one lane of Paired End reads was extremely slow.

So far I feel the reality is not living up to the hype or maybe I am penalised for not working with human/mouse/rat resequencing data. Does anyone know if there are any tutorials on the NGS part of CLC bio that are relevent to indexed Illumina data or that from miRNAs?
By the way, thanks for the Velvet comparison. So far that has been the best de novo assembler for our group.

Cheers,
Lesley
Lesley is offline   Reply With Quote
Old 06-12-2009, 03:11 AM   #24
Roald
Director at CLC bio
 
Location: Denmark

Join Date: Aug 2008
Posts: 26
Default Workbench issues

** Disclaimer: I work at CLC bio **

Hi Lesley,
I am sorry to hear that you have had some problems getting started with our workbench. I have added a few comments below that I hope are useful to you.

We strive to cater for data from all major platforms by e.g. having a dedicated short read assembler for Illumina/Helicos data and a dedicated color-space assembler for SOLiD data. But we can off course always get better at this, so I would be really grateful to learn which parts of the software you find too 454-orientated?

Regarding the indexed sequencing I would like to point you to our Multiplexing module - you can read more at http://www.clcbio.com/index.php?id=1...tiplexing.html and please let me know what you think since this is a feature that we review quite often to keep track with new sequencing protocols.

Regarding the mapping/assembly issue you raise and the comparison between CLC and other assemblers, I need a bit more info to give you a good answer. Could you tell me what your definition of mapping is, and how this differs from reference assembly and what your specific concern is with our algorithm comparisons?
Perhaps you would also be interested in reading some of our white papers on this issue at http://www.clcbio.com/index.php?id=1368 Please note that these algorithms are exactly the same as implemented in the Workbench even though the white papers pertain to the stand-alone command line software.

Better support for quantification and discovery of small RNAs is definitely something that we are working on improving. As you may have noticed, we have a full expression analysis package that allows downstream analysis of expression data. As of now this take input from analog expression arrays and digital RNA-seq data. As of next release it will also accept data from digital tag-based expression analysis and is our plan to extend this with expression data from small RNA quantification experiments as well.

Regarding the data import we have increased the speed quite dramatically recently, so I hope you will give the latest version a spin - see more at http://www.clcbio.com/index.php?id=1297

We have a bunch of tutorials lying around at http://www.clcbio.com/index.php?id=649 but unfortunately we do not have any for multiplexing yet - I will pass that to our documentation guys.

Do not hesitate to get back if there is more we can do to help you.

Best regards

Roald Forsberg
Director of Scientific Software Solutions, CLC bio.


** Disclaimer: I work at CLC bio **
Roald is offline   Reply With Quote
Old 06-15-2009, 02:38 AM   #25
Roald
Director at CLC bio
 
Location: Denmark

Join Date: Aug 2008
Posts: 26
Default CLC Genomics 3.5

Disclaimer: I work at CLC bio
Hi Arne,

I have added some comments to your post here that I hope may be of use:
You are right that the Java side of our software uses a lot of memory. In order to utilize the full potential of the hardware and get things done as fast as possible we allow the program to use as much memory as is safe. This is done by checking the hardware specifications during startup.
If you are using the .sh installer the vmoptions should automatically be set to around 75%. However, if you think this is too much you can change the memory settings from the vmoptions file in the installation directory (e.g. clcgenomicswb3.vmoptions).

We have an ongoing effort to optimize our algorithms and data structures such that the software will run smoothly on even moderately equipped hardware and will fit the use case of doing big jobs on a large machine and then delegating the inspection to e.g. labtops.
On my MacBook Pro labtop I can quite comfortably view very large contigs of all human chromosomes. However, when the reference sequence of the contig is heavily decorated with annotations the machine can get a bit slow and unresponsive. This is something that we will address over the next couple of months as part of a major restructuring of our annotation handling framework. Stay tuned for that.

Regarding the missing search functionalities for RNA-seq results, we actually offer some quite advanced but also quite well hidden options for filtering and searching the result table (as well as most other tables). Please, have a look at http://www.clcbio.com/index.php?id=1...th_tables.html

I hope this helps, otherwise please get back here or try our support folks.

With best regards

Roald Forsberg
Director of Scientific Software Solutions, CLC bio


Disclaimer: I work at CLC bio
Roald is offline   Reply With Quote
Old 06-15-2009, 03:07 PM   #26
Lesley
Junior Member
 
Location: New Zealand

Join Date: Jan 2008
Posts: 8
Default

Thanks Roald,
Thanks for your quick reply. I am still waiting for a reply officially through the trial manager.

The multiplexing instructions specify restriction sites and tags for each end. Under Solexa sequencing the tag is read at the end of the first read. What would be extremely useful would be some instructions or tutorials explaining how to sort tags from Solexa PE indexed reads separate from those for 454 reads which is what is listed. Another major issue is how errors are taken into account for determining which index is which. The sequences are designed so that you can still determine indexes even with 2 errors but from the instructions it looks as if the CLC algorithm looks for perfect matches only. This is also 454 based and not appropriated for high throughput sequencing. We need illumina indexing instructions not the current ones that are for 454.

Now the definition of mapping - this is where you are NOT trying to assemble contigs. This is where the aim is to take a sequence and map its position on a reference genome. For instance, you have trimmed a small RNA sequence to 22-25 nt (the size for a potential miRNA) then you find its possible positions on the genome. Since the target sequence is smaller than one sequence assembly is not required. For longer RNAs that is cool but mapping will show these up just as well. Maq and soap do this well. The key output in this instance is a table of coordinates mapping the sequence to the reference genome. We then convert the output to gff and view in gbrowse. Please note that small RNA work is not mRNA-seq. They are totally different things. I am very interested to be able to link the mapping of the small RNAs to then folding and evaluating those foldings using CLC bio. However, the reference assembly algorithm tries to assemble into contigs and completely screws up the data. At present I hate to say CLC genomic workbench is not suitable for small RNA Illumina work. (now there is a challenge to your guys :-)
I suggest your development team take a complete newbe (with no 454 or Illumina or CLC experience), give them illumina data and let them tell you what is wrong with your documentation.
I am willing to work directly with you on this if you like and trial any improvements that are made. We are trialling this until the end of August when we are running a workshop on NGS. We have a reputation of being honest and brutal when it comes to the performance of software. At the moment we are tending towards the brutal but it would be nice to lean the other way.
Cheers and thanks again,
Lesley
Lesley is offline   Reply With Quote
Old 06-16-2009, 03:40 AM   #27
Roald
Director at CLC bio
 
Location: Denmark

Join Date: Aug 2008
Posts: 26
Default To Lesley

Disclaimer: I work at CLC bio
Hi Lesley,

It is correct that there are no options to sort tagged/barcoded Illumina PE data in our current "Multiplexing by tag" functionality. We designed this module to be used with 454 data and to be flexible enough to accommodate "home brew" multiplexing as is performed by a number of our users.
The reason that we did not focus on the indexed Illumina data is that Illuminas Pipeline software should be able to sort the tagged reads and append the barcode to the sequence name such that downstream analysis software, like ours, needs to address the naming conventions rather than the actual tag in the sequence. For this reason, we designed the a "Multiplexing by name" module that allows the user to sort reads based on naming conventions - see http://www.clcbio.com/index.php?id=1...nces_name.html

However, if the Pipeline sorting does not work or is not optimal we are off course grateful to know this so that we can elaborate on our current functionality such that Illumina PE data can also be sorted in our software and we are grateful to get your input on this. Could you let me know what your reason is for not using the Pipeline software to filter the reads ?

Regarding the mapping issue. We do not have any customized features for small RNAs but this is in our roadmap for this year. However, I think that our tools still should be applicable for a lot of small RNA related issues and hope that we can use your input to improve this.
Currently, the workflow in our software is such that when you perform mapping/reference assembly against a number of reference sequences, e.g the chromosomes of a reference genome, the program will output a number of contigs which represent the global alignments of the reads against the references. Your first problem is then that you would like to have the result as a tab-delimited file of the local alignments of reads against the references. Our cmd-line assembly program suite (NGS Cell) actually already offers this option - http://www.clcbio.com/index.php?id=1...e_Program.html and we have a plan to make this available in the workbench as well. It is really simple to do so, as all the information about the local alignment is also contained in the contig objects. Your reason for outputting the tab-delimited format is for viewing in gbrowser. However, until we have the tab-delimited export sorted, I would suggest that you could view the results in the contig objects inside the genomics workbench, which we in all modesty believe is a pretty powerful contig viewer.

For a "full" analysis workflow, I would suggest that you try something like this:
  • reference assemble your small RNA reads against the reference to produce full reference contigs
  • run the ChIp-seq analysis on the contig table/contigs but disable the read shifting and read orientation filters - this is basically using the module as a peak detector for regions enriched in small RNAs
  • use the chip-seq peak table to navigate the putative small RNA sites
  • potentiall, you can use the extract annotations function to extract all putative small RNA encoding regions to a sequence list that can then be exported to a miRNA detection software or whatever is relevant to your problem

I would be happy to hear how you get along and also happy to give this a go myself if I can get the data. Your input is much appreciated and I hope that we can keep the dialog open - you are also welcome to contact me in person - and see if we can't get you leaning the other way

Cheers

Roald
Disclaimer: I work at CLC bio
Roald is offline   Reply With Quote
Old 06-18-2009, 03:31 PM   #28
Lesley
Junior Member
 
Location: New Zealand

Join Date: Jan 2008
Posts: 8
Default

Thanks again Roald,
We are going to try this workflow for our sequences and see how it goes.
The reason we are not using the pipeline for separation is that we had issues with version 1.3 and we have just received 1.4. We now have a script under 1.4 which will be used from now on but we will have to retro-fit for previously run data.
I tried separating on name and my system (with 8G RAM on 64 bit quad core) froze with one lane of data (3 indexes).
However, I am going to try again (after freeing up as much memory as possible) to see if it will work.
Cheers,
Lesley
Lesley is offline   Reply With Quote
Old 06-23-2009, 12:34 AM   #29
Roald
Director at CLC bio
 
Location: Denmark

Join Date: Aug 2008
Posts: 26
Default To Lesley

Thanks for the info Lesley.
Do you happen to have a sample of some tagged Illumina data that I can get?
I basically just need a description of the format so just a few lines from the a file would suffice.

Cheers

Roald

Disclaimer: I work at CLC bio
Roald is offline   Reply With Quote
Old 07-08-2009, 01:36 PM   #30
The_Roads
Member
 
Location: USA

Join Date: May 2009
Posts: 36
Default

Hi,

Anyone else having problems viewing graphical output from CLCGWB?

We're working with high coverage assemblies (5-20K ave/10M reads) and it takes 10-30 min to create any type of graphical output and even longer ~20-30min to export csv files of any graphs. We're working with version 3.6 but have had the same problem with all previous versions. I assume this is in part due to the depth of coverage we have but I'd like to rule out any problem with our workstation/install.
The_Roads is offline   Reply With Quote
Old 07-08-2009, 11:57 PM   #31
arne.muller
Junior Member
 
Location: Europe

Join Date: Jun 2009
Posts: 3
Default

Hello,

you mean exporting the current view into a graphics (e.g. png) file? I've had some relatively long response times during the export, but not as long as you report. I can imagine that the time needed for export is proportional to the number of elements in the current view. Maybe just take a screen shot for instead (not nice but often that's enough)?

Arne
arne.muller is offline   Reply With Quote
Old 07-09-2009, 07:01 AM   #32
The_Roads
Member
 
Location: USA

Join Date: May 2009
Posts: 36
Default

No sorry I meant turning on alignment info like coverage maps, non-specific reads etc. it takes a very long time for CLC to generate the graph in the top frame. likewise once the graph is there it takes ages to export a csv file of the graph. i'd like to know if anyone else has this problem or whether it might be something funky with my workstation (win6x 1x quad xeon 32Gb)

Thanks
The_Roads is offline   Reply With Quote
Old 07-28-2009, 02:20 PM   #33
smprince18
Junior Member
 
Location: boston

Join Date: Jun 2008
Posts: 4
Default

disclaimer I work at CLC bio

The_Roads

I was wondering if you have tried to contact [email protected] yet? Or you can reach us at (617)-444-8765. It could be a result that your VMoptions where not adjust correctly by the installer. If you go to the directory for the CLCGenomicsWB3 and show hidden, then you will see the .vmoption file, open this in notepad, there will be a line that looks like the following -Xmx####, where # = the number of mb allocated to the application (an example Xmx1024, means you have 1 gb of RAM allocated to the application)

The second possibility you downloaded the 32 bit version of the application and not the 64 bit, this would result in very slow response time since, a 32 bit application can only request 2gb of ram)

Again if you would like some help with this contact our support team or myself directly.

Shawn M Prince
smprince18 is offline   Reply With Quote
Old 07-28-2009, 03:29 PM   #34
The_Roads
Member
 
Location: USA

Join Date: May 2009
Posts: 36
Default

Hi Shawn,

Thanks I'll be in touch.

We have the 64 bit version and we've already tweaked vmoptions. Everything to relating to assembling, SNP detection etc. that requires 64 bit computing is working fine. It is just anything that alters the GUI or exports graphics/text that locks the workstations on large assemblies.

The_Roads
The_Roads is offline   Reply With Quote
Old 07-31-2009, 08:33 AM   #35
polsum
Member
 
Location: Texas

Join Date: May 2009
Posts: 32
Default

Hi, I have been testing the trial version of CLC workbench and I encountered two issues.

1. I BLASTed a set of illumina generated short reads (17-33, after trimming the 3'adapters) to mouse RefSeq database through stand alone BLAST program with stringent parameters and I found, say 2500 reads matching to mouse mRNAs. When I aligned all those 2500 reads to the same RefSeq database by using CLC reference assembly, only half of them are aligning to the reference. I tried all different options available changing the gap penalties, global alignment, scores etc...but never all the reads aligned to the reference. I think there should be more options here.

2. I used BLAST feature in CLC bench and when I view the blast output parsed results, I dont see all the columns in the overview table. For example I dont see strand orientation titled column in the overview. However, I see it in the individual blast mapping, but it is useless for me because I need to count the total number of minus and plus mappings of the total number of mappings. This is a serious limitation for me.
polsum is offline   Reply With Quote
Old 09-09-2009, 08:26 PM   #36
The_Roads
Member
 
Location: USA

Join Date: May 2009
Posts: 36
Default

As an update for future readers, received some excellent help from CLC and it appears the delays we experienced were due to the way we assembled our contigs. Using conventional assembly parameters graphics now render in seconds to minutes (CLC3.6.5).
The_Roads is offline   Reply With Quote
Old 09-10-2009, 04:14 AM   #37
smprince18
Junior Member
 
Location: boston

Join Date: Jun 2008
Posts: 4
Default

Disclaimer I work for CLC bio

Polsum,

Once you Blast your sequences within CLC WB, you will be shown a graphic view of the results. If you look in the lower left hand corner of the working are you will see a table view. Once this is open you will see a default group of columns, Please note in the right hand side panel you can toggle on and off the columns you want to see.. You will be able to look at the direction of the results. Also if you are using this to reference map your reads we graphically show orientation (red = reverse read, green forward) Also the count of forward and reverse reads can be found in the contig report. Please let me know if this straightens anything up for you. If you would like I can be contacted at the CLC Boston office 617-444-8765.

Shawn

Disclaimer I work for CLC bio
smprince18 is offline   Reply With Quote
Old 09-10-2009, 04:18 AM   #38
smprince18
Junior Member
 
Location: boston

Join Date: Jun 2008
Posts: 4
Default

The Roads,

Glad to hear that your are enjoying your experience with CLC Genomics WB. Let me know if you need anything.

Shawn M Prince

Disclaimer I work at CLC bio
smprince18 is offline   Reply With Quote
Old 12-10-2009, 07:10 AM   #39
johnny
Member
 
Location: Germany

Join Date: Dec 2009
Posts: 15
Default

Hey there,

one more newbie here
I have a probably easy to answer question but can't find the solution on my own....
How can I see the corresponding protein sequence to a nucleotide sequence ? In more detail, after a reference assembly, I would like to click through the variations with the "Find Conflict" button and instantly see if the protein sequence is affected as well.

Thanks for your help!
johnny is offline   Reply With Quote
Old 12-10-2009, 07:12 AM   #40
The_Roads
Member
 
Location: USA

Join Date: May 2009
Posts: 36
Default

Hi Johnny, I assume you are using CLCGWB. if so the conflict table is not the place to look. you should run a snp detection. if you have an annotated ref seq then the table will present you with all the amino acid changes.
The_Roads is offline   Reply With Quote
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 12:35 AM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2018, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO