Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • #16
    Hi Torst,
    did you see any significant improvements in terms of speed in the 3.2 release from March? They stated that it was significantly faster (like 25%).
    Last edited by Stegger; 04-01-2009, 12:18 AM.

    Comment


    • #17
      Stegger

      One thing different in the new release is that there is no longer a minimum contig length of 200 bases ... strangely enough I am now getting "contigs" in de novo assembly of 36 bases ... from 36 base solexa reads ! ... this seems to be a glitch ... my problem with CLC is related to connecting contigs that "by eye" have plenty of coverage at overlapping regions but CLC wont connect them ... the penalty adjustments dont seem to do anything significant ... mismatch penalty of 2 gives basically the same result as mismatch penalty of 1 for de novo assembly ... with velvet there is a large difference in the contig size when you reduce the coverage_cutoff ... other problems, like accuracy, are introduced with reduced coverage_cutoff but at least it acts as one would expect ... with CLC, staring at some of the contig ends after blasting them on what for sure is where they come together, and then looking at the read coverage in the unjoined region, it is hard to understand what kept the assemler from joining them into a larger contig ... on the other hand, CLC does give you the graphic that neatly lines up all the reads so you have the opportunity of looking at them to try to understand how it made its decisions ...

      as Torst points out, the many helpful graphic utilities with CLC (presumably the reason it is slow?) make the experience more pleasant ...

      Rudy

      Comment


      • #18
        Originally posted by Stegger View Post
        did you see any significant improvements in terms of speed in the 3.2 release from March? They stated that it was significantly faster (like 25%).
        We haven't played with it yet. We just got the single-PC licences and are waiting for some new beefed-up desktops to arrive (16 GB RAM + Quad CPU). I'm busy with Velvet/Shrimp pipelines, but the biologists here will put it through its paces.

        The main issue is that there is no true objective criterion for comparing de novo assemblies when no close references are available.

        Comment


        • #19
          Originally posted by Torst View Post
          The main issue is that there is no true objective criterion for comparing de novo assemblies when no close references are available.
          Thats true. I have the Genomics program and are really happy with it, but I am also mostly a molecular biologist with an interest into bioinformatics but no expert. So I like the interface and options in provides me in a somewhat familiar interface.

          Comment


          • #20
            Originally posted by RudyS View Post
            Stuff
            Rudy
            Thanks Rudy, I will try and look a bit further into my assembly output!

            Comment


            • #21
              objective criterion for comparing de novo assemblies

              The main issue is that there is no true objective criterion for comparing de novo assemblies when no close references are available.[/QUOTE]

              Torsten

              For your bacterial genomes the majority of the DNA is coding for proteins (presumably) ... long open reading frames for proteins that "make sense" is a decent biological criterion ... assembly errors will produce stop codons at a relatively high rate ... indels mostly lead to out-of-frame shifts more often than expected ... I have seen reports of people working on programs to incorporate this kind of CDS "spell-check" ... I do it with undergrads ...

              Rudy

              Comment


              • #22
                CLC Workbench 3.5

                Hello,

                I'm new to NGS and this list. So this is my first posting ... ;-)

                I am testing the CLC Genomic Workbench 3.5 for our molecular biologists (our main users). I like the user interface, and the assembly against a reference genome/transcriptome is fast (comparable with bowtie - not arguing about minutes ...) and "only" consumes about 2Gb of memory.

                Still the application is memory greedy - the assembler/mapper seems to be a stand alone binary program (C/C++?) that's called by the workbench, whereas the rest is java which consumes lots of memory (~30 Gb when loading 7mio Solexa reads in fastq format and the human RefSeq mRNAs as the reference).

                I run the workbench on a 64 Gb Linux machine, but our end users only have small winXP workstations. Even if I did the assemblies and mappings for them, the resulting contig file is too large to load on any winXP machine (limited to <4 Gb of memory) for browsing. Anyway, there's probably a trick to split thing up ... (maybe RTFM helps ;-).

                We're doing RNA-Seq (qualitative), and the main reasons why our biologists are interested in the workbench is to query for their favorite gene in the assembly and look how many reads align where - confirm the presence of transcripts and ultimately/hopefully work out tissue specific isoforms. However, for the moment the search capabilities in the workbench is not yet as good as I'd like to have it, e.g. the assembled contigs table does not allow to search for gene names even though the reference is RefSeq mRNA from gene bank with lots of annotation. I guess they're still improving this kind of functionality.

                Has anybody experience using their Genomics Server in combination with the workbench? It's supposed to let users run the workbench as a client and let the assembly and mapping to be calculated on the server, but again loading the results into the client for browsing could still be a bottleneck.

                Finally, what alternatives are there for browsing assembly/mapping results (when mapping to a reference genome) interactively and with some graphics, I mean for end users? I just read about MapView but haven't tested it yet.

                regards,

                Arne

                Comment


                • #23
                  Hi all,
                  I am just beginning to evaluate CLC Genomic Workbench for use with Illumina output and I am finding it so 454 orientated that it is driving me crazy with irrelevant instructions. Does anyone have any clear instruction on sorting Illumina based indexed sequencing?

                  The other question is - can we do real mapping with CLC or are we stuck with contig assembly (with or without reference). I do a lot of work with small ncRNAs and cannot find any tools in the trial that are remotely useful. I also find the comparison of their assembler with maq and soap a laugh, this is comparing an assembler with mappers.
                  I am working under an ubuntu 64 bit environment and the data loading of one lane of Paired End reads was extremely slow.

                  So far I feel the reality is not living up to the hype or maybe I am penalised for not working with human/mouse/rat resequencing data. Does anyone know if there are any tutorials on the NGS part of CLC bio that are relevent to indexed Illumina data or that from miRNAs?
                  By the way, thanks for the Velvet comparison. So far that has been the best de novo assembler for our group.

                  Cheers,
                  Lesley

                  Comment


                  • #24
                    Workbench issues

                    ** Disclaimer: I work at CLC bio **

                    Hi Lesley,
                    I am sorry to hear that you have had some problems getting started with our workbench. I have added a few comments below that I hope are useful to you.

                    We strive to cater for data from all major platforms by e.g. having a dedicated short read assembler for Illumina/Helicos data and a dedicated color-space assembler for SOLiD data. But we can off course always get better at this, so I would be really grateful to learn which parts of the software you find too 454-orientated?

                    Regarding the indexed sequencing I would like to point you to our Multiplexing module - you can read more at http://www.clcbio.com/index.php?id=1...tiplexing.html and please let me know what you think since this is a feature that we review quite often to keep track with new sequencing protocols.

                    Regarding the mapping/assembly issue you raise and the comparison between CLC and other assemblers, I need a bit more info to give you a good answer. Could you tell me what your definition of mapping is, and how this differs from reference assembly and what your specific concern is with our algorithm comparisons?
                    Perhaps you would also be interested in reading some of our white papers on this issue at http://www.clcbio.com/index.php?id=1368 Please note that these algorithms are exactly the same as implemented in the Workbench even though the white papers pertain to the stand-alone command line software.

                    Better support for quantification and discovery of small RNAs is definitely something that we are working on improving. As you may have noticed, we have a full expression analysis package that allows downstream analysis of expression data. As of now this take input from analog expression arrays and digital RNA-seq data. As of next release it will also accept data from digital tag-based expression analysis and is our plan to extend this with expression data from small RNA quantification experiments as well.

                    Regarding the data import we have increased the speed quite dramatically recently, so I hope you will give the latest version a spin - see more at http://www.clcbio.com/index.php?id=1297

                    We have a bunch of tutorials lying around at http://www.clcbio.com/index.php?id=649 but unfortunately we do not have any for multiplexing yet - I will pass that to our documentation guys.

                    Do not hesitate to get back if there is more we can do to help you.

                    Best regards

                    Roald Forsberg
                    Director of Scientific Software Solutions, CLC bio.


                    ** Disclaimer: I work at CLC bio **

                    Comment


                    • #25
                      CLC Genomics 3.5

                      Disclaimer: I work at CLC bio
                      Hi Arne,

                      I have added some comments to your post here that I hope may be of use:
                      You are right that the Java side of our software uses a lot of memory. In order to utilize the full potential of the hardware and get things done as fast as possible we allow the program to use as much memory as is safe. This is done by checking the hardware specifications during startup.
                      If you are using the .sh installer the vmoptions should automatically be set to around 75%. However, if you think this is too much you can change the memory settings from the vmoptions file in the installation directory (e.g. clcgenomicswb3.vmoptions).

                      We have an ongoing effort to optimize our algorithms and data structures such that the software will run smoothly on even moderately equipped hardware and will fit the use case of doing big jobs on a large machine and then delegating the inspection to e.g. labtops.
                      On my MacBook Pro labtop I can quite comfortably view very large contigs of all human chromosomes. However, when the reference sequence of the contig is heavily decorated with annotations the machine can get a bit slow and unresponsive. This is something that we will address over the next couple of months as part of a major restructuring of our annotation handling framework. Stay tuned for that.

                      Regarding the missing search functionalities for RNA-seq results, we actually offer some quite advanced but also quite well hidden options for filtering and searching the result table (as well as most other tables). Please, have a look at http://www.clcbio.com/index.php?id=1...th_tables.html

                      I hope this helps, otherwise please get back here or try our support folks.

                      With best regards

                      Roald Forsberg
                      Director of Scientific Software Solutions, CLC bio


                      Disclaimer: I work at CLC bio

                      Comment


                      • #26
                        Thanks Roald,
                        Thanks for your quick reply. I am still waiting for a reply officially through the trial manager.

                        The multiplexing instructions specify restriction sites and tags for each end. Under Solexa sequencing the tag is read at the end of the first read. What would be extremely useful would be some instructions or tutorials explaining how to sort tags from Solexa PE indexed reads separate from those for 454 reads which is what is listed. Another major issue is how errors are taken into account for determining which index is which. The sequences are designed so that you can still determine indexes even with 2 errors but from the instructions it looks as if the CLC algorithm looks for perfect matches only. This is also 454 based and not appropriated for high throughput sequencing. We need illumina indexing instructions not the current ones that are for 454.

                        Now the definition of mapping - this is where you are NOT trying to assemble contigs. This is where the aim is to take a sequence and map its position on a reference genome. For instance, you have trimmed a small RNA sequence to 22-25 nt (the size for a potential miRNA) then you find its possible positions on the genome. Since the target sequence is smaller than one sequence assembly is not required. For longer RNAs that is cool but mapping will show these up just as well. Maq and soap do this well. The key output in this instance is a table of coordinates mapping the sequence to the reference genome. We then convert the output to gff and view in gbrowse. Please note that small RNA work is not mRNA-seq. They are totally different things. I am very interested to be able to link the mapping of the small RNAs to then folding and evaluating those foldings using CLC bio. However, the reference assembly algorithm tries to assemble into contigs and completely screws up the data. At present I hate to say CLC genomic workbench is not suitable for small RNA Illumina work. (now there is a challenge to your guys :-)
                        I suggest your development team take a complete newbe (with no 454 or Illumina or CLC experience), give them illumina data and let them tell you what is wrong with your documentation.
                        I am willing to work directly with you on this if you like and trial any improvements that are made. We are trialling this until the end of August when we are running a workshop on NGS. We have a reputation of being honest and brutal when it comes to the performance of software. At the moment we are tending towards the brutal but it would be nice to lean the other way.
                        Cheers and thanks again,
                        Lesley

                        Comment


                        • #27
                          To Lesley

                          Disclaimer: I work at CLC bio
                          Hi Lesley,

                          It is correct that there are no options to sort tagged/barcoded Illumina PE data in our current "Multiplexing by tag" functionality. We designed this module to be used with 454 data and to be flexible enough to accommodate "home brew" multiplexing as is performed by a number of our users.
                          The reason that we did not focus on the indexed Illumina data is that Illuminas Pipeline software should be able to sort the tagged reads and append the barcode to the sequence name such that downstream analysis software, like ours, needs to address the naming conventions rather than the actual tag in the sequence. For this reason, we designed the a "Multiplexing by name" module that allows the user to sort reads based on naming conventions - see http://www.clcbio.com/index.php?id=1...nces_name.html

                          However, if the Pipeline sorting does not work or is not optimal we are off course grateful to know this so that we can elaborate on our current functionality such that Illumina PE data can also be sorted in our software and we are grateful to get your input on this. Could you let me know what your reason is for not using the Pipeline software to filter the reads ?

                          Regarding the mapping issue. We do not have any customized features for small RNAs but this is in our roadmap for this year. However, I think that our tools still should be applicable for a lot of small RNA related issues and hope that we can use your input to improve this.
                          Currently, the workflow in our software is such that when you perform mapping/reference assembly against a number of reference sequences, e.g the chromosomes of a reference genome, the program will output a number of contigs which represent the global alignments of the reads against the references. Your first problem is then that you would like to have the result as a tab-delimited file of the local alignments of reads against the references. Our cmd-line assembly program suite (NGS Cell) actually already offers this option - http://www.clcbio.com/index.php?id=1...e_Program.html and we have a plan to make this available in the workbench as well. It is really simple to do so, as all the information about the local alignment is also contained in the contig objects. Your reason for outputting the tab-delimited format is for viewing in gbrowser. However, until we have the tab-delimited export sorted, I would suggest that you could view the results in the contig objects inside the genomics workbench, which we in all modesty believe is a pretty powerful contig viewer.

                          For a "full" analysis workflow, I would suggest that you try something like this:
                          • reference assemble your small RNA reads against the reference to produce full reference contigs
                          • run the ChIp-seq analysis on the contig table/contigs but disable the read shifting and read orientation filters - this is basically using the module as a peak detector for regions enriched in small RNAs
                          • use the chip-seq peak table to navigate the putative small RNA sites
                          • potentiall, you can use the extract annotations function to extract all putative small RNA encoding regions to a sequence list that can then be exported to a miRNA detection software or whatever is relevant to your problem


                          I would be happy to hear how you get along and also happy to give this a go myself if I can get the data. Your input is much appreciated and I hope that we can keep the dialog open - you are also welcome to contact me in person - and see if we can't get you leaning the other way

                          Cheers

                          Roald
                          Disclaimer: I work at CLC bio

                          Comment


                          • #28
                            Thanks again Roald,
                            We are going to try this workflow for our sequences and see how it goes.
                            The reason we are not using the pipeline for separation is that we had issues with version 1.3 and we have just received 1.4. We now have a script under 1.4 which will be used from now on but we will have to retro-fit for previously run data.
                            I tried separating on name and my system (with 8G RAM on 64 bit quad core) froze with one lane of data (3 indexes).
                            However, I am going to try again (after freeing up as much memory as possible) to see if it will work.
                            Cheers,
                            Lesley

                            Comment


                            • #29
                              To Lesley

                              Thanks for the info Lesley.
                              Do you happen to have a sample of some tagged Illumina data that I can get?
                              I basically just need a description of the format so just a few lines from the a file would suffice.

                              Cheers

                              Roald

                              Disclaimer: I work at CLC bio

                              Comment


                              • #30
                                Hi,

                                Anyone else having problems viewing graphical output from CLCGWB?

                                We're working with high coverage assemblies (5-20K ave/10M reads) and it takes 10-30 min to create any type of graphical output and even longer ~20-30min to export csv files of any graphs. We're working with version 3.6 but have had the same problem with all previous versions. I assume this is in part due to the depth of coverage we have but I'd like to rule out any problem with our workstation/install.

                                Comment

                                Latest Articles

                                Collapse

                                • seqadmin
                                  Strategies for Sequencing Challenging Samples
                                  by seqadmin


                                  Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                                  03-22-2024, 06:39 AM
                                • seqadmin
                                  Techniques and Challenges in Conservation Genomics
                                  by seqadmin



                                  The field of conservation genomics centers on applying genomics technologies in support of conservation efforts and the preservation of biodiversity. This article features interviews with two researchers who showcase their innovative work and highlight the current state and future of conservation genomics.

                                  Avian Conservation
                                  Matthew DeSaix, a recent doctoral graduate from Kristen Ruegg’s lab at The University of Colorado, shared that most of his research...
                                  03-08-2024, 10:41 AM

                                ad_right_rmr

                                Collapse

                                News

                                Collapse

                                Topics Statistics Last Post
                                Started by seqadmin, 03-27-2024, 06:37 PM
                                0 responses
                                12 views
                                0 likes
                                Last Post seqadmin  
                                Started by seqadmin, 03-27-2024, 06:07 PM
                                0 responses
                                11 views
                                0 likes
                                Last Post seqadmin  
                                Started by seqadmin, 03-22-2024, 10:03 AM
                                0 responses
                                53 views
                                0 likes
                                Last Post seqadmin  
                                Started by seqadmin, 03-21-2024, 07:32 AM
                                0 responses
                                69 views
                                0 likes
                                Last Post seqadmin  
                                Working...
                                X