Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • #16
    Thanks a lo for useful inputs. As I mentioned, the main challenge is that I have to wander without a reference sequence, which leaves only a few options for de novo discovery. BLAT is indeed one of attractive options, I have not tried it yet as I need a local install due to data volumes. This data set was generated after only a few MiSeq runs, so as data set grows a cluster or cloud is unavoidable, but I need to design a smart heuristic de novo strategy, otherwise any capability can be eventually saturated in some way. That is what I am trying to do using the local machine. De novo discovery has a dedicated forum, but I could not find much discussion there. Any ideas are appreciated and taken into consideration.

    Comment


    • #17
      I'm not sure I complete understand what you actually try to do?

      De novo discovery of what? (genes?) You're sequencing an organism with unknown genome sequence? RNA-seq? I can't see what kind of data it is you're blasting.

      Comment


      • #18
        I'm not sure I complete understand what you actually try to do?
        This and a first much smaller datasets were obtained from ancient DNA extracted from about 1000 yr old bone. I examined the smaller dataset using online tools, first using blastn, and obviously it was heavily contaminated with bacterial DNA. However, even at that point, many reads or obtained contigs did not give perfect match to known sequences, and many did not match to anything at all. I used online DeconSeq to filter out known bacterial, archeal, viral, human and mouse sequences from the smaller dataset at 94% identity threshold. The most of remaining 80% did not match to anything using blastn even at very loose criteria. When I lowered the identity threshold to 70%, about 60% of the dataset still passed the filter.

        I did not use this strategy on the bigger dataset yet as all online servers kick me out, so it is likely quite contaminated with known sequences. What I want to know eventually is what are those sequences that do not match to anything, whether they carry any biologically meaningful (from my point of understanding) information or simply represent artefactual junk (again from my point of understanding). As I know from analysis of the smaller dataset that blastn cannot give me any clue, I want to try blastx in hope that it will be more sensitive in detecting biologically meaningful information. Does this make sense?
        Last edited by yaximik; 01-04-2013, 06:20 AM.

        Comment


        • #19
          Originally posted by pallevillesen View Post
          So have you simply tested the

          -num_threads <Integer, >=1>
          Number of threads (CPUs) to use in the BLAST search
          Default = `1'

          option as suggested? And how did it go?

          Anyway. for blasting 300k sequences I would find a cluster somewhere and parallelize it on as many nodes as possible.
          As I am self educated user negotiating steep learning curves, I am not sure I am clear on the use multithreading. Does this apply to cores (the current box has 16) or CPUs (it has 2)? What happens if I specify 16 when the box actually has only 2 CPUs?

          I just found that T610 cannot accept any of GPUs either due to inadequate PCI slots or lack of needed power connectors. Two options are either use remaining in production vCORE GPU servers, which can be connected to T610, or to get entirely separate GPU server. The first option is about to be discontinued, and the second is more expensive. Any advice on reliable GPU server vendors? Unfortunately, many of them are Windows-based, which not a good option for bioinformatics.

          Comment


          • #20
            Originally posted by yaximik View Post
            As I am self educated user negotiating steep learning curves, I am not sure I am clear on the use multithreading. Does this apply to cores (the current box has 16) or CPUs (it has 2)? What happens if I specify 16 when the box actually has only 2 CPUs?

            I just found that T610 cannot accept any of GPUs either due to inadequate PCI slots or lack of needed power connectors. Two options are either use remaining in production vCORE GPU servers, which can be connected to T610, or to get entirely separate GPU server. The first option is about to be discontinued, and the second is more expensive. Any advice on reliable GPU server vendors? Unfortunately, many of them are Windows-based, which not a good option for bioinformatics.
            Multi-threading applies to cores on your CPU. That said it should be noted that you can't expect to get a 16-fold increase in speed for your blast searches even if you were to use all 16 cores. Other factors (mainly I/O limitations, disk/bus-bandwidth) are going to come into play quickly, which will limit how fast you can compute on the T610.

            If this is a one time project then perhaps splitting the workload across a cluster (as others have suggested previously) may be the most efficient route to follow.

            If this is a personally owned machine then you could always look at upgrading components (motherboard/power supply) and pursue the GPU compute route.

            Comment


            • #21
              If this is a personally owned machine then you could always look at upgrading components (motherboard/power supply) and pursue the GPU compute route.
              Thanks GenoMax, I did not think about that route.

              Comment


              • #22
                retraction

                Originally posted by yaximik View Post
                Thanks for input, that is overwhelming how much I need to learn. Before I got MiSeq, I was using online BLAST with its pretty interface. Once NCBI kicked me out because of sharply increased data volume, I had to make a local install. Now it is time to learn nuts and bolts of CLI. Thanks for advice, maupb and mjp, it is time to digest the manual.



                Wow, that looks very promising! Could you enlighten me a bit more, as this is yet another set of nuts and bolts I have to learn on the fly. GPU - graphic processor unit, correct? Perhaps I could find and install one, but how do you plug it in to work with blast+, I presume instead of motherboard CPU?
                I thought I had figured it all out too but It appears upon closer inspection that if I run the same set through with multiple CPUs there is not a significant speed up. In fact I found that with 32 cpu it was faster than the gpu. I am looking into other ways of speeding up blast as well or perhaps using a different program all together.

                Comment


                • #23
                  A while ago I contacted NCBI about this and here was their reply:

                  ####################
                  Not all phases of the algorithm are multi-threaded, which often means that even with "-num_threads" set >1, only 1 cpu is used. You might try formatting the large fasta file as a database and run the gene sequence against that, but that may not work either. It often takes a large input file against a large database to invoke multi-threading; it is not implemented for the traceback or formatting phases.
                  #####################

                  There apparently are other versions of blast which speed it up (e.g. mpiBlast, but this is probably an overkill), so as others have mentioned it probably is fastest if you split your inputs, giving them all 1 or 2 cores and working in batches simultaneously.

                  Comment


                  • #24
                    Try mpiBLAST.

                    Comment


                    • #25
                      Try mpiBLAST.
                      My understanding open source is pretty old (v.1.6 was released in 2010) and is based on legacy blast (2.2.20), not blast+. I guess the only option for a while is use its commercial spin-off AbokiaBLAST. Any experience with it anyone - is it worth its cost?

                      Comment


                      • #26
                        How about using usearch , they compare usearch with blastx (sensitivity vs speed), I have used usearch for short nucleotides of 50bp long and searched 3 million vs 3 million. You may consider this instead of blastx.

                        Comment


                        • #27
                          How about using usearch ,
                          Well, how much is licensing for 64-bit?

                          Comment


                          • #28
                            I am using 34-bit, about the cost for 64-bit it costs nearly 900$
                            Last edited by balaji; 02-03-2013, 12:59 PM. Reason: modifying the cost part

                            Comment


                            • #29
                              Maybe you can try
                              GPU blastn

                              Comment


                              • #30
                                Originally posted by Kennels View Post
                                There apparently are other versions of blast which speed it up (e.g. mpiBlast, but this is probably an overkill), so as others have mentioned it probably is fastest if you split your inputs, giving them all 1 or 2 cores and working in batches simultaneously.
                                This. For example, running four 4-thread blasts in parallel is much faster than running a single 16-thread blast (needs a lot more RAM too and IO that is not a bottleneck). I've been meaning to experiment on this with Gnu Parallel, although I usually do my blasts on a cluster by SGE..
                                Last edited by rhinoceros; 08-29-2013, 01:36 PM.
                                savetherhino.org

                                Comment

                                Latest Articles

                                Collapse

                                • seqadmin
                                  Essential Discoveries and Tools in Epitranscriptomics
                                  by seqadmin




                                  The field of epigenetics has traditionally concentrated more on DNA and how changes like methylation and phosphorylation of histones impact gene expression and regulation. However, our increased understanding of RNA modifications and their importance in cellular processes has led to a rise in epitranscriptomics research. “Epitranscriptomics brings together the concepts of epigenetics and gene expression,” explained Adrien Leger, PhD, Principal Research Scientist...
                                  04-22-2024, 07:01 AM
                                • seqadmin
                                  Current Approaches to Protein Sequencing
                                  by seqadmin


                                  Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
                                  04-04-2024, 04:25 PM

                                ad_right_rmr

                                Collapse

                                News

                                Collapse

                                Topics Statistics Last Post
                                Started by seqadmin, Today, 11:49 AM
                                0 responses
                                11 views
                                0 likes
                                Last Post seqadmin  
                                Started by seqadmin, Yesterday, 08:47 AM
                                0 responses
                                16 views
                                0 likes
                                Last Post seqadmin  
                                Started by seqadmin, 04-11-2024, 12:08 PM
                                0 responses
                                61 views
                                0 likes
                                Last Post seqadmin  
                                Started by seqadmin, 04-10-2024, 10:19 PM
                                0 responses
                                60 views
                                0 likes
                                Last Post seqadmin  
                                Working...
                                X