Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • #16
    large number of contigs

    Originally posted by JonB View Post
    I have problems generating a new genome.
    My genome is more than 300.000 contigs and ~300MB in size.

    Feb 18 11:25:11 ..... Started STAR run
    Feb 18 11:25:11 ... Starting to generate Genome files
    /var/spool/slurmd/job1306271/slurm_script: line 9: 12129 Killed
    As @NGSfan pointed out, this is indeed RAM problem. STAR bins genome sequence in a way that each chromosome (contig) starts at a new bin, which creates an overhead of Nchromosomes*BinSize, where BinSize=2^genomeChrBinNbits. By default, --genomeChrBinNbits = 18,
    so BinSize=2^18~256kb, so with 300,000 contigs you would need ~75GB of RAM - that's what likely killed your job.

    I suggest that you try a much smaller value of --genomeChrBinNbits 12. This would require just a few GB of RAM and should allow you to generate the genome files. I have not tried STAR with more than 50,000 contigs, and I suspect there might be significant slowdown in the mapping speed when the number of contigs is too big.

    Comment


    • #17
      mixed datasets

      Originally posted by bob-loblaw View Post
      This looks great, I think I'll try it out! Have you tested STAR on mixed RNA datasets? i.e. RNA-Seq on a sample containing both human and bacterial/viral RNA? Thats what I'm working with at the moment and Tophat just isn't cutting it, there are millions of reads that are identified as being human and bacterial (depending on which aligning step I run first)
      We used STAR in two mixed-species settings: human + viruses (~4,500 viruses from NCBI) and human + mouse, and in both cases it worked well as far as we could tell. You have to keep an eye on RAM, since STAR would need ~10*TotalGenomeSize bytes of RAM (~50GB for human+mouse), and if you have a large number of small chromosomes/scaffolds/contigs, you would need to reduce --genomeChrBinNbits as I explained in the previous post.

      For mixed-species mapping I strongly agree with other users who recommended mapping to a combined genome rather than a 2-step mapping. When a mapper aligns a read to combined genome it considers various possible alignments and picks the best one. This reduces the false positives and negatives for unique calls to each species. On the other hand, in the 2-step method you force alignments to one of the species in the 1st step, so you results will be strongly biased towards the 1st species.

      Comment


      • #18
        As @NGSfan pointed out, this is indeed RAM problem. STAR bins genome sequence in a way that each chromosome (contig) starts at a new bin, which creates an overhead of Nchromosomes*BinSize, where BinSize=2^genomeChrBinNbits. By default, --genomeChrBinNbits = 18,
        so BinSize=2^18~256kb, so with 300,000 contigs you would need ~75GB of RAM - that's what likely killed your job.

        I suggest that you try a much smaller value of --genomeChrBinNbits 12. This would require just a few GB of RAM and should allow you to generate the genome files. I have not tried STAR with more than 50,000 contigs, and I suspect there might be significant slowdown in the mapping speed when the number of contigs is too big.
        I've also had success in working around this problem by creating dummy scaffolds. I cat all the contigs into a single big fasta entry with a couple hundred 'X's separating each contig to (hopefully) prevent spurious alignments from spanning contigs. This made a huge difference in the amount of memory required.

        Comment


        • #19
          large number of contigs

          Originally posted by cram View Post
          I've also had success in working around this problem by creating dummy scaffolds. I cat all the contigs into a single big fasta entry with a couple hundred 'X's separating each contig to (hopefully) prevent spurious alignments from spanning contigs. This made a huge difference in the amount of memory required.
          This is a great idea. It will solve both the RAM problem and the slowdown problem I was talking about. However, you would need to post-process your alignments to extract alignments' coordinates within the real contigs from the coordinates within "super-contigs". Also, while Xs or Ns separating the contigs will prevent STAR from "extending" the alignment into adjacent contigs, it will not prevent splicing between contigs. So at the post-processing step you would also need to filter out alignments that span more than one real contig.

          Comment


          • #20
            Trouble generating a small genome suffix array index...

            I'm having trouble generating a small reference genome for running STAR. Below is the command, Error output and the input "genome" file. Your thoughts on how I can rectify are welcome. At present, I am using BFAST and it works reasonably well, but am looking for something a little easier for automatically generating seeds for parsing sequencing reads into bins. Thank you for your help and time.

            Regards,

            -Tom Blomquist

            University of Toledo, Ohio



            Command:

            ./STAR --runMode genomeGenerate --genomeDir ./ --genomeFastaFiles ./barcode.fa --runThreadN 8



            Error Output:

            Feb 19 09:39:34 ..... Started STAR run
            Feb 19 09:39:34 ... Starting to generate Genome files
            Feb 19 09:39:34 ... starting to sort Suffix Array. This may take a long time...
            Feb 19 09:39:34 ... sorting Suffix Array chunks and saving them to disk...
            Feb 19 09:39:34 ... loading chunks from disk, packing SA...
            Feb 19 09:39:34 ... writing Suffix Array to disk ...
            Feb 19 09:39:34 ... Finished generating suffix array
            Feb 19 09:39:34 ... starting to generate Suffix Array index...

            BUG: next index is smaller than previous, EXITING



            ./barcode.fa is the following -->

            >ATGC
            TTTTCATGCGATCAGGCGTCTGTCGTGCTC

            >CAGT
            TTTTCCAGTGATCAGGCGTCTGTCGTGCTC

            >GACA
            TTTTCGACAGATCAGGCGTCTGTCGTGCTC

            >TGTG
            TTTTCTGTGGATCAGGCGTCTGTCGTGCTC

            >ACAC
            TTTTCACACGATCAGGCGTCTGTCGTGCTC

            >TACG
            TTTTCTACGGATCAGGCGTCTGTCGTGCTC

            >GCTA
            TTTTCGCTAGATCAGGCGTCTGTCGTGCTC

            >CATA
            TTTTCCATAGATCAGGCGTCTGTCGTGCTC

            >TAGA
            TTTTCTAGAGATCAGGCGTCTGTCGTGCTC

            >GACT
            TTTTCGACTGATCAGGCGTCTGTCGTGCTC

            >CTGA
            TTTTCCTGAGATCAGGCGTCTGTCGTGCTC

            >TGCT
            TTTTCTGCTGATCAGGCGTCTGTCGTGCTC

            Comment


            • #21
              small reference

              Originally posted by thomasblomquist View Post
              I'm having trouble generating a small reference genome for running STAR. Below is the command, Error output and the input "genome" file. Your thoughts on how I can rectify are welcome. At present, I am using BFAST and it works reasonably well, but am looking for something a little easier for automatically generating seeds for parsing sequencing reads into bins. Thank you for your help and time.

              Regards,

              -Tom Blomquist

              University of Toledo, Ohio
              Tom,
              please use the following parameters:
              --genomeChrBinNbits 6 --genomeSAindexNbases 4
              I was able to run successful genome generation step for your small reference.
              It looks like you want to use STAR in a quite non-standard way.
              If you explain it a bit more I could suggest the parameters for the mapping step.

              Comment


              • #22
                Originally posted by alexdobin View Post
                Tom,
                please use the following parameters:
                --genomeChrBinNbits 6 --genomeSAindexNbases 4
                I was able to run successful genome generation step for your small reference.
                It looks like you want to use STAR in a quite non-standard way.
                If you explain it a bit more I could suggest the parameters for the mapping step.
                Thank you. I do tend to go about things in a non-standard way . I am performing targeted resequencing of an RNA-seq library. Also known as amplicon sequencing. It is also semi-quantitative. My end goal being, I want to count the number of times a given sequence (e.g. for a gene target), or genetic variation (alleles from 1 base substitution up to 6 bases substituted). I want to be able to assign each sequence read the "name" of the consensus sequence it best matches.

                So, my reference genome is usually anywhere from 10-1000 short sequences of ~10-40 bases in length.

                Thank you for your time. I look forward to seeing how your software works in my studies.

                Regards,

                -Tom Blomquist

                Comment


                • #23
                  Originally posted by thomasblomquist View Post
                  Thank you. I do tend to go about things in a non-standard way . I am performing targeted resequencing of an RNA-seq library. Also known as amplicon sequencing. It is also semi-quantitative. My end goal being, I want to count the number of times a given sequence (e.g. for a gene target), or genetic variation (alleles from 1 base substitution up to 6 bases substituted). I want to be able to assign each sequence read the "name" of the consensus sequence it best matches.

                  So, my reference genome is usually anywhere from 10-1000 short sequences of ~10-40 bases in length.

                  Thank you for your time. I look forward to seeing how your software works in my studies.

                  Regards,

                  -Tom Blomquist
                  If I understand it correctly, you need to match short sub-sequences of your long reads to a set of reference short sequences. For starters I would suggest the following parameters:
                  --outFilterMismatchNmax N (N=number of mismatches you wish to tolerate)
                  --outFilterScoreMinOverLread 0
                  --outFilterMatchNminOverLread 0
                  --outFilterMatchNmin L (L=shortest of the reference sequences)

                  Please let me know whether it works for you.

                  Comment


                  • #24
                    Originally posted by alexdobin View Post
                    If I understand it correctly, you need to match short sub-sequences of your long reads to a set of reference short sequences. For starters I would suggest the following parameters:
                    --outFilterMismatchNmax N (N=number of mismatches you wish to tolerate)
                    --outFilterScoreMinOverLread 0
                    --outFilterMatchNminOverLread 0
                    --outFilterMatchNmin L (L=shortest of the reference sequences)

                    Please let me know whether it works for you.
                    In essence. Since I generally know where the alleles are, the sequence reads are trimmed to include the variant loci and ~15 bases of sequencing read on either side of the expected loci(s). So, I'm matching ~40 base sequences to a reference library of chromosomes that are ~20-30 bases.

                    Thanks again for your help. I will try the modifications later tonight. -Tom

                    Comment


                    • #25
                      Had the chance to test your recommended parameters for my specific needs. This is not my normal mode of operation, but... HOLY POOP!!! I cut a 2 hour BFAST match and sorting into bins down to 10 seconds!!! And the specificity and sensitivity is off the hook! When I do some more formal testing I will provide back metrics.

                      You are a computer jedi master! This will make analysis of clinical sequencing data a breeze. Very intuitive command lines.

                      -Tom Blomquist

                      Comment


                      • #26
                        Originally posted by alexdobin View Post
                        We used STAR in two mixed-species settings: human + viruses (~4,500 viruses from NCBI) and human + mouse, and in both cases it worked well as far as we could tell. You have to keep an eye on RAM, since STAR would need ~10*TotalGenomeSize bytes of RAM (~50GB for human+mouse), and if you have a large number of small chromosomes/scaffolds/contigs, you would need to reduce --genomeChrBinNbits as I explained in the previous post.

                        For mixed-species mapping I strongly agree with other users who recommended mapping to a combined genome rather than a 2-step mapping. When a mapper aligns a read to combined genome it considers various possible alignments and picks the best one. This reduces the false positives and negatives for unique calls to each species. On the other hand, in the 2-step method you force alignments to one of the species in the 1st step, so you results will be strongly biased towards the 1st species.
                        Interesting, thanks. Unfortunately for mixed-species mapping (At least when one has alternative splicing and the other does not) the way tophat works it's essentially the same as running bowtie for bacterial reads first (Though I think there might be a way around it). There's also problems with the max size of a bowtie index :/ Would the size be a concern for STAR? Or could it run a database of anysize? (Hopefully RAM shouldn't be too much of an issue for me)

                        Also just to clarify, you don't know STAR would perform with bacterial reads?
                        Last edited by bob-loblaw; 02-22-2013, 06:54 AM.

                        Comment


                        • #27
                          For loci that vary by one base substitution, what would be the best parameters to discriminate the two alleles?

                          Also, I want to tailor the output for single best match, and if two are equivalent, no matches reported.

                          Regards,

                          -Tom

                          Comment


                          • #28
                            Hey, does anyone know if you need the reference genome indexed according to Star because I know for tophat2 the reference genome needs to be indexed *.b2t (bowtie2)

                            Thanks,
                            Nino

                            Comment


                            • #29
                              Originally posted by bob-loblaw View Post
                              Interesting, thanks. Unfortunately for mixed-species mapping (At least when one has alternative splicing and the other does not) the way tophat works it's essentially the same as running bowtie for bacterial reads first (Though I think there might be a way around it). There's also problems with the max size of a bowtie index :/ Would the size be a concern for STAR? Or could it run a database of anysize? (Hopefully RAM shouldn't be too much of an issue for me)

                              Also just to clarify, you don't know STAR would perform with bacterial reads?
                              Sorry, did not check this thread for awhile, just found your questions.

                              STAR will work with database of any size, however, required RAM scales linearly with database size, at ~10*DBsize bytes.

                              STAR works fine with bacterial reads, this is the work we did with long and small RNA-seq for some bacteria.

                              Comment


                              • #30
                                Originally posted by thomasblomquist View Post
                                For loci that vary by one base substitution, what would be the best parameters to discriminate the two alleles?

                                Also, I want to tailor the output for single best match, and if two are equivalent, no matches reported.

                                Regards,

                                -Tom
                                Sorry, did not check this thread for awhile, just found your question.

                                With default parameters, if two alignments differ by one mismatch, only the best one would be reported. This is controlled by --outFilterMultimapScoreRange, (=1 by default) which defines the range of alignment scores that are reported as multi-mappers.

                                The max number of alignments that allowed for output is controlled by --outFilterMultimapNmax. It's 10 be default, so any read with 10 or fewer alignments with scores >= BestScore-1 will be reported.

                                Comment

                                Latest Articles

                                Collapse

                                • seqadmin
                                  Techniques and Challenges in Conservation Genomics
                                  by seqadmin



                                  The field of conservation genomics centers on applying genomics technologies in support of conservation efforts and the preservation of biodiversity. This article features interviews with two researchers who showcase their innovative work and highlight the current state and future of conservation genomics.

                                  Avian Conservation
                                  Matthew DeSaix, a recent doctoral graduate from Kristen Ruegg’s lab at The University of Colorado, shared that most of his research...
                                  03-08-2024, 10:41 AM
                                • seqadmin
                                  The Impact of AI in Genomic Medicine
                                  by seqadmin



                                  Artificial intelligence (AI) has evolved from a futuristic vision to a mainstream technology, highlighted by the introduction of tools like OpenAI's ChatGPT and Google's Gemini. In recent years, AI has become increasingly integrated into the field of genomics. This integration has enabled new scientific discoveries while simultaneously raising important ethical questions1. Interviews with two researchers at the center of this intersection provide insightful perspectives into...
                                  02-26-2024, 02:07 PM

                                ad_right_rmr

                                Collapse

                                News

                                Collapse

                                Topics Statistics Last Post
                                Started by seqadmin, 03-14-2024, 06:13 AM
                                0 responses
                                32 views
                                0 likes
                                Last Post seqadmin  
                                Started by seqadmin, 03-08-2024, 08:03 AM
                                0 responses
                                71 views
                                0 likes
                                Last Post seqadmin  
                                Started by seqadmin, 03-07-2024, 08:13 AM
                                0 responses
                                80 views
                                0 likes
                                Last Post seqadmin  
                                Started by seqadmin, 03-06-2024, 09:51 AM
                                0 responses
                                68 views
                                0 likes
                                Last Post seqadmin  
                                Working...
                                X