Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • saturate the junctions!!

    Hi,
    Would it be wise to generate the genome with GTF file and re-generate it later with the sample specific SJ.out.tab?

    I have more than 100 samples for variant analysis. Another strategy that I thing could be useful is to pull 20-30 samples together (mapping being extremely fast, it was over within 40mins), so that the discovery of novel junctions will get saturated, and then use this final SJ.out.tab to re-generate the genome, and perform sample wise alignment. This has reported ~2,00,000 total junctions, which is quite closer to the number of exons in human genome ~1,80,000

    Cheers,
    Amol

    Comment


    • Originally posted by amolkolte View Post
      Hi,
      Would it be wise to generate the genome with GTF file and re-generate it later with the sample specific SJ.out.tab?

      I have more than 100 samples for variant analysis. Another strategy that I thing could be useful is to pull 20-30 samples together (mapping being extremely fast, it was over within 40mins), so that the discovery of novel junctions will get saturated, and then use this final SJ.out.tab to re-generate the genome, and perform sample wise alignment. This has reported ~2,00,000 total junctions, which is quite closer to the number of exons in human genome ~1,80,000

      Cheers,
      Amol
      Hi Amol,

      you are talking about what we now call a "2-pass" approach, where junctions detected in the first pass are used as annotations in the 2nd pass. This is generally the most accurate way to map the RNA-seq data. As you pointed out, there are two options:

      1). Using the 1st pass in the sample specific way.
      This can be done with --twopass1readsN <Nreads> option, where Nreads is the number of reads to map in the 1st pass. I recommend to use a very big number (or -1) which will map all the reads in the 1st pass (of course, in any case all the reads are re-mapped in the 2nd pass). At the moment this approach cannot be used with annotations.

      2). You can collect all the junctions from the 1st mapping pass over multiple samples, and then re-generate the genome with a full collection of junctions, and re-map all reads with the new genome. This has to be done "manually" and it's a bit more cumbersome, but in the end it will give you the best sensitivity. Some filtering of the 1st pass novel junctions may be required, e.g. removing junctions on mitochondrion genomes. Some users also reported a speed decrease in the 2nd pass with millions of junctions. You will also see increase in multi-mappers, since the reads will be able to map to multiple junctions, especially in cases of short splicing overhangs.

      Cheers
      Alex

      Comment


      • I am trying to use the new (to me) twopass1readsN option in STAR, but am finding it very slow (to the point of I am not sure if STAR is crashing or not).

        Using some small test data (1M PE reads) and a very small (1000) twopass1reads N parameter it takes about one minute to insert the 1st pass junctions:

        STAR --genomeDir /iGenomes/Homo_sapiens/Ensembl/GRCh37/Sequence/StarIndex --readFilesIn R1.fastq.gz R2.fastq.gz --runThreadN 12 --genomeLoad NoSharedMemory --outFilterMultimapNmax 1 --outSAMtype BAM SortedByCoordinate --twopass1readsN 1000 --sjdbOverhang 99
        Jan 27 12:56:01 ..... Started STAR run
        Jan 27 12:56:13 ..... Started 1st pass mapping
        Jan 27 12:56:15 ..... Finished 1st pass mapping
        Jan 27 12:57:25 ..... Finished inserting 1st pass junctions into genome
        Jan 27 12:57:25 ..... Started mapping
        Jan 27 12:57:41 ..... Started sorting BAM

        Running on a real dataset (~40M PE reads with twopass1readsN set to -1) took >18h before I killed the job - assuming it had crashed. It was using 100% CPU that entire time however, so it may well have simply been doing the right thing - just very slowly.

        Is this timing normal for this option? Manually doing the 1st pass, genome regeneration and 2nd pass would be much quicker I think. At this speed I won't be able to use the auto 2 pass mode in my normal pipeline.

        Comment


        • Originally posted by ruggedtextile View Post
          I am trying to use the new (to me) twopass1readsN option in STAR, but am finding it very slow (to the point of I am not sure if STAR is crashing or not).

          Using some small test data (1M PE reads) and a very small (1000) twopass1reads N parameter it takes about one minute to insert the 1st pass junctions:

          STAR --genomeDir /iGenomes/Homo_sapiens/Ensembl/GRCh37/Sequence/StarIndex --readFilesIn R1.fastq.gz R2.fastq.gz --runThreadN 12 --genomeLoad NoSharedMemory --outFilterMultimapNmax 1 --outSAMtype BAM SortedByCoordinate --twopass1readsN 1000 --sjdbOverhang 99
          Jan 27 12:56:01 ..... Started STAR run
          Jan 27 12:56:13 ..... Started 1st pass mapping
          Jan 27 12:56:15 ..... Finished 1st pass mapping
          Jan 27 12:57:25 ..... Finished inserting 1st pass junctions into genome
          Jan 27 12:57:25 ..... Started mapping
          Jan 27 12:57:41 ..... Started sorting BAM

          Running on a real dataset (~40M PE reads with twopass1readsN set to -1) took >18h before I killed the job - assuming it had crashed. It was using 100% CPU that entire time however, so it may well have simply been doing the right thing - just very slowly.

          Is this timing normal for this option? Manually doing the 1st pass, genome regeneration and 2nd pass would be much quicker I think. At this speed I won't be able to use the auto 2 pass mode in my normal pipeline.
          Hi,

          18 hours is definitely too long, 2nd pass should normally take just a bit longer than the 1st pass. Could you please send me the Log.out file of the failed run?

          Cheers
          Alex

          Comment


          • I have narrowed the problem down to using a genome index without pre-existing junctions annotated. If I use an otherwise identical index with junctions it runs fine (takes a couple of minutes max to insert the 1st pass junctions).

            I.e:

            Code:
            [15:49 guttea@Maul MODY] > STAR --genomeDir /mnt/nas_omics/shared_data/iGenomes/Homo_sapiens/Ensembl/GRCh37/Sequence/StarAnnotatedIndex --readFilesIn data/fastq/D0-1_R1.fastq.gz data/fastq/D0-1_R2.fastq.gz --runThreadN 12 --genomeLoad NoSharedMemory --readFilesCommand zcat --outSAMtype BAM SortedByCoordinate --twopass1readsN -1 --sjdbOverhang 149
            Jan 30 15:58:43 ..... Started STAR run
            Jan 30 15:59:32 ..... Started 1st pass mapping
            Jan 30 16:00:35 ..... Finished 1st pass mapping
            Jan 30 16:02:23 ..... Finished inserting 1st pass junctions into genome
            Jan 30 16:02:25 ..... Started mapping
            Jan 30 16:03:43 ..... Started sorting BAM
            Jan 30 16:04:04 ..... Finished successfully
            [16:04 guttea@Maul MODY] > STAR --genomeDir /mnt/nas_omics/shared_data/iGenomes/Homo_sapiens/Ensembl/GRCh37/Sequence/StarIndex --readFilesIn data/fastq/D0-1_R1.fastq.gz data/fastq/D0-1_R2.fastq.gz --runThreadN 12 --genomeLoad NoSharedMemory --readFilesCommand zcat --outSAMtype BAM SortedByCoordinate --twopass1readsN -1 --sjdbOverhang 149 --outFileNamePrefix NoExisitingAnnotation
            Jan 30 16:06:00 ..... Started STAR run
            Jan 30 16:06:15 ..... Started 1st pass mapping
            Jan 30 16:07:19 ..... Finished 1st pass mapping
            With this being copied at 16:15 (i.e. it has already taken ~8 minutes and last time it hung here for 18 hours using 100% CPU). I've attached the two log.out files. You can see in the NoExistingAnnotation.log.out we never get confirmation of:

            Code:
            Jan 30 16:01:18   Finished sorting SA indicesL nInd=50917350
            Hope that helps.
            Attached Files

            Comment


            • memory problems while running STAR

              Hi, I'm trying to use genomegenerate but every time I try to run it I have this problem:

              terminate called after throwing an instance of 'std::bad_alloc'
              Aborted (core dumped)

              I try to use this options as well:
              --runThreadN 12 --limitGenomeGenerateRAM 40

              but I still get the same error.

              I'm working with a genome of 6G. could that be the problem.

              I really appreciate any help,

              V

              Comment


              • Hi,

                for a 6Gb genome, you would need ~60GB of RAM, and you would need to specify --limitGenomeGenerateRAM 60000000000 (i.e. the number of bytes).

                Cheers
                Alex

                Originally posted by vromanr_2015 View Post
                Hi, I'm trying to use genomegenerate but every time I try to run it I have this problem:

                terminate called after throwing an instance of 'std::bad_alloc'
                Aborted (core dumped)

                I try to use this options as well:
                --runThreadN 12 --limitGenomeGenerateRAM 40

                but I still get the same error.

                I'm working with a genome of 6G. could that be the problem.

                I really appreciate any help,

                V

                Comment


                • Hey,

                  I just switched from BWA to STAR for my RNAseq analysis yesterday. I got stuck with this: Mar 27 08:14:36 ..... Started STAR run
                  Mar 27 08:24:33 ..... Started mapping
                  Killed: 9

                  Then STAR stopped.
                  I have 20 of 50bp SE samples (~40M reads/sample) and the analysis was done with my poor Macmini (Core i7 / 16GB RAM).
                  I used Ensembl mouse genome (Mus_musculus.GRCm38.dna.primary_assembly.fa.gz and
                  Mus_musculus.GRCm38.79.gtf.gz) as reference.

                  The code I used to run the mapping:
                  > STAR --genomeDir ./GenomeDir --readFilesIn ./BGI_RNAseq_data_2015/01.fq,./BGI_RNAseq_data_2015/02.fq,./BGI_RNAseq_data_2015/03.fq,./BGI_RNAseq_data_2015/04.fq,./BGI_RNAseq_data_2015/05.fq,./BGI_RNAseq_data_2015/06.fq,./BGI_RNAseq_data_2015/07.fq,./BGI_RNAseq_data_2015/08.fq,./BGI_RNAseq_data_2015/09.fq,./BGI_RNAseq_data_2015/10.fq,./BGI_RNAseq_data_2015/11.fq,./BGI_RNAseq_data_2015/12.fq,./BGI_RNAseq_data_2015/13.fq,./BGI_RNAseq_data_2015/14.fq,./BGI_RNAseq_data_2015/15.fq,./BGI_RNAseq_data_2015/16.fq,./BGI_RNAseq_data_2015/17.fq,./BGI_RNAseq_data_2015/18.fq,./BGI_RNAseq_data_2015/19.fq,./BGI_RNAseq_data_2015/20.fq --runThreadN 8

                  I knew that STAR demands much more computing power (and memory) than BWA but I am not sure if my code is correct. I had no issues with BWA on the same Macmini though.
                  The log files are attached.
                  Attached Files

                  Comment


                  • Originally posted by neokao View Post
                    Hey,

                    I just switched from BWA to STAR for my RNAseq analysis yesterday. I got stuck with this: Mar 27 08:14:36 ..... Started STAR run
                    Mar 27 08:24:33 ..... Started mapping
                    Killed: 9

                    Then STAR stopped.
                    I have 20 of 50bp SE samples (~40M reads/sample) and the analysis was done with my poor Macmini (Core i7 / 16GB RAM).
                    I used Ensembl mouse genome (Mus_musculus.GRCm38.dna.primary_assembly.fa.gz and
                    Mus_musculus.GRCm38.79.gtf.gz) as reference.

                    The code I used to run the mapping:
                    > STAR --genomeDir ./GenomeDir --readFilesIn ./BGI_RNAseq_data_2015/01.fq,./BGI_RNAseq_data_2015/02.fq,./BGI_RNAseq_data_2015/03.fq,./BGI_RNAseq_data_2015/04.fq,./BGI_RNAseq_data_2015/05.fq,./BGI_RNAseq_data_2015/06.fq,./BGI_RNAseq_data_2015/07.fq,./BGI_RNAseq_data_2015/08.fq,./BGI_RNAseq_data_2015/09.fq,./BGI_RNAseq_data_2015/10.fq,./BGI_RNAseq_data_2015/11.fq,./BGI_RNAseq_data_2015/12.fq,./BGI_RNAseq_data_2015/13.fq,./BGI_RNAseq_data_2015/14.fq,./BGI_RNAseq_data_2015/15.fq,./BGI_RNAseq_data_2015/16.fq,./BGI_RNAseq_data_2015/17.fq,./BGI_RNAseq_data_2015/18.fq,./BGI_RNAseq_data_2015/19.fq,./BGI_RNAseq_data_2015/20.fq --runThreadN 8

                    I knew that STAR demands much more computing power (and memory) than BWA but I am not sure if my code is correct. I had no issues with BWA on the same Macmini though.
                    The log files are attached.
                    Hi,
                    to fit the mouse genome into the 16GB of RAM, you need to use at the genome generation step:
                    --genomeSAsparseD 2 --genomeSAindexNbases 13
                    Hopefully this will solve the problem.
                    Cheers
                    Alex

                    Comment


                    • Hey, Alex:

                      Just tried with reindexing genome.
                      Code:
                      STAR --runMode genomeGenerate --genomeDir ./GenomeDir --genomeFastaFiles ./Mus_musculus.GRCm38.dna.primary_assembly.fa --runThreadN 8 --sjdbGTFfile ./Mus_musculus.GRCm38.79.gtf --sjdbOverhang 48 --genomeSAsparseD 2 --genomeSAindexNbases 13

                      It was completed and then I tried the same code to map the files.

                      Still got the same error:
                      Mar 27 13:16:44 ..... Started STAR run
                      Mar 27 13:18:05 ..... Started mapping
                      Killed: 9

                      Any more suggestions?
                      Thanks.

                      Comment


                      • Somehow I could not send private message successfully, the new log.out file is in this link https://app.box.com/s/0nqn3jknqq1ta66jl1evme8c8ldunq9k

                        I guess that the issue started when it was creating the 8th thread. So I changed to --runThreadN 1 and then it seems OK so far for 5 hours already (still mapping). Is it normal to take more than 5 hours for 20 files like these described in the question? Thanks for your suggestions.

                        Comment


                        • Originally posted by neokao View Post
                          Somehow I could not send private message successfully, the new log.out file is in this link https://app.box.com/s/0nqn3jknqq1ta66jl1evme8c8ldunq9k

                          I guess that the issue started when it was creating the 8th thread. So I changed to --runThreadN 1 and then it seems OK so far for 5 hours already (still mapping). Is it normal to take more than 5 hours for 20 files like these described in the question? Thanks for your suggestions.
                          You can check Log.progress.out to see how many reads have been mapped. Since you are using one thread, I would expect the speed to be ~100M reads per hour, so it could take ~8 hours for ~800M reads.
                          You can try --runThreadN 7 or 6, or you can reduce the per-trhead IO buffer size with --limitIObufferSize 100000000.

                          Cheers
                          Alex

                          Comment


                          • So it went through successfully. With the code like this:
                            > STAR --genomeDir ./GenomeDir/ --readFilesIn ./BGI_RNAseq_data_2015/01.fq,./BGI_RNAseq_data_2015/02.fq,./BGI_RNAseq_data_2015/03.fq,./BGI_RNAseq_data_2015/04.fq,./BGI_RNAseq_data_2015/05.fq,./BGI_RNAseq_data_2015/06.fq,./BGI_RNAseq_data_2015/07.fq,./BGI_RNAseq_data_2015/08.fq,./BGI_RNAseq_data_2015/09.fq,./BGI_RNAseq_data_2015/10.fq,./BGI_RNAseq_data_2015/11.fq,./BGI_RNAseq_data_2015/12.fq,./BGI_RNAseq_data_2015/13.fq,./BGI_RNAseq_data_2015/14.fq,./BGI_RNAseq_data_2015/15.fq,./BGI_RNAseq_data_2015/16.fq,./BGI_RNAseq_data_2015/17.fq,./BGI_RNAseq_data_2015/18.fq,./BGI_RNAseq_data_2015/19.fq,./BGI_RNAseq_data_2015/20.fq --runThreadN 1
                            , it took ~20 hours at the speed of ~37M reads per hour (That Mac mini was in use for other stuff then).

                            Somehow --runThreadN 2, --runThreadN 4 or --runThreadN 6 all got the same errors.

                            Anyway, the output is one ~200GB Aligned.out.sam file. I thought I will get 20 individual sam files to do featurecounts. Is there anyway to parse it ? Or I need to redo the STAR mapping all over with the input fq file individually? Thanks.
                            Last edited by neokao; 03-28-2015, 03:46 PM.

                            Comment


                            • Originally posted by neokao View Post
                              So it went through successfully. With the code like this:
                              > STAR --genomeDir ./GenomeDir/ --readFilesIn ./BGI_RNAseq_data_2015/01.fq,./BGI_RNAseq_data_2015/02.fq,./BGI_RNAseq_data_2015/03.fq,./BGI_RNAseq_data_2015/04.fq,./BGI_RNAseq_data_2015/05.fq,./BGI_RNAseq_data_2015/06.fq,./BGI_RNAseq_data_2015/07.fq,./BGI_RNAseq_data_2015/08.fq,./BGI_RNAseq_data_2015/09.fq,./BGI_RNAseq_data_2015/10.fq,./BGI_RNAseq_data_2015/11.fq,./BGI_RNAseq_data_2015/12.fq,./BGI_RNAseq_data_2015/13.fq,./BGI_RNAseq_data_2015/14.fq,./BGI_RNAseq_data_2015/15.fq,./BGI_RNAseq_data_2015/16.fq,./BGI_RNAseq_data_2015/17.fq,./BGI_RNAseq_data_2015/18.fq,./BGI_RNAseq_data_2015/19.fq,./BGI_RNAseq_data_2015/20.fq --runThreadN 1
                              , it took ~20 hours at the speed of ~37M reads per hour (That Mac mini was in use for other stuff then).

                              Somehow --runThreadN 2, --runThreadN 4 or --runThreadN 6 all got the same errors.

                              Anyway, the output is one ~200GB Aligned.out.sam file. I thought I will get 20 individual sam files to do featurecounts. Is there anyway to parse it ? Or I need to redo the STAR mapping all over with the input fq file individually? Thanks.
                              If read names in each of the .fastq files have distinct prefixes, you could split the resulting SAM into separate files.
                              Otherwise, you would have to map each of the files separately in a cycle. Something like "--readFilesIn XX.fq --outFileNamePrefix XX" will allow storing separate output files in one directory.
                              It's strange that multi-threading does not work. Is it possible that other stuff takes up a significant chunk of RAM?

                              Cheers
                              Alex

                              Comment


                              • Thanks, Alex.

                                I don't know what's causing the problem for multi-threading on my Mac mini.
                                I got the same error even after I rebooted the computer and started the mapping freshly.

                                Anyways, I went ahead and started the mapping one by one with --runThreadN 1 already.
                                It is weird that even with --runThreadN 1 in code like this:

                                > STAR --genomeDir ./GenomeDir/ --readFilesIn ./BGI_RNAseq_data_2015/01.fq --runThreadN 1

                                , it sometimes worked but sometimes did not.

                                For the same .fq file, it could give the the Killed: 9 error and when I redid with the exact same code, it went through successfully. Very strange.

                                My .fq files do have distinct prefixes and they are ordered by two digits number as described before: 01.fq, 02.fq, etc. Could you shed more light on --readFilesIn XX.fq --outFileNamePrefix XX ? Thanks.

                                Comment

                                Latest Articles

                                Collapse

                                • seqadmin
                                  Strategies for Sequencing Challenging Samples
                                  by seqadmin


                                  Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                                  03-22-2024, 06:39 AM
                                • seqadmin
                                  Techniques and Challenges in Conservation Genomics
                                  by seqadmin



                                  The field of conservation genomics centers on applying genomics technologies in support of conservation efforts and the preservation of biodiversity. This article features interviews with two researchers who showcase their innovative work and highlight the current state and future of conservation genomics.

                                  Avian Conservation
                                  Matthew DeSaix, a recent doctoral graduate from Kristen Ruegg’s lab at The University of Colorado, shared that most of his research...
                                  03-08-2024, 10:41 AM

                                ad_right_rmr

                                Collapse

                                News

                                Collapse

                                Topics Statistics Last Post
                                Started by seqadmin, Yesterday, 06:37 PM
                                0 responses
                                12 views
                                0 likes
                                Last Post seqadmin  
                                Started by seqadmin, Yesterday, 06:07 PM
                                0 responses
                                10 views
                                0 likes
                                Last Post seqadmin  
                                Started by seqadmin, 03-22-2024, 10:03 AM
                                0 responses
                                51 views
                                0 likes
                                Last Post seqadmin  
                                Started by seqadmin, 03-21-2024, 07:32 AM
                                0 responses
                                68 views
                                0 likes
                                Last Post seqadmin  
                                Working...
                                X