Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Tophat2 long_spanning_reads error -- cannot open xx.bam for reading

    This is what I am running:
    TopHat run (v2.0.9), Bowtie version: 2.1.0.0, Samtools version: 0.1.19.0

    This is what I entered:
    Code:
    tophat2 -p 24 -G /Users/mascano/Sequence_Analyses/Reference/Homo_sapiens_NCBI_build37.2/Homo_sapiens/NCBI/build37.2/Annotation/Genes/genes.gtf -o PMA_0hr_TotalRNA /Users/mascano/Sequence_Analyses/Reference/Homo_sapiens_NCBI_build37.2/Homo_sapiens/NCBI/build37.2/Sequence/Bowtie2index/genome /Users/mascano/Sequence_Analyses/DATA/THP1_timecourse/Act_1_ATCACG_L002_R1_001.fastq
    And finally, here is my tophat.log
    Code:
    [2013-08-20 12:28:14] Beginning TopHat run (v2.0.9)
    -----------------------------------------------
    [2013-08-20 12:28:14] Checking for Bowtie
    		  Bowtie version:	 2.1.0.0
    [2013-08-20 12:28:14] Checking for Samtools
    		Samtools version:	 0.1.19.0
    [2013-08-20 12:28:14] Checking for Bowtie index files (genome)..
    [2013-08-20 12:28:14] Checking for reference FASTA file
    [2013-08-20 12:28:14] Generating SAM header for /Users/mascano/Sequence_Analyses/Reference/Homo_sapiens_NCBI_build37.2/Homo_sapiens/NCBI/build37.2/Sequence/Bowtie2index/genome
    	format:		 fastq
    	quality scale:	 phred33 (default)
    [2013-08-20 12:28:47] Reading known junctions from GTF file
    [2013-08-20 12:28:51] Preparing reads
    	 left reads: min. length=101, max. length=101, 26861915 kept reads (78856 discarded)
    [2013-08-20 12:38:22] Building transcriptome data files..
    [2013-08-20 12:39:34] Building Bowtie index from genes.fa
    [2013-08-20 12:52:18] Mapping left_kept_reads to transcriptome genes with Bowtie2 
    [2013-08-20 13:03:00] Resuming TopHat pipeline with unmapped reads
    [2013-08-20 13:03:00] Mapping left_kept_reads.m2g_um to genome genome with Bowtie2 
    [2013-08-20 13:36:56] Mapping left_kept_reads.m2g_um_seg1 to genome genome with Bowtie2 (1/4)
    [2013-08-20 13:43:24] Mapping left_kept_reads.m2g_um_seg2 to genome genome with Bowtie2 (2/4)
    [2013-08-20 13:51:47] Mapping left_kept_reads.m2g_um_seg3 to genome genome with Bowtie2 (3/4)
    [2013-08-20 13:59:15] Mapping left_kept_reads.m2g_um_seg4 to genome genome with Bowtie2 (4/4)
    [2013-08-20 14:10:43] Searching for junctions via segment mapping
    [2013-08-20 14:15:54] Retrieving sequences for splices
    [2013-08-20 14:18:21] Indexing splices
    [2013-08-20 14:19:02] Mapping left_kept_reads.m2g_um_seg1 to genome segment_juncs with Bowtie2 (1/4)
    [2013-08-20 14:20:37] Mapping left_kept_reads.m2g_um_seg2 to genome segment_juncs with Bowtie2 (2/4)
    [2013-08-20 14:22:42] Mapping left_kept_reads.m2g_um_seg3 to genome segment_juncs with Bowtie2 (3/4)
    [2013-08-20 14:24:24] Mapping left_kept_reads.m2g_um_seg4 to genome segment_juncs with Bowtie2 (4/4)
    [2013-08-20 14:26:36] Joining segment hits
    	[FAILED]
    Error running 'long_spanning_reads':Error: cannot open PMA_0hr_TotalRNA/tmp/left_kept_reads.m2g_um.bam for reading
    The output directory is created, as are the subdirectories. The tmp directory contains plenty of files, including "left_kept_reads.m2g_um.bam"
    That file is ~1GB (and it's permissions are me:read and write, staff:read only, everyone:read only)

    Help is appreciated

  • #2
    A bit of an update

    Running 16 threads, instead of 24, allowed tophat to complete the run:
    Code:
    tophat2 -p 16 -G /Users/mascano/Sequence_Analyses/Reference/Homo_sapiens_NCBI_build37.2/Homo_sapiens/NCBI/build37.2/Annotation/Genes/genes.gtf -o PMA_0hr_TotalRNA /Users/mascano/Sequence_Analyses/Reference/Homo_sapiens_NCBI_build37.2/Homo_sapiens/NCBI/build37.2/Sequence/Bowtie2index/genome /Users/mascano/Sequence_Analyses/DATA/THP1_timecourse/Act_1_ATCACG_L002_R1_001.fastq

    I have a 2 x 2.4Ghz 6-core Xeon - so that's 12-core physical plus 12 virtual with hyperthreading, which theoretically means I can assign '-p 24'

    My guess is memory usage, but not entirely clear. I have 64GB RAM (which is the maximum allowed, until Mavericks OSX comes out).

    Any advice on how to assign 24 threads without TopHat2 failing? Would calling the '-mm' argument work?

    Comment


    • #3
      Why is using all 24 threads so important?

      Have you considered the possibility that the storage subsystem you have on this machine is probably a bottleneck (look in the activity monitor to see if you are maxing out the throughput).

      So rather than having 24 cores in some sort of iowait round robin state it may be better to start with a smaller number of cores and experiment to find the optimal performance balance.

      Comment


      • #4
        Thank you for the reply and suggestion. I had not considered that the HDD io may be the bottleneck; I imagine an SSD may improve it. However, in looking at the disk activity, I haven't seen it peak anywhere near 6 Gb/sec (or 768MB) which should be the bandwidth of my HDD (using the ICH10 bridge), in a successful run (using 16 threads).

        I doubt it will skyrocket to that throughput ceiling with all 24 threads, no?

        Comment


        • #5
          There is theoretical throughput and real life performance. Since tophat suite is developed on Mac it should be optimized for OS X.

          If you are interested you could look at specific application level stats by following the suggestions in this post: http://blog.yerkanian.com/2011/10/17...io-on-macos-x/

          Check using the following to see CPU level performance for various processes in a terminal window (adjust parameters as needed by looking at man entry for top).

          Code:
          $ top -n10 -u
          Last edited by GenoMax; 08-21-2013, 10:03 AM.

          Comment


          • #6
            Under -p 16 conditions:

            Memory usage peaked at 4 GB for long_spanning process. But IO for HDD did not exceed 20MB/sec (read or write) total. I used
            Code:
            sudo iotop -C 5
            as well as viewing memory usage and disk activity via Activity Monitor.

            I was monitoring during these log events, which is when it would fail if -p 24:
            Code:
            [2013-08-21 14:59:08] Mapping left_kept_reads.m2g_um_seg4 to genome segment_juncs with Bowtie2 (4/4)
            [2013-08-21 15:01:19] Joining segment hits
            [2013-08-21 15:05:52] Reporting output tracks

            Comment


            • #7
              So we know that 16 threads work but not 24. OS X may need some cores to keep essential parts of the OS running. Next thing to try would be to increment 16 towards 24 and see at what point the process fails.

              Comment


              • #8
                So I can move to 22 cores either as a single run or parallel process runs that total up to 22. That said, I'm not being shy about using the computer simultaneously for other applications (MsOffice, Chrome, Mail, etc.) so, at least for my configuration, 22 threads is more than satisfactory.

                Comment


                • #9
                  Since you did the experiment...

                  How much time (if any) is saved by going from 16 to 22 cores for the same job? A rough estimate is fine if you did not time the runs.

                  Comment


                  • #10
                    Not using the exact same fastq, but of similar size (~30Mio):
                    I went from ~2hrs at 16 cores, to 1.5hrs at 22 cores, to 2.5 hrs at 11 cores. Although, when I run two shelled processes at 11 cores each, one of them is consistently around 2.5 and the other one 3.5 hrs. I think I'll have to just wing this and figure out the best balance of the number of parallel processes vs the number of cores per process.

                    Comment

                    Latest Articles

                    Collapse

                    • seqadmin
                      Current Approaches to Protein Sequencing
                      by seqadmin


                      Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
                      04-04-2024, 04:25 PM
                    • seqadmin
                      Strategies for Sequencing Challenging Samples
                      by seqadmin


                      Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                      03-22-2024, 06:39 AM

                    ad_right_rmr

                    Collapse

                    News

                    Collapse

                    Topics Statistics Last Post
                    Started by seqadmin, 04-11-2024, 12:08 PM
                    0 responses
                    18 views
                    0 likes
                    Last Post seqadmin  
                    Started by seqadmin, 04-10-2024, 10:19 PM
                    0 responses
                    22 views
                    0 likes
                    Last Post seqadmin  
                    Started by seqadmin, 04-10-2024, 09:21 AM
                    0 responses
                    16 views
                    0 likes
                    Last Post seqadmin  
                    Started by seqadmin, 04-04-2024, 09:00 AM
                    0 responses
                    47 views
                    0 likes
                    Last Post seqadmin  
                    Working...
                    X