Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Bowite "Error: could not open" fa file

    Hi, all. I am new to informatics.

    I have indexed human references sequences (chr1.fa, chr2.fa, etc) separately successfully using Bowtie 0.11.3. After that, I combine the 24 fa files into a large fa file chr.fa (2.9G). And then tried to index it. But I got this error: "Error: could not open + [my fa file path]"

    The path is definitely correct. I have changed the mod of the file chr.fa to 777. Even I combined chrX.fa and chrY.fa to test.fa. This test.fa could be indexed! What is wrong with it?

    The reference sequence file is 2.9G. I am using Fedora 11 with 4G memory. Is it the problem of memory? Or is there any built index provided by Bowtie?

    By the way, I use bowtie 0.9 to index the human reference genome (24 chromosomes separately). The original size is 2.9G, and the indexed size is 3.0G, the memory footprint is 6.79G! Why is it like that? The number is far from the claimed value on official site. Did I do anything wrong? I used default setting of bowtie.

    Thanks a lot.

  • #2
    My hunch, if you are running on 32-bit Linux is that you are hitting the 2GB file limit. You could test this is the case by trying a smaller test file to see if Bowtie opens it, e.g. head -10000 mybigfile > mysmallfile. See this page for more detail (http://linuxmafia.com/faq/VALinux-kb...ize-limit.html). You might want to consider recompiling Bowtie with large file support, see (http://www.suse.de/~aj/linux_lfs.html).

    Comment


    • #3
      Thanks very much for your reply.

      But I am using 64-bit Fedora 11. Does it has the same limit?

      Comment


      • #4
        64-bit Linux shouldn't have the 2GB file limit, so I am not sure what causes your problem, but pre-built indexes of hg18 and hg19 (with headers chr1, chr2 etc) can be readily downloaded from the Bowtie main page, which seems to be what you want anyway. BTW, when indexing the human reference genome on my 64-bit Linux machine with bowtie 0.11.3, the memory footprint is ~5GB. Hope this helps somewhat,

        -- Leo
        Last edited by HTS; 11-21-2009, 08:29 AM.

        Comment


        • #5
          Thanks, dear Leo.

          It is the first time track memory footprint, could you be so nice to tell whether my approach is feasible? When I run "./bowtie ...", I get its process id by "top", say 1000. Then I use "cat /proc/1000/status" to print memory footprint. I reported 6.7G, which is actually the addition of all peak values. If we consider "VmWHM", it is maybe ~4G. I am still not very clear about what is "VmWHM"...

          And one more question, hope no so stupid. What is "hg18", is it "human genome 18"? So what is the difference between "hg18" and "hg19"? There are also NCBI 36.3, mm9... What do they stand for?

          Thank you again.

          Comment


          • #6
            Originally posted by CarlElit View Post
            Thanks, dear Leo.

            It is the first time track memory footprint, could you be so nice to tell whether my approach is feasible? When I run "./bowtie ...", I get its process id by "top", say 1000. Then I use "cat /proc/1000/status" to print memory footprint. I reported 6.7G, which is actually the addition of all peak values. If we consider "VmWHM", it is maybe ~4G. I am still not very clear about what is "VmWHM"...

            And one more question, hope no so stupid. What is "hg18", is it "human genome 18"? So what is the difference between "hg18" and "hg19"? There are also NCBI 36.3, mm9... What do they stand for?

            Thank you again.
            You are welcome! I simply used top to take a rough look at the memory used by bowtie-build, but you could use more sophisticated tools such as memtime or tstime if you want. If memory is really an issue, you could use the -p option to trade speed for memory, but I would recommend you to add more RAM if possible. hg18, hg19, mm9 are genome build names given by UCSC, while NCBI 36.3, NCBI 37.1 etc are names used by NCBI. For example, hg18 has exactly the same sequences as NCBI 36.3 while hg19 has exactly the same sequences as NCBI 37.1, only the header information is slightly different. UCSC use headers such as chr1, chr2 to be compatible with the UCSC genome browser while NCBI use more obscure headers such as ">gi|224589800|ref|NC_000001.10| Homo sapiens chromosome 1, GRCh37 primary reference assembly".

            Comment


            • #7
              Wo, I learn a lot.

              Thanks, Leo.

              Comment


              • #8
                A comment. NCBI/UCSC/Ensembl do not use exactly the same human reference sequence. This is is true at least for build 36. UCSC concatenates unassembled contigs into chrN_random. Ensembl masks out pseudoautosomal regions on Y. I recommend the Ensembl reference sequence. If you use the NCBI/UCSC genome, you can essentially map no read to pseudoautosomal regions.

                Comment


                • #9
                  Originally posted by lh3 View Post
                  A comment. NCBI/UCSC/Ensembl do not use exactly the same human reference sequence. This is is true at least for build 36. UCSC concatenates unassembled contigs into chrN_random. Ensembl masks out pseudoautosomal regions on Y. I recommend the Ensembl reference sequence. If you use the NCBI/UCSC genome, you can essentially map no read to pseudoautosomal regions.
                  Thanks for your clarification, Heng! I checked hg19 and NCBI 37.1 the other day and they are exactly the same, i.e., both don't include chrN_random now. Personally I stick with UCSC for convenience reasons most of the time, but this may not be enough depends on what you do (as pointed out by Heng). CarlElit, as you can see there are quite some subtleties in Bioinformatics and you just have to learn to cope with them along the way (and hopefully contribute to make things better as well).

                  Comment

                  Latest Articles

                  Collapse

                  • seqadmin
                    Current Approaches to Protein Sequencing
                    by seqadmin


                    Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
                    04-04-2024, 04:25 PM
                  • seqadmin
                    Strategies for Sequencing Challenging Samples
                    by seqadmin


                    Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                    03-22-2024, 06:39 AM

                  ad_right_rmr

                  Collapse

                  News

                  Collapse

                  Topics Statistics Last Post
                  Started by seqadmin, 04-11-2024, 12:08 PM
                  0 responses
                  30 views
                  0 likes
                  Last Post seqadmin  
                  Started by seqadmin, 04-10-2024, 10:19 PM
                  0 responses
                  32 views
                  0 likes
                  Last Post seqadmin  
                  Started by seqadmin, 04-10-2024, 09:21 AM
                  0 responses
                  28 views
                  0 likes
                  Last Post seqadmin  
                  Started by seqadmin, 04-04-2024, 09:00 AM
                  0 responses
                  52 views
                  0 likes
                  Last Post seqadmin  
                  Working...
                  X