Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • SAM output from Bowtie >50GB?

    I'm working with my first Illumina sequence reads and using Bowtie for alignment to the reference genome. I must be doing something wrong, because the SAM files that are generated are over 50 GB in size. It seems to be some sort of loop, because the algorithm is halted once my disk quota is exceeded.

    The Illumina read files are about 4 GB after conversion to FASTQ Sanger format.

    I've posted the script below for anyone who can tell me where I'm messing up:

    -----
    #BSUB -J BowtieSwineSeq4
    #BSUB -o BowtieSwineSeq4.o%J
    #BSUB -e BowtieSwineSeq4.e%J
    #BSUB -n 2

    bowtie -p 2 -a -v 2 -t -S SwineGenome9.58 s_4_sequence.fastq SwineSeq4.sam
    -----

    Many thanks in advance,
    JJW

  • #2
    It's hard to know if there is a problem unless we know (a) how many reads you have and (b) how many lines of output bowtie generates before your disk quota is exceeded. It seems likely that Bowtie's doing what it should do.

    Ben

    Comment


    • #3
      Thanks for the reply. As to the information needed:

      a) The current file I am working with has 29751337 illumina reads, converted into sanger reads.

      b) I can't seem view the SAM output file contents (e.g. in a text editor) in their entirety. The first few lines that do come up are pasted below.

      I'm sure Bowtie is doing what I'm asking it to do. Being new to this type of analysis, I'm sure the glitch is on my end, not the software's. I'm very impressed with the alignment program you have produced.

      Much appreciated!
      JJW

      HWUSI-EAS558R_0001:2:1:1054:8831#0/1 16 15 64381792 255 42M * 0 0 CCCTAACATGGGAAAGGGATCACTCACTCAAATCCATGAAGC 5EEEBEBECBED>@41:ECEE$
      HWUSI-EAS558R_0001:2:1:1054:8831#0/1 16 7 78866817 255 42M * 0 0 CCCTAACATGGGAAAGGGATCACTCACTCAAATCCATGAAGC 5EEEBEBECBED>@41:ECEE$
      HWUSI-EAS558R_0001:2:1:1054:8831#0/1 16 8 36985051 255 42M * 0 0 CCCTAACATGGGAAAGGGATCACTCACTCAAATCCATGAAGC 5EEEBEBECBED>@41:ECEE$
      HWUSI-EAS558R_0001:2:1:1054:8831#0/1 16 7 78903247 255 42M * 0 0 CCCTAACATGGGAAAGGGATCACTCACTCAAATCCATGAAGC 5EEEBEBECBED>@41:ECEE$
      HWUSI-EAS558R_0001:2:1:1054:8831#0/1 16 7 78837623 255 42M * 0 0 CCCTAACATGGGAAAGGGATCACTCACTCAAATCCATGAAGC 5EEEBEBECBED>@41:ECEE$
      HWUSI-EAS558R_0001:2:1:1054:8831#0/1 16 5 22131022 255 42M * 0 0 CCCTAACATGGGAAAGGGATCACTCACTCAAATCCATGAAGC 5EEEBEBECBED>@41:ECEE$
      HWUSI-EAS558R_0001:2:1:1054:8831#0/1 16 9 9081785 255 42M * 0 0 CCCTAACATGGGAAAGGGATCACTCACTCAAATCCATGAAGC 5EEEBEBECBED>@41:ECEECDDECDEE$
      HWUSI-EAS558R_0001:2:1:1054:11106#0/1 4 * 0 0 * * 0 0 TCATATTGCTTTTTGAACTTGATGAACTGTCTGATAGTTTAT B=B=B-AACCDDDDD=DDD?CCCC?=DDD$
      HWUSI-EAS558R_0001:2:1:1054:4407#0/1 4 * 0 0 * * 0 0 CAGAGTGTCTATGTGAAGCCGTATGTCTTGAAGAGAAGCTTT D?BD@@6?@@DD?DD-BD??CDDDDD=:=$
      HWUSI-EAS558R_0001:2:1:1054:5597#0/1 16 15 120471582 255 42M * 0 0 AAAAAGAGAAATTTGATTATAGTATATTCATTCATTCAAGAA AEDDGFGDDCFFDFGGFAG$

      Comment


      • #4
        because you use "-a"

        Comment


        • #5
          Thanks,

          I'll drop the -a option. Out of curiosity, don't I want all valid alignments?

          JJW

          Comment


          • #6
            Did you remove the adapters from your Solexa reads?

            I know from personal experience that if you do trim the adapters off and don't set a minimum length, say 17bp, you could end up with sequences that are only a few base pairs long. Needless to say those sequences will be found thousands of times in the reference genome you are using and create a HUGE SAM file, if you don't run out of memory first.

            If you want the most valid alignments use '-a', but you could also use '-k x' (example: -k 40, which would find the best 40 regions that match a read to the reference genome).

            Welcome to the wonderful world of Solexa sequencing data.

            Hope this might help you.
            Last edited by DrD2009; 06-27-2010, 11:47 AM.

            Comment


            • #7
              Thanks so much for the great advice! I assumed (always a bad idea) that the reads sent back to me from our DNA core had trimmed the adapters. I will verify this before proceeding.

              JJW

              Comment


              • #8
                If it created a file that large I would assume they are already trimmed. If they weren't and still had the adapters on them you would get around a 1.00% alignment to your reference genome and it would complete the alignment in a few minutes or less. I've done this before. haha '-t' is a great parameter.

                Comment


                • #9
                  Actually, I'm glad to hear you say that. I think your earlier suggestion to use the -k option to limit the number of valid alignments returned sounds like a great way to go. I'll give it a try and post the outcome.

                  Thanks for helping out a greenhorn!

                  Comment


                  • #10
                    Thanks, I'm just glad I'm at a point I can actually start helping others with problems I've experienced.

                    Us greenhorns have to stick together.

                    Comment

                    Latest Articles

                    Collapse

                    • seqadmin
                      Essential Discoveries and Tools in Epitranscriptomics
                      by seqadmin




                      The field of epigenetics has traditionally concentrated more on DNA and how changes like methylation and phosphorylation of histones impact gene expression and regulation. However, our increased understanding of RNA modifications and their importance in cellular processes has led to a rise in epitranscriptomics research. “Epitranscriptomics brings together the concepts of epigenetics and gene expression,” explained Adrien Leger, PhD, Principal Research Scientist...
                      04-22-2024, 07:01 AM
                    • seqadmin
                      Current Approaches to Protein Sequencing
                      by seqadmin


                      Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
                      04-04-2024, 04:25 PM

                    ad_right_rmr

                    Collapse

                    News

                    Collapse

                    Topics Statistics Last Post
                    Started by seqadmin, Yesterday, 11:49 AM
                    0 responses
                    15 views
                    0 likes
                    Last Post seqadmin  
                    Started by seqadmin, 04-24-2024, 08:47 AM
                    0 responses
                    16 views
                    0 likes
                    Last Post seqadmin  
                    Started by seqadmin, 04-11-2024, 12:08 PM
                    0 responses
                    61 views
                    0 likes
                    Last Post seqadmin  
                    Started by seqadmin, 04-10-2024, 10:19 PM
                    0 responses
                    60 views
                    0 likes
                    Last Post seqadmin  
                    Working...
                    X