Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • grep only reads from bam file

    Hi all,
    I am working on TCGA WGS.. and the files are huge..
    Is there a way to copy all the reads directly from the bam file without using samtools and bam2fastx (from tophat)?

    Right now i am using:
    samtools sort -n *.bam *.bam.sorted
    bam2fastx -q -Q -A -P -o out.fq *.bam.sorted.bam

    and it is very slow..
    i dont mind to stay in binary mode, and i am not using the quality + the order isnt relevant..

    Or,
    to spped up what i am doing now..

    Thanks!

  • #2
    Are the reads paired-end and do you care if the output is? If the answer to either of those is "no", then you can skip the sort step. If the answer to both is "yes", then there's really no great way to do that. Realistically, paired-end reads will normally close to each other in a file (since that's how they'll map), so you could just keep them in memory until running into the appropriate mate. You might see if you can just get the fastq files, it might be faster.

    Comment


    • #3
      Originally posted by dpryan View Post
      Are the reads paired-end and do you care if the output is? If the answer to either of those is "no", then you can skip the sort step. If the answer to both is "yes", then there's really no great way to do that. Realistically, paired-end reads will normally close to each other in a file (since that's how they'll map), so you could just keep them in memory until running into the appropriate mate. You might see if you can just get the fastq files, it might be faster.
      If i am interesting only in the first pair, can i use bam2fastx without sorting?
      Or, what you meant is that there are no seperator between the 'paired ends', and i will have sinfle file with the 2 paired?

      thanks

      Comment


      • #4
        If you're only interested in the first mate in a pair, then:

        Code:
        samtools view -f 0x40 sample.bam | awk '{printf("%s\n%s\n+\n%s\n", $1, $10, $11)}' | gzip > sample.fq.gz
        will work and probably be a bit faster. If you removed the "-f 0x40", then you'd have both mates of the pair in the same file, though not necessarily next to each other. Of course, if the file contains non-linear alignments (i.e., a single read can be split across multiple lines, using the 0x800 bit in the flag) then this won't work.

        Comment


        • #5
          Originally posted by dpryan View Post
          If you're only interested in the first mate in a pair, then:

          Code:
          samtools view -f 0x40 sample.bam | awk '{printf("%s\n%s\n+\n%s\n", $1, $10, $11)}' | gzip > sample.fq.gz
          will work and probably be a bit faster. If you removed the "-f 0x40", then you'd have both mates of the pair in the same file, though not necessarily next to each other. Of course, if the file contains non-linear alignments (i.e., a single read can be split across multiple lines, using the 0x800 bit in the flag) then this won't work.
          I tried this and i got that the first end is: 130713161918
          But when i used:
          samtools sort -n *.bam *.bam.sorted
          bam2fastx -q -Q -A -P -o out.fq *.bam.sorted.bam
          i got: 146521575509

          What could be the reason?
          The procedure that you offer was much quicker.. but it might miss some reads?

          Comment


          • #6
            The method I gave will never miss any reads. The difference could only be due to the presence of the "-f 0x40" in the command that I suggested and the lack of that in bam2fastx, which put both reads of a pair in the output. I should note that one down-side to the method I gave is that the reads won't be reverse-complemented as appropriate prior to output. If you don't care about the order, then just skip the sort step and use bam2fastx.

            Comment


            • #7
              Originally posted by dpryan View Post
              The method I gave will never miss any reads. The difference could only be due to the presence of the "-f 0x40" in the command that I suggested and the lack of that in bam2fastx, which put both reads of a pair in the output. I should note that one down-side to the method I gave is that the reads won't be reverse-complemented as appropriate prior to output. If you don't care about the order, then just skip the sort step and use bam2fastx.
              I am interesting in all the reads that are in the bam file, without missing any read.
              When i am using bam2fastx without sorting i am getting this error:
              Error: couldn't retrieve both reads for pair HWI-ST512_115262631:6:1306:12150:22582. Perhaps the input file is not sorted by name?
              (using 'samtools sort -n' might fix this)

              Comment


              • #8
                Originally posted by papori View Post
                I am interesting in all the reads that are in the bam file, without missing any read.
                When i am using bam2fastx without sorting i am getting this error:
                Error: couldn't retrieve both reads for pair HWI-ST512_115262631:6:1306:12150:22582. Perhaps the input file is not sorted by name?
                (using 'samtools sort -n' might fix this)
                Apparently bam2fastx requires the input file to be precollated (have pairs next to each other). There are several other programs that perform this collation process on the fly. One of them is bamtofastq in biobambam. Another is SamToFastq in Picard.

                Comment


                • #9
                  @papori: Have you checked to see if the original fastq files are available for TCGA WGS data?

                  Comment


                  • #10
                    Originally posted by GenoMax View Post
                    @papori: Have you checked to see if the original fastq files are available for TCGA WGS data?
                    As i understood from the FAQ of TCGA only fastq of RNA/WXS available now..
                    Do you know somthing else?

                    Comment


                    • #11
                      Originally posted by papori View Post
                      As i understood from the FAQ of TCGA only fastq of RNA/WXS available now..
                      Do you know somthing else?
                      No I don't. I was wondering if the group that submitted the WGS data also had made the sequences available. Have you tried to ask for sequence data to see if it can be made available?

                      Comment

                      Latest Articles

                      Collapse

                      • seqadmin
                        Strategies for Sequencing Challenging Samples
                        by seqadmin


                        Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                        03-22-2024, 06:39 AM
                      • seqadmin
                        Techniques and Challenges in Conservation Genomics
                        by seqadmin



                        The field of conservation genomics centers on applying genomics technologies in support of conservation efforts and the preservation of biodiversity. This article features interviews with two researchers who showcase their innovative work and highlight the current state and future of conservation genomics.

                        Avian Conservation
                        Matthew DeSaix, a recent doctoral graduate from Kristen Ruegg’s lab at The University of Colorado, shared that most of his research...
                        03-08-2024, 10:41 AM

                      ad_right_rmr

                      Collapse

                      News

                      Collapse

                      Topics Statistics Last Post
                      Started by seqadmin, Yesterday, 06:37 PM
                      0 responses
                      8 views
                      0 likes
                      Last Post seqadmin  
                      Started by seqadmin, Yesterday, 06:07 PM
                      0 responses
                      8 views
                      0 likes
                      Last Post seqadmin  
                      Started by seqadmin, 03-22-2024, 10:03 AM
                      0 responses
                      49 views
                      0 likes
                      Last Post seqadmin  
                      Started by seqadmin, 03-21-2024, 07:32 AM
                      0 responses
                      66 views
                      0 likes
                      Last Post seqadmin  
                      Working...
                      X