Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • How to Pick certain reads out of RNA-seq data to a fasta file?

    Hello, every one.
    I tried searching the solution to my question. But I couldn't . So put a new thread here. What I want to do is to pick out all the reads by Illumina Hiseq that contain the sequence "CACAGCTTCTAGTGCTATTCTGCGCCGGTATCC" and collect these reads into a fasta file or any other formats.

    I have read through the manual of bowtie and BWA, I didn't find this usage.

    Anyone could help me out?

    Thanks!

  • #2
    Do you have the original fastq files or do you have the aligned SAM/BAM files? It sounds like you'll be best served by just using grep:

    grep -B 1 -A 2 CACAGCTTCTAGTGCTATTCTGCGCCGGTATCC reads.fastq > matches.fastq

    or

    samtools view aligned.bam | grep CACAGCTTCTAGTGCTATTCTGCGCCGGTATCC > matches.sam

    You could trivially pipe the second one through awk/perl/whatever to get output in fastq or fasta, if you prefer.

    Comment


    • #3
      Like dpryan said, you could get fasta output like this:

      Code:
      grep --no-group-separator -B 1 CACAGCTTCTAGTGCTATTCTGCGCCGGTATCC FILE.fq | sed 's/@/>/' > NEW_FILE.fa
      Last edited by JamieHeather; 09-11-2013, 03:28 AM. Reason: Taking too long to reply

      Comment


      • #4
        Originally posted by JamieHeather View Post
        You could do this in bash like this:

        Code:
        grep --no-group-separator -B 1 CACAGCTTCTAGTGCTATTCTGCGCCGGTATCC FILE.fq | sed 's/\@/\>/' > NEW_FILE.fa
        I always forget about the --no-group-separator flag!

        It should be noted that this flag isn't always available, in which case grep -v can be used to remove the separator (this issue has come up a couple times on this forum).

        Comment


        • #5
          Ooh I didn't realise the --no-group-separator isn't always present, good to know.

          I thought the -v flag gave all non-matches? At least it seems to in my version.

          Comment


          • #6
            Originally posted by dpryan View Post
            Do you have the original fastq files or do you have the aligned SAM/BAM files? It sounds like you'll be best served by just using grep:

            grep -B 1 -A 2 CACAGCTTCTAGTGCTATTCTGCGCCGGTATCC reads.fastq > matches.fastq

            or

            samtools view aligned.bam | grep CACAGCTTCTAGTGCTATTCTGCGCCGGTATCC > matches.sam

            You could trivially pipe the second one through awk/perl/whatever to get output in fastq or fasta, if you prefer.

            Thanks for your reply. I have the original rna-seq data in *.fastq format.
            is grep the linux function? what are -B and -A 2 for?

            Thank you guys!

            Comment


            • #7
              You're welcome!

              Grep is indeed a unix function, you can just enter it straight into the command line.

              Grep finds and returns the line(s) that contain the expression or string you supplied (in this case your sequence), then the A and B flags tell grep to also output lines above and below the matching line.

              So -B 1 -A 2 would give you the line above the sequence line and the two lines beneath, thus outputting the entire fastq record.

              The code I gave you output just the one above, as fasta doesn't use lines 3 and 4, and then just used sed to change the '@' character to a '>' in the ID line.

              Grep and sed are very useful for hands on playing with your data like this, they're both worth getting to grips with.
              Last edited by JamieHeather; 09-11-2013, 03:39 AM.

              Comment


              • #8
                Originally posted by JamieHeather View Post
                You're welcome!

                Grep is indeed a unix function, you can just enter it straight into the command line.

                Grep finds and returns the line that contains the expression or string you supplied (in this case your sequence), then the A and B flags tell grep to also output lines above and below the matching line.

                It's a very powerful tool, worth getting to grips with.
                Thanks. thank you very much.

                Comment


                • #9
                  Originally posted by JamieHeather View Post
                  Ooh I didn't realise the --no-group-separator isn't always present, good to know.

                  I thought the -v flag gave all non-matches? At least it seems to in my version.
                  It does, you just have to pipe the first grep into a grep -v and then the output of that into awk/sed/whatever. I don't recall which platform was missing the --no-group-separator option, I'd have to search back through the forums.

                  Comment


                  • #10
                    Originally posted by dpryan View Post
                    It does, you just have to pipe the first grep into a grep -v and then the output of that into awk/sed/whatever. I don't recall which platform was missing the --no-group-separator option, I'd have to search back through the forums.
                    Ahh I see, that makes sense!

                    Comment

                    Latest Articles

                    Collapse

                    • seqadmin
                      Current Approaches to Protein Sequencing
                      by seqadmin


                      Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
                      04-04-2024, 04:25 PM
                    • seqadmin
                      Strategies for Sequencing Challenging Samples
                      by seqadmin


                      Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                      03-22-2024, 06:39 AM

                    ad_right_rmr

                    Collapse

                    News

                    Collapse

                    Topics Statistics Last Post
                    Started by seqadmin, 04-11-2024, 12:08 PM
                    0 responses
                    30 views
                    0 likes
                    Last Post seqadmin  
                    Started by seqadmin, 04-10-2024, 10:19 PM
                    0 responses
                    32 views
                    0 likes
                    Last Post seqadmin  
                    Started by seqadmin, 04-10-2024, 09:21 AM
                    0 responses
                    28 views
                    0 likes
                    Last Post seqadmin  
                    Started by seqadmin, 04-04-2024, 09:00 AM
                    0 responses
                    53 views
                    0 likes
                    Last Post seqadmin  
                    Working...
                    X