Seqanswers Leaderboard Ad

**dpryan** · 09-11-2013, 03:17 AM

Do you have the original fastq files or do you have the aligned SAM/BAM files? It sounds like you'll be best served by just using grep:

grep -B 1 -A 2 CACAGCTTCTAGTGCTATTCTGCGCCGGTATCC reads.fastq > matches.fastq

or

samtools view aligned.bam | grep CACAGCTTCTAGTGCTATTCTGCGCCGGTATCC > matches.sam

You could trivially pipe the second one through awk/perl/whatever to get output in fastq or fasta, if you prefer.

**JamieHeather** · 09-11-2013, 03:21 AM

Like dpryan said, you could get fasta output like this:

Code:

grep --no-group-separator -B 1 CACAGCTTCTAGTGCTATTCTGCGCCGGTATCC FILE.fq | sed 's/@/>/' > NEW_FILE.fa

**dpryan** · 09-11-2013, 03:24 AM

Originally posted by JamieHeather View Post

You could do this in bash like this:

Code:

grep --no-group-separator -B 1 CACAGCTTCTAGTGCTATTCTGCGCCGGTATCC FILE.fq | sed 's/\@/\>/' > NEW_FILE.fa

I always forget about the --no-group-separator flag!

It should be noted that this flag isn't always available, in which case grep -v can be used to remove the separator (this issue has come up a couple times on this forum).

**JamieHeather** · 09-11-2013, 03:30 AM

Ooh I didn't realise the --no-group-separator isn't always present, good to know.

I thought the -v flag gave all non-matches? At least it seems to in my version.

**HSV-1** · 09-11-2013, 03:31 AM

Originally posted by dpryan View Post

Do you have the original fastq files or do you have the aligned SAM/BAM files? It sounds like you'll be best served by just using grep:

grep -B 1 -A 2 CACAGCTTCTAGTGCTATTCTGCGCCGGTATCC reads.fastq > matches.fastq

or

samtools view aligned.bam | grep CACAGCTTCTAGTGCTATTCTGCGCCGGTATCC > matches.sam

You could trivially pipe the second one through awk/perl/whatever to get output in fastq or fasta, if you prefer.

Thanks for your reply. I have the original rna-seq data in *.fastq format.
is grep the linux function? what are -B and -A 2 for?

Thank you guys!

**JamieHeather** · 09-11-2013, 03:36 AM

You're welcome!

Grep is indeed a unix function, you can just enter it straight into the command line.

Grep finds and returns the line(s) that contain the expression or string you supplied (in this case your sequence), then the A and B flags tell grep to also output lines above and below the matching line.

So -B 1 -A 2 would give you the line above the sequence line and the two lines beneath, thus outputting the entire fastq record.

The code I gave you output just the one above, as fasta doesn't use lines 3 and 4, and then just used sed to change the '@' character to a '>' in the ID line.

Grep and sed are very useful for hands on playing with your data like this, they're both worth getting to grips with.

**HSV-1** · 09-11-2013, 03:39 AM

Originally posted by JamieHeather View Post

You're welcome!

Grep is indeed a unix function, you can just enter it straight into the command line.

Grep finds and returns the line that contains the expression or string you supplied (in this case your sequence), then the A and B flags tell grep to also output lines above and below the matching line.

It's a very powerful tool, worth getting to grips with.

Thanks. thank you very much.

**dpryan** · 09-11-2013, 03:45 AM

Originally posted by JamieHeather View Post

Ooh I didn't realise the --no-group-separator isn't always present, good to know.

I thought the -v flag gave all non-matches? At least it seems to in my version.

It does, you just have to pipe the first grep into a grep -v and then the output of that into awk/sed/whatever. I don't recall which platform was missing the --no-group-separator option, I'd have to search back through the forums.

**JamieHeather** · 09-11-2013, 04:04 AM

Originally posted by dpryan View Post

It does, you just have to pipe the first grep into a grep -v and then the output of that into awk/sed/whatever. I don't recall which platform was missing the --no-group-separator option, I'd have to search back through the forums.

Ahh I see, that makes sense!

Topics	Statistics	Last Post
Cancer Metastasis: A Deep Dive into Cellular Plasticity by seqadmin Started by seqadmin, 04-11-2024, 12:08 PM	0 responses 30 views 0 likes	Last Post by seqadmin 04-11-2024, 12:08 PM
Proteogenomic Profiles Offer New Clues in Prostate Cancer by seqadmin Started by seqadmin, 04-10-2024, 10:19 PM	0 responses 32 views 0 likes	Last Post by seqadmin 04-10-2024, 10:19 PM
Novel Diagnostic Assay Enhances Ovarian Cancer Detection by seqadmin Started by seqadmin, 04-10-2024, 09:21 AM	0 responses 28 views 0 likes	Last Post by seqadmin 04-10-2024, 09:21 AM
Evolutionary Dynamics of Centromeres: A Comparative Genomic Analysis by seqadmin Started by seqadmin, 04-04-2024, 09:00 AM	0 responses 53 views 0 likes	Last Post by seqadmin 04-04-2024, 09:00 AM

Seqanswers Leaderboard Ad

Announcement

How to Pick certain reads out of RNA-seq data to a fasta file?

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News