Seqanswers Leaderboard Ad

**Brian Bushnell** · 12-16-2015, 05:35 PM

You can extract sequences that share kmers with your sequences with BBDuk:

Code:

bbduk.sh in=a.fa ref=b.fa out=c.fa mkf=1 mm=f k=31

This will print to C all the sequences in A that share 100% of their 31-mers with sequences in B.

You can also do something more precise with Dedupe, as it allows arbitrary set operations; so, let me know if the above method is insufficient.

**cmbetts** · 12-16-2015, 05:41 PM

There's almost certainly lots of ways to do this depending on what tools you're comfortable with.

In R/Bioconductor, you could read in the fasta using the bioconductor ShortRead package, and then use vcountPattern to identify the hits to your query sequences and write those as a new fasta file.

It's been a long time since I used it, but BioPython also has some nice iterators for going through fasta files, and would be better suited for a bigger fasta file. You would essentially write a little loop that iterates through the fasta and queries each record for your desired sequence. If it finds the sequence, write the record to a new file, otherwise move on to the next record.

I'm sure some folks might have some grep based methods for the commandline as well

**entomology** · 12-17-2015, 02:20 AM

3xs for the info.

Originally posted by Brian Bushnell View Post

You can extract sequences that share kmers with your sequences with BBDuk:

Code:

bbduk.sh in=a.fa ref=b.fa out=c.fa mkf=1 mm=f k=31

This will print to C all the sequences in A that share 100% of their 31-mers with sequences in B.

You can also do something more precise with Dedupe, as it allows arbitrary set operations; so, let me know if the above method is insufficient.

**entomology** · 12-17-2015, 02:27 AM

Actually, I do prefer command based under shell environment. grep seems a great way to do it, thanks for tips.

Originally posted by cmbetts View Post

There's almost certainly lots of ways to do this depending on what tools you're comfortable with.

In R/Bioconductor, you could read in the fasta using the bioconductor ShortRead package, and then use vcountPattern to identify the hits to your query sequences and write those as a new fasta file.

It's been a long time since I used it, but BioPython also has some nice iterators for going through fasta files, and would be better suited for a bigger fasta file. You would essentially write a little loop that iterates through the fasta and queries each record for your desired sequence. If it finds the sequence, write the record to a new file, otherwise move on to the next record.

I'm sure some folks might have some grep based methods for the commandline as well

**maubp** · 12-17-2015, 07:15 AM

Here are my Python scripts to do this, with Galaxy wrappers:

This filters the FASTA file (loads all the IDs, then goes through the FASTA file once):

pico_galaxy/tools/seq_filter_by_id at master · peterjc/pico_galaxy

https://github.com/peterjc/pico_galaxy/tree/master/tools/seq_filter_by_id

Galaxy tools and wrappers for sequence analysis. Contribute to peterjc/pico_galaxy development by creating an account on GitHub.

This indexes the FASTA file, then goes through the IDs one by one:

pico_galaxy/tools/seq_select_by_id at master · peterjc/pico_galaxy

https://github.com/peterjc/pico_galaxy/tree/master/tools/seq_select_by_id

Galaxy tools and wrappers for sequence analysis. Contribute to peterjc/pico_galaxy development by creating an account on GitHub.

The difference is which order do you want? The order in the FASTA file (faster), or the order in the ID file (slower).

**entomology** · 12-17-2015, 08:49 AM

Seems these script also deal with a list of ids, then using these ids to fetch sequences in a fasta file. I wanna use sequences directly and get a subset of fasta from a big fasta file base on provided sequences. Thank you for your information as well.

Originally posted by maubp View Post

Here are my Python scripts to do this, with Galaxy wrappers:

This filters the FASTA file (loads all the IDs, then goes through the FASTA file once):

pico_galaxy/tools/seq_filter_by_id at master · peterjc/pico_galaxy

https://github.com/peterjc/pico_galaxy/tree/master/tools/seq_filter_by_id

Galaxy tools and wrappers for sequence analysis. Contribute to peterjc/pico_galaxy development by creating an account on GitHub.

This indexes the FASTA file, then goes through the IDs one by one:

pico_galaxy/tools/seq_select_by_id at master · peterjc/pico_galaxy

https://github.com/peterjc/pico_galaxy/tree/master/tools/seq_select_by_id

Galaxy tools and wrappers for sequence analysis. Contribute to peterjc/pico_galaxy development by creating an account on GitHub.

The difference is which order do you want? The order in the FASTA file (faster), or the order in the ID file (slower).

**maubp** · 12-17-2015, 09:03 AM

What scripting/programming language(s) are you learning?

**entomology** · 12-17-2015, 11:48 AM

Actually, I do more wet lab most of the time. Sometimes, I'll also do simple small rna analysis with ready-to-use perl script and simple shell script. It's enough most of the time with my work, but sometime it's not easy for me.

Originally posted by maubp View Post

What scripting/programming language(s) are you learning?

**maubp** · 12-17-2015, 12:28 PM

Sadly my shell scripting skills are minimal, and my Perl worse, so I can't really help directly.

**entomology** · 12-17-2015, 12:39 PM

No worry, I'll try to use grep to deal with the problem

.

Originally posted by maubp View Post

Sadly my shell scripting skills are minimal, and my Perl worse, so I can't really help directly.

**GenoMax** · 12-17-2015, 01:08 PM

Brian's solution should work. Did you try it?

While I like grep and its variants it may not always work for something as intricate as deciphering nucleotide patterns, specially if your sequences wrap around on multiple lines.

**entomology** · 12-17-2015, 02:00 PM

Yes, I've tried bbduk.sh.

bbduk.sh in=a.fa ref=b.fa out=c.fa mkf=1 mm=f k=31

But my situation is that b.fa is not fasta file, it contain one sequence per line. I just want the sequence in b from a.fa, than make a new fasta file (c.fa).

since my b.fa is not a fasta file, so bbduk.sh give some error:

Exception in thread "Thread-9" java.lang.RuntimeException: Error parsing read from text.

Originally posted by GenoMax View Post

Brian's solution should work. Did you try it?

While I like grep and its variants it may not always work for something as intricate as deciphering nucleotide patterns, specially if your sequences wrap around on multiple lines.

**GenoMax** · 12-17-2015, 02:23 PM

If your sequences are one on each line then use the following command to convert them to a fasta format file (change file names as needed)

Code:

$ awk -F "\n" 'BEGIN{counts=1}{print ">"counts"\n"""$0""; counts++}' your_file > new_file_as_fasta

Then use the file with BBDuk.

**entomology** · 12-17-2015, 03:03 PM

Thank you for the code. It can change my sequences to fasta file. And I try bbduk.fas again, but the result is not as expected. An example will be more easier to understand. there are two fasta

original.fas
>1123-11234
aaaaaa
>wer
atgcca
>ad
ctaacg
>232-23424
tttttt
>323-342
cacaaa
>416-2
gggggg
>13424241234-23423
cccccc
>5-234
cggcgtcacgttggttgttga

ref.fas(after I make fasta using your awk script)
>1
aaaaaa
>2
tttttt
>3
gggggg
>4
cccccc

I use "bbmap/bbduk.sh in=original.fas ref=ref.fas out=out.fas mkf=1 mm=f k=21"

out.fas is like this
>5-234
cggcgtcacgttggttgttga

actually, I want a fasta like this

>1123-11234
aaaaaa
>232-23424
tttttt
>416-2
gggggg
>13424241234-23423
cccccc

Just like fetch the id from the original.fas

Originally posted by GenoMax View Post

If your sequences are one on each line then use the following command to convert them to a fasta format file (change file names as needed)

Code:

$ awk -F "\n" 'BEGIN{counts=1}{print ">"counts"\n"""$0""; counts++}' your_file > new_file_as_fasta

Then use the file with BBDuk.

Topics	Statistics	Last Post
Expanding the Horizons of Cellular Research with the Single Cell Atlas by seqadmin Started by seqadmin, Yesterday, 11:49 AM	0 responses 13 views 0 likes	Last Post by seqadmin Yesterday, 11:49 AM
Genetic Variants and Diabetes Risk in Childhood Cancer Survivors by seqadmin Started by seqadmin, 04-24-2024, 08:47 AM	0 responses 16 views 0 likes	Last Post by seqadmin 04-24-2024, 08:47 AM
Cancer Metastasis: A Deep Dive into Cellular Plasticity by seqadmin Started by seqadmin, 04-11-2024, 12:08 PM	0 responses 61 views 0 likes	Last Post by seqadmin 04-11-2024, 12:08 PM
Proteogenomic Profiles Offer New Clues in Prostate Cancer by seqadmin Started by seqadmin, 04-10-2024, 10:19 PM	0 responses 60 views 0 likes	Last Post by seqadmin 04-10-2024, 10:19 PM

Seqanswers Leaderboard Ad

Announcement

Extract multiple fasta sequences from a fasta file based on sequenes

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News