SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics



Similar Threads
Thread Thread Starter Forum Replies Last Post
Extract sequence from multi fasta file with PERL andreitudor Bioinformatics 27 07-07-2019 07:45 AM
Extract only sequence ids from fasta file with makeblastdb angeloulivieri Bioinformatics 13 07-30-2012 02:41 AM
extract subsequence from genomic fasta file jwhite Bioinformatics 7 06-28-2012 11:15 AM
PubMed: Statistical Analyses of Next Generation Sequence Data: A Partial Overview. Newsbot! Literature Watch 0 11-30-2010 08:20 AM
In Sequence: Illumina Ships ‘Record’ GAs In Q3; Says Longer Reads Will Spur De Novo S Newsbot! Illumina/Solexa 0 10-28-2008 01:17 PM

Reply
 
Thread Tools
Old 10-30-2012, 03:26 AM   #1
cdlam
Junior Member
 
Location: East Coast

Join Date: Oct 2012
Posts: 7
Default Extract partial sequence from FASTA record

Hi all,

So I'm fairly new to anything bioinformatics related and I've been kind of muddling my way through so far.

I have extracted the introns from three different species and have their sequences stored in three FASTA files. I need a way to extract the first 10 bases from each of these sequences and put them in a new file. I don't know if it helps or not, but the first 10 bases are in caps while the rest of the sequence is in lowercase. I'm not sure if I could use some sort of regex hereor something. So for example,

>Fasta format Identification stuff
ACTTGTACATatgggtatcataatcagggagatcc
>Fasta format Identification stuff
ACTTGTACATatgggtatcataatcagggagatcc
>Fasta format Identification stuff
ACTTGTACATatgggtatcataatcagggagatcc

Ideally I could preserve the ID lines with the extracted 10 base sequences. I have been using UNIX and perl for some of this (doing the actual extractions), but I also have access to Windows with python, biopython, and Emboss. Thanks for any help you guys could give!
cdlam is offline   Reply With Quote
Old 10-30-2012, 04:19 AM   #2
Apexy
Member
 
Location: Africa

Join Date: Apr 2011
Posts: 62
Default

Hello cdlam,
Assuming your sequences are in a file name file.fa. At the command prompt run this
perl -e 'while($id = <>){$seq = <>; $seq = substr($seq,0,10); print $id.$seq."\n"}' file.fa
you should get this:
>Fasta format Identification stuff
ACTTGTACAT
>Fasta format Identification stuff
ACTTGTACAT
>Fasta format Identification stuff
ACTTGTACAT

Enjoy!

MBandi
Apexy is offline   Reply With Quote
Old 10-30-2012, 05:23 AM   #3
pbluescript
Senior Member
 
Location: Boston

Join Date: Nov 2009
Posts: 224
Default

awk solution:

awk '{if (/^>/)
print $0
else
print(substr($1,1,10))
}' file.fasta
pbluescript is offline   Reply With Quote
Old 10-30-2012, 07:44 AM   #4
cdlam
Junior Member
 
Location: East Coast

Join Date: Oct 2012
Posts: 7
Default

Thanks for the replies. I'm not exactly sure how to do the awk solution,is that run from the command prompt?

Apexy, that seems to have worked. However, unforseen problem....one of my files is "file.fasta" but apparently isn't in the format... It's like:

scaffold_8 2234626-2234945 (-) GATCCGTAAGgtgaccacttccagcggcctggtgaagcgcctgatcgaggtgcagacgacgaccgtgacgaagaccgtggcgggcgagacgagcacgactgtggagacggagactcgcgaggagatcgatgacagcagtgccacgagctcgaccacgacgacgtcggatgcgactctggttagctcgtcgacgacgcacgagacggacgaggacggcgctgctggcgcgacggtgaagaccgaggtgttccgtacgatgggtgccaacggcagcgtcatcaccaagacggttcgcacgacgatccgtaagGTGACCACTT asmbl_6481

Except it's just in one line when I open it:
scaffold_8 2234626-2234945 (-) GATCCGTAAG.........

It does do multiple lines per entry, but the breaks seem unpredictable, it's weird. I don't suppose anyone has any ideas for getting this into the proper format?
cdlam is offline   Reply With Quote
Old 10-30-2012, 07:54 AM   #5
pbluescript
Senior Member
 
Location: Boston

Join Date: Nov 2009
Posts: 224
Default

cdlam, the awk script should be run from the command line.

Is the format of the weird fasta file consistent? Does the sequence name always start with "scaffold"? Do the sequences always start with capitol letters?
pbluescript is offline   Reply With Quote
Old 10-30-2012, 08:03 AM   #6
cdlam
Junior Member
 
Location: East Coast

Join Date: Oct 2012
Posts: 7
Default

Hey,

Yes, it seems consistent. The beginning of each on is "scaffold_8 2234626-2234945 (-) " either with a (+) or (-) depending on what strand it came from. So it will always go ( ) GTGAA....and each line ends with "asmbl_####"

So,

scaffold_# ###-##### ( ) SEQUENCE-DATA assembl_###

It starts a new line for a new "scaffold_#...." and the "assembl_###" is just on the end of the last line with sequence data.

Edit: Sorry, yes there are always 10 capitol letters at the beginning and 10 capitol letters at the end of the sequence.
cdlam is offline   Reply With Quote
Old 10-30-2012, 08:23 AM   #7
pbluescript
Senior Member
 
Location: Boston

Join Date: Nov 2009
Posts: 224
Default

Quote:
Originally Posted by cdlam View Post
Hey,

Yes, it seems consistent. The beginning of each on is "scaffold_8 2234626-2234945 (-) " either with a (+) or (-) depending on what strand it came from. So it will always go ( ) GTGAA....and each line ends with "asmbl_####"

So,

scaffold_# ###-##### ( ) SEQUENCE-DATA assembl_###

It starts a new line for a new "scaffold_#...." and the "assembl_###" is just on the end of the last line with sequence data.

Edit: Sorry, yes there are always 10 capitol letters at the beginning and 10 capitol letters at the end of the sequence.
Start with:
Code:
scaffold_8      2234626-2234945 (-) GATCCGTAAGgtgaccacttccagcggcctggtgaagcgcctgatcgaggtgcagacgacgaccgtgacgaagaccgtggcgggcgagacgagcacgactgtggagacggagactcgcgaggagatcgatgacagcagtgccacgagctcgaccacgacgacgtcggatgcgactctggttagctcgtcgacgacgcacgagacggacgaggacggcgctgctggcgcgacggtgaagaccgaggtgttccgtacgatgggtgccaacggcagcgtcatcaccaagacggttcgcacgacgatccgtaagGTGACCACTT asmbl_6481
Run:
Code:
sed 's/^/>/g' input | sed 's/(*)\s/&\n/g' | sed 's/ as.*$//g' | sed 's/\s/ /g' > output
I broke the sed code down into piped steps to make it easier to understand.
's/^/>/g' adds a > to each line.
's/(*)\s/&\n/g' inserts a new line after the strand designation.
's/ as.*$//g' removes the assembl_### portion. You could just move it instead if you want to keep it.
's/\s/ /g' replaces white space with a space. I added this since it seemed like some of the fields in the name line might have been separated by a tab.

End with:
Code:
>scaffold_8 2234626-2234945 (-)
GATCCGTAAGgtgaccacttccagcggcctggtgaagcgcctgatcgaggtgcagacgacgaccgtgacgaagaccgtggcgggcgagacgagcacgactgtggagacggagactcgcgaggagatcgatgacagcagtgccacgagctcgaccacgacgacgtcggatgcgactctggttagctcgtcgacgacgcacgagacggacgaggacggcgctgctggcgcgacggtgaagaccgaggtgttccgtacgatgggtgccaacggcagcgtcatcaccaagacggttcgcacgacgatccgtaagGTGACCACTT
pbluescript is offline   Reply With Quote
Old 10-30-2012, 08:55 AM   #8
cdlam
Junior Member
 
Location: East Coast

Join Date: Oct 2012
Posts: 7
Default

That worked, thanks a lot!

Hmm in terms of my original question, when using the

Quote:
perl -e 'while($id = <>){$seq = <>; $seq = substr($seq,0,10); print $id.$seq."\n"}' file.fa
To get the ending sequence, I just used $seq,-20, which gave me the final 20 bases. Is there a way I can start 5 bases from the end and take the preceding 20 bases?
cdlam is offline   Reply With Quote
Old 10-30-2012, 01:08 PM   #9
Apexy
Member
 
Location: Africa

Join Date: Apr 2011
Posts: 62
Default

I would suggest pragmatism. first extract last 25 and put in a file. From this file extract first 20
OR
perl -e 'while($id = <>){$seq = <>; $seq = substr($seq,-26, 20); print $id.$seq."\n"}' file.fa
Apexy is offline   Reply With Quote
Old 10-30-2012, 02:21 PM   #10
cdlam
Junior Member
 
Location: East Coast

Join Date: Oct 2012
Posts: 7
Default

Haha, wow I really feel like I should have been able to think of that myself

Oh well.


Thanks a lot!
cdlam is offline   Reply With Quote
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 08:33 AM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2020, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO