SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics



Similar Threads
Thread Thread Starter Forum Replies Last Post
Extract gene sequences from gff3 file and reference fasta JonB Bioinformatics 1 07-15-2014 01:13 AM
Annotate diff file with Entrez gene ID Parharn Bioinformatics 2 03-06-2014 10:13 AM
fasta file manipulation- combining sequences by gene rather than species gevielr Bioinformatics 2 11-28-2013 04:12 PM
Question: Searching FASTA file for specific IDs aw90 Bioinformatics 1 07-19-2013 04:14 AM
Extract only sequence ids from fasta file with makeblastdb angeloulivieri Bioinformatics 13 07-30-2012 03:41 AM

Reply
 
Thread Tools
Old 07-03-2015, 07:21 AM   #1
kurban910
Member
 
Location: urumqi

Join Date: Jul 2014
Posts: 58
Default Get Protein Sequences fasta file by using Entrez Gene Ids

I want to get a protein sequences FASTA file for a given list of Entrez Gene IDs, which is shown as blow:

Code:
kurban@kurban-X550VC:~/Desktop$ more Triboliumcastaneum_tf_id.txt
100141790
100142111
100142176
100142203
100142308
654967
655070
655772
655998
how could i extract their corresponding protein sequences fasta file for these Tribolium castaneum gene id from ncbi? thanks.
kurban910 is offline   Reply With Quote
Old 07-03-2015, 08:17 AM   #2
Richard Finney
Senior Member
 
Location: bethesda

Join Date: Feb 2009
Posts: 700
Default

Here's a hint:

Use the efetch utilty :
example for mrna:
wget "http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=nuccore&id=100141790,100142111,100142176,100142203,100142308,654967,655070,655772,655998&rettype=fasta&retmode=text" -O out

Getting the protein is the hard part.

Full solution
echo -e "100141790\n100142111\n100142176\n100142203\n100142308\n654967\n655070\n655772\n655998" | while read G; do curl -s "http://eutils.ncbi.nlm.nih.gov/entrez/eutils/elink.fcgi?dbfrom=gene&db=protein&id=${G}" | grep -A 1 "<Link>" | grep "<Id>" | cut -d '>' -f 2 | cut -d '<' -f 1 | while read S ; do curl -s "http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=protein&id=${S}&retmode=text&rettype=fasta" ; done; done


from Pierre Lindenbaum's post at biostars:
https://www.biostars.org/p/52652/

Note there are multiple isoforms
Richard Finney is offline   Reply With Quote
Old 07-03-2015, 10:30 AM   #3
kurban910
Member
 
Location: urumqi

Join Date: Jul 2014
Posts: 58
Unhappy

thanks @Richard,
the commend really works like a charm, but the total sequences i wanna extract are 519, so how could i change my file formation
Code:
100141790
100142111
100142176
100142203
100142308
654967
655070
655772
655998
to this form: "100141790\n100142111\n100142176\n100142203\n100142308\n654967\n655070\n655772\n655998"?

sorry , i am now at this.

Last edited by kurban910; 07-03-2015 at 10:50 AM.
kurban910 is offline   Reply With Quote
Reply

Tags
entrez gene ids

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 01:43 AM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2021, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO