SEQanswers

SEQanswers (http://seqanswers.com/forums/index.php)
-   Bioinformatics (http://seqanswers.com/forums/forumdisplay.php?f=18)
-   -   Protein ID that blast could not identify (http://seqanswers.com/forums/showthread.php?t=23209)

Tsuyoshi 09-10-2012 02:29 AM

Protein ID that blast could not identify
 
HI
I downloaded a proteome in fasta formater, which contains hundreds of proteins (http://labs.umassmed.edu/chlamyfp/in...p?content=help). And I want to blast against these proteins with my data using Blast+, however, when I makeblastdb the proteome dataset, an error occurred
*******************************************************************
Error: NCBI C++ Exception:
"/am/ncbiapdata/release/blast/src/2.2.26/IntelMAC-universal/c++/GCC401-ReleaseMT--IntelMAC-universal/../src/objects/seq/../seqloc/Seq_id.cpp", line 1679: Error: ncbi::objects::CSeq_id::x_Init() - Unsupported ID type C_1150005
*******************************************************************
I thing there must be something wrong with the proteome data, cause the blast+ just worked well when I used the data downloaded directly from NCBI.

Therefore, I opened the proteome data with textedit, and for example, the header of each sequence was like this
*****************************************************************
>C_680011|168600 FAP45, Flagellar Associated Protein Weakly Similar to Nasopharyngeal Epithelium Specific Protein 1
MPQTPPRSGGYRSGKQSYVDESLFGGSKRTGAAQVETLDSLKLTAPTRTISPKDRDVVTLTKGDLTRMLKASPIMTAEDVAAAKREAEAKREQLQAVSKA
RKEKMLKLEEEAKKQAPPTETEILQRQLNDATRSRATHMMLEQKDPVKHMNQMMLYSKCVTIRDAQIEEKKQMLAEEEEEQRRLDLMMEIERVKALEQYE
ARERQRVEERRKGAAVLSEQIKERERERIRQEELRDQERLQMLREIERLKEEEMQAQIEKKIQAKQLMEEVAAANSEQIKRKEGMKVREKEEDLRIADYI
LQKEMREQ
*****************************************************************

Here the "C_680011|168600" should be the protein ID I think, but there was no found if I search it in NCBI. I just wonder what kind of ID it is and how should I do to make the blast+ recognise it.

Thanks!

maubp 09-10-2012 03:11 AM

Are you using the -parse_seqids option? If so, try it without this. I only ever use this if my FASTA file identifiers follow the NCBI naming conventions.

It would be useful to show the command you used to run makeblastdb as that might help us to understand what you are doing.

Tsuyoshi 09-10-2012 03:22 AM

Quote:

Originally Posted by maubp (Post 83611)
Are you using the -parse_seqids option? If so, try it without this. I only ever use this if my FASTA file identifiers follow the NCBI naming conventions.

It would be useful to show the command you used to run makeblastdb as that might help us to understand what you are doing.

Dear Maubp,
Thanks for you reply.
Yes I used -parse_seqids, and followed your suggestion, without the -parse_seqids, another error showed up,
*******************************************************************
Error: (CArgException::eNoArg) Argument "dbtype". Mandatory value is missing: `String, `nucl', `prot''
Error: (CArgException::eNoArg) Application's initialization failed
*****************************************************************

The command I used was
makeblastdb -in CrFP.fasta -out CrFP

Thanks

maubp 09-10-2012 03:30 AM

That error is clear isn't it? You have to tell makeblastdb if your FASTA file is protein or nucleotides. i.e. either:

Code:

makeblastdb -in CrFP.fasta -out CrFP -dbtype nucl
or,

Code:

makeblastdb -in CrFP.fasta -out CrFP -dbtype prot

Tsuyoshi 09-10-2012 03:38 AM

Quote:

Originally Posted by maubp (Post 83613)
That error is clear isn't it? You have to tell makeblastdb if your FASTA file is protein or nucleotides. i.e. either:

Code:

makeblastdb -in CrFP.fasta -out CrFP -dbtype nucl
or,

Code:

makeblastdb -in CrFP.fasta -out CrFP -dbtype prot

YES!
What a stupid mistake I made. It succeeded now!

Thank you!

maubp 09-10-2012 03:41 AM

Quote:

Originally Posted by Tsuyoshi (Post 83614)
It succeeded now!

Oh good. Understanding the NCBI BLAST+ error messages gets easier with practice ;)

Tsuyoshi 09-10-2012 03:45 AM

Quote:

Originally Posted by maubp (Post 83615)
Oh good. Understanding the NCBI BLAST+ error messages gets easier with practice ;)

YEAP!

I couldn't agree with you anymore. Many thanks!

Tsuyoshi 09-10-2012 04:02 AM

Quote:

Originally Posted by maubp (Post 83615)
Oh good. Understanding the NCBI BLAST+ error messages gets easier with practice ;)

HI Maubp,
But I still have a question about the protein ID, it seems like that there is no database name the proteins in that way, I mean, take several proteins as example, they are

C_1620015|156900
C_10830001|152917
C_2020008|159281
C_510029|166481
C_510029|166481
C_510029|166481
C_510029|166481

I do not think they are accession numbers for Chlamydomonas in NCBI, but I want to identify their correct or real NCBI accession numbers, do you have any idea about that?

maubp 09-10-2012 04:09 AM

That's a different question - the only way your sequences would have real NCBI accession numbers would be if they have already been submitted to one of the databases. I would explore the NCBI databases for this using Entrez search term "chlamydomonas[orgn]" and see if anything matches your dataset:

http://www.ncbi.nlm.nih.gov/sites/gq...[orgn\
(square brackets in the URL confuse the forum software)

Or you could try BLAST'ing some of your sequences against the NR database to see if any give perfect matches?

Tsuyoshi 09-10-2012 04:12 AM

Quote:

Originally Posted by maubp (Post 83619)
That's a different question - the only way your sequences would have real NCBI accession numbers would be if they have already been submitted to one of the databases. I would explore the NCBI databases for this using Entrez search term "chlamydomonas[orgn]" and see if anything matches your dataset:

http://www.ncbi.nlm.nih.gov/sites/gq...=chlamydomonas[orgn]

Or you could try BLAST'ing some of your sequences against the NR database to see if any give perfect matches?

The sequences themselves are perfectly matched the submitted data of Chlamydomonas. I just have no idea what kind of IDs they are that the authors used.

maubp 09-10-2012 04:14 AM

If you can work out how to get the data from the NCBI with their accessions, that might be simpler than working with the original author's private identifiers.

Tsuyoshi 09-10-2012 04:22 AM

Quote:

Originally Posted by maubp (Post 83623)
If you can work out how to get the data from the NCBI with their accessions, that might be simpler than working with the original author's private identifiers.

That's right.
Anyway, I will try to extract the accession numbers from NCBI.
Thank you very much Maubp !


All times are GMT -8. The time now is 05:23 PM.

Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2019, vBulletin Solutions, Inc.