Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Protein ID that blast could not identify

    HI
    I downloaded a proteome in fasta formater, which contains hundreds of proteins (http://labs.umassmed.edu/chlamyfp/in...p?content=help). And I want to blast against these proteins with my data using Blast+, however, when I makeblastdb the proteome dataset, an error occurred
    *******************************************************************
    Error: NCBI C++ Exception:
    "/am/ncbiapdata/release/blast/src/2.2.26/IntelMAC-universal/c++/GCC401-ReleaseMT--IntelMAC-universal/../src/objects/seq/../seqloc/Seq_id.cpp", line 1679: Error: ncbi:bjects::CSeq_id::x_Init() - Unsupported ID type C_1150005
    *******************************************************************
    I thing there must be something wrong with the proteome data, cause the blast+ just worked well when I used the data downloaded directly from NCBI.

    Therefore, I opened the proteome data with textedit, and for example, the header of each sequence was like this
    *****************************************************************
    >C_680011|168600 FAP45, Flagellar Associated Protein Weakly Similar to Nasopharyngeal Epithelium Specific Protein 1
    MPQTPPRSGGYRSGKQSYVDESLFGGSKRTGAAQVETLDSLKLTAPTRTISPKDRDVVTLTKGDLTRMLKASPIMTAEDVAAAKREAEAKREQLQAVSKA
    RKEKMLKLEEEAKKQAPPTETEILQRQLNDATRSRATHMMLEQKDPVKHMNQMMLYSKCVTIRDAQIEEKKQMLAEEEEEQRRLDLMMEIERVKALEQYE
    ARERQRVEERRKGAAVLSEQIKERERERIRQEELRDQERLQMLREIERLKEEEMQAQIEKKIQAKQLMEEVAAANSEQIKRKEGMKVREKEEDLRIADYI
    LQKEMREQ
    *****************************************************************

    Here the "C_680011|168600" should be the protein ID I think, but there was no found if I search it in NCBI. I just wonder what kind of ID it is and how should I do to make the blast+ recognise it.

    Thanks!

  • #2
    Are you using the -parse_seqids option? If so, try it without this. I only ever use this if my FASTA file identifiers follow the NCBI naming conventions.

    It would be useful to show the command you used to run makeblastdb as that might help us to understand what you are doing.

    Comment


    • #3
      Originally posted by maubp View Post
      Are you using the -parse_seqids option? If so, try it without this. I only ever use this if my FASTA file identifiers follow the NCBI naming conventions.

      It would be useful to show the command you used to run makeblastdb as that might help us to understand what you are doing.
      Dear Maubp,
      Thanks for you reply.
      Yes I used -parse_seqids, and followed your suggestion, without the -parse_seqids, another error showed up,
      *******************************************************************
      Error: (CArgException::eNoArg) Argument "dbtype". Mandatory value is missing: `String, `nucl', `prot''
      Error: (CArgException::eNoArg) Application's initialization failed
      *****************************************************************

      The command I used was
      makeblastdb -in CrFP.fasta -out CrFP

      Thanks

      Comment


      • #4
        That error is clear isn't it? You have to tell makeblastdb if your FASTA file is protein or nucleotides. i.e. either:

        Code:
        makeblastdb -in CrFP.fasta -out CrFP -dbtype nucl
        or,

        Code:
        makeblastdb -in CrFP.fasta -out CrFP -dbtype prot

        Comment


        • #5
          Originally posted by maubp View Post
          That error is clear isn't it? You have to tell makeblastdb if your FASTA file is protein or nucleotides. i.e. either:

          Code:
          makeblastdb -in CrFP.fasta -out CrFP -dbtype nucl
          or,

          Code:
          makeblastdb -in CrFP.fasta -out CrFP -dbtype prot
          YES!
          What a stupid mistake I made. It succeeded now!

          Thank you!

          Comment


          • #6
            Originally posted by Tsuyoshi View Post
            It succeeded now!
            Oh good. Understanding the NCBI BLAST+ error messages gets easier with practice

            Comment


            • #7
              Originally posted by maubp View Post
              Oh good. Understanding the NCBI BLAST+ error messages gets easier with practice
              YEAP!

              I couldn't agree with you anymore. Many thanks!

              Comment


              • #8
                Originally posted by maubp View Post
                Oh good. Understanding the NCBI BLAST+ error messages gets easier with practice
                HI Maubp,
                But I still have a question about the protein ID, it seems like that there is no database name the proteins in that way, I mean, take several proteins as example, they are

                C_1620015|156900
                C_10830001|152917
                C_2020008|159281
                C_510029|166481
                C_510029|166481
                C_510029|166481
                C_510029|166481

                I do not think they are accession numbers for Chlamydomonas in NCBI, but I want to identify their correct or real NCBI accession numbers, do you have any idea about that?

                Comment


                • #9
                  That's a different question - the only way your sequences would have real NCBI accession numbers would be if they have already been submitted to one of the databases. I would explore the NCBI databases for this using Entrez search term "chlamydomonas[orgn]" and see if anything matches your dataset:


                  (square brackets in the URL confuse the forum software)

                  Or you could try BLAST'ing some of your sequences against the NR database to see if any give perfect matches?
                  Last edited by maubp; 09-10-2012, 03:10 AM. Reason: Trying to fix link

                  Comment


                  • #10
                    Originally posted by maubp View Post
                    That's a different question - the only way your sequences would have real NCBI accession numbers would be if they have already been submitted to one of the databases. I would explore the NCBI databases for this using Entrez search term "chlamydomonas[orgn]" and see if anything matches your dataset:

                    http://www.ncbi.nlm.nih.gov/sites/gq...=chlamydomonas[orgn]

                    Or you could try BLAST'ing some of your sequences against the NR database to see if any give perfect matches?
                    The sequences themselves are perfectly matched the submitted data of Chlamydomonas. I just have no idea what kind of IDs they are that the authors used.

                    Comment


                    • #11
                      If you can work out how to get the data from the NCBI with their accessions, that might be simpler than working with the original author's private identifiers.

                      Comment


                      • #12
                        Originally posted by maubp View Post
                        If you can work out how to get the data from the NCBI with their accessions, that might be simpler than working with the original author's private identifiers.
                        That's right.
                        Anyway, I will try to extract the accession numbers from NCBI.
                        Thank you very much Maubp !

                        Comment

                        Latest Articles

                        Collapse

                        • seqadmin
                          Strategies for Sequencing Challenging Samples
                          by seqadmin


                          Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                          03-22-2024, 06:39 AM
                        • seqadmin
                          Techniques and Challenges in Conservation Genomics
                          by seqadmin



                          The field of conservation genomics centers on applying genomics technologies in support of conservation efforts and the preservation of biodiversity. This article features interviews with two researchers who showcase their innovative work and highlight the current state and future of conservation genomics.

                          Avian Conservation
                          Matthew DeSaix, a recent doctoral graduate from Kristen Ruegg’s lab at The University of Colorado, shared that most of his research...
                          03-08-2024, 10:41 AM

                        ad_right_rmr

                        Collapse

                        News

                        Collapse

                        Topics Statistics Last Post
                        Started by seqadmin, Yesterday, 06:37 PM
                        0 responses
                        7 views
                        0 likes
                        Last Post seqadmin  
                        Started by seqadmin, Yesterday, 06:07 PM
                        0 responses
                        7 views
                        0 likes
                        Last Post seqadmin  
                        Started by seqadmin, 03-22-2024, 10:03 AM
                        0 responses
                        49 views
                        0 likes
                        Last Post seqadmin  
                        Started by seqadmin, 03-21-2024, 07:32 AM
                        0 responses
                        66 views
                        0 likes
                        Last Post seqadmin  
                        Working...
                        X