Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Issue with FASTA header in QIIME

    Dear all,

    I have to analyze a set of 26 samples of 16S amplicon data, coming from 250 nt Paired-end Illumina Hi-Seq reads. When I received those sequences they were already demultiplexed , merged and converted into FASTA format. I have no access to Barcode and Primer sequence since the commercial provider who performed the sequencing refuses to provide such information (they say it is confidential information).

    After extensively reading qiime documentation and multiple forum questions about how to analyze this kind of sequences, I'm afraid I'm one step beyond in the difficulty of this issue (or one step behind by not understanding the information I read...we will see).

    I face 2 main problems:

    1) The FASTA header of the sequences.

    The current header has this format:

    >Sample_Name tagX (Where X is the number of each consecutive tag from 1 to N)

    After reading the add_qiime_labels documentation (http://qiime.org/scripts/add_qiime_labels.html) I understand that my header is completely different from that in the examples:

    >Sample.1_0 FLP3FBN01ELBSX length=250 xy=1766_0111 region=1 run=R_2008_12_09_13_51_01_ AACAGATTAGACCAGATTAAGCCGAGATTTACCCGA

    And I have no means of obtaining all the information lacking in my headers.


    2)How to create a functional mapping file for qiime taking into account my current FASTA headers.

    I guess this second issue can be fixed easily if the first Issue can be fixed.

    Thanks in advance.


    JL

  • #2
    Originally posted by Jluis View Post
    Dear all,

    I have to analyze a set of 26 samples of 16S amplicon data, coming from 250 nt Paired-end Illumina Hi-Seq reads. When I received those sequences they were already demultiplexed , merged and converted into FASTA format. I have no access to Barcode and Primer sequence since the commercial provider who performed the sequencing refuses to provide such information (they say it is confidential information).

    After extensively reading qiime documentation and multiple forum questions about how to analyze this kind of sequences, I'm afraid I'm one step beyond in the difficulty of this issue (or one step behind by not understanding the information I read...we will see).

    I face 2 main problems:

    1) The FASTA header of the sequences.

    The current header has this format:

    >Sample_Name tagX (Where X is the number of each consecutive tag from 1 to N)

    After reading the add_qiime_labels documentation (http://qiime.org/scripts/add_qiime_labels.html) I understand that my header is completely different from that in the examples:

    >Sample.1_0 FLP3FBN01ELBSX length=250 xy=1766_0111 region=1 run=R_2008_12_09_13_51_01_ AACAGATTAGACCAGATTAAGCCGAGATTTACCCGA

    And I have no means of obtaining all the information lacking in my headers.


    2)How to create a functional mapping file for qiime taking into account my current FASTA headers.

    I guess this second issue can be fixed easily if the first Issue can be fixed.

    Thanks in advance.


    JL
    JL,

    It appears that your service provider has already done all this work for you.

    - You do not need to have the barcode sequences because they have already demultiplexed the reads.

    - You probably do not need the primer sequences because it is likely they already trimmed the primers as part of the merging process. If they did not state explicitly whether or not primer sequences were trimmed ask them. This is essential for you to know.

    - The header format they provided you is nearly what you need; just change

    Code:
    >Sample_Name tagX
    to
    >Sample_Name_X
    [Honestly QIIME may be perfectly happy with the format of the FASTA deflines already in the file. I don't use QIIME so can't say for sure.]

    - All the other stuff on the example defline in the QIIME manual is worthless. The example is from a Roche 454 GS-FLX read which is a dead platform.

    Comment


    • #3
      Dear kmcarr,

      Thank you very much for your answer!
      I'm currently on holidays, but I will try to test your solution as soon as I get back to work.

      Best

      JL

      Comment


      • #4
        Here is how I'm handling demultiplexed data from a MiSeq (I think it should be very similar to HiSeq as far as headers go). Be aware that qiime uses _ as a field deliminator, so you can't have any in your sample name.



        I'm not a fan of qiime, so my script just gets you to the beginning of the process clustering process. If you are just starting out with this kind of analysis, I think mothur is much better documented which makes it easier to learn. Plus mothur does fully de novo clustering, as opposed to qiime's closed reference then de novo the ones that don't match approach. Clustering your data by 2 methods based on an incomplete reference is sketchy.
        Microbial ecologist, running a sequencing core. I have lots of strong opinions on how to survey communities, pretty sure some are even correct.

        Comment

        Latest Articles

        Collapse

        • seqadmin
          Strategies for Sequencing Challenging Samples
          by seqadmin


          Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
          03-22-2024, 06:39 AM
        • seqadmin
          Techniques and Challenges in Conservation Genomics
          by seqadmin



          The field of conservation genomics centers on applying genomics technologies in support of conservation efforts and the preservation of biodiversity. This article features interviews with two researchers who showcase their innovative work and highlight the current state and future of conservation genomics.

          Avian Conservation
          Matthew DeSaix, a recent doctoral graduate from Kristen Ruegg’s lab at The University of Colorado, shared that most of his research...
          03-08-2024, 10:41 AM

        ad_right_rmr

        Collapse

        News

        Collapse

        Topics Statistics Last Post
        Started by seqadmin, Yesterday, 06:37 PM
        0 responses
        8 views
        0 likes
        Last Post seqadmin  
        Started by seqadmin, Yesterday, 06:07 PM
        0 responses
        8 views
        0 likes
        Last Post seqadmin  
        Started by seqadmin, 03-22-2024, 10:03 AM
        0 responses
        49 views
        0 likes
        Last Post seqadmin  
        Started by seqadmin, 03-21-2024, 07:32 AM
        0 responses
        67 views
        0 likes
        Last Post seqadmin  
        Working...
        X