SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics



Similar Threads
Thread Thread Starter Forum Replies Last Post
How do I get the RQ= information in the header of the PB fastq ou fasta files cklopp Pacific Biosciences 5 01-29-2016 01:18 AM
about uniprot sprot.fasta header kurban910 Bioinformatics 3 01-01-2016 10:12 AM
Manipulating the quality file during fasta-demultiplexing with QIIME nouse Illumina/Solexa 0 01-15-2015 02:28 AM
Extraction of a portion of a fasta header chayan Bioinformatics 4 01-04-2015 11:50 PM
fasta header polijana Bioinformatics 2 03-31-2013 03:01 PM

Reply
 
Thread Tools
Old 08-05-2016, 02:41 AM   #1
Jluis
Member
 
Location: Bilbao

Join Date: Apr 2012
Posts: 44
Default Issue with FASTA header in QIIME

Dear all,

I have to analyze a set of 26 samples of 16S amplicon data, coming from 250 nt Paired-end Illumina Hi-Seq reads. When I received those sequences they were already demultiplexed , merged and converted into FASTA format. I have no access to Barcode and Primer sequence since the commercial provider who performed the sequencing refuses to provide such information (they say it is confidential information).

After extensively reading qiime documentation and multiple forum questions about how to analyze this kind of sequences, I'm afraid I'm one step beyond in the difficulty of this issue (or one step behind by not understanding the information I read...we will see).

I face 2 main problems:

1) The FASTA header of the sequences.

The current header has this format:

>Sample_Name tagX (Where X is the number of each consecutive tag from 1 to N)

After reading the add_qiime_labels documentation (http://qiime.org/scripts/add_qiime_labels.html) I understand that my header is completely different from that in the examples:

>Sample.1_0 FLP3FBN01ELBSX length=250 xy=1766_0111 region=1 run=R_2008_12_09_13_51_01_ AACAGATTAGACCAGATTAAGCCGAGATTTACCCGA

And I have no means of obtaining all the information lacking in my headers.


2)How to create a functional mapping file for qiime taking into account my current FASTA headers.

I guess this second issue can be fixed easily if the first Issue can be fixed.

Thanks in advance.


JL
Jluis is offline   Reply With Quote
Old 08-05-2016, 07:59 AM   #2
kmcarr
Senior Member
 
Location: USA, Midwest

Join Date: May 2008
Posts: 1,147
Default

Quote:
Originally Posted by Jluis View Post
Dear all,

I have to analyze a set of 26 samples of 16S amplicon data, coming from 250 nt Paired-end Illumina Hi-Seq reads. When I received those sequences they were already demultiplexed , merged and converted into FASTA format. I have no access to Barcode and Primer sequence since the commercial provider who performed the sequencing refuses to provide such information (they say it is confidential information).

After extensively reading qiime documentation and multiple forum questions about how to analyze this kind of sequences, I'm afraid I'm one step beyond in the difficulty of this issue (or one step behind by not understanding the information I read...we will see).

I face 2 main problems:

1) The FASTA header of the sequences.

The current header has this format:

>Sample_Name tagX (Where X is the number of each consecutive tag from 1 to N)

After reading the add_qiime_labels documentation (http://qiime.org/scripts/add_qiime_labels.html) I understand that my header is completely different from that in the examples:

>Sample.1_0 FLP3FBN01ELBSX length=250 xy=1766_0111 region=1 run=R_2008_12_09_13_51_01_ AACAGATTAGACCAGATTAAGCCGAGATTTACCCGA

And I have no means of obtaining all the information lacking in my headers.


2)How to create a functional mapping file for qiime taking into account my current FASTA headers.

I guess this second issue can be fixed easily if the first Issue can be fixed.

Thanks in advance.


JL
JL,

It appears that your service provider has already done all this work for you.

- You do not need to have the barcode sequences because they have already demultiplexed the reads.

- You probably do not need the primer sequences because it is likely they already trimmed the primers as part of the merging process. If they did not state explicitly whether or not primer sequences were trimmed ask them. This is essential for you to know.

- The header format they provided you is nearly what you need; just change

Code:
>Sample_Name tagX
to
>Sample_Name_X
[Honestly QIIME may be perfectly happy with the format of the FASTA deflines already in the file. I don't use QIIME so can't say for sure.]

- All the other stuff on the example defline in the QIIME manual is worthless. The example is from a Roche 454 GS-FLX read which is a dead platform.
kmcarr is offline   Reply With Quote
Old 08-10-2016, 02:45 AM   #3
Jluis
Member
 
Location: Bilbao

Join Date: Apr 2012
Posts: 44
Default

Dear kmcarr,

Thank you very much for your answer!
I'm currently on holidays, but I will try to test your solution as soon as I get back to work.

Best

JL
Jluis is offline   Reply With Quote
Old 08-10-2016, 08:21 AM   #4
thermophile
Senior Member
 
Location: CT

Join Date: Apr 2015
Posts: 233
Default

Here is how I'm handling demultiplexed data from a MiSeq (I think it should be very similar to HiSeq as far as headers go). Be aware that qiime uses _ as a field deliminator, so you can't have any in your sample name.

https://github.com/krmaas/bioinforma...me.process.txt

I'm not a fan of qiime, so my script just gets you to the beginning of the process clustering process. If you are just starting out with this kind of analysis, I think mothur is much better documented which makes it easier to learn. Plus mothur does fully de novo clustering, as opposed to qiime's closed reference then de novo the ones that don't match approach. Clustering your data by 2 methods based on an incomplete reference is sketchy.
thermophile is offline   Reply With Quote
Reply

Tags
16s, demultiplex, illumina, metagenomics, qiime

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 01:57 AM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2018, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO