SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics



Similar Threads
Thread Thread Starter Forum Replies Last Post
SRA metadata structure kovaliuk Bioinformatics 4 09-17-2013 09:08 AM
Editing a Kegg Ortholgy file for metagenomic analysis in Qiime Giorgio C Bioinformatics 0 11-19-2012 11:49 AM
SRA study metadata download ersgupta Bioinformatics 15 12-05-2011 12:53 AM
SRF metadata Nick Bioinformatics 2 09-03-2010 12:24 AM
metadata for SRA Sequencing Illumina/Solexa 0 08-05-2010 03:43 AM

Reply
 
Thread Tools
Old 07-31-2014, 02:59 AM   #1
gprakhar
Member
 
Location: India

Join Date: Aug 2010
Posts: 78
Question Regarding Qiime Metadata Mapping File

Hello,

Library specs: Paired End, Read length 150 bp, V3 region 16S rRNA gene
Platform : MiSeq, Illumina
Experiment : Wheat Field, rhizosphere samples, Elevated CO2 and temperature
Computational platform : AWS EC2, Qiime 1.8.0

I am a Qiime newbie, have total 39 (13x3) samples, which represent 12 Treatments and 1 control with 3 replicates per Treatment and also control.

According to Qiime Documentation , for creating the metadata file I require Sample ID, Barcode, Primer sequence and description.

As this sequencing was done by a commercial provider, they refuse to provide barcode sequences.

Ques1: What should I use as Sample ID ? Does it have to be a part of read name?

Ques2: For Beta diversity analysis, I would like the 3 replicates pooled for every treatment, how should the mapping file be constructed for this?
Given that I do not have barcode sequence.

Any help / pointers / comments are appreciated.

--
pg
gprakhar is offline   Reply With Quote
Old 08-04-2014, 02:43 AM   #2
gprakhar
Member
 
Location: India

Join Date: Aug 2010
Posts: 78
Default

bump.

For a MiSeq V3 Data set multiple samples (3 replicate per sample), with barcode used but sequence not available, how to create a meta-data file so the samples can be associated by Qiime with the corresponding SampleId in the file ?
gprakhar is offline   Reply With Quote
Old 08-04-2014, 03:38 AM   #3
GenoMax
Senior Member
 
Location: East Coast USA

Join Date: Feb 2008
Posts: 7,076
Default

Quote:
Originally Posted by gprakhar View Post
Hello,

As this sequencing was done by a commercial provider, they refuse to provide barcode sequences.

Any help / pointers / comments are appreciated.

--
pg
That is odd indeed. Since the barcodes have done their work of separating the samples can you use a subset from illumina barcodes list (any other codes for that matter) to go forward.

I assume these sequences were demultiplexed on the MiSeq and you do not have the barcodes available in Fastq ID header.
GenoMax is offline   Reply With Quote
Old 08-04-2014, 06:01 AM   #4
kmcarr
Senior Member
 
Location: USA, Midwest

Join Date: May 2008
Posts: 1,178
Default

Quote:
Originally Posted by GenoMax View Post
I assume these sequences were demultiplexed on the MiSeq and you do not have the barcodes available in Fastq ID header.
Unfortunately, no you would not have the index sequences written in FastQ definition line. The MiSeq output only includes the index number (an integer from 1-N where N is the number of libraries listed in the sample sheet) in the read definition line. This differs from the behavior of CASAVA/Bcl2fastq which includes the actual index read in the definition line. Why does Illumina do this? No clue.
kmcarr is offline   Reply With Quote
Old 08-04-2014, 06:06 AM   #5
GenoMax
Senior Member
 
Location: East Coast USA

Join Date: Feb 2008
Posts: 7,076
Default

That is what I figure has happened. I just wanted to confirm.

MiSeq is meant to be a sequencing "appliance" with minimal "user serviceable" parts so I assume things are kept simple.

I do not understand why the provider would not make the barcodes available (it's not like they are a state secret).
GenoMax is offline   Reply With Quote
Old 08-04-2014, 07:47 AM   #6
kmcarr
Senior Member
 
Location: USA, Midwest

Join Date: May 2008
Posts: 1,178
Default

Quote:
Originally Posted by GenoMax View Post
I do not understand why the provider would not make the barcodes available (it's not like they are a state secret).
This may just be a communication breakdown. The service provider meaning the MiSeq software does not report the index sequence for each read (like the HiSeq does) so they simply do not have that data to provide.

I will also add my (completely unsolicited so fee free to ignore it) 2 about Qiime and MiSeq data. I often encounter researchers who to want to faithfully reproduce the pipeline in the Qiime tutorial, which assumes the input data still requires demultiplexing, primer and inline barcode trimming. This was designed in the era of 454 data; this isn't the case for MiSeq data. MiSeq data is already demultiplexed; the Illumina sequencing methodology places the index in a separate read, not part of your sequence read so there is no need to trim barcodes. Depending on the method used to generate your 16S amplicons there is no need to trim PCR primer sequences since the sequencing primers used are the same as the PCR primers thus no part of the PCR primer ends up in your final read (e.g. the Caporaso & Knight method and the Schloss method).

Qiime is a great tool for studying bacterial community diversity but just be aware that all of these pre-processing steps were designed around a different type of input data (e.g. 454). Instead of trying to shoehorn MiSeq data into this pipeline, you need to adjust your pre-processing steps to the standard output of the MiSeq.
kmcarr is offline   Reply With Quote
Old 08-04-2014, 11:42 PM   #7
gprakhar
Member
 
Location: India

Join Date: Aug 2010
Posts: 78
Default

Quote:
Originally Posted by GenoMax View Post
That is odd indeed. Since the barcodes have done their work of separating the samples can you use a subset from illumina barcodes list (any other codes for that matter) to go forward.


Hello,

I assume these sequences were demultiplexed on the MiSeq and you do not have the barcodes available in Fastq ID header.
So that means that in the Qiime mapping file I can use any barcode sequence, it only has be a unique one for each sample & same for all replicates of a sample ?

According to my understanding of the Qiime pe-processing the barcode sequences are used to separate out the samples by the split_libraries.py .. Is that correct ?

Regards,

Last edited by gprakhar; 08-05-2014 at 12:14 AM.
gprakhar is offline   Reply With Quote
Old 08-04-2014, 11:46 PM   #8
gprakhar
Member
 
Location: India

Join Date: Aug 2010
Posts: 78
Default

Quote:
Originally Posted by kmcarr View Post
Unfortunately, no you would not have the index sequences written in FastQ definition line. The MiSeq output only includes the index number (an integer from 1-N where N is the number of libraries listed in the sample sheet) in the read definition line. This differs from the behavior of CASAVA/Bcl2fastq which includes the actual index read in the definition line. Why does Illumina do this? No clue.
I am still not clear about what exactly should I use as SampleId in the Qiime mapping file.
But from this post I assume a part of fastq header can be used for this to identify the samples uniquely, is it so ?
and in case of replicates, does the sampleID remain the same ?
gprakhar is offline   Reply With Quote
Old 08-04-2014, 11:53 PM   #9
gprakhar
Member
 
Location: India

Join Date: Aug 2010
Posts: 78
Default

Quote:
Originally Posted by GenoMax View Post
That is what I figure has happened. I just wanted to confirm.

MiSeq is meant to be a sequencing "appliance" with minimal "user serviceable" parts so I assume things are kept simple.

I do not understand why the provider would not make the barcodes available (it's not like they are a state secret).
The sequencing was done by a third party sequencing provider.
On requesting them for (1) Adapter sequences, (2) barcode and (3) Primer sequence for assembling the Paired end reads.

The commercial provider, they gave an FAQ document
(1) the V3 Primer seq both F & R
(2) link to Illumina chemistry documentation for Adapter sequence
(but no Truseq version, so I am still not clear which Adapters to use as that document has about 5 different Truseq versions)
(3) as for Barcode,
Quote:
2. What is Barcode sequence used ?
The bar code sequences are proprietary sequences and are unable to provide it.
gprakhar is offline   Reply With Quote
Old 08-05-2014, 12:13 AM   #10
gprakhar
Member
 
Location: India

Join Date: Aug 2010
Posts: 78
Default

Quote:
Originally Posted by kmcarr View Post
This may just be a communication breakdown. The service provider meaning the MiSeq software does not report the index sequence for each read (like the HiSeq does) so they simply do not have that data to provide.

I will also add my (completely unsolicited so fee free to ignore it) 2 about Qiime and MiSeq data. I often encounter researchers who to want to faithfully reproduce the pipeline in the Qiime tutorial, which assumes the input data still requires demultiplexing, primer and inline barcode trimming. This was designed in the era of 454 data; this isn't the case for MiSeq data. MiSeq data is already demultiplexed; the Illumina sequencing methodology places the index in a separate read, not part of your sequence read so there is no need to trim barcodes. Depending on the method used to generate your 16S amplicons there is no need to trim PCR primer sequences since the sequencing primers used are the same as the PCR primers thus no part of the PCR primer ends up in your final read (e.g. the Caporaso & Knight method and the Schloss method).

Qiime is a great tool for studying bacterial community diversity but just be aware that all of these pre-processing steps were designed around a different type of input data (e.g. 454). Instead of trying to shoehorn MiSeq data into this pipeline, you need to adjust your pre-processing steps to the standard output of the MiSeq.
Hello,

I do understand that the pre-processing for MiSeq Paired End data is different.
For my data I first assemble the Paired end reads, using PANDAseq. So no need for split_libraries.py

As mentioned in first post, I have multiple samples, with 3 replicates per sample.
From my understanding of Qiime, I should be able to process all the samples together in a single run of Qiime. To achieve this I would assume the mapping file holds the key.
Since I do not have barcode hence the confusion in creating the mapping file.

As per GenoMax's reply,
this would be achievable with any barcode seq and the unique SampleId would come from the fastq header ??
gprakhar is offline   Reply With Quote
Old 08-05-2014, 06:55 AM   #11
GenoMax
Senior Member
 
Location: East Coast USA

Join Date: Feb 2008
Posts: 7,076
Default

Quote:
Originally Posted by gprakhar View Post
I am still not clear about what exactly should I use as SampleId in the Qiime mapping file.
But from this post I assume a part of fastq header can be used for this to identify the samples uniquely, is it so ?
and in case of replicates, does the sampleID remain the same ?
I am not an Qiime expert but the following seems logical. kmcarr (or someone else more knowledgeable) can correct the info.

You should use the sampleID you have for the samples as shown in the example here: http://qiime.org/1.6.0/documentation...-file-overview. Be aware that the sampleID that you use to make the file would have to be added to the demultiplexed data files as shown in the example (See "Handling already demultiplexed samples" section). I am not certain if you can create your mapping file without the "barcodes/primers" (ref doc link) and use it. That way you would not need to worry about barcodes.
GenoMax is offline   Reply With Quote
Old 08-05-2014, 06:56 AM   #12
GenoMax
Senior Member
 
Location: East Coast USA

Join Date: Feb 2008
Posts: 7,076
Default

Quote:
Originally Posted by kmcarr View Post

I will also add my (completely unsolicited so fee free to ignore it) 2 about Qiime and MiSeq data. I often encounter researchers who to want to faithfully reproduce the pipeline in the Qiime tutorial, which assumes the input data still requires demultiplexing, primer and inline barcode trimming. This was designed in the era of 454 data; this isn't the case for MiSeq data. MiSeq data is already demultiplexed; the Illumina sequencing methodology places the index in a separate read, not part of your sequence read so there is no need to trim barcodes. Depending on the method used to generate your 16S amplicons there is no need to trim PCR primer sequences since the sequencing primers used are the same as the PCR primers thus no part of the PCR primer ends up in your final read (e.g. the Caporaso & Knight method and the Schloss method).

Qiime is a great tool for studying bacterial community diversity but just be aware that all of these pre-processing steps were designed around a different type of input data (e.g. 454). Instead of trying to shoehorn MiSeq data into this pipeline, you need to adjust your pre-processing steps to the standard output of the MiSeq.
I agree completely. Qiime folks should redo this part of the pipeline to account for the switch in predominant sequence technology from 454 to Illumina. Everyone seems to have to do these transformations just to get their data into Qiime.
GenoMax is offline   Reply With Quote
Old 08-05-2014, 02:52 PM   #13
id0
Senior Member
 
Location: USA

Join Date: Sep 2012
Posts: 130
Default

Quote:
Originally Posted by GenoMax View Post
I agree completely. Qiime folks should redo this part of the pipeline to account for the switch in predominant sequence technology from 454 to Illumina. Everyone seems to have to do these transformations just to get their data into Qiime.
To give Qiime developers some credit, they are making progress in that regard. The latest version 1.8 added join_paired_ends.py and extract_barcodes.py scripts, which is a significant step forward.
id0 is offline   Reply With Quote
Old 10-16-2014, 07:23 PM   #14
ETWang
Junior Member
 
Location: GA

Join Date: Oct 2014
Posts: 1
Default

Quote:
Originally Posted by gprakhar View Post
Hello,

Library specs: Paired End, Read length 150 bp, V3 region 16S rRNA gene
Platform : MiSeq, Illumina
Experiment : Wheat Field, rhizosphere samples, Elevated CO2 and temperature
Computational platform : AWS EC2, Qiime 1.8.0

I am a Qiime newbie, have total 39 (13x3) samples, which represent 12 Treatments and 1 control with 3 replicates per Treatment and also control.

According to Qiime Documentation , for creating the metadata file I require Sample ID, Barcode, Primer sequence and description.

As this sequencing was done by a commercial provider, they refuse to provide barcode sequences.

Ques1: What should I use as Sample ID ? Does it have to be a part of read name?

Ques2: For Beta diversity analysis, I would like the 3 replicates pooled for every treatment, how should the mapping file be constructed for this?
Given that I do not have barcode sequence.

Any help / pointers / comments are appreciated.

--
pg
Not sure if gprakhar has solved this issue. I am both a qiime and sequencing newbie, so I kind of understand what the situation is.

The problem is with MiSeq platform, the machine has already demultiplexed the samples. So usually a genomic facility will only provide users with separate demultiplexed sample files. The Illumina adapter sequences and barcode sequences (I mean the barcode you provided to Illumina within the sample sheet) have already been cut in sequences in the separate sample files. Therefore, even the sequencing guy provided you with the barcode sequences, they are useless to solve your problem which is to use QIIME to analyze the data.

In QIIME, in order for your data to be analyzed, all your sequences have to be in one fasta file. And each different sequence within one sample has to have a unique sample ID. E.g. sequence No.1 in sample.1 should have a sampleID like sample.1_1. So it is easy to combine sequences from different samples into one file, but it is a little bit tricky to rename all the sequences according to above rule.

Fortunately, I just found QIIME do have a function that works for this, at least in the latest version (1.8.0) . The function (? I do not know what is the name of this) is add_qiime_labels.py. You can check this in QIIME documentations on how to use it.

A little bit more on the mapping file. In this case, we do not have to provide barcode or linkerPrimer sequence in the mapping file. When you check the mapping file using validate_mapping_file.py, if you add -p -b in the end, the function will not check for barcode and linkerprimer.

Regarding to your question 2, it is actually pretty easy. Since I already typed a lot. I'll stop here.

Last edited by ETWang; 10-16-2014 at 07:26 PM.
ETWang is offline   Reply With Quote
Reply

Tags
metagenomic analysis, qiime

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 04:58 PM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2020, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO