SEQanswers

Go Back   SEQanswers > Sequencing Technologies/Companies > Illumina/Solexa



Similar Threads
Thread Thread Starter Forum Replies Last Post
DEXSeq error in estimateDispersions: match.arg(start.method, c("log(y)", "mean")) fpadilla Bioinformatics 14 07-03-2013 03:11 PM
Positiv "Negativ control" using NEXTflex Small RNA Kit CarloB General 1 11-30-2011 06:19 PM
Relatively large proportion of "LOWDATA", "FAIL" of FPKM_status running cufflink ruben6um Bioinformatics 3 10-12-2011 01:39 AM
The position file formats ".clocs" and "_pos.txt"? Ist there any difference? elgor Illumina/Solexa 0 06-27-2011 08:55 AM
"Systems biology and administration" & "Genome generation: no engineering allowed" seb567 Bioinformatics 0 05-25-2010 01:19 PM

Reply
 
Thread Tools
Old 05-09-2013, 09:59 AM   #1
silin284
Member
 
Location: ny

Join Date: Jul 2009
Posts: 23
Default what is the "molecular indices" in the NEXTflex qRNA kit

Hi

I just saw the new NEXTflex RNA-Seq kit. They have this interesting "molecular indice" that would label each dsDNA molecule.

http://www.biooscientific.com/Detail...m_medium=email

When I took a closer look, I can see that this "indice" is most likely to be a pool of barcode adapters (as in Craig et al 2004 Nat Methods). I think they also have the 3' end matches the truseq adapter. Unlike the truseq adapter, one aditional barcode could be placed before the T overhang. Nextflex claims to have 9,216 barcode in each adapter. That is 4x4x4x4x4x3x3. They could synthesize/anneal 9,216 adapters but I think it is more reasonable to use 7 random nucleotide (during oligo synthesis) barcode before the T overhang. In order to make the adapter, they must be able to anneal it to a complementary oligo. I assume they could use one (or a few) with deoxyinosine.

Anyone has more information? I am just guessing, and could be completely wrong.

Cheers
silin
silin284 is offline   Reply With Quote
Old 05-09-2013, 02:40 PM   #2
bbeitzel
Member
 
Location: Ft. Detrick, MD

Join Date: Aug 2008
Posts: 50
Default

I've attached the white paper where they describe the kit. They've commercialized the technique described in 2 of the PNAS papers in the references.

Basically, they have a pool of 96 adapters that have an additional 8-mer barcode downstream of the read 1 and read 2 sequencing primer binding sites. The adapters in the pool are added stochastically to each end of a dsDNA, so you have 96 x 96 = 9216 possible combinations of "molecular barcodes." The barcode combination plus the sequence of the insert allow you to tell if a read is a PCR duplicate (generated during enrichment PCR) or if it is a true duplicate. A PCR duplicate will have the same insert and molecular barcodes. A true duplicate will have the same insert with different molecular barcodes.

The white paper explains it pretty well.
bbeitzel is offline   Reply With Quote
Old 05-09-2013, 03:01 PM   #3
kmcarr
Senior Member
 
Location: USA, Midwest

Join Date: May 2008
Posts: 1,135
Default

The thing I'm not seeing in any of their literature is any actual data on the rate of PCR duplication detected using their technique. With these libraries you can directly measure the rate of PCR duplicate reads, so why haven't they reported how many they find? If PCR duplication is a significant source of error in RNA-Seq data sets why not present data showing that? On the other hand if PCR duplication is generally not a significant factor then there is no reason to use their kit.

Hmm...did I just answer my own question?
kmcarr is offline   Reply With Quote
Old 07-01-2013, 08:39 AM   #4
Bioo Scientific
Registered Vendor
 
Location: Austin, Tx

Join Date: Oct 2009
Posts: 99
Default

Hi Kmcarr,

In the set of experiments we describe in the white paper describing the NEXTflex qRNA-Seq Kit our goal was to compare unique fragments determined by reads with unique start and stop sites verses molecular indexing. Figure 5 illustrates under representation of high expressing ERCC and selected mRNAs using start and stop sites as indicators of unique fragments. When molecular indices are used (blue line), the under representation at the high expressing genes is clear. We’re happy to share the individual ERCC RNAs and mRNAs studied if anyone is interested. In this study we didn’t focus on PCR duplicates, in fact we purposely performed the experiment with as few PCR cycles as possible to demonstrate the differences between copy number. We are working on data that describes the benefit of this technology for PCR duplication using Chip-Seq and other applications were PCR cycles are typically higher than they should be.

Cheers,
Dawn

Last edited by Bioo Scientific; 04-06-2015 at 10:51 AM. Reason: updating URL
Bioo Scientific is offline   Reply With Quote
Old 07-01-2013, 10:00 AM   #5
kmcarr
Senior Member
 
Location: USA, Midwest

Join Date: May 2008
Posts: 1,135
Default

Quote:
Originally Posted by Bioo Scientific View Post
Hi Kmcarr,

In the set of experiments we describe in the white paper describing the NEXTflex qRNA-Seq Kit our goal was to compare unique fragments determined by reads with unique start and stop sites verses molecular indexing. Figure 5 illustrates under representation of high expressing ERCC and selected mRNAs using start and stop sites as indicators of unique fragments. When molecular indices are used (blue line), the under representation at the high expressing genes is clear. We’re happy to share the individual ERCC RNAs and mRNAs studied if anyone is interested. In this study we didn’t focus on PCR duplicates, in fact we purposely performed the experiment with as few PCR cycles as possible to demonstrate the differences between copy number. We are working on data that describes the benefit of this technology for PCR duplication using Chip-Seq and other applications were PCR cycles are typically higher than they should be.

Cheers,
Dawn
As described, the benefit of using this kit is to permit the researcher a sensitive method to distinguish whether fragments with the same start-end positions arose from distinct cDNAs (the "8 reads, 8 fragments" scenario of Figure 2) or from PCR duplication (the "8 reads, 4 fragments" scenario).

In the paragraph at the bottom of page 4 (below Figure 3) it states (emphasis mine),
Quote:
Therefore, when multiple reads mapping to the same transcript are encountered, it is not possible to determine whether sequenced reads originate from the same or different cDNA molecule. As a remedy to this re-sampling problem, many researchers evaluate whether or not each read has the same start and stop mapping coordinates. Reads with identical start and stop positions are usually assumed to be clonal duplicates derived from the same parent molecule.
The problem of distinguishing "reads originate from the same or different cDNA molecule" IS the issue of PCR duplication. This whole study focuses on a protocol to distinguish fragments arising from PCR duplication from fragments arising from distinct but identical cDNAs prior to amplification. The paper then makes the assertion that reads with identical start-end coordinates are "usually assumed to be clonal (i.e. PCR) duplicates". That is a reasonable assumption to make for genomic DNA sequencing but not for RNA-Seq, and it's an assumption I never make for RNA-Seq; in fact I assume just the opposite. I never perform any duplicate removal if I am doing an RNA-Seq experiment involving counting reads.

Both panels in Figure 5 need a third line added showing the "Total reads" for each of the ERCC controls or mRNA species. If the "Total reads" curve is not significantly different than the "Molecular indexing" then using this protocol for RNA-Seq doesn't add much.
kmcarr is offline   Reply With Quote
Old 12-17-2013, 09:41 AM   #6
cement_head
Senior Member
 
Location: Oxford, Ohio

Join Date: Mar 2012
Posts: 187
Question

Hello,

I have a question regarding the molecular indicies and demultiplexing them (maybe I'm missing something, let me know if I am).

If one constructs an RNA-Seq library using this kit they should be able to count each individual read based on the stochaistically attached molecular indicies. However, when one is aligning the reads to the genome, or other reference, does one need to demultiplex the molecular indicies? Or will they not interfere with the alignment to the reference?

Thanks,
CH
cement_head is offline   Reply With Quote
Old 03-14-2014, 01:07 PM   #7
Bioo Scientific
Registered Vendor
 
Location: Austin, Tx

Join Date: Oct 2009
Posts: 99
Default

Hi CH,

Here is the analysis workflow we recommend:

1) quality control
2) sample demultiplexing
3) remove 5' stochastic label and 3' adapter sequence (if any)
4) map to hg19 refseq gene library and filtered for reads with MAQ>30
5) add back stochastic labels to mapped reads
6) count the number of reads, the number of unique stochastic labels
7) summary/plot

We can send you a whitepaper that contains additional bioinformatics analysis information if you would like us to.

Regards,
Bioo Scientific
Bioo Scientific is offline   Reply With Quote
Old 02-26-2015, 10:51 AM   #8
cement_head
Senior Member
 
Location: Oxford, Ohio

Join Date: Mar 2012
Posts: 187
Default

Quote:
Originally Posted by Bioo Scientific View Post
Hi CH,

Here is the analysis workflow we recommend:

1) quality control
2) sample demultiplexing
3) remove 5' stochastic label and 3' adapter sequence (if any)
4) map to hg19 refseq gene library and filtered for reads with MAQ>30
5) add back stochastic labels to mapped reads
6) count the number of reads, the number of unique stochastic labels
7) summary/plot

We can send you a whitepaper that contains additional bioinformatics analysis information if you would like us to.

Regards,
Bioo Scientific
Hi,

Do you recommend doing a sequence trim and QC control BEFORE building a count table of unique reads?

Another question: It seems to me that an improvement over the above method might be to simply build a count table of unique reads PRIOR to mapping reads to the reference. In other words, if one can identify the unique reads based on the sequence ends & the molecular indices, whay not just map those reads and use the RAW count data as the quantitative data? What am I missing? The reason I suggest this approach, is that wouldn't this save the bother of trying to track all the reads and then re-add the stoichastic labels?

Thanks,
Andor
cement_head is offline   Reply With Quote
Old 10-08-2015, 06:15 PM   #9
danwiththeplan
Member
 
Location: Auckland

Join Date: Sep 2011
Posts: 72
Default re-adding stochastic labels

Can I ask for specifics about how, exactly, you re-add the stochastic labels (which I assume is the same as the 8bp molecular index at the start) to "mapped reads" (= Bam file?? Sorted how?)?
Or do you mean simply using the previous FASTQ files without the molecular index removed? If you have a specific script to do this, is it possible to share it? I am happy to develop one myself but no sense re-doing something that's been done already.
danwiththeplan is offline   Reply With Quote
Old 10-09-2015, 03:00 PM   #10
kerplunk412
Senior Member
 
Location: Bioo Scientific, Austin, TX, USA

Join Date: Jun 2012
Posts: 119
Default

Hi dan,
You can find a script for analysis of qRNA-Seq data here under the Resources tab. Please email us at [email protected] if you have any questions.
kerplunk412 is offline   Reply With Quote
Old 10-09-2015, 03:20 PM   #11
danwiththeplan
Member
 
Location: Auckland

Join Date: Sep 2011
Posts: 72
Default thanks for the response...

Thanks for the response, but this does not answer my questions and nor does any of the information in the qRNA-Analysis.pdf white paper. Some of the terminology in the description of the dqRNASeq script is quite unclear and I'm still not clear about several things. The reason I'd prefer to have these questions on SeqAnswers is because then other people dealing with the same problem can see the solution. Specifically I'd like to know:
  • Is BWA the required mapping tool or is Bowtie2 going to work too (a mapping is a mapping, right?
  • Are BAM files to be used with the dqRNASeq meant to be sorted in any particular way?
  • Where and when are you supposed to remove the 8bp molecular index from input FASTQ files, and when you say "stochastic label" is this the same thing as "molecular index"?
  • Where and when are you supposed to use FASTQ files with molecular index still attached and/ or with molecular index removed? (I'm assuming that the mapping must be done without the molecular index, as it would result in mismatches if it was still attached)?
  • When you mention "add back stochastic labels to mapped reads" what exactly do you mean (add back to the mapped reads in the BAM file? add back to the input FASTQ files? use input FASTQ files that haven't had the molecular index removed?)
If this script is not being maintained that's fine, but please let me know so that I can develop my own. I think the technology of molecular indexing is very useful and I'm keen to get resources into the public domain for easy analysis.
danwiththeplan is offline   Reply With Quote
Old 10-09-2015, 03:52 PM   #12
kerplunk412
Senior Member
 
Location: Bioo Scientific, Austin, TX, USA

Join Date: Jun 2012
Posts: 119
Default

I understand and agree with your point about wanting to keep this discussion public. I will consult with my colleagues and post answers to your questions on Monday. I can tell you now that molecular index and stochastic label (STL) are used interchangeably, my apologies for the inconsistency.

We are working on an updated script for analysis of molecular indexing data, but we would value your ideas and input.
kerplunk412 is offline   Reply With Quote
Old 10-09-2015, 03:55 PM   #13
danwiththeplan
Member
 
Location: Auckland

Join Date: Sep 2011
Posts: 72
Default

Much appreciated. I think it's a very exciting technology. Easy access to well-documented analysis tools would probably increase uptake.
danwiththeplan is offline   Reply With Quote
Old 10-12-2015, 09:28 AM   #14
kerplunk412
Senior Member
 
Location: Bioo Scientific, Austin, TX, USA

Join Date: Jun 2012
Posts: 119
Default

Answers to your questions, as promised:

-Bowtie2 or any other aligner that produces a bam file is fine.
-Sorting is not necessary.
-Stochastic labels should be removed from the FASTQ prior to alignment (Bowtie also has an option to trim bases as part of the alignment command). The terms stochastic label and molecular index are used interchangeably.
-FASTQ files with the molecular indexes removed should be used for aligning. FASTQ files with the indexes still present should be input into the dqRNASeq script.
-Adding back of stochastic labels is performed by the dqRNASeq script in order to identify true PCR duplicates. Both start/stop site information and stochastic label information is required for proper PCR duplicate removal, which is why both mapped data and molecular index data is necessary.
kerplunk412 is offline   Reply With Quote
Old 10-26-2015, 05:05 AM   #15
cement_head
Senior Member
 
Location: Oxford, Ohio

Join Date: Mar 2012
Posts: 187
Default

Quote:
Originally Posted by danwiththeplan View Post
Thanks for the response, but this does not answer my questions and nor does any of the information in the qRNA-Analysis.pdf white paper. Some of the terminology in the description of the dqRNASeq script is quite unclear and I'm still not clear about several things. The reason I'd prefer to have these questions on SeqAnswers is because then other people dealing with the same problem can see the solution. Specifically I'd like to know:
  • Is BWA the required mapping tool or is Bowtie2 going to work too (a mapping is a mapping, right?
  • Are BAM files to be used with the dqRNASeq meant to be sorted in any particular way?
  • Where and when are you supposed to remove the 8bp molecular index from input FASTQ files, and when you say "stochastic label" is this the same thing as "molecular index"?
  • Where and when are you supposed to use FASTQ files with molecular index still attached and/ or with molecular index removed? (I'm assuming that the mapping must be done without the molecular index, as it would result in mismatches if it was still attached)?
  • When you mention "add back stochastic labels to mapped reads" what exactly do you mean (add back to the mapped reads in the BAM file? add back to the input FASTQ files? use input FASTQ files that haven't had the molecular index removed?)
If this script is not being maintained that's fine, but please let me know so that I can develop my own. I think the technology of molecular indexing is very useful and I'm keen to get resources into the public domain for easy analysis.
We (my group) are almost finished a tool to be used with this technology (Molecular Indicies, aka STL) approach. We are using a different approach and will shortly have this available - currently we are in beta testing. The tool will be a Plug-In for the CLC Genomics Workbench and will be very easy to use GUI based tool. Our Plug-In will include a tutorial, help files, and example files. We think our approach will be much easier for most biologists and bioinformaticians than the current script. PM for more details.

Last edited by cement_head; 10-26-2015 at 05:08 AM. Reason: clarity, spelling
cement_head is offline   Reply With Quote
Old 03-04-2016, 04:44 AM   #16
cement_head
Senior Member
 
Location: Oxford, Ohio

Join Date: Mar 2012
Posts: 187
Default

Quote:
Originally Posted by kmcarr View Post
As described, the benefit of using this kit is to permit the researcher a sensitive method to distinguish whether fragments with the same start-end positions arose from distinct cDNAs (the "8 reads, 8 fragments" scenario of Figure 2) or from PCR duplication (the "8 reads, 4 fragments" scenario).

In the paragraph at the bottom of page 4 (below Figure 3) it states (emphasis mine),


The problem of distinguishing "reads originate from the same or different cDNA molecule" IS the issue of PCR duplication. This whole study focuses on a protocol to distinguish fragments arising from PCR duplication from fragments arising from distinct but identical cDNAs prior to amplification. The paper then makes the assertion that reads with identical start-end coordinates are "usually assumed to be clonal (i.e. PCR) duplicates". That is a reasonable assumption to make for genomic DNA sequencing but not for RNA-Seq, and it's an assumption I never make for RNA-Seq; in fact I assume just the opposite. I never perform any duplicate removal if I am doing an RNA-Seq experiment involving counting reads.

Both panels in Figure 5 need a third line added showing the "Total reads" for each of the ERCC controls or mRNA species. If the "Total reads" curve is not significantly different than the "Molecular indexing" then using this protocol for RNA-Seq doesn't add much.
Nope - You've misinterpreted the paper. It doesn't actually say that - it says that using the USS alone is problematic - but the combination of USS and STL is not.
cement_head is offline   Reply With Quote
Old 03-04-2016, 10:19 AM   #17
kmcarr
Senior Member
 
Location: USA, Midwest

Join Date: May 2008
Posts: 1,135
Default

Quote:
Originally Posted by cement_head View Post
Nope - You've misinterpreted the paper. It doesn't actually say that - it says that using the USS alone is problematic - but the combination of USS and STL is not.
I stand by my statements. USS would be problematic if duplicate removal was standard practice for RNA-Seq data. It is not.

Second, they failed to present any data demonstrating significant levels of PCR duplication in the data in the first place. What is needed in Fig. 5 is a curve plotting number of reads mapped for each of the 24 ERCC controls or mRNAs with out first doing any de-duplication by either USS alone or USS+STL. Only if that curve was significantly higher than the USS+STL curve (blue line in Fig. 5) could they make for adding molecular indexes to RNA-Seq libraries.

What the white paper is attempting to do is to sell a solution without first demonstrating a problem which needs solving.
kmcarr is offline   Reply With Quote
Old 03-05-2016, 06:57 AM   #18
cement_head
Senior Member
 
Location: Oxford, Ohio

Join Date: Mar 2012
Posts: 187
Default

Quote:
Originally Posted by kmcarr View Post
I stand by my statements. USS would be problematic if duplicate removal was standard practice for RNA-Seq data. It is not.

Second, they failed to present any data demonstrating significant levels of PCR duplication in the data in the first place. What is needed in Fig. 5 is a curve plotting number of reads mapped for each of the 24 ERCC controls or mRNAs with out first doing any de-duplication by either USS alone or USS+STL. Only if that curve was significantly higher than the USS+STL curve (blue line in Fig. 5) could they make for adding molecular indexes to RNA-Seq libraries.

What the white paper is attempting to do is to sell a solution without first demonstrating a problem which needs solving.
I believe you are incorrect. Despite the assumption that any fragment that is identical with respect to USS is likely a PCR duplicate, it turns out that this is not the case. Not removing true duplicates would misrepresent (overestimate) the original (pre-PCR) molecule population in the biological sample. If you are not removing duplicates in an RNA-Seq analysis, then you are relying on statistical tests to de facto do so during the analyses stages - especially in the cases of DGE experiments. The approach of using molecular indices (STLs) in combination with USSs clarifies which of the fragments is PCR artefact based and which fragments were originally present in the biological sample.


Here is a link to example data that demonstrates the difference between USS usage and USS + STLs usage: http://www.biooscientific.com/Portal...lar-Labels.pdf

I hope this clears up your queries.

Regards,
Andor
cement_head is offline   Reply With Quote
Old 10-30-2017, 12:11 PM   #19
cement_head
Senior Member
 
Location: Oxford, Ohio

Join Date: Mar 2012
Posts: 187
Default

For CLC Genomics Workbench:

https://www.qiagenbioinformatics.com...cular-indexing
cement_head is offline   Reply With Quote
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 01:19 PM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2017, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO