SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics



Similar Threads
Thread Thread Starter Forum Replies Last Post
RNASeq Best Practices sunkid RNA Sequencing 10 09-12-2015 07:21 AM
GATK recalibration confused, one step or two step? frankyue50 Bioinformatics 2 11-25-2013 02:25 PM
Bisulphite Sequencing Proof-reading or NOT proof-reading? yog77 Epigenetics 1 01-25-2012 06:45 AM
The GATK Best Practices without a dbSNP file? oiiio Bioinformatics 1 11-15-2011 03:24 PM
Posts subscription hxygz Site Feedback/Suggestions 14 10-19-2009 06:16 AM

Reply
 
Thread Tools
Old 03-05-2016, 06:59 PM   #1
Studentlost
Member
 
Location: Sacramento

Join Date: Oct 2014
Posts: 28
Exclamation Confused about RG ID LB even after reading all posts/GATK best practices

I am getting conflicting information on how to assign RG ID, LB, PU, SM for an exome analysis I am working with.

Can someone just clarify for me how I should assign the RG please?

If you don't want to read the details, take a look at this table. I just need clarification on how to assign the RG ID, SM, LB for these samples, since they were multiplexed and come from different libraries, with some samples being pooled.

Sample, Technical Replicate, Flow Cell ID, Lane ID, Library
A 1 AXXX2 1 Group 1
A 2 AXXX2 2 Group 1
B 1 AXXX2 1 Group 1
B 2 DXCX5 1 Group 1
G 1 AXXX2 1 Group 2
G 2 DXCX5 1 Group 2

Is this correct?

RG ID:AXXX2.1 SM:A LB:Group_1
RG ID:AXXX2.2 SM:A LB:Group_1
RG ID:AXXX2.1 SM:B LB:Group_1
RG ID:DXCX5.1 SM:B LB:Group_1
RG ID:AXXX2.1 SM:G LB:Group_2
RG ID:DXCX5.1 SM:G LB:Group_2

or this

RG ID:AXXX2.A.1 SM:A LB:Group_1
RG ID:AXXX2.A.2 SM:A LB:Group_1
RG ID:AXXX2.B.1 SM:B LB:Group_1
RG ID:DXCX5.B.1 SM:B LB:Group_1
RG ID:AXXX2.G.1 SM:G LB:Group_2
RG ID:DXCX5.G.1 SM:G LB:Group_2


I'm working with 36 different biological samples that were run with 100 bp PE.

The libraries were pooled in batches of 12, so there are three batches.

Here are the three issues I'm considering.

1) For one of the pooled library batches (Group 2), the first 12 samples were sequenced on two difference Flow Cell Ids.

2) For the second pooled library batch (Group 1), the second 12 samples were sequenced on the same Flow Cell Id, but on two separate Lanes.

3) For the last pooled library batch (Group 3), the last 12 samples were sequenced on the same Flow Cell ID, and same Lane ID, but twice (two different runs).

How do I assign an appropriate RG ID, LB, and SM for these samples?

From what I understand:

Each 12 samples from a single batch/group will have the same unifying library id.
The SM is unique to each sample, but since each sample has two technical replicates, I need to differentiate the technical replicates for the same sample in the RG ID.

For the read group ID, I have read two conflicting answers.
The first was that the ID should simply be Flow_Cell_ID:Lane_ID.
The second was that the ID should be Flow_Cell_ID:SM:Lane_ID.

Should the read group ID be unique for each SM? Or should it only identify the Flow Cell and Lane ID? The read group is used to recalibrate the data for the same sample based on whether it was run on the same lane or not, but since the samples were multiplexed in groups of 12, wouldn't it be informative for the read group ID to be common for all samples that were run on the same flow cell and lane in order to increase the corrective power?

Last edited by Studentlost; 03-05-2016 at 07:06 PM.
Studentlost is offline   Reply With Quote
Old 03-06-2016, 01:16 AM   #2
dpryan
Devon Ryan
 
Location: Freiburg, Germany

Join Date: Jul 2011
Posts: 3,479
Default

An "LB" tag should only ever be associated with one sample. This refers to the physical library made from a sample and has absolutely nothing to do with pooling. The hierarchy is:

SM: A biological sample
LB: A library made from a single biological sample (if a sample has more than one, you have technical replicates)
ID: A single instance of a given library. You might have more than one of these per library if you sequenced it on multiple flow cells or multiple lanes (honestly, I would just merge the lanes these days, though).

Practically speaking, the various tags should be unique. Whether you use this:

Code:
RG ID:AXXX2.1 SM:A LB:1
RG ID:AXXX2.2 SM:A LB:1
RG ID:AXXX2.1 SM:B LB:2
RG ID:DXCX5.1 SM:B LB:2
RG ID:AXXX2.1 SM:G LB:3
RG ID:DXCX5.1 SM:G LB:3
or this:

Code:
RG ID:red SM:A LB:Group_1
RG ID:orange SM:A LB:Group_1
RG ID:yellow SM:B LB:Group_2
RG ID:green SM:B LB:Group_2
RG ID:blue SM:G LB:Group_3
RG ID:purple SM:G LB:Group_3
Or some other naming scheme it doesn't matter. The only thing that matters is the association between and nesting of the tags.
dpryan is offline   Reply With Quote
Old 03-06-2016, 02:28 AM   #3
Studentlost
Member
 
Location: Sacramento

Join Date: Oct 2014
Posts: 28
Exclamation

Quote:
Originally Posted by dpryan View Post
An "LB" tag should only ever be associated with one sample. This refers to the physical library made from a sample and has absolutely nothing to do with pooling. The hierarchy is:

SM: A biological sample
LB: A library made from a single biological sample (if a sample has more than one, you have technical replicates)
ID: A single instance of a given library. You might have more than one of these per library if you sequenced it on multiple flow cells or multiple lanes (honestly, I would just merge the lanes these days, though).

Practically speaking, the various tags should be unique. Whether you use this:

Code:
RG ID:AXXX2.1 SM:A LB:1
RG ID:AXXX2.2 SM:A LB:1
RG ID:AXXX2.1 SM:B LB:2
RG ID:DXCX5.1 SM:B LB:2
RG ID:AXXX2.1 SM:G LB:3
RG ID:DXCX5.1 SM:G LB:3
or this:

Code:
RG ID:red SM:A LB:Group_1
RG ID:orange SM:A LB:Group_1
RG ID:yellow SM:B LB:Group_2
RG ID:green SM:B LB:Group_2
RG ID:blue SM:G LB:Group_3
RG ID:purple SM:G LB:Group_3
Or some other naming scheme it doesn't matter. The only thing that matters is the association between and nesting of the tags.

Thank you for your reply! Just to make sure I follow, the library is basically the sample name so long as it's a single sample and only technical replicates, correct?

And the read group id is just a way to identify which flow cell/lane the sample was run on for each technical replicate?

How does base recalibration work in terms of using the flow cell/lane as a co variate? 12 samples were multiplexed at a time, so read number is lower. Shouldn't all samples run on the same flow cell/lane be used as a covariate when doing base recalibration of a single sample?

Or am I misunderstanding the method?

Thank you again!
Studentlost is offline   Reply With Quote
Old 03-08-2016, 03:40 AM   #4
dpryan
Devon Ryan
 
Location: Freiburg, Germany

Join Date: Jul 2011
Posts: 3,479
Default

Yeah, typically LB and SM are the same and ID is just a random unique identifier. I've never checked the source code for GATK to see exactly how it deals with lane as a covariate.
dpryan is offline   Reply With Quote
Reply

Tags
exome, gatk, picard, read group header, variant

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 11:56 AM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2018, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO