SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics



Similar Threads
Thread Thread Starter Forum Replies Last Post
Adding Read Group info to a set of Bam files wjeck Bioinformatics 42 02-23-2015 06:16 AM
Convert merged BAM back to per lane BAM or FASTQ file danielsbrewer Bioinformatics 6 10-03-2013 08:29 AM
read group: GATK or BWA option? m_elena_bioinfo Bioinformatics 9 12-09-2012 10:53 AM
phi x control on the same lane with sample? jasmineja Illumina/Solexa 17 11-30-2011 01:10 PM
Meaning of "Lane" in Solexa Sample Preparation foolishbrat General 1 12-25-2008 05:29 AM

Reply
 
Thread Tools
Old 03-02-2011, 01:41 AM   #1
Sylphide
Member
 
Location: France

Join Date: Feb 2011
Posts: 11
Default GATK sample/library/lane meaning in BAM read group @RG

Hello
I got puzzled by GATK way of explaining sample/library/lane to detect SNP.
It seems to me that their "sample" is equivalent to an individual. However after reading the SAM Format Specification I thought the individuals were indicated in the read group @RG ID tag and not in the SM sample tag...

I want to study SNPs in various individuals from illumina data so I wanted to know how to call individuals in the BAM file :
@RG ID:individual_1
or
@RG SM:individual_1 (and in this case, what is the ID tag for ?)
???

And about the library, if various individuals were tagged and sequenced together in the same illumina run, are these individuals forming a library ? But then what is the difference between a lane and a library ??
And just to be sure : is the lane encoded in the PU tag ?

Many thanks for your help

Last edited by Sylphide; 03-02-2011 at 01:49 AM.
Sylphide is offline   Reply With Quote
Old 03-02-2011, 05:19 AM   #2
Sylphide
Member
 
Location: France

Join Date: Feb 2011
Posts: 11
Default

I think I found the answers to my questions, I put it here in case it needs corrections or in case it is useful to someone else:

The ID tag in the read group @RG refers to a unique combination of the other tags (sample SM, library LB, lane PU etc.). this way when a read has a read group tag RG:Z:read-group-ID it refers indirectly to the origin of this read : individual (sample tag SM), lane (PU tag) library (LB tag)...
Thus the ID tag doesn't really have a biological meaning, it is just a way to compress data.

As far as I've understood, the lane and the library can be identical according to the experiment. If a unit of DNA (library) is sequenced in a single run (lane) then lane and library refer to the same thing. However if the unit of DNA (library) was sequenced through multiple runs (lanes) then there are numerous lanes corresponding to one library.
Sylphide is offline   Reply With Quote
Old 07-14-2013, 10:58 PM   #3
Tristanator2021
Junior Member
 
Location: San Diego, CA

Join Date: Mar 2013
Posts: 2
Default

Hi Sylphide, thanks for posting what you found, I know I've read it at least three times and I'm probably not the only one. Let's continue the conversation as this is an important aspect of the process.

To summarize, the big ones are RGID (or just ID), LB, PL, PU, SM
Read Group
ID - A unique identifier for the origin of that sequencing read. By this do we mean, the ID should be specific for the read file or the read itself? EG, I have 16 files of reads for a given sample, the first 8 being the pairs of the second 8. So we can add the read group (picard tool or in bwa mem -R) and should have a unique ID for each file? Of importance for thoroughness though not really used for much.

LB - The actually library from which the sample was prepared. Important for properly identifying PCR duplications that may have arisen, important for the MarkDuplicates tool and probably some genotyping tools. Can be the same as the lane but not necessarily of course.

PL, PU - platform and platform unit, might be important for some tools to know what the platform type was and for yourself if your looking back. I can imagine wanting to know the platform unit number if your tracking down the culprit of some serious errors in major sequencing labs but most of us won't really care (or know) about PU (mine tends to be set to 1234). Is this terrible? Somebody please fix me if I'm wrong.

SM - The most intuitive flag, what's your sample called this time? I like to make a good table of the ridiculous names I'm handed and the short names I've turned them into in case we need to backtrack.

I hope this sparks a continued discussion....
Tristanator2021 is offline   Reply With Quote
Old 07-14-2013, 11:42 PM   #4
Heisman
Senior Member
 
Location: St. Louis

Join Date: Dec 2010
Posts: 535
Default

To clarify if you're confused:

PU - I agree, most will have little use for this.
PL - as you said, some programs may want this. Nothing difficult to understand
SM - as you said, the sample that the data pertains to
LB - as you said, the library that data pertains to
RG - you can think of this as the lane of sequencing the data pertains to

So, if you have one sample (called x) prepped with two different libraries (called y and z), and one library is sequenced on lanes A and B and another is sequenced on lanes B and C (assuming the libraries are indexed), then you can encapsulate all of this information with RG/SM/LB:

lane A: SM:x,LB:y,RG:Ay
lane B: SM:x,LB:y,RG:By and SM;x,LB:z,RG:Bz
lane C: SM:x,LB:z,RG:Cz

Ultimately if you have a merged alignment file for one sample from reads derived from multiple libraries and multiple lanes of sequencing, using RG/LB/SM you can disentangle all of that information.
Heisman is offline   Reply With Quote
Old 07-15-2013, 09:15 AM   #5
KrisWithaK
Junior Member
 
Location: La Jolla, CA

Join Date: Jun 2013
Posts: 1
Default

This is a related question about the necessity of the @RG information.

I had about a dozen human genomes that were EACH broken into 8-10 fastq files (paired end). After mapping, I had 4 or 5 bam files that needed to be merged. By using bwa mem -R I was able to add the @RGID information to each read in the respective bams while mapping took place, but the PU, PL, SM, & LB information was added only as part of the header.

My problem arose when I merged the .bam files (samtools merge) and the @RG header from ONLY one of the bam files was attached to the merged bam. Thus, the PU/PL/SM/LB information was lost for all but one of the original samples.

Are the PU/PL/SM/LB information necessary since each read has a unique RG ID attached to it?

Any insight would be appreciated. Cheers.
KrisWithaK is offline   Reply With Quote
Old 05-23-2014, 12:39 PM   #6
xrao
Junior Member
 
Location: Houston USA

Join Date: Mar 2014
Posts: 9
Default

I think you can use the picard AddOrReplaceReadGroups.jar to add RG to the merged bam file. The RG information are important for samples.
xrao is offline   Reply With Quote
Old 05-27-2014, 09:20 AM   #7
xrao
Junior Member
 
Location: Houston USA

Join Date: Mar 2014
Posts: 9
Default

Quote:
Originally Posted by Heisman View Post
To clarify if you're confused:

PU - I agree, most will have little use for this.
PL - as you said, some programs may want this. Nothing difficult to understand
SM - as you said, the sample that the data pertains to
LB - as you said, the library that data pertains to
RG - you can think of this as the lane of sequencing the data pertains to

So, if you have one sample (called x) prepped with two different libraries (called y and z), and one library is sequenced on lanes A and B and another is sequenced on lanes B and C (assuming the libraries are indexed), then you can encapsulate all of this information with RG/SM/LB:

lane A: SM:x,LB:y,RG:Ay
lane B: SM:x,LB:y,RG:By and SM;x,LB:z,RG:Bz
lane C: SM:x,LB:z,RG:Cz

Ultimately if you have a merged alignment file for one sample from reads derived from multiple libraries and multiple lanes of sequencing, using RG/LB/SM you can disentangle all of that information.
The RG means RGID, right? Thank you for explaining the concept very clearly. I now have another situation, one lane with multiple samples.

Lane 1: case1-normal case1-tumor1 case1-tumor2 case2-normal
Lane 2: case2-tumor1 case2-tumor2 .....

In this case, I used to define RGID=case1-normal SM=case1-normal LB=lib(I do not have such information) PL=illumina, would this cause any problem?? I should define RGID=lane1 so that GATK will treat the 4 samples from lane1 as having the same background, right?

Any input would be very appreciated!
xrao is offline   Reply With Quote
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 05:31 PM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2020, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO