Seqanswers Leaderboard Ad

**Sylphide** · 03-02-2011, 05:19 AM

I think I found the answers to my questions, I put it here in case it needs corrections or in case it is useful to someone else:

The ID tag in the read group @RG refers to a unique combination of the other tags (sample SM, library LB, lane PU etc.). this way when a read has a read group tag RG:Z:read-group-ID it refers indirectly to the origin of this read : individual (sample tag SM), lane (PU tag) library (LB tag)...
Thus the ID tag doesn't really have a biological meaning, it is just a way to compress data.

As far as I've understood, the lane and the library can be identical according to the experiment. If a unit of DNA (library) is sequenced in a single run (lane) then lane and library refer to the same thing. However if the unit of DNA (library) was sequenced through multiple runs (lanes) then there are numerous lanes corresponding to one library.

**Tristanator2021** · 07-14-2013, 09:58 PM

Hi Sylphide, thanks for posting what you found, I know I've read it at least three times and I'm probably not the only one. Let's continue the conversation as this is an important aspect of the process.

To summarize, the big ones are RGID (or just ID), LB, PL, PU, SM
Read Group
ID - A unique identifier for the origin of that sequencing read. By this do we mean, the ID should be specific for the read file or the read itself? EG, I have 16 files of reads for a given sample, the first 8 being the pairs of the second 8. So we can add the read group (picard tool or in bwa mem -R) and should have a unique ID for each file? Of importance for thoroughness though not really used for much.

LB - The actually library from which the sample was prepared. Important for properly identifying PCR duplications that may have arisen, important for the MarkDuplicates tool and probably some genotyping tools. Can be the same as the lane but not necessarily of course.

PL, PU - platform and platform unit, might be important for some tools to know what the platform type was and for yourself if your looking back. I can imagine wanting to know the platform unit number if your tracking down the culprit of some serious errors in major sequencing labs but most of us won't really care (or know) about PU (mine tends to be set to 1234). Is this terrible? Somebody please fix me if I'm wrong.

SM - The most intuitive flag, what's your sample called this time? I like to make a good table of the ridiculous names I'm handed and the short names I've turned them into in case we need to backtrack.

I hope this sparks a continued discussion....

**Heisman** · 07-14-2013, 10:42 PM

To clarify if you're confused:

PU - I agree, most will have little use for this.
PL - as you said, some programs may want this. Nothing difficult to understand
SM - as you said, the sample that the data pertains to
LB - as you said, the library that data pertains to
RG - you can think of this as the lane of sequencing the data pertains to

So, if you have one sample (called x) prepped with two different libraries (called y and z), and one library is sequenced on lanes A and B and another is sequenced on lanes B and C (assuming the libraries are indexed), then you can encapsulate all of this information with RG/SM/LB:

lane A: SM:x,LB:y,RG:Ay
lane B: SM:x,LB:y,RG:By and SM;x,LB:z,RG:Bz
lane C: SM:x,LB:z,RG:Cz

Ultimately if you have a merged alignment file for one sample from reads derived from multiple libraries and multiple lanes of sequencing, using RG/LB/SM you can disentangle all of that information.

**KrisWithaK** · 07-15-2013, 08:15 AM

This is a related question about the necessity of the @RG information.

I had about a dozen human genomes that were EACH broken into 8-10 fastq files (paired end). After mapping, I had 4 or 5 bam files that needed to be merged. By using bwa mem -R I was able to add the @RGID information to each read in the respective bams while mapping took place, but the PU, PL, SM, & LB information was added only as part of the header.

My problem arose when I merged the .bam files (samtools merge) and the @RG header from ONLY one of the bam files was attached to the merged bam. Thus, the PU/PL/SM/LB information was lost for all but one of the original samples.

Are the PU/PL/SM/LB information necessary since each read has a unique RG ID attached to it?

Any insight would be appreciated. Cheers.

**xrao** · 05-23-2014, 11:39 AM

I think you can use the picard AddOrReplaceReadGroups.jar to add RG to the merged bam file. The RG information are important for samples.

**xrao** · 05-27-2014, 08:20 AM

Originally posted by Heisman View Post

To clarify if you're confused:

PU - I agree, most will have little use for this.
PL - as you said, some programs may want this. Nothing difficult to understand
SM - as you said, the sample that the data pertains to
LB - as you said, the library that data pertains to
RG - you can think of this as the lane of sequencing the data pertains to

So, if you have one sample (called x) prepped with two different libraries (called y and z), and one library is sequenced on lanes A and B and another is sequenced on lanes B and C (assuming the libraries are indexed), then you can encapsulate all of this information with RG/SM/LB:

lane A: SM:x,LB:y,RG:Ay
lane B: SM:x,LB:y,RG:By and SM;x,LB:z,RG:Bz
lane C: SM:x,LB:z,RG:Cz

Ultimately if you have a merged alignment file for one sample from reads derived from multiple libraries and multiple lanes of sequencing, using RG/LB/SM you can disentangle all of that information.

The RG means RGID, right? Thank you for explaining the concept very clearly. I now have another situation, one lane with multiple samples.

Lane 1: case1-normal case1-tumor1 case1-tumor2 case2-normal
Lane 2: case2-tumor1 case2-tumor2 .....

In this case, I used to define RGID=case1-normal SM=case1-normal LB=lib(I do not have such information) PL=illumina, would this cause any problem?? I should define RGID=lane1 so that GATK will treat the 4 samples from lane1 as having the same background, right?

Any input would be very appreciated!

**xrao** · 06-27-2014, 11:32 AM

Just a moment...

http://gatkforums.broadinstitute.org/discussion/2078/how-read-groups-affect-variant-calling?

You may refer to the above link for more details! Hope it helps!

Topics	Statistics	Last Post
Cancer Metastasis: A Deep Dive into Cellular Plasticity by seqadmin Started by seqadmin, 04-11-2024, 12:08 PM	0 responses 37 views 0 likes	Last Post by seqadmin 04-11-2024, 12:08 PM
Proteogenomic Profiles Offer New Clues in Prostate Cancer by seqadmin Started by seqadmin, 04-10-2024, 10:19 PM	0 responses 41 views 0 likes	Last Post by seqadmin 04-10-2024, 10:19 PM
Novel Diagnostic Assay Enhances Ovarian Cancer Detection by seqadmin Started by seqadmin, 04-10-2024, 09:21 AM	0 responses 35 views 0 likes	Last Post by seqadmin 04-10-2024, 09:21 AM
Evolutionary Dynamics of Centromeres: A Comparative Genomic Analysis by seqadmin Started by seqadmin, 04-04-2024, 09:00 AM	0 responses 55 views 0 likes	Last Post by seqadmin 04-04-2024, 09:00 AM

Seqanswers Leaderboard Ad

Announcement

GATK sample/library/lane meaning in BAM read group @RG

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News