Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • GATK sample/library/lane meaning in BAM read group @RG

    Hello
    I got puzzled by GATK way of explaining sample/library/lane to detect SNP.
    It seems to me that their "sample" is equivalent to an individual. However after reading the SAM Format Specification I thought the individuals were indicated in the read group @RG ID tag and not in the SM sample tag...

    I want to study SNPs in various individuals from illumina data so I wanted to know how to call individuals in the BAM file :
    @RG ID:individual_1
    or
    @RG SM:individual_1 (and in this case, what is the ID tag for ?)
    ???

    And about the library, if various individuals were tagged and sequenced together in the same illumina run, are these individuals forming a library ? But then what is the difference between a lane and a library ??
    And just to be sure : is the lane encoded in the PU tag ?

    Many thanks for your help
    Last edited by Sylphide; 03-02-2011, 01:49 AM.

  • #2
    I think I found the answers to my questions, I put it here in case it needs corrections or in case it is useful to someone else:

    The ID tag in the read group @RG refers to a unique combination of the other tags (sample SM, library LB, lane PU etc.). this way when a read has a read group tag RG:Z:read-group-ID it refers indirectly to the origin of this read : individual (sample tag SM), lane (PU tag) library (LB tag)...
    Thus the ID tag doesn't really have a biological meaning, it is just a way to compress data.

    As far as I've understood, the lane and the library can be identical according to the experiment. If a unit of DNA (library) is sequenced in a single run (lane) then lane and library refer to the same thing. However if the unit of DNA (library) was sequenced through multiple runs (lanes) then there are numerous lanes corresponding to one library.

    Comment


    • #3
      Hi Sylphide, thanks for posting what you found, I know I've read it at least three times and I'm probably not the only one. Let's continue the conversation as this is an important aspect of the process.

      To summarize, the big ones are RGID (or just ID), LB, PL, PU, SM
      Read Group
      ID - A unique identifier for the origin of that sequencing read. By this do we mean, the ID should be specific for the read file or the read itself? EG, I have 16 files of reads for a given sample, the first 8 being the pairs of the second 8. So we can add the read group (picard tool or in bwa mem -R) and should have a unique ID for each file? Of importance for thoroughness though not really used for much.

      LB - The actually library from which the sample was prepared. Important for properly identifying PCR duplications that may have arisen, important for the MarkDuplicates tool and probably some genotyping tools. Can be the same as the lane but not necessarily of course.

      PL, PU - platform and platform unit, might be important for some tools to know what the platform type was and for yourself if your looking back. I can imagine wanting to know the platform unit number if your tracking down the culprit of some serious errors in major sequencing labs but most of us won't really care (or know) about PU (mine tends to be set to 1234). Is this terrible? Somebody please fix me if I'm wrong.

      SM - The most intuitive flag, what's your sample called this time? I like to make a good table of the ridiculous names I'm handed and the short names I've turned them into in case we need to backtrack.

      I hope this sparks a continued discussion....

      Comment


      • #4
        To clarify if you're confused:

        PU - I agree, most will have little use for this.
        PL - as you said, some programs may want this. Nothing difficult to understand
        SM - as you said, the sample that the data pertains to
        LB - as you said, the library that data pertains to
        RG - you can think of this as the lane of sequencing the data pertains to

        So, if you have one sample (called x) prepped with two different libraries (called y and z), and one library is sequenced on lanes A and B and another is sequenced on lanes B and C (assuming the libraries are indexed), then you can encapsulate all of this information with RG/SM/LB:

        lane A: SM:x,LB:y,RG:Ay
        lane B: SM:x,LB:y,RG:By and SM;x,LB:z,RG:Bz
        lane C: SM:x,LB:z,RG:Cz

        Ultimately if you have a merged alignment file for one sample from reads derived from multiple libraries and multiple lanes of sequencing, using RG/LB/SM you can disentangle all of that information.

        Comment


        • #5
          This is a related question about the necessity of the @RG information.

          I had about a dozen human genomes that were EACH broken into 8-10 fastq files (paired end). After mapping, I had 4 or 5 bam files that needed to be merged. By using bwa mem -R I was able to add the @RGID information to each read in the respective bams while mapping took place, but the PU, PL, SM, & LB information was added only as part of the header.

          My problem arose when I merged the .bam files (samtools merge) and the @RG header from ONLY one of the bam files was attached to the merged bam. Thus, the PU/PL/SM/LB information was lost for all but one of the original samples.

          Are the PU/PL/SM/LB information necessary since each read has a unique RG ID attached to it?

          Any insight would be appreciated. Cheers.

          Comment


          • #6
            I think you can use the picard AddOrReplaceReadGroups.jar to add RG to the merged bam file. The RG information are important for samples.

            Comment


            • #7
              Originally posted by Heisman View Post
              To clarify if you're confused:

              PU - I agree, most will have little use for this.
              PL - as you said, some programs may want this. Nothing difficult to understand
              SM - as you said, the sample that the data pertains to
              LB - as you said, the library that data pertains to
              RG - you can think of this as the lane of sequencing the data pertains to

              So, if you have one sample (called x) prepped with two different libraries (called y and z), and one library is sequenced on lanes A and B and another is sequenced on lanes B and C (assuming the libraries are indexed), then you can encapsulate all of this information with RG/SM/LB:

              lane A: SM:x,LB:y,RG:Ay
              lane B: SM:x,LB:y,RG:By and SM;x,LB:z,RG:Bz
              lane C: SM:x,LB:z,RG:Cz

              Ultimately if you have a merged alignment file for one sample from reads derived from multiple libraries and multiple lanes of sequencing, using RG/LB/SM you can disentangle all of that information.
              The RG means RGID, right? Thank you for explaining the concept very clearly. I now have another situation, one lane with multiple samples.

              Lane 1: case1-normal case1-tumor1 case1-tumor2 case2-normal
              Lane 2: case2-tumor1 case2-tumor2 .....

              In this case, I used to define RGID=case1-normal SM=case1-normal LB=lib(I do not have such information) PL=illumina, would this cause any problem?? I should define RGID=lane1 so that GATK will treat the 4 samples from lane1 as having the same background, right?

              Any input would be very appreciated!

              Comment


              • #8


                You may refer to the above link for more details! Hope it helps!

                Comment

                Latest Articles

                Collapse

                • seqadmin
                  Essential Discoveries and Tools in Epitranscriptomics
                  by seqadmin


                  The field of epigenetics has traditionally concentrated more on DNA and how changes like methylation and phosphorylation of histones impact gene expression and regulation. However, our increased understanding of RNA modifications and their importance in cellular processes has led to a rise in epitranscriptomics research. “Epitranscriptomics brings together the concepts of epigenetics and gene expression,” explained Adrien Leger, PhD, Principal Research Scientist on Modified Bases...
                  Yesterday, 07:01 AM
                • seqadmin
                  Current Approaches to Protein Sequencing
                  by seqadmin


                  Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
                  04-04-2024, 04:25 PM

                ad_right_rmr

                Collapse

                News

                Collapse

                Topics Statistics Last Post
                Started by seqadmin, 04-11-2024, 12:08 PM
                0 responses
                37 views
                0 likes
                Last Post seqadmin  
                Started by seqadmin, 04-10-2024, 10:19 PM
                0 responses
                41 views
                0 likes
                Last Post seqadmin  
                Started by seqadmin, 04-10-2024, 09:21 AM
                0 responses
                35 views
                0 likes
                Last Post seqadmin  
                Started by seqadmin, 04-04-2024, 09:00 AM
                0 responses
                55 views
                0 likes
                Last Post seqadmin  
                Working...
                X