Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Confused about RG ID LB even after reading all posts/GATK best practices

    I am getting conflicting information on how to assign RG ID, LB, PU, SM for an exome analysis I am working with.

    Can someone just clarify for me how I should assign the RG please?

    If you don't want to read the details, take a look at this table. I just need clarification on how to assign the RG ID, SM, LB for these samples, since they were multiplexed and come from different libraries, with some samples being pooled.

    Sample, Technical Replicate, Flow Cell ID, Lane ID, Library
    A 1 AXXX2 1 Group 1
    A 2 AXXX2 2 Group 1
    B 1 AXXX2 1 Group 1
    B 2 DXCX5 1 Group 1
    G 1 AXXX2 1 Group 2
    G 2 DXCX5 1 Group 2

    Is this correct?

    RG ID:AXXX2.1 SM:A LB:Group_1
    RG ID:AXXX2.2 SM:A LB:Group_1
    RG ID:AXXX2.1 SM:B LB:Group_1
    RG IDXCX5.1 SM:B LB:Group_1
    RG ID:AXXX2.1 SM:G LB:Group_2
    RG IDXCX5.1 SM:G LB:Group_2

    or this

    RG ID:AXXX2.A.1 SM:A LB:Group_1
    RG ID:AXXX2.A.2 SM:A LB:Group_1
    RG ID:AXXX2.B.1 SM:B LB:Group_1
    RG IDXCX5.B.1 SM:B LB:Group_1
    RG ID:AXXX2.G.1 SM:G LB:Group_2
    RG IDXCX5.G.1 SM:G LB:Group_2


    I'm working with 36 different biological samples that were run with 100 bp PE.

    The libraries were pooled in batches of 12, so there are three batches.

    Here are the three issues I'm considering.

    1) For one of the pooled library batches (Group 2), the first 12 samples were sequenced on two difference Flow Cell Ids.

    2) For the second pooled library batch (Group 1), the second 12 samples were sequenced on the same Flow Cell Id, but on two separate Lanes.

    3) For the last pooled library batch (Group 3), the last 12 samples were sequenced on the same Flow Cell ID, and same Lane ID, but twice (two different runs).

    How do I assign an appropriate RG ID, LB, and SM for these samples?

    From what I understand:

    Each 12 samples from a single batch/group will have the same unifying library id.
    The SM is unique to each sample, but since each sample has two technical replicates, I need to differentiate the technical replicates for the same sample in the RG ID.

    For the read group ID, I have read two conflicting answers.
    The first was that the ID should simply be Flow_Cell_ID:Lane_ID.
    The second was that the ID should be Flow_Cell_ID:SM:Lane_ID.

    Should the read group ID be unique for each SM? Or should it only identify the Flow Cell and Lane ID? The read group is used to recalibrate the data for the same sample based on whether it was run on the same lane or not, but since the samples were multiplexed in groups of 12, wouldn't it be informative for the read group ID to be common for all samples that were run on the same flow cell and lane in order to increase the corrective power?
    Last edited by Studentlost; 03-05-2016, 08:06 PM.

  • #2
    An "LB" tag should only ever be associated with one sample. This refers to the physical library made from a sample and has absolutely nothing to do with pooling. The hierarchy is:

    SM: A biological sample
    LB: A library made from a single biological sample (if a sample has more than one, you have technical replicates)
    ID: A single instance of a given library. You might have more than one of these per library if you sequenced it on multiple flow cells or multiple lanes (honestly, I would just merge the lanes these days, though).

    Practically speaking, the various tags should be unique. Whether you use this:

    Code:
    RG ID:AXXX2.1 SM:A LB:1
    RG ID:AXXX2.2 SM:A LB:1
    RG ID:AXXX2.1 SM:B LB:2
    RG ID:DXCX5.1 SM:B LB:2
    RG ID:AXXX2.1 SM:G LB:3
    RG ID:DXCX5.1 SM:G LB:3
    or this:

    Code:
    RG ID:red SM:A LB:Group_1
    RG ID:orange SM:A LB:Group_1
    RG ID:yellow SM:B LB:Group_2
    RG ID:green SM:B LB:Group_2
    RG ID:blue SM:G LB:Group_3
    RG ID:purple SM:G LB:Group_3
    Or some other naming scheme it doesn't matter. The only thing that matters is the association between and nesting of the tags.

    Comment


    • #3
      Originally posted by dpryan View Post
      An "LB" tag should only ever be associated with one sample. This refers to the physical library made from a sample and has absolutely nothing to do with pooling. The hierarchy is:

      SM: A biological sample
      LB: A library made from a single biological sample (if a sample has more than one, you have technical replicates)
      ID: A single instance of a given library. You might have more than one of these per library if you sequenced it on multiple flow cells or multiple lanes (honestly, I would just merge the lanes these days, though).

      Practically speaking, the various tags should be unique. Whether you use this:

      Code:
      RG ID:AXXX2.1 SM:A LB:1
      RG ID:AXXX2.2 SM:A LB:1
      RG ID:AXXX2.1 SM:B LB:2
      RG ID:DXCX5.1 SM:B LB:2
      RG ID:AXXX2.1 SM:G LB:3
      RG ID:DXCX5.1 SM:G LB:3
      or this:

      Code:
      RG ID:red SM:A LB:Group_1
      RG ID:orange SM:A LB:Group_1
      RG ID:yellow SM:B LB:Group_2
      RG ID:green SM:B LB:Group_2
      RG ID:blue SM:G LB:Group_3
      RG ID:purple SM:G LB:Group_3
      Or some other naming scheme it doesn't matter. The only thing that matters is the association between and nesting of the tags.

      Thank you for your reply! Just to make sure I follow, the library is basically the sample name so long as it's a single sample and only technical replicates, correct?

      And the read group id is just a way to identify which flow cell/lane the sample was run on for each technical replicate?

      How does base recalibration work in terms of using the flow cell/lane as a co variate? 12 samples were multiplexed at a time, so read number is lower. Shouldn't all samples run on the same flow cell/lane be used as a covariate when doing base recalibration of a single sample?

      Or am I misunderstanding the method?

      Thank you again!

      Comment


      • #4
        Yeah, typically LB and SM are the same and ID is just a random unique identifier. I've never checked the source code for GATK to see exactly how it deals with lane as a covariate.

        Comment

        Latest Articles

        Collapse

        • seqadmin
          Strategies for Sequencing Challenging Samples
          by seqadmin


          Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
          03-22-2024, 06:39 AM
        • seqadmin
          Techniques and Challenges in Conservation Genomics
          by seqadmin



          The field of conservation genomics centers on applying genomics technologies in support of conservation efforts and the preservation of biodiversity. This article features interviews with two researchers who showcase their innovative work and highlight the current state and future of conservation genomics.

          Avian Conservation
          Matthew DeSaix, a recent doctoral graduate from Kristen Ruegg’s lab at The University of Colorado, shared that most of his research...
          03-08-2024, 10:41 AM

        ad_right_rmr

        Collapse

        News

        Collapse

        Topics Statistics Last Post
        Started by seqadmin, Yesterday, 06:37 PM
        0 responses
        7 views
        0 likes
        Last Post seqadmin  
        Started by seqadmin, Yesterday, 06:07 PM
        0 responses
        7 views
        0 likes
        Last Post seqadmin  
        Started by seqadmin, 03-22-2024, 10:03 AM
        0 responses
        49 views
        0 likes
        Last Post seqadmin  
        Started by seqadmin, 03-21-2024, 07:32 AM
        0 responses
        66 views
        0 likes
        Last Post seqadmin  
        Working...
        X