Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • how much of the MHC is represented in the reference genome?

    I remember that former versions of the reference genome (at least hg18) used to lack some genes due to compression of the sequences to one prototype in the case of closely located repeated 'genes'.

    I am wondering to which extend this is still true in the hg19/B37 builds and how many genes/regions might suffer from this artifactual simplification. Please correct me if this is wrong as I am not 100% certain.

    This question aims at informing and maybe warning people performing NGS that their reference might not be the true genome in given analyzed cell types.

    Thanks to Immunologists or specialists aware of the status of MHC regions and other similar hypervariable loci (Ig genes, TCR...) for their lights.

    Cheers,
    http://www.bits.vib.be/index.php

  • #2
    Just at a glance, here is the list of HLA genes in RefSeq
    HLA-F
    HLA-F-AS1
    HLA-F
    HLA-G
    HLA-H
    HLA-A
    HLA-J
    HLA-L
    HLA-E
    HLA-B
    HLA-C
    HLA-DRA
    HLA-DRB5
    HLA-DQA1
    HLA-DRB6
    HLA-DQB1
    HLA-DRB1
    HLA-DQA2
    HLA-DQB2
    HLA-DOB
    HLA-DMA
    HLA-DOA
    HLA-DMB
    HLA-DPB2
    HLA-DPA1
    HLA-DPB1
    HLA-DRB3
    HLA-DRB4
    Of course to get some trustworthy allele information (which is still far from complete, as I believe) one should check out IMGT/HLA http://www.ebi.ac.uk/ipd/imgt/

    As for IGH/K/L, TRA/B/G/D loci, they are also present, as I believe. Here RefSeq only maps locus, and Ensembl transcripts provide more detailed view for V/D/J genes like TRBV7-1. The only list of alleles is available in IMGT (http://www.imgt.org), however this database is insanely hard to browse, and contains many spurious alleles (like a Variable segment allele created from mRNA reference lacking a part near conserved Cys residue). Anyways so far IMGT is the only choice.

    I will try to compile our own list of immune receptor segment genes and upload it (would take about a week).

    For specialized tasks, like targeted TCR sequencing one should use specialized software. Check out our MiTCR software at http://mitcr.milaboratory.com

    Comment


    • #3
      Although not answering your question, I might add that I heard recently that the published cod genome is missing quite a few MHC genes.

      I study MHC in non-model organisms using NGS. Can someone please tell me how alleles are designated/characterized in human studies, using traditional approaches and NGS?

      Another thing worth noting is that we estimated in our fish species that MHC IIb genes may be duplicated among loci, and are only distinguishable by variation in intron II.
      I imagine that as with any genome sequence, accurately including CNVs (identical or very close in sequence identity) is pretty tricky.

      Comment


      • #4
        Yes, the hg19 still missing some MHC regions. This is the reason why can be download same haplotypes from UCSC (chr6_apd_hap1, chr6_cox_hap2, etc)

        Comment


        • #5
          Variable, Diversity and Joining segments data

          Ok just as promised (sorry for a delay due to holidays). Here are the lists of segments for TRA/B/G/D and IGH/K/L genes of human and mice.
          They were originally filtered from IMGT data. The script was made to parse HTMLs from IMGT web page, as no other way to download bulk data exists.
          The major allele (marked as *01) was taken, and all alleles that are incomplete (e.g. V segment that missed sequence near conserved Cys residue) and non-functional were removed. We don't use all the alleles as many of them have a spurious evidence (e.g. alleles from cDNA data) and are incomplete. So the ideology here is to use the most frequent allele that is in full agreement with locus it is derived from as reference and derive SNPs from your sequencing data.

          Two files are attached:

          segments_cdr3.txt with a structure "Species Gene Segment_type Segment_name ReferencePoint Sequence"
          Here the reference point marks the position of conserved Cys in Variable segment or Phe/Trp in Joining segment.
          In case of Variable segment the reference point is the coordinate of first nucleotide after Cys, so to obtain the Cys residue, e.g. in Java:
          Code:
          seq.substring(ref - 3, ref)
          In case of Joining segment the reference point is the coordinate of first nucleotide before Phe/Trp, to obtain it execute:
          Code:
          seq.substring(ref + 1, ref + 4)
          Example of usage could be found in a working script that performs CDR3 extraction from HTS data:
          https://github.com/mikessh/migec/blo...r3Blast.groovy

          segments_cdr12.txt with a structure "Species Gene Segment_type Segment_name CDR1start CDR1end CDR2start CDR2end Sequence"
          To get CDR1,2 regions use e.g.
          Code:
          seq.substring(CDR1start, CDR1end)
          Hope this would be useful!

          Regards,
          Mike
          Attached Files

          Comment

          Latest Articles

          Collapse

          • seqadmin
            Essential Discoveries and Tools in Epitranscriptomics
            by seqadmin


            The field of epigenetics has traditionally concentrated more on DNA and how changes like methylation and phosphorylation of histones impact gene expression and regulation. However, our increased understanding of RNA modifications and their importance in cellular processes has led to a rise in epitranscriptomics research. “Epitranscriptomics brings together the concepts of epigenetics and gene expression,” explained Adrien Leger, PhD, Principal Research Scientist on Modified Bases...
            Yesterday, 07:01 AM
          • seqadmin
            Current Approaches to Protein Sequencing
            by seqadmin


            Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
            04-04-2024, 04:25 PM

          ad_right_rmr

          Collapse

          News

          Collapse

          Topics Statistics Last Post
          Started by seqadmin, 04-11-2024, 12:08 PM
          0 responses
          39 views
          0 likes
          Last Post seqadmin  
          Started by seqadmin, 04-10-2024, 10:19 PM
          0 responses
          41 views
          0 likes
          Last Post seqadmin  
          Started by seqadmin, 04-10-2024, 09:21 AM
          0 responses
          35 views
          0 likes
          Last Post seqadmin  
          Started by seqadmin, 04-04-2024, 09:00 AM
          0 responses
          55 views
          0 likes
          Last Post seqadmin  
          Working...
          X