Unconfigured Ad

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts
  • popto
    Junior Member
    • Mar 2010
    • 3

    Haplotype and "random" chromosomes

    Hi all!

    I'm trying to make sense of some sequence data (50 base reads, Illumina) and I noticed an area of coverage gap on chromosome 6 - around the region that aligns very well to the haplotype chromosomes. I have some reads that mapped to the haplotype chromosomes (e.g. chr6_cox_hap1), not enough to explain the dip in coverage. I am worried that because of the high homology between the chromosomes, the "missing" reads might be "hiding" as ##:##:## (i.e., as not mappable due to the fact they map equally well to >1 locus).

    So basically what I am wondering - and I apologize if this is a very basic question - Do you align your reads to all the available chromosomes or do you omit the "haplotype" and "random" ones from your build? And if you are using all of the chromosomes, do you observe the same dip in coverage?

    I would be very grateful for any advice you might have...
    Thanks!!
    Popto
  • Simon Anders
    Senior Member
    • Feb 2010
    • 995

    #2
    I always omit the haplotype sequences from the reference index, for precisely the reason you mention.

    Simon

    Comment

    • popto
      Junior Member
      • Mar 2010
      • 3

      #3
      Thank you, Simon, this is very helpful.

      Comment

      • thinkRNA
        Member
        • Jan 2010
        • 94

        #4
        Originally posted by Simon Anders View Post
        I always omit the haplotype sequences from the reference index, for precisely the reason you mention.

        Simon
        How do you determine which region is haplotype sequence?

        Comment

        • Simon Anders
          Senior Member
          • Feb 2010
          • 995

          #5
          I took my reference from Ensembl: ftp://ftp.ensembl.org/pub/current_fa...o_sapiens/dna/

          All the files with "HSCHR" in the file name are haplotype variants, e.g., the "HSCHR6_MHC" files contain variants to the the MHC region of chromosome 6. I suggest to simply not include these files when building the reference (unless, of course, you are specifically interested in them, but then you need to do some additional tweaking).

          The "nonchromosomal" file contains the "random" contigs. I usually include them, but these contigs are so short that it does not really matter.

          Do not take, by the way, the repeat masked ("rm" in the filename) sequences. You should leave checking for repeats to the aligner.

          Simon

          Comment

          • pcg
            Junior Member
            • Jan 2010
            • 8

            #6
            Simon,

            I presume that if you do exclude the haplotypes in the index then you remove those chromosomes from the GTF annotation file aswell? Right?

            So basically if I am understanding correctly the reason then Simon, you remove these haplotypes because there is going to be an alignment problem due to the high similarity between the two chromosomes and you may get false mapping to a chromosome?

            Thanks,

            Comment

            • Simon Anders
              Senior Member
              • Feb 2010
              • 995

              #7
              Originally posted by pcg View Post
              I presume that if you do exclude the haplotypes in the index then you remove those chromosomes from the GTF annotation file aswell? Right?
              Actually, no. The aligner does not need a GTF file, and when counting later (e.g. with my htseq-count script), a feature in the GTF file with a chromosome name that does not appear in the SAM file will not collect any counts anyway.

              So basically if I am understanding correctly the reason then Simon, you remove these haplotypes because there is going to be an alignment problem due to the high similarity between the two chromosomes and you may get false mapping to a chromosome?
              Especially when looking for differential expression, it is a good idea to discount all non-unique alignments. Now, if the aligner sees several version of, e.g., the MHC, it does not know that these are all variants of the same region but rather treats them as paralogs at different places. So. if a read maps there, the aligner will think that there are multiple mappings, flag the read accordingly, and you will exclude it, ending up with no signal at all at the variant regions, even (or: especially) at the parts of the variant region that are actually conserved and would hence have posed no problem for mapping.

              Simon

              Comment

              • pcg
                Junior Member
                • Jan 2010
                • 8

                #8
                Thanks Simon for your reply.

                As you rightly point out you do not need a GTF for alignment but if you want to run a cufflinks analysis on the alignment and only want expression for what is currently annotated (in the GTF) then unless you remove those haplotypes from the GTF file you will still see hits to them and expression values?

                Thanks in advance,

                Comment

                Latest Articles

                Collapse

                ad_right_rmr

                Collapse

                News

                Collapse

                Topics Statistics Last Post
                Started by SEQadmin2, 06-09-2026, 11:58 AM
                0 responses
                30 views
                0 reactions
                Last Post SEQadmin2  
                Started by SEQadmin2, 06-05-2026, 10:09 AM
                0 responses
                38 views
                0 reactions
                Last Post SEQadmin2  
                Started by SEQadmin2, 06-04-2026, 08:59 AM
                0 responses
                43 views
                0 reactions
                Last Post SEQadmin2  
                Started by SEQadmin2, 06-02-2026, 12:03 PM
                0 responses
                64 views
                0 reactions
                Last Post SEQadmin2  
                Working...