Seqanswers Leaderboard Ad

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts
  • popto
    Junior Member
    • Mar 2010
    • 3

    Haplotype and "random" chromosomes

    Hi all!

    I'm trying to make sense of some sequence data (50 base reads, Illumina) and I noticed an area of coverage gap on chromosome 6 - around the region that aligns very well to the haplotype chromosomes. I have some reads that mapped to the haplotype chromosomes (e.g. chr6_cox_hap1), not enough to explain the dip in coverage. I am worried that because of the high homology between the chromosomes, the "missing" reads might be "hiding" as ##:##:## (i.e., as not mappable due to the fact they map equally well to >1 locus).

    So basically what I am wondering - and I apologize if this is a very basic question - Do you align your reads to all the available chromosomes or do you omit the "haplotype" and "random" ones from your build? And if you are using all of the chromosomes, do you observe the same dip in coverage?

    I would be very grateful for any advice you might have...
    Thanks!!
    Popto
  • Simon Anders
    Senior Member
    • Feb 2010
    • 995

    #2
    I always omit the haplotype sequences from the reference index, for precisely the reason you mention.

    Simon

    Comment

    • popto
      Junior Member
      • Mar 2010
      • 3

      #3
      Thank you, Simon, this is very helpful.

      Comment

      • thinkRNA
        Member
        • Jan 2010
        • 94

        #4
        Originally posted by Simon Anders View Post
        I always omit the haplotype sequences from the reference index, for precisely the reason you mention.

        Simon
        How do you determine which region is haplotype sequence?

        Comment

        • Simon Anders
          Senior Member
          • Feb 2010
          • 995

          #5
          I took my reference from Ensembl: ftp://ftp.ensembl.org/pub/current_fa...o_sapiens/dna/

          All the files with "HSCHR" in the file name are haplotype variants, e.g., the "HSCHR6_MHC" files contain variants to the the MHC region of chromosome 6. I suggest to simply not include these files when building the reference (unless, of course, you are specifically interested in them, but then you need to do some additional tweaking).

          The "nonchromosomal" file contains the "random" contigs. I usually include them, but these contigs are so short that it does not really matter.

          Do not take, by the way, the repeat masked ("rm" in the filename) sequences. You should leave checking for repeats to the aligner.

          Simon

          Comment

          • pcg
            Junior Member
            • Jan 2010
            • 8

            #6
            Simon,

            I presume that if you do exclude the haplotypes in the index then you remove those chromosomes from the GTF annotation file aswell? Right?

            So basically if I am understanding correctly the reason then Simon, you remove these haplotypes because there is going to be an alignment problem due to the high similarity between the two chromosomes and you may get false mapping to a chromosome?

            Thanks,

            Comment

            • Simon Anders
              Senior Member
              • Feb 2010
              • 995

              #7
              Originally posted by pcg View Post
              I presume that if you do exclude the haplotypes in the index then you remove those chromosomes from the GTF annotation file aswell? Right?
              Actually, no. The aligner does not need a GTF file, and when counting later (e.g. with my htseq-count script), a feature in the GTF file with a chromosome name that does not appear in the SAM file will not collect any counts anyway.

              So basically if I am understanding correctly the reason then Simon, you remove these haplotypes because there is going to be an alignment problem due to the high similarity between the two chromosomes and you may get false mapping to a chromosome?
              Especially when looking for differential expression, it is a good idea to discount all non-unique alignments. Now, if the aligner sees several version of, e.g., the MHC, it does not know that these are all variants of the same region but rather treats them as paralogs at different places. So. if a read maps there, the aligner will think that there are multiple mappings, flag the read accordingly, and you will exclude it, ending up with no signal at all at the variant regions, even (or: especially) at the parts of the variant region that are actually conserved and would hence have posed no problem for mapping.

              Simon

              Comment

              • pcg
                Junior Member
                • Jan 2010
                • 8

                #8
                Thanks Simon for your reply.

                As you rightly point out you do not need a GTF for alignment but if you want to run a cufflinks analysis on the alignment and only want expression for what is currently annotated (in the GTF) then unless you remove those haplotypes from the GTF file you will still see hits to them and expression values?

                Thanks in advance,

                Comment

                Latest Articles

                Collapse

                • seqadmin
                  Pathogen Surveillance with Advanced Genomic Tools
                  by seqadmin




                  The COVID-19 pandemic highlighted the need for proactive pathogen surveillance systems. As ongoing threats like avian influenza and newly emerging infections continue to pose risks, researchers are working to improve how quickly and accurately pathogens can be identified and tracked. In a recent SEQanswers webinar, two experts discussed how next-generation sequencing (NGS) and machine learning are shaping efforts to monitor viral variation and trace the origins of infectious...
                  03-24-2025, 11:48 AM
                • seqadmin
                  New Genomics Tools and Methods Shared at AGBT 2025
                  by seqadmin


                  This year’s Advances in Genome Biology and Technology (AGBT) General Meeting commemorated the 25th anniversary of the event at its original venue on Marco Island, Florida. While this year’s event didn’t include high-profile musical performances, the industry announcements and cutting-edge research still drew the attention of leading scientists.

                  The Headliner
                  The biggest announcement was Roche stepping back into the sequencing platform market. In the years since...
                  03-03-2025, 01:39 PM

                ad_right_rmr

                Collapse

                News

                Collapse

                Topics Statistics Last Post
                Started by seqadmin, Today, 10:17 AM
                0 responses
                6 views
                0 reactions
                Last Post seqadmin  
                Started by seqadmin, 03-20-2025, 05:03 AM
                0 responses
                49 views
                0 reactions
                Last Post seqadmin  
                Started by seqadmin, 03-19-2025, 07:27 AM
                0 responses
                59 views
                0 reactions
                Last Post seqadmin  
                Started by seqadmin, 03-18-2025, 12:50 PM
                0 responses
                50 views
                0 reactions
                Last Post seqadmin  
                Working...