Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • orphan reads - any advice?

    Hi all,

    I'm having some trouble with the analysis of my ChIP-seq data. From a ChIP-seq experiment of mouse pancreas, I get a reasonable number of reads that map to the mouse genome (good!), some map to the human genome (contamination), and around 35% don't map anywhere.

    To start with, the Fastqc analysis doesn't reveal overrepresented sequences (I thougt that adaptors might be contaminating but it doesn't seem to be the case)

    I've checked wether the orphan reads match to different microoranism genomes, no hits. I've also checked a database that contains adaptor sequences, no hits.

    When I blast a read against mouse/human, I get a perfect match for half of the sequence, but no matches for the rest of the read.
    If I blast the non-matching sequence against everything, I get a list of matches against different microorganisms. But they are always the same, so my guess is that these are conserved regions, not specific for a single microorg.

    I would appreciate any advice of what can I do to know what are these reads.

    Many thanks,
    Francesc

    Ps. I'm attaching some of the commented orphan reads in case you wished to check anything:
    CCACTGAAGGTGAATTTGTCTTTTACGAAGGTCCACCAAC
    CGACCACGGGAGCATCGTTCGCGTCCAGCGCGAAACGGCG
    CCAATTCCTTCCGCGCCTTGGCTGCGCTAATATCTCCCGT
    CAATAATTCTTGGCAATGGTTCAATCGTACTGGTCGAGCT
    TGATAAGAAATAATTGTAAGTAGCTAACAATATTCCAAGT
    GCATTCTCTCGCCGCGACTGTCCTCGATAGACACCAACTC
    GATGCTGGTCCACTCGCCGACGAGGATCTGATCGTGAGCG
    GTGTTATTTATTTACTCACATCGATAACAGTGATAAACTC
    CTCATCGACGGCGTGCGCGCGCTGCGGGCCCGGCAGATGG
    GGTACTCTCTCAGCAAGGAGAGATGAAGGAGGAAGAAGTT
    CCATCTTCATTTTCGATGAATGAGTATGCTTGGATTTCAA
    CTTTGCAAGGCGTCTGCCAATTGTTGGTTCGCCTCTTCGA
    CCAGGATTGAAAAGTTTGTCAAAAAGGCGGTTATTCAGGA
    ATTATTTAGTGGTTTTAACTAACGATTTCGTCTAGAAATG
    ATCTATATCGTCTTCACGCAGAAGGTGACCGATTGGCGCA
    CGCCGCTTCTATCGAAAGGAGCTCTAAGATGGTCAAATTG
    AGAAAAATGAAATGCGTTGCGTGGCTAAAAGCATATAACG

  • #2
    You can try an assembly and blasting the results.

    Possible it's fish dna from the bead blocking.

    Comment


    • #3
      Similar problem

      I am having a similar issue for ChIP-seq mouse data (HiSeq, SE, 50 bp). In particular, alignment statistics appear to be very antibody specific: for one protein I get ~20%, for a second ~40%, for a third ~65%, and for non-IP input ~90% (these approx %'s are borne out in two replicates for each sample).

      Contamination does not seem to be an issue: fastqc did not show any adaptors left on the reads, and not much in the way of over-represented sequences. I used fastq_screen to check against human, rat, mouse, fly, yeast, c elegans, e coli, staph, and phiX, and the best matches were still to mouse, by far. Blast showed a similar mix of things, as fmadriles noted. At any rate, if it were contamination I would expect to see similar issues in all the samples, rather than depending so strongly on the antibody/protein of interest.

      Short of attempted assembly on the unmapped reads, which I may try, does anyone have any other suggestions about what the issue could be, or other things to try? Has anyone else seen this kind of thing in ChIP-seq data before?

      Comment


      • #4
        Possibly chimeric sequences from amplification.

        Comment


        • #5
          For mmaiensc specially.
          So at the end an expert has helped me and done some analyses, and finally concluded:

          I did a species screen with the original data and found (as you did) that
          most of the sequence comes from human and mouse (more mouse than
          human). Most of the reads map uniquely and there is a
          bit of overlap between mouse and rat (as you’d expect). There are however
          around 35% of reads which don’t map to any of the genomes or contaminants
          we screen for. I’ve extracted these to a new dataset and did an assembly
          with velvet. It’s not a great assembly since the reads are short, but it
          gave some extra information.

          I’ve included the set of contigs of at least 100bp and have sorted these
          both by coverage and length. All of the high coverage contigs appear to
          be human alpha satellite DNA or general AT rich repeats.

          For the long contigs, a bunch of these turned out to be rRNA from both
          mouse and human so a chunk of your extra sequence comes from these. In
          addition there is also a large set of sequences which come from a
          bacterial source. However you
          don’t appear to have the whole genome present, but more specifically you
          have a region of the genome around an integrase gene. This strongly
          suggests that either you have a high copy number transgene in your mouse, or it could be that this has contaminated one of
          the reagents in your library prep process.

          I think this is as far as I can justify taking this analysis. I’ve
          included the sequences and contigs I generated if you really want to
          pursue this, but I suspect the satellite sequence, the rRNA and the
          bacterial DNA should account for a significant chunk of the previously
          unknown sequence, and there really isn’t anything else I consistently
          found in there. If anything the contamination with really high levels of
          human sequence should probably be more of a concern in your case since
          this is certainly something we shouldn’t expect to see.




          I hope it is useful for other people as well as it has been for me!!

          Comment

          Latest Articles

          Collapse

          • seqadmin
            Techniques and Challenges in Conservation Genomics
            by seqadmin



            The field of conservation genomics centers on applying genomics technologies in support of conservation efforts and the preservation of biodiversity. This article features interviews with two researchers who showcase their innovative work and highlight the current state and future of conservation genomics.

            Avian Conservation
            Matthew DeSaix, a recent doctoral graduate from Kristen Ruegg’s lab at The University of Colorado, shared that most of his research...
            03-08-2024, 10:41 AM
          • seqadmin
            The Impact of AI in Genomic Medicine
            by seqadmin



            Artificial intelligence (AI) has evolved from a futuristic vision to a mainstream technology, highlighted by the introduction of tools like OpenAI's ChatGPT and Google's Gemini. In recent years, AI has become increasingly integrated into the field of genomics. This integration has enabled new scientific discoveries while simultaneously raising important ethical questions1. Interviews with two researchers at the center of this intersection provide insightful perspectives into...
            02-26-2024, 02:07 PM

          ad_right_rmr

          Collapse

          News

          Collapse

          Topics Statistics Last Post
          Started by seqadmin, 03-14-2024, 06:13 AM
          0 responses
          32 views
          0 likes
          Last Post seqadmin  
          Started by seqadmin, 03-08-2024, 08:03 AM
          0 responses
          71 views
          0 likes
          Last Post seqadmin  
          Started by seqadmin, 03-07-2024, 08:13 AM
          0 responses
          80 views
          0 likes
          Last Post seqadmin  
          Started by seqadmin, 03-06-2024, 09:51 AM
          0 responses
          68 views
          0 likes
          Last Post seqadmin  
          Working...
          X