SEQanswers

Go Back   SEQanswers > Sequencing Technologies/Companies > Illumina/Solexa



Similar Threads
Thread Thread Starter Forum Replies Last Post
Illumina DNA sequence specific strand bias involving orphan reads Tally Bioinformatics 1 06-21-2012 06:26 AM
Had a question, Need some advice break4minutes Introductions 2 06-07-2012 11:15 AM
Sequencing advice nkaushik Bioinformatics 2 05-31-2012 02:23 PM
Filtering SOLiD reads before mapping?? Conflicting advice hlwright SOLiD 5 06-27-2011 06:10 AM
ChIP-Seq: Genome-wide binding of the orphan nuclear receptor TR4 suggests its general Newsbot! Literature Watch 0 12-04-2010 03:01 AM

Reply
 
Thread Tools
Old 02-10-2014, 09:48 AM   #1
fmadriles
Junior Member
 
Location: madrid

Join Date: Feb 2014
Posts: 2
Default orphan reads - any advice?

Hi all,

I'm having some trouble with the analysis of my ChIP-seq data. From a ChIP-seq experiment of mouse pancreas, I get a reasonable number of reads that map to the mouse genome (good!), some map to the human genome (contamination), and around 35% don't map anywhere.

To start with, the Fastqc analysis doesn't reveal overrepresented sequences (I thougt that adaptors might be contaminating but it doesn't seem to be the case)

I've checked wether the orphan reads match to different microoranism genomes, no hits. I've also checked a database that contains adaptor sequences, no hits.

When I blast a read against mouse/human, I get a perfect match for half of the sequence, but no matches for the rest of the read.
If I blast the non-matching sequence against everything, I get a list of matches against different microorganisms. But they are always the same, so my guess is that these are conserved regions, not specific for a single microorg.

I would appreciate any advice of what can I do to know what are these reads.

Many thanks,
Francesc

Ps. I'm attaching some of the commented orphan reads in case you wished to check anything:
CCACTGAAGGTGAATTTGTCTTTTACGAAGGTCCACCAAC
CGACCACGGGAGCATCGTTCGCGTCCAGCGCGAAACGGCG
CCAATTCCTTCCGCGCCTTGGCTGCGCTAATATCTCCCGT
CAATAATTCTTGGCAATGGTTCAATCGTACTGGTCGAGCT
TGATAAGAAATAATTGTAAGTAGCTAACAATATTCCAAGT
GCATTCTCTCGCCGCGACTGTCCTCGATAGACACCAACTC
GATGCTGGTCCACTCGCCGACGAGGATCTGATCGTGAGCG
GTGTTATTTATTTACTCACATCGATAACAGTGATAAACTC
CTCATCGACGGCGTGCGCGCGCTGCGGGCCCGGCAGATGG
GGTACTCTCTCAGCAAGGAGAGATGAAGGAGGAAGAAGTT
CCATCTTCATTTTCGATGAATGAGTATGCTTGGATTTCAA
CTTTGCAAGGCGTCTGCCAATTGTTGGTTCGCCTCTTCGA
CCAGGATTGAAAAGTTTGTCAAAAAGGCGGTTATTCAGGA
ATTATTTAGTGGTTTTAACTAACGATTTCGTCTAGAAATG
ATCTATATCGTCTTCACGCAGAAGGTGACCGATTGGCGCA
CGCCGCTTCTATCGAAAGGAGCTCTAAGATGGTCAAATTG
AGAAAAATGAAATGCGTTGCGTGGCTAAAAGCATATAACG
fmadriles is offline   Reply With Quote
Old 02-10-2014, 11:53 AM   #2
ffinkernagel
Senior Member
 
Location: Marburg, Germany

Join Date: Oct 2009
Posts: 110
Default

You can try an assembly and blasting the results.

Possible it's fish dna from the bead blocking.
ffinkernagel is offline   Reply With Quote
Old 02-18-2014, 09:02 AM   #3
mmaiensc
Junior Member
 
Location: Chicago

Join Date: Jan 2014
Posts: 1
Default Similar problem

I am having a similar issue for ChIP-seq mouse data (HiSeq, SE, 50 bp). In particular, alignment statistics appear to be very antibody specific: for one protein I get ~20%, for a second ~40%, for a third ~65%, and for non-IP input ~90% (these approx %'s are borne out in two replicates for each sample).

Contamination does not seem to be an issue: fastqc did not show any adaptors left on the reads, and not much in the way of over-represented sequences. I used fastq_screen to check against human, rat, mouse, fly, yeast, c elegans, e coli, staph, and phiX, and the best matches were still to mouse, by far. Blast showed a similar mix of things, as fmadriles noted. At any rate, if it were contamination I would expect to see similar issues in all the samples, rather than depending so strongly on the antibody/protein of interest.

Short of attempted assembly on the unmapped reads, which I may try, does anyone have any other suggestions about what the issue could be, or other things to try? Has anyone else seen this kind of thing in ChIP-seq data before?
mmaiensc is offline   Reply With Quote
Old 02-19-2014, 08:54 AM   #4
MU Core
Member
 
Location: Columbia, Missouri

Join Date: Apr 2008
Posts: 57
Default

Possibly chimeric sequences from amplification.
MU Core is offline   Reply With Quote
Old 02-24-2014, 04:35 AM   #5
fmadriles
Junior Member
 
Location: madrid

Join Date: Feb 2014
Posts: 2
Default

For mmaiensc specially.
So at the end an expert has helped me and done some analyses, and finally concluded:

I did a species screen with the original data and found (as you did) that
most of the sequence comes from human and mouse (more mouse than
human). Most of the reads map uniquely and there is a
bit of overlap between mouse and rat (as you’d expect). There are however
around 35% of reads which don’t map to any of the genomes or contaminants
we screen for. I’ve extracted these to a new dataset and did an assembly
with velvet. It’s not a great assembly since the reads are short, but it
gave some extra information.

I’ve included the set of contigs of at least 100bp and have sorted these
both by coverage and length. All of the high coverage contigs appear to
be human alpha satellite DNA or general AT rich repeats.

For the long contigs, a bunch of these turned out to be rRNA from both
mouse and human so a chunk of your extra sequence comes from these. In
addition there is also a large set of sequences which come from a
bacterial source. However you
don’t appear to have the whole genome present, but more specifically you
have a region of the genome around an integrase gene. This strongly
suggests that either you have a high copy number transgene in your mouse, or it could be that this has contaminated one of
the reagents in your library prep process.

I think this is as far as I can justify taking this analysis. I’ve
included the sequences and contigs I generated if you really want to
pursue this, but I suspect the satellite sequence, the rRNA and the
bacterial DNA should account for a significant chunk of the previously
unknown sequence, and there really isn’t anything else I consistently
found in there. If anything the contamination with really high levels of
human sequence should probably be more of a concern in your case since
this is certainly something we shouldn’t expect to see.




I hope it is useful for other people as well as it has been for me!!
fmadriles is offline   Reply With Quote
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 05:57 PM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2021, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO