SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics



Similar Threads
Thread Thread Starter Forum Replies Last Post
Overrepresented kmers at the start of reads kentk Bioinformatics 20 07-23-2014 01:23 AM
Overrepresented sequences from FastQC report morning latte Bioinformatics 7 08-27-2013 08:31 AM
FASTQC overrepresented Kmers: Chirag Bioinformatics 1 08-23-2012 06:04 AM
FastQC; overrepresented sequences versus a grep mgg Bioinformatics 16 12-23-2011 01:51 AM
fastqc - overrepresented sequences PFS Bioinformatics 3 07-05-2011 06:18 PM

Reply
 
Thread Tools
Old 09-11-2013, 07:27 AM   #1
foehn
Junior Member
 
Location: china

Join Date: Sep 2013
Posts: 5
Post RRBS overrepresented sequences

Hi all, I am processing the RRBS data generated by Illumina Hiseq 2000, 50 bp, single end. I used fastQC for quality check, and found one sample has many more overrepresented sequences than others:

Code:
Sequence 	Count 	Percentage 	Possible Source
CTCCCACTTATTCTACACCTCTCATGTCTCTTCACCGTGCCAGACTAGAG 	4168787 	3.694709642846457 	No Hit
CGCCTGATTATCTCACCGGCAGTCTTGCCGGTGACAATGGGTTTGACCCG 	2534233 	2.2460382606066713 	No Hit
GATCGGAAGAGCGGTTCAGCAGGAATGCCGAGACCGATCTCGTATGCCGT 	2493949 	2.2103353851053744 	Illumina Paired End PCR Primer 2 (100% over 50bp)
TCATCAGTTACATTGGAATCCAAATTGCCAACAAAAATAGTAGTGTTATT 	1344836 	1.1919003147071456 	No Hit
AAATTATGCAGTCGAGTTTCCCACATTTGGGGAAATCGCAGGGGTCAGCA 	1281767 	1.1360035652534837 	No Hit
GATAAAGACTCATTCCTTGTAGAGCAATAAAATTTATCGTGGCTTAACTA 	1028733 	0.9117447677260468 	No Hit
CACAGAGTGGAACGTCCCTTTAGACAGAGCAGATTTGAAACACTCTTTTT 	1027822 	0.9109373672796742 	No Hit
GATCTCTTTCACTGTCATAATTTCCTCAGTTATAATTTTGCAAAGGCGGT 	980751 	0.8692193141389342 	No Hit
CAGATCATGGGCACCAGACAGGCAAGACAGGTTTGTTAAGAGATGGGTGG 	941876 	0.8347652061776363 	No Hit
CCCGCACCGTCCCTGGCCAGATGTGAGTCCTCCCACCCCTGTCGGGGCTC 	896901 	0.7949047944590668 	No Hit
CTCCTATTTCCAAAAATCCATTTAATATATTGTCCTCGGATAGAGGACGT 	857061 	0.7595954269689545 	No Hit
CCGAGTTTTGTGGAGGAACCCACATAACAAACACCCGCGAAGCCAAAGCA 	841097 	0.7454468641523844 	No Hit
GGGGGGAATAAGGAATGACTGCAAATGGGTATGGAGTTTTTTAGGATGTT 	764633 	0.6776784034153376 	No Hit
GTCAGCCTTGACTACATAGCAAAACTCAAGGCCAACCAGACCTAAAAACA 	738731 	0.6547219968709378 	No Hit
AGTGAGAACCTGGTGCTATGGACAGCTAAGAGCTCACATCCCAAACTGCA 	689985 	0.6115194258952095 	No Hit
AGAGATTAAGGCCGAGTACTGGGCGGTGTCCTCCCCCACTGCACTAACCT 	677824 	0.6007413832735414 	No Hit
CCAATATAGGATGGCCCCCTACCAAAAGCTGAGTTTTAGACTACATCCCT 	648286 	0.5745624651780862 	No Hit
GCTGGGCGTGGTGGTGGGCGCCTGTAGTCCCAGCTGCTCGGGAGGCTGAG 	635748 	0.5634502952586327 	No Hit
CACTTTCTCAGGTATAGAGAGACTCACTTCCTCCTGTGGAGGAAAGCCTG 	627786 	0.556393739436437 	No Hit
GGGGCTTATCACAGCAATAGAACAGCAATTATGACTGGAGTATGATAGTT 	620923 	0.5503112045698546 	No Hit
TGAAAAAACATGAAGACCTCGGTCTGGATCTCTAACACCCGCACTTTCCA 	617973 	0.5476966806216661 	No Hit
CTCGCCTCCGGTGCACCTCAGGTACACGACTTTGACCTCGTTGGGGTCGA 	610104 	0.5407225487747862 	No Hit
GTGTGGAATGCCCAGGAGGCCCAGGCTGACTTTGCCAAGGTGTTGGAGCT 	578537 	0.51274536997056 	No Hit
TGTTGTTCTGAGGGTCTACCCGAACTGCTCCTGAGGGGCCCAGGTTTGTA 	573695 	0.5084540055783129 	No Hit
AGAGCACAGCAGACTTACGGCCTTAGGAAGAAAGCTGCTCACCACATACT 	561256 	0.4974295773100019 	No Hit
CTGCCATCTTTGCGGGTGACTTTCCATCCCTTGAACCAAGGCATATTAGC 	491273 	0.43540509274522954 	No Hit
GGCAAATGAAAGGATTCTCCAGGGGCAACACAAATCAGGTTTTCAATTAT 	486016 	0.4307459224416272 	No Hit
AGACTTCATTGCTCATAGCTATATAGCCTTCATGCTGGGTTAGCTAGCTT 	478411 	0.4240057683311276 	No Hit
TCTTTTGTGCAGGAAGCAGGGGAAGGACCAGGTGTCTACCACCTGTAGAA 	458030 	0.4059425098267104 	No Hit
CTCTCCTTGCTTTCTCCTTGCTAGCTGCCCCCTCCTTGGCAGCCCACACC 	448639 	0.39761946087842615 	No Hit
CATAGCAAATCCTGTACAACCTTCAAAATGGATGCAGAATGCCTCCACTC 	440726 	0.39060633274214956 	No Hit
CTCCCCGGGGCTCCCGCCGGCTTCTCCGGGATCGGTCGCGTTACCGCACT 	425394 	0.3770178984460049 	No Hit
GTGGACCAACCCATGTACTGTGGTACATCTACACAAAATCAGTGACTTCA 	421541 	0.3736030642858794 	No Hit
GCTTCGCGCCCCAGCCCGACCGACCCAGCCCTTAGAGCCAATCCTTATCC 	421527 	0.3735906563756168 	No Hit
TAGGGAATTGAAACACCACAAGTGGTAGGAAGTGCGGCCACAAGGTCGGT 	414823 	0.3676490399184453 	No Hit
GGTGCCCTTCCGTCAATTCCTTTAAGTTTCAGCTTTGCAACCATACTCCC 	403885 	0.3579549168861449 	No Hit
TACCTTCTTCTGGTGTGTCTGAAGACAGTACAGTGTACGTGCATACGTAC 	349789 	0.3100107516314984 	No Hit
CCCACCATACATATTAACTGTAGTCACAATGTGACCGACTTCTTTTTGCT 	344734 	0.30553060974739904 	No Hit
CCTCAGGTTCTGGTGACATTCCTGCTACTCCCACACTACTAGCTTATATT 	342766 	0.3037864120762008 	No Hit
CGCTCTGGTCCGTCTTGCGCCGGTCCAAGAATTTCACCTCTAGCGGCGCA 	325915 	0.28885171951656513 	No Hit
ATGTCTCTCAGACCAACAGAATGTGAAGACAATGGCTGTACATGGCGGCC 	325544 	0.2885229098946065 	No Hit
CAATTCGATGGTGTTTCCATTCGATTCATTCGATGTTGATTCCATTAGCT 	323500 	0.28671135499626843 	No Hit
ATCATTTGAGGTCAAGAGTTCGAGACCAGCCTGGCCAACATGGTGAAACC 	318014 	0.2818492267319422 	No Hit
CTGGTCAGCCACAGCAAAGACTGGGAAGAGCACCTGAGGGAAGGACGCAC 	307797 	0.27279411107816515 	No Hit
ACCTGATCTAGCCTAGAGACCAGACCCTAGGTGACAGTACTGTTTCAAGC 	290243 	0.2572363641674867 	No Hit
TGACTTTGTATGTTCATTGTAACTTCTTTGTTGATTCATCTAGCTTTCTC 	289391 	0.2564812542000776 	No Hit
CCAACTGTTGCCTCGGTGCCACACTCCATCATCAATGGGTACAAGCGCGT 	280591 	0.24868199632073554 	No Hit
CGCCTAGAAATTTTGATTCCATTCGTGAAAATTTTTCTATATCCCGAACA 	278477 	0.24680840187108452 	No Hit
AAAGTCGAAATGCAGGATGGGATTTTAAAATGGTAGAAGAGTAGGAAGCT 	276112 	0.24471235131601132 	No Hit
GCCCTTTTTCTTGTGCAGTTTGAGTTTGGAAATGTCTTAGAGCATGTCTT 	264968 	0.23483565474698995 	No Hit
CCTGGTACAACTCCTGGTGGTGGGTCTGGGAGGGCTGACTGGGCAGGGAG 	263449 	0.23348939648349898 	No Hit
CTATACAATTCTCTGTTATGTGGGTCTGTCATGTGCACTGTAGGACATTT 	261867 	0.23208730262382635 	No Hit
GTCTGTGATGCCCTTAGATGTCCGGGGCTGCACGCGCGCTACACTGACTG 	260688 	0.2310423793238554 	No Hit
CCAGTGTTGTGATTGAGCTATCCCACCAAAAGTATCGAGACCCACCTGTG 	258187 	0.22882579478337417 	No Hit
TCTCTCTCAATTTGGTCTTCTAGGTGATTCTAGTTCCAGTCAGTTGACAA 	254677 	0.22571495442468206 	No Hit
TGGGATTATAGGCGTGCGTCACCACGCCCAGCTAATTTTGTTGTATTTTT 	242812 	0.21519925047713734 	No Hit
CTGGTCAAGTGAAGCAGTGGGAGCGGAGAAGGAACAAAGAAATCTGTAAC 	230735 	0.2044956553170448 	No Hit
CTCCTATTCCATCTCCCTGCTCCAAAAATCCATTTAATATATTGTCCTCG 	230627 	0.20439993715216198 	No Hit
GTTTGATATGGTTTGGCTGTATTCCCATCCAACTATCACCTTGAATTGTA 	217060 	0.1923757858284081 	No Hit
CTGTGCCATCTATGAGGGACAGCCGCTGACGTGTCCTCATTGGCAGTGTG 	211325 	0.1872929740172687 	No Hit
CTTTTCAGGAGCACCCCACTTGTGGTACCAATTTACTCTGTGAGTCCATT 	210579 	0.18663180965613355 	No Hit
CCTGGTAGTATACTTTTCTGGTAGAGAGTAGTATATGTATTTTGTGGAAC 	208492 	0.18478214474770324 	No Hit
GTCGCTTCTTGGAACCCAATTGCTTCTCATGGGTTGGGTGGAGAGCAAAC 	202247 	0.17924733049128375 	No Hit
CCCCCCAAGCACCCCACCTTGTCCCCCAGGATGGTCAGGCATCTAGGGAT 	200278 	0.17750224654078098 	No Hit
AGAAGCAGGGCTCTACCATAACTAGAGCTCTGAGGCGGGATGTCAGTTAG 	198273 	0.17572525653531723 	No Hit
CAGTTAACACTATAATCAAATGTACTTATAAAATCTGGACCTAACAGCAT 	198031 	0.17551077694363532 	No Hit
ATTATATAAGTGTTTGTTCATTTGCGGGTGAAGCTACCATTTCCCACAAA 	197166 	0.1747441453452682 	No Hit
CCTTAAAGTATTTTTGAACTATGAAACAAAAACTAAACTGGCTTTATCCA 	195457 	0.1732294940139278 	No Hit
GTGCACCGGCTGCTCCGCAAGGGCAACTACTCGGAGCGCGTGGGCGCCGG 	194906 	0.17274115411716442 	No Hit
TGGACAATGACAGGAGGTAAAACCATGGGGAAAGAATGTTACCTACTGAG 	192060 	0.17021880321664082 	No Hit
CATCCATATCAGAATCCTGTCAACAAGCACTCCTGTCTTCATTAAGTTTT 	191966 	0.17013549296202057 	No Hit
AATCATCGAGTGGAATCGAATGGAATTATGATCAAATGGAATCGAATGTA 	190779 	0.16908347942761387 	No Hit
AGTTACTTGGTGACTTCAGTTCATTCTCACTTGGACACGCTTGTATTTAG 	186902 	0.16564737456418102 	No Hit
CTCAAGCATTATTACAGAGCAATAGTTAAAAAACTTATATGGTATTGGTA 	185477 	0.16438442655531027 	No Hit
CTCCAGTACCTGTCTAGGCATACACAACTGCACCTGGTTTGTGTGGTGCT 	183150 	0.16232205461380697 	No Hit
CTATAATTCCTTTATACCACACTTGAAATTATCCTGGTTGTAATTTTTTT 	181595 	0.16094389029535505 	No Hit
TATCGAACAGATATCTGCCATGTTTATTGCAGCACTATTCACAATAGCCA 	179886 	0.15942923896401465 	No Hit
GCGGCTGAGGCGGCCGTCGGCTGGGTGGGCAGGAGTGGTCGGGCGAACCC 	178773 	0.15844281009813876 	No Hit
AATTTATTAGTATAAAGCAGGGACAGGAGAGATGGTTCTAGAAGTAAAAG 	170043 	0.15070559177010964 	No Hit
CAAGCACAAGGCCCTGGGTTCAGTCCCCAGCTCCAAAAAAAAAAATTATT 	167454 	0.14841101465083503 	No Hit
CTGATAAATGCACGCATCCCCCCCCGGGAAGGGGGGTCAGCGCCCGTCGG 	162444 	0.14397075533543685 	No Hit
AAGAGCACACCGACAGGTACCAGCAAATGCTGACGGGCCATCAATGCGGG 	161423 	0.1430658641655723 	No Hit
ATAATATTTTAGAGGCAGAAGATCATAAAGTCCACAGAGAAACTGAGAGC 	158692 	0.14064543538506283 	No Hit
CCCAGGCTGGAGTGCAGTGGCACAATCTCGGCCCACTGCAACCTCCGCCT 	145570 	0.12901567835179842 	No Hit
TCATTGAGATTAGCCAGACCCAAAGCTTGTACACCTCAATGAACTTAATA 	143644 	0.1273087044113879 	No Hit
TGGTTAGGTGGAGGGAAAAAATAGTTAAATTTATGGATGTTTTAGTATGG 	140787 	0.12477660443851511 	No Hit
AACGTATAAGGTCATCCACTATTAGACCACATGGGTATAAGGCTGTCCCT 	139600 	0.12372459090410841 	No Hit
CTTCCACAACTTCCTTCTTCTCCTTTAAGTCCTTGGTGGTGATTTCGGAG 	139514 	0.12364837088392393 	No Hit
GTTTGCTTCAGAGGCACTGTGTTCCACCCAGAAACATAGACTGCAAGACC 	138876 	0.12308292468767162 	No Hit
CATTCATTCCTCCATGGCTTCTGCTTCAGTTCCTGCCTCCAGGTTCCTGC 	136648 	0.12110829439731094 	No Hit
ATCTCATGGCAGAAGAGCATCACATGCTGAGCAGCTGCACAAGATAGAGC 	134427 	0.11913986806208153 	No Hit
CTCTGCTGCTTAATTTCAGGAATGGCAAATTATCAGCATTACTGACATAT 	133329 	0.1181667333857727 	No Hit
CTTCCTTCTACTGTTCAGTCTATGTCATATCAAATAAATTTACTCATTAG 	128815 	0.11416606860539201 	No Hit
CTCTAGTAAACATGTCATCTCACTAGCACAAATGTCCTCGTTAGCCAGTC 	127186 	0.11272231961840926 	No Hit
TGGTATGAGATTGATAGTTAATAAGTATTGTAAGGGAAAGTTGAAAAGAA 	125995 	0.11166676096678466 	No Hit
TACACACACCTTTAAATTTACGAATTCCCAAAACTAAGTCAAGCAGGGTA 	123011 	0.10902210352224412 	No Hit
ATCTGGACGTCCCTGAAGCAGGGGGACAGGTGTACAGACATGTTCTTGTG 	122365 	0.10844956709155605 	No Hit
ATTAGCCTTGTCTTTGGAAGGAGACTTACTGTCTCTCTTCCTAAATTTAA 	121134 	0.10735855726775265 	No Hit
CCTCCTCTCATTTTTGTTTTGCCTTTGAATATTGCTTTCACTAATTTTAG 	119193 	0.10563828913777502 	No Hit
Here is a relatively normal sample for comparison,
Code:
Sequence 	Count 	Percentage 	Possible Source
TGGGAGTTTGAGGAGATGTTAGTTGATGTGAGAGAGAATTGAGGTAGATG 	270138 	0.22391259681883607 	No Hit
TGGTATGAGATTGATAGTTAATAAGTATTGTAAGGGAAAGTTGAAAAGAA 	263551 	0.2184527493510764 	No Hit
CGGGAGTTTTAGTGTATTAGGGTTTTAGATGGTTTTTGGTTTTTTTTTTT 	215863 	0.17892501198315086 	No Hit
CGGTTTTAGAGGAATTTTGTTTTTGTGTGTTTTGAGTTTATTAGGTAGGT 	210263 	0.17428327130917876 	No Hit
TGGAGGGAGGAGTGGGGATGGTGATGGTGGGATGTGGGGAGGGGGGAGAG 	200400 	0.16610800554714536 	No Hit
TGGGGAGTGGGGTTTTGTGAGTAGATTTTTAGTTGTGTGATGTGATTTTT 	187398 	0.15533087836089793 	No Hit
TGGGGATATAGTATTTTTTTGGTTTTAGAGGTTTAGGTTTTTGTTATTAG 	178946 	0.14832516547225286 	No Hit
TGGTTGTTGTGGTTGTGGTGGTGTGTTTTGTTTGGTTTTTTGGAGGTGTG 	176769 	0.14652068878524618 	No Hit
GCTGTCCACACGTCGTTGAAAGGCACTGACTGCCCCTGAGCTACTTAGGG 	170937 	0.1416866474262095 	No Hit
CGGTACGAGATCGATAGTTAATAAGTATCGTAAGGGAAAGTTGAAAAGAA 	168924 	0.14001810743036916 	No Hit
CATTTCAGGCCTTGTGCCAACATCATTAAACTCCCAGTCATACCCAAAAC 	158301 	0.13121289114829668 	No Hit
TAGGCAGTACCATTCAGGACATAGGCATGGGCAAGGACTTCATGTCTAAA 	152310 	0.12624705750940973 	No Hit
GATCGGAAGAGCGGTTCAGCAGGAATGCCGAGACCGATCTCGTATGCCGT 	135347 	0.11218672767859023 	Illumina Paired End PCR Primer 2 (100% over 50bp)
TGGTTGTGGGAATGTTGTTGTGGAAGGGGGGGATGAGGTGGTAATTGTAG 	124767 	0.10341715333383573 	No Hit
CGGGGGACGTTTTAATCGCGTAGGTTTTGGGATTCGTGAGAGACGTTTTA 	124016 	0.10279466275416556 	No Hit
The sequence in red is one of the 2 adapters used. In total, 2 adapters and 2 PCR primers were used in the sequencing process. They are

PE Adapters
Code:
5' P-GATCGGAAGAGCGGTTCAGCAGGAATGCCGAG
5' ACACTCTTTCCCTACACGACGCTCTTCCGATCT
PE PCR Primer 1.0
Code:
5' AATGATACGGCGACCACCGAGATCTACACTCTTTCCCTACACGACGCTCTTCCGATCT
PE PCR Primer 2.0
Code:
5' CAAGCAGAAGACGGCATACGAGATCGGTCTCGGCATTCCTGCTGAACCGCTCTTCCGATCT
While it is easy to spot one adapter as contamination, I have difficulty in finding the possible sources for other overrepresented sequences. They don't seem to stem from the other adapter and primers.

My colleague suggested me trying blast, and I used UCSC blat for the first overrepresented sequence (in green) "CTCCCACTTATTCTACACCTCTCATGTCTCTTCACCGTGCCAGACTAGAG", and the results show it came from chrUn:23414511-23414560,

cDNA YourSeq
Code:
CTCCCACTTA TTCTACACCT CTCATGTCTC TTCACCGTGC CAGACTAGAG  50
Genomic chrUn (reverse strand):
Code:
cagtgaaaaa acgatgagag tagtggtatt tcaccggcgg cccgcgaggc  23414611
cggcggaccc cgccccgacc cctcgcgggg aacggggggg cgccgggggc  23414561
CTCCCACTTA TTCTACACCT CTCATGTCTC TTCACCGTGC CAGACTAGAG  23414511
tcaagctcaa cagggtcttc tttccccgct gattccgcca agcccgttcc  23414461
cttggctgtg gtttcgctgg atagtaggta gggacagtgg gaatctcgtt  23414411
Side by Side Alignment
Code:
00000001 ctcccacttattctacacctctcatgtctcttcaccgtgccagactagag 00000050
<<<<<<<< |||||||||||||||||||||||||||||||||||||||||||||||||| <<<<<<<<
23414560 ctcccacttattctacacctctcatgtctcttcaccgtgccagactagag 23414511
I am confused by the possibility of such contamination from genome thus making up 3.69% of total reads.

I also tried blat for "CGCCTGATTATCTCACCGGCAGTCTTGCCGGTGACAATGGGTTTGACCCG" and "TCATCAGTTACATTGGAATCCAAATTGCCAACAAAAATAGTAGTGTTATT", they have no matches found. Therefore I have a couple of questions:

1) how to find the possible origins of the overrepresented sequences
2) how to filtered them
2.1) is it safe enough to filter all of them out? (of course if it is certain they are pure pollutions)
2.2) fastQC outputs overrepresented sequences only whose frequency is above 0.1%, do I need to search for more such sequences? If so, how to determine the threshold? (BSMAP uses a parameter -k to filter the top overrepresented k-mers, its default being 1e-6.)

Lastly 2 less relevant questions about the raw reads not beginning with C or T, since my data are MspI digested (cut at C-CGG), fragments are supposed to begin with C or T, so is it safe to discard them? Also, as the methylation information is concentrated at the head of reads, is it necessary/feasible to study methylation contexts other than CpG, e.g. CHG, CHH from my data?

Thanks for any advice.

PS. I forgot to mention the species is rat. Thanks.

Last edited by foehn; 09-11-2013 at 07:30 AM.
foehn is offline   Reply With Quote
Old 09-11-2013, 08:11 AM   #2
Blahah404
Member
 
Location: Cambridge, UK

Join Date: Dec 2011
Posts: 48
Default

I used NCBI BLAST to check the identity of the first two sequences. The first one is mammalian ribosomal RNA...

Code:
TPA: Mus musculus ribosomal DNA, complete repeating unit	99.6	99.6	100%	1e-18	100%	BK000964.3
Chain 5, Structure Of The H. Sapiens 60s Rrna	99.6	99.6	100%	1e-18	100%	3J3F_5
The second is Arabidopsis mRNA:

Code:
Arabidopsis thaliana clone 2531 mRNA, complete sequence	99.6	99.6	100%	1e-18	100%	AY086470.1
What species are you sequencing?

Last edited by Blahah404; 09-11-2013 at 08:35 AM.
Blahah404 is offline   Reply With Quote
Old 09-11-2013, 06:59 PM   #3
foehn
Junior Member
 
Location: china

Join Date: Sep 2013
Posts: 5
Default

Hi Blahah404, it is rat, so I didn't search other species. It is quite a surprise to learn there may be mouse and Arabidopsis mixed in.
foehn is offline   Reply With Quote
Old 09-12-2013, 12:58 AM   #4
Blahah404
Member
 
Location: Cambridge, UK

Join Date: Dec 2011
Posts: 48
Default

It's not that unusual to get contamination, either at the wetlab stage, or at the sequencing centre. If you don't work on Arabidopsis you might want to check some more of the overrepresented sequences in NCBI, and if there's a significant amount of contamination you can filter it out using bowtie2 against the Arabidopsis transcriptome. Same for rRNA using the Silva rRNA database.
Blahah404 is offline   Reply With Quote
Old 09-12-2013, 11:17 PM   #5
foehn
Junior Member
 
Location: china

Join Date: Sep 2013
Posts: 5
Default

I've checked it with our sequencing experimentalist, it is certain now there are contaminants from other species. According to the blast results of ~ top 20 overrepresented sequences, there are at least Arabidopsis, human, and mouse. Filtering against Arabidopsis genome may work, but for the human and mouse pollutions, would doing similar alignments filter out rat genome as well due to the mammalian homology? Any advice, thanks.
foehn is offline   Reply With Quote
Old 09-12-2013, 11:36 PM   #6
ewels
Phil Ewels
 
Location: SciLifeLab, Stockholm, Sweden

Join Date: Mar 2011
Posts: 32
Default

It might be worth having a look at the data with FastQ Screen - we routinely use this along with FastQC to check for potential contamination.

In addition to showing you what other species you have contamination from, it will show you whether the reads matching those species are unique. If so, you can safely ignore them and just map against the reference genome you're interested in. If they come up red (matching multiple genomes) then you'll need to filter them out.

Final plug - we use Trim Galore! to remove adapter contamination. If your contaminants are only a few sequences, it's relatively easy to get Trim Galore! to remove these from your library as well.
ewels is offline   Reply With Quote
Old 09-13-2013, 12:15 AM   #7
foehn
Junior Member
 
Location: china

Join Date: Sep 2013
Posts: 5
Default

Hi tallphil, thanks for the software introduced. The problem is this sample is not simply contaminated by adapters (only ~2%), there is a huge amount (>40%) of foreign species pollutants including human and mouse which may share homology with rat, so it is difficult to decide what to remove.
foehn is offline   Reply With Quote
Old 09-13-2013, 01:05 AM   #8
fkrueger
Senior Member
 
Location: Cambridge, UK

Join Date: Sep 2009
Posts: 618
Default

Hi Foehn,

Do you have any idea where all these contaminants are coming from? In any case, expanding of what Phil has recommeded I'd like to suggest the following strategy:

Running FastQ Screen is normally a good idea to get a quick idea if you've got contaminating species, however this does have the limitation that it doesn't normally work for bisulfite converted sequences unless you use especially prepared genomes (and even then you would get problems with methylated seqyuences). Looking at some of the sequences in the list it would appear however that your contaminating sequence are not bisulfite converted, in which case FastQ screen should work just fine. Since normal genomic sequences look like fully methylated sequences it is all the more important to remove these sequences since they could potentially affect the conclusions you draw from your experiment later on.

Here is what I would do:
1) Identify contaminating species using FastQ screen or similar things (you have already identified human, mouse and Arabidopsis)
2) Align sequences against the contaminating genomes with Bismark using the option --unmapped. This will then write out FastQ files of all sequence that did not map against the contaminant, in other words remove sequences that align to the contaminants.
3) Repeat step 2) for all contaminants
4) Use the remaining unmapped FastQ files to align against the Rat genome and see if the results make any sense
fkrueger is offline   Reply With Quote
Old 09-13-2013, 01:46 AM   #9
foehn
Junior Member
 
Location: china

Join Date: Sep 2013
Posts: 5
Default

Hi fkrueger, no idea about the source, nor do the sequencing stuff know clearly; they only told me the pollution may be brought in after library preparation or during sequencing.
foehn is offline   Reply With Quote
Old 09-13-2013, 02:56 AM   #10
Blahah404
Member
 
Location: Cambridge, UK

Join Date: Dec 2011
Posts: 48
Default

Foehn,

In addition to the good advice given by others above, because you've got over 40% contamination, I would consider asking the sequencing centre to resequence that sample free of charge. We usually get some consideration from the sequencing centre in these cases.
Blahah404 is offline   Reply With Quote
Reply

Tags
bisulfite-seq, contamination, filtering reads, overrepresentation, rrbs

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 06:43 PM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2019, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO