Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • RRBS overrepresented sequences

    Hi all, I am processing the RRBS data generated by Illumina Hiseq 2000, 50 bp, single end. I used fastQC for quality check, and found one sample has many more overrepresented sequences than others:

    Code:
    Sequence 	Count 	Percentage 	Possible Source
    [COLOR="Green"]CTCCCACTTATTCTACACCTCTCATGTCTCTTCACCGTGCCAGACTAGAG[/COLOR] 	4168787 	3.694709642846457 	No Hit
    CGCCTGATTATCTCACCGGCAGTCTTGCCGGTGACAATGGGTTTGACCCG 	2534233 	2.2460382606066713 	No Hit
    [COLOR="Red"]GATCGGAAGAGCGGTTCAGCAGGAATGCCGAGACCGATCTCGTATGCCGT[/COLOR] 	2493949 	2.2103353851053744 	Illumina Paired End PCR Primer 2 (100% over 50bp)
    TCATCAGTTACATTGGAATCCAAATTGCCAACAAAAATAGTAGTGTTATT 	1344836 	1.1919003147071456 	No Hit
    AAATTATGCAGTCGAGTTTCCCACATTTGGGGAAATCGCAGGGGTCAGCA 	1281767 	1.1360035652534837 	No Hit
    GATAAAGACTCATTCCTTGTAGAGCAATAAAATTTATCGTGGCTTAACTA 	1028733 	0.9117447677260468 	No Hit
    CACAGAGTGGAACGTCCCTTTAGACAGAGCAGATTTGAAACACTCTTTTT 	1027822 	0.9109373672796742 	No Hit
    GATCTCTTTCACTGTCATAATTTCCTCAGTTATAATTTTGCAAAGGCGGT 	980751 	0.8692193141389342 	No Hit
    CAGATCATGGGCACCAGACAGGCAAGACAGGTTTGTTAAGAGATGGGTGG 	941876 	0.8347652061776363 	No Hit
    CCCGCACCGTCCCTGGCCAGATGTGAGTCCTCCCACCCCTGTCGGGGCTC 	896901 	0.7949047944590668 	No Hit
    CTCCTATTTCCAAAAATCCATTTAATATATTGTCCTCGGATAGAGGACGT 	857061 	0.7595954269689545 	No Hit
    CCGAGTTTTGTGGAGGAACCCACATAACAAACACCCGCGAAGCCAAAGCA 	841097 	0.7454468641523844 	No Hit
    GGGGGGAATAAGGAATGACTGCAAATGGGTATGGAGTTTTTTAGGATGTT 	764633 	0.6776784034153376 	No Hit
    GTCAGCCTTGACTACATAGCAAAACTCAAGGCCAACCAGACCTAAAAACA 	738731 	0.6547219968709378 	No Hit
    AGTGAGAACCTGGTGCTATGGACAGCTAAGAGCTCACATCCCAAACTGCA 	689985 	0.6115194258952095 	No Hit
    AGAGATTAAGGCCGAGTACTGGGCGGTGTCCTCCCCCACTGCACTAACCT 	677824 	0.6007413832735414 	No Hit
    CCAATATAGGATGGCCCCCTACCAAAAGCTGAGTTTTAGACTACATCCCT 	648286 	0.5745624651780862 	No Hit
    GCTGGGCGTGGTGGTGGGCGCCTGTAGTCCCAGCTGCTCGGGAGGCTGAG 	635748 	0.5634502952586327 	No Hit
    CACTTTCTCAGGTATAGAGAGACTCACTTCCTCCTGTGGAGGAAAGCCTG 	627786 	0.556393739436437 	No Hit
    GGGGCTTATCACAGCAATAGAACAGCAATTATGACTGGAGTATGATAGTT 	620923 	0.5503112045698546 	No Hit
    TGAAAAAACATGAAGACCTCGGTCTGGATCTCTAACACCCGCACTTTCCA 	617973 	0.5476966806216661 	No Hit
    CTCGCCTCCGGTGCACCTCAGGTACACGACTTTGACCTCGTTGGGGTCGA 	610104 	0.5407225487747862 	No Hit
    GTGTGGAATGCCCAGGAGGCCCAGGCTGACTTTGCCAAGGTGTTGGAGCT 	578537 	0.51274536997056 	No Hit
    TGTTGTTCTGAGGGTCTACCCGAACTGCTCCTGAGGGGCCCAGGTTTGTA 	573695 	0.5084540055783129 	No Hit
    AGAGCACAGCAGACTTACGGCCTTAGGAAGAAAGCTGCTCACCACATACT 	561256 	0.4974295773100019 	No Hit
    CTGCCATCTTTGCGGGTGACTTTCCATCCCTTGAACCAAGGCATATTAGC 	491273 	0.43540509274522954 	No Hit
    GGCAAATGAAAGGATTCTCCAGGGGCAACACAAATCAGGTTTTCAATTAT 	486016 	0.4307459224416272 	No Hit
    AGACTTCATTGCTCATAGCTATATAGCCTTCATGCTGGGTTAGCTAGCTT 	478411 	0.4240057683311276 	No Hit
    TCTTTTGTGCAGGAAGCAGGGGAAGGACCAGGTGTCTACCACCTGTAGAA 	458030 	0.4059425098267104 	No Hit
    CTCTCCTTGCTTTCTCCTTGCTAGCTGCCCCCTCCTTGGCAGCCCACACC 	448639 	0.39761946087842615 	No Hit
    CATAGCAAATCCTGTACAACCTTCAAAATGGATGCAGAATGCCTCCACTC 	440726 	0.39060633274214956 	No Hit
    CTCCCCGGGGCTCCCGCCGGCTTCTCCGGGATCGGTCGCGTTACCGCACT 	425394 	0.3770178984460049 	No Hit
    GTGGACCAACCCATGTACTGTGGTACATCTACACAAAATCAGTGACTTCA 	421541 	0.3736030642858794 	No Hit
    GCTTCGCGCCCCAGCCCGACCGACCCAGCCCTTAGAGCCAATCCTTATCC 	421527 	0.3735906563756168 	No Hit
    TAGGGAATTGAAACACCACAAGTGGTAGGAAGTGCGGCCACAAGGTCGGT 	414823 	0.3676490399184453 	No Hit
    GGTGCCCTTCCGTCAATTCCTTTAAGTTTCAGCTTTGCAACCATACTCCC 	403885 	0.3579549168861449 	No Hit
    TACCTTCTTCTGGTGTGTCTGAAGACAGTACAGTGTACGTGCATACGTAC 	349789 	0.3100107516314984 	No Hit
    CCCACCATACATATTAACTGTAGTCACAATGTGACCGACTTCTTTTTGCT 	344734 	0.30553060974739904 	No Hit
    CCTCAGGTTCTGGTGACATTCCTGCTACTCCCACACTACTAGCTTATATT 	342766 	0.3037864120762008 	No Hit
    CGCTCTGGTCCGTCTTGCGCCGGTCCAAGAATTTCACCTCTAGCGGCGCA 	325915 	0.28885171951656513 	No Hit
    ATGTCTCTCAGACCAACAGAATGTGAAGACAATGGCTGTACATGGCGGCC 	325544 	0.2885229098946065 	No Hit
    CAATTCGATGGTGTTTCCATTCGATTCATTCGATGTTGATTCCATTAGCT 	323500 	0.28671135499626843 	No Hit
    ATCATTTGAGGTCAAGAGTTCGAGACCAGCCTGGCCAACATGGTGAAACC 	318014 	0.2818492267319422 	No Hit
    CTGGTCAGCCACAGCAAAGACTGGGAAGAGCACCTGAGGGAAGGACGCAC 	307797 	0.27279411107816515 	No Hit
    ACCTGATCTAGCCTAGAGACCAGACCCTAGGTGACAGTACTGTTTCAAGC 	290243 	0.2572363641674867 	No Hit
    TGACTTTGTATGTTCATTGTAACTTCTTTGTTGATTCATCTAGCTTTCTC 	289391 	0.2564812542000776 	No Hit
    CCAACTGTTGCCTCGGTGCCACACTCCATCATCAATGGGTACAAGCGCGT 	280591 	0.24868199632073554 	No Hit
    CGCCTAGAAATTTTGATTCCATTCGTGAAAATTTTTCTATATCCCGAACA 	278477 	0.24680840187108452 	No Hit
    AAAGTCGAAATGCAGGATGGGATTTTAAAATGGTAGAAGAGTAGGAAGCT 	276112 	0.24471235131601132 	No Hit
    GCCCTTTTTCTTGTGCAGTTTGAGTTTGGAAATGTCTTAGAGCATGTCTT 	264968 	0.23483565474698995 	No Hit
    CCTGGTACAACTCCTGGTGGTGGGTCTGGGAGGGCTGACTGGGCAGGGAG 	263449 	0.23348939648349898 	No Hit
    CTATACAATTCTCTGTTATGTGGGTCTGTCATGTGCACTGTAGGACATTT 	261867 	0.23208730262382635 	No Hit
    GTCTGTGATGCCCTTAGATGTCCGGGGCTGCACGCGCGCTACACTGACTG 	260688 	0.2310423793238554 	No Hit
    CCAGTGTTGTGATTGAGCTATCCCACCAAAAGTATCGAGACCCACCTGTG 	258187 	0.22882579478337417 	No Hit
    TCTCTCTCAATTTGGTCTTCTAGGTGATTCTAGTTCCAGTCAGTTGACAA 	254677 	0.22571495442468206 	No Hit
    TGGGATTATAGGCGTGCGTCACCACGCCCAGCTAATTTTGTTGTATTTTT 	242812 	0.21519925047713734 	No Hit
    CTGGTCAAGTGAAGCAGTGGGAGCGGAGAAGGAACAAAGAAATCTGTAAC 	230735 	0.2044956553170448 	No Hit
    CTCCTATTCCATCTCCCTGCTCCAAAAATCCATTTAATATATTGTCCTCG 	230627 	0.20439993715216198 	No Hit
    GTTTGATATGGTTTGGCTGTATTCCCATCCAACTATCACCTTGAATTGTA 	217060 	0.1923757858284081 	No Hit
    CTGTGCCATCTATGAGGGACAGCCGCTGACGTGTCCTCATTGGCAGTGTG 	211325 	0.1872929740172687 	No Hit
    CTTTTCAGGAGCACCCCACTTGTGGTACCAATTTACTCTGTGAGTCCATT 	210579 	0.18663180965613355 	No Hit
    CCTGGTAGTATACTTTTCTGGTAGAGAGTAGTATATGTATTTTGTGGAAC 	208492 	0.18478214474770324 	No Hit
    GTCGCTTCTTGGAACCCAATTGCTTCTCATGGGTTGGGTGGAGAGCAAAC 	202247 	0.17924733049128375 	No Hit
    CCCCCCAAGCACCCCACCTTGTCCCCCAGGATGGTCAGGCATCTAGGGAT 	200278 	0.17750224654078098 	No Hit
    AGAAGCAGGGCTCTACCATAACTAGAGCTCTGAGGCGGGATGTCAGTTAG 	198273 	0.17572525653531723 	No Hit
    CAGTTAACACTATAATCAAATGTACTTATAAAATCTGGACCTAACAGCAT 	198031 	0.17551077694363532 	No Hit
    ATTATATAAGTGTTTGTTCATTTGCGGGTGAAGCTACCATTTCCCACAAA 	197166 	0.1747441453452682 	No Hit
    CCTTAAAGTATTTTTGAACTATGAAACAAAAACTAAACTGGCTTTATCCA 	195457 	0.1732294940139278 	No Hit
    GTGCACCGGCTGCTCCGCAAGGGCAACTACTCGGAGCGCGTGGGCGCCGG 	194906 	0.17274115411716442 	No Hit
    TGGACAATGACAGGAGGTAAAACCATGGGGAAAGAATGTTACCTACTGAG 	192060 	0.17021880321664082 	No Hit
    CATCCATATCAGAATCCTGTCAACAAGCACTCCTGTCTTCATTAAGTTTT 	191966 	0.17013549296202057 	No Hit
    AATCATCGAGTGGAATCGAATGGAATTATGATCAAATGGAATCGAATGTA 	190779 	0.16908347942761387 	No Hit
    AGTTACTTGGTGACTTCAGTTCATTCTCACTTGGACACGCTTGTATTTAG 	186902 	0.16564737456418102 	No Hit
    CTCAAGCATTATTACAGAGCAATAGTTAAAAAACTTATATGGTATTGGTA 	185477 	0.16438442655531027 	No Hit
    CTCCAGTACCTGTCTAGGCATACACAACTGCACCTGGTTTGTGTGGTGCT 	183150 	0.16232205461380697 	No Hit
    CTATAATTCCTTTATACCACACTTGAAATTATCCTGGTTGTAATTTTTTT 	181595 	0.16094389029535505 	No Hit
    TATCGAACAGATATCTGCCATGTTTATTGCAGCACTATTCACAATAGCCA 	179886 	0.15942923896401465 	No Hit
    GCGGCTGAGGCGGCCGTCGGCTGGGTGGGCAGGAGTGGTCGGGCGAACCC 	178773 	0.15844281009813876 	No Hit
    AATTTATTAGTATAAAGCAGGGACAGGAGAGATGGTTCTAGAAGTAAAAG 	170043 	0.15070559177010964 	No Hit
    CAAGCACAAGGCCCTGGGTTCAGTCCCCAGCTCCAAAAAAAAAAATTATT 	167454 	0.14841101465083503 	No Hit
    CTGATAAATGCACGCATCCCCCCCCGGGAAGGGGGGTCAGCGCCCGTCGG 	162444 	0.14397075533543685 	No Hit
    AAGAGCACACCGACAGGTACCAGCAAATGCTGACGGGCCATCAATGCGGG 	161423 	0.1430658641655723 	No Hit
    ATAATATTTTAGAGGCAGAAGATCATAAAGTCCACAGAGAAACTGAGAGC 	158692 	0.14064543538506283 	No Hit
    CCCAGGCTGGAGTGCAGTGGCACAATCTCGGCCCACTGCAACCTCCGCCT 	145570 	0.12901567835179842 	No Hit
    TCATTGAGATTAGCCAGACCCAAAGCTTGTACACCTCAATGAACTTAATA 	143644 	0.1273087044113879 	No Hit
    TGGTTAGGTGGAGGGAAAAAATAGTTAAATTTATGGATGTTTTAGTATGG 	140787 	0.12477660443851511 	No Hit
    AACGTATAAGGTCATCCACTATTAGACCACATGGGTATAAGGCTGTCCCT 	139600 	0.12372459090410841 	No Hit
    CTTCCACAACTTCCTTCTTCTCCTTTAAGTCCTTGGTGGTGATTTCGGAG 	139514 	0.12364837088392393 	No Hit
    GTTTGCTTCAGAGGCACTGTGTTCCACCCAGAAACATAGACTGCAAGACC 	138876 	0.12308292468767162 	No Hit
    CATTCATTCCTCCATGGCTTCTGCTTCAGTTCCTGCCTCCAGGTTCCTGC 	136648 	0.12110829439731094 	No Hit
    ATCTCATGGCAGAAGAGCATCACATGCTGAGCAGCTGCACAAGATAGAGC 	134427 	0.11913986806208153 	No Hit
    CTCTGCTGCTTAATTTCAGGAATGGCAAATTATCAGCATTACTGACATAT 	133329 	0.1181667333857727 	No Hit
    CTTCCTTCTACTGTTCAGTCTATGTCATATCAAATAAATTTACTCATTAG 	128815 	0.11416606860539201 	No Hit
    CTCTAGTAAACATGTCATCTCACTAGCACAAATGTCCTCGTTAGCCAGTC 	127186 	0.11272231961840926 	No Hit
    TGGTATGAGATTGATAGTTAATAAGTATTGTAAGGGAAAGTTGAAAAGAA 	125995 	0.11166676096678466 	No Hit
    TACACACACCTTTAAATTTACGAATTCCCAAAACTAAGTCAAGCAGGGTA 	123011 	0.10902210352224412 	No Hit
    ATCTGGACGTCCCTGAAGCAGGGGGACAGGTGTACAGACATGTTCTTGTG 	122365 	0.10844956709155605 	No Hit
    ATTAGCCTTGTCTTTGGAAGGAGACTTACTGTCTCTCTTCCTAAATTTAA 	121134 	0.10735855726775265 	No Hit
    CCTCCTCTCATTTTTGTTTTGCCTTTGAATATTGCTTTCACTAATTTTAG 	119193 	0.10563828913777502 	No Hit
    Here is a relatively normal sample for comparison,
    Code:
    Sequence 	Count 	Percentage 	Possible Source
    TGGGAGTTTGAGGAGATGTTAGTTGATGTGAGAGAGAATTGAGGTAGATG 	270138 	0.22391259681883607 	No Hit
    TGGTATGAGATTGATAGTTAATAAGTATTGTAAGGGAAAGTTGAAAAGAA 	263551 	0.2184527493510764 	No Hit
    CGGGAGTTTTAGTGTATTAGGGTTTTAGATGGTTTTTGGTTTTTTTTTTT 	215863 	0.17892501198315086 	No Hit
    CGGTTTTAGAGGAATTTTGTTTTTGTGTGTTTTGAGTTTATTAGGTAGGT 	210263 	0.17428327130917876 	No Hit
    TGGAGGGAGGAGTGGGGATGGTGATGGTGGGATGTGGGGAGGGGGGAGAG 	200400 	0.16610800554714536 	No Hit
    TGGGGAGTGGGGTTTTGTGAGTAGATTTTTAGTTGTGTGATGTGATTTTT 	187398 	0.15533087836089793 	No Hit
    TGGGGATATAGTATTTTTTTGGTTTTAGAGGTTTAGGTTTTTGTTATTAG 	178946 	0.14832516547225286 	No Hit
    TGGTTGTTGTGGTTGTGGTGGTGTGTTTTGTTTGGTTTTTTGGAGGTGTG 	176769 	0.14652068878524618 	No Hit
    GCTGTCCACACGTCGTTGAAAGGCACTGACTGCCCCTGAGCTACTTAGGG 	170937 	0.1416866474262095 	No Hit
    CGGTACGAGATCGATAGTTAATAAGTATCGTAAGGGAAAGTTGAAAAGAA 	168924 	0.14001810743036916 	No Hit
    CATTTCAGGCCTTGTGCCAACATCATTAAACTCCCAGTCATACCCAAAAC 	158301 	0.13121289114829668 	No Hit
    TAGGCAGTACCATTCAGGACATAGGCATGGGCAAGGACTTCATGTCTAAA 	152310 	0.12624705750940973 	No Hit
    [COLOR="Red"]GATCGGAAGAGCGGTTCAGCAGGAATGCCGAGACCGATCTCGTATGCCGT[/COLOR] 	135347 	0.11218672767859023 	Illumina Paired End PCR Primer 2 (100% over 50bp)
    TGGTTGTGGGAATGTTGTTGTGGAAGGGGGGGATGAGGTGGTAATTGTAG 	124767 	0.10341715333383573 	No Hit
    CGGGGGACGTTTTAATCGCGTAGGTTTTGGGATTCGTGAGAGACGTTTTA 	124016 	0.10279466275416556 	No Hit
    The sequence in red is one of the 2 adapters used. In total, 2 adapters and 2 PCR primers were used in the sequencing process. They are

    PE Adapters
    Code:
    5' P-GATCGGAAGAGCGGTTCAGCAGGAATGCCGAG
    5' ACACTCTTTCCCTACACGACGCTCTTCCGATCT
    PE PCR Primer 1.0
    Code:
    5' AATGATACGGCGACCACCGAGATCTACACTCTTTCCCTACACGACGCTCTTCCGATCT
    PE PCR Primer 2.0
    Code:
    5' CAAGCAGAAGACGGCATACGAGATCGGTCTCGGCATTCCTGCTGAACCGCTCTTCCGATCT
    While it is easy to spot one adapter as contamination, I have difficulty in finding the possible sources for other overrepresented sequences. They don't seem to stem from the other adapter and primers.

    My colleague suggested me trying blast, and I used UCSC blat for the first overrepresented sequence (in green) "CTCCCACTTATTCTACACCTCTCATGTCTCTTCACCGTGCCAGACTAGAG", and the results show it came from chrUn:23414511-23414560,

    cDNA YourSeq
    Code:
    CTCCCACTTA TTCTACACCT CTCATGTCTC TTCACCGTGC CAGACTAGAG  50
    Genomic chrUn (reverse strand):
    Code:
    cagtgaaaaa acgatgagag tagtggtatt tcaccggcgg cccgcgaggc  23414611
    cggcggaccc cgccccgacc cctcgcgggg aacggggggg cgccgggggc  23414561
    CTCCCACTTA TTCTACACCT CTCATGTCTC TTCACCGTGC CAGACTAGAG  23414511
    tcaagctcaa cagggtcttc tttccccgct gattccgcca agcccgttcc  23414461
    cttggctgtg gtttcgctgg atagtaggta gggacagtgg gaatctcgtt  23414411
    Side by Side Alignment
    Code:
    00000001 ctcccacttattctacacctctcatgtctcttcaccgtgccagactagag 00000050
    <<<<<<<< |||||||||||||||||||||||||||||||||||||||||||||||||| <<<<<<<<
    23414560 ctcccacttattctacacctctcatgtctcttcaccgtgccagactagag 23414511
    I am confused by the possibility of such contamination from genome thus making up 3.69% of total reads.

    I also tried blat for "CGCCTGATTATCTCACCGGCAGTCTTGCCGGTGACAATGGGTTTGACCCG" and "TCATCAGTTACATTGGAATCCAAATTGCCAACAAAAATAGTAGTGTTATT", they have no matches found. Therefore I have a couple of questions:

    1) how to find the possible origins of the overrepresented sequences
    2) how to filtered them
    2.1) is it safe enough to filter all of them out? (of course if it is certain they are pure pollutions)
    2.2) fastQC outputs overrepresented sequences only whose frequency is above 0.1%, do I need to search for more such sequences? If so, how to determine the threshold? (BSMAP uses a parameter -k to filter the top overrepresented k-mers, its default being 1e-6.)

    Lastly 2 less relevant questions about the raw reads not beginning with C or T, since my data are MspI digested (cut at C-CGG), fragments are supposed to begin with C or T, so is it safe to discard them? Also, as the methylation information is concentrated at the head of reads, is it necessary/feasible to study methylation contexts other than CpG, e.g. CHG, CHH from my data?

    Thanks for any advice.

    PS. I forgot to mention the species is rat. Thanks.
    Last edited by foehn; 09-11-2013, 07:30 AM.

  • #2
    I used NCBI BLAST to check the identity of the first two sequences. The first one is mammalian ribosomal RNA...

    Code:
    TPA: Mus musculus ribosomal DNA, complete repeating unit	99.6	99.6	100%	1e-18	100%	BK000964.3
    Chain 5, Structure Of The H. Sapiens 60s Rrna	99.6	99.6	100%	1e-18	100%	3J3F_5
    The second is Arabidopsis mRNA:

    Code:
    Arabidopsis thaliana clone 2531 mRNA, complete sequence	99.6	99.6	100%	1e-18	100%	AY086470.1
    What species are you sequencing?
    Last edited by Blahah404; 09-11-2013, 08:35 AM.

    Comment


    • #3
      Hi Blahah404, it is rat, so I didn't search other species. It is quite a surprise to learn there may be mouse and Arabidopsis mixed in.

      Comment


      • #4
        It's not that unusual to get contamination, either at the wetlab stage, or at the sequencing centre. If you don't work on Arabidopsis you might want to check some more of the overrepresented sequences in NCBI, and if there's a significant amount of contamination you can filter it out using bowtie2 against the Arabidopsis transcriptome. Same for rRNA using the Silva rRNA database.

        Comment


        • #5
          I've checked it with our sequencing experimentalist, it is certain now there are contaminants from other species. According to the blast results of ~ top 20 overrepresented sequences, there are at least Arabidopsis, human, and mouse. Filtering against Arabidopsis genome may work, but for the human and mouse pollutions, would doing similar alignments filter out rat genome as well due to the mammalian homology? Any advice, thanks.

          Comment


          • #6
            It might be worth having a look at the data with FastQ Screen - we routinely use this along with FastQC to check for potential contamination.

            In addition to showing you what other species you have contamination from, it will show you whether the reads matching those species are unique. If so, you can safely ignore them and just map against the reference genome you're interested in. If they come up red (matching multiple genomes) then you'll need to filter them out.

            Final plug - we use Trim Galore! to remove adapter contamination. If your contaminants are only a few sequences, it's relatively easy to get Trim Galore! to remove these from your library as well.

            Comment


            • #7
              Hi tallphil, thanks for the software introduced. The problem is this sample is not simply contaminated by adapters (only ~2%), there is a huge amount (>40%) of foreign species pollutants including human and mouse which may share homology with rat, so it is difficult to decide what to remove.

              Comment


              • #8
                Hi Foehn,

                Do you have any idea where all these contaminants are coming from? In any case, expanding of what Phil has recommeded I'd like to suggest the following strategy:

                Running FastQ Screen is normally a good idea to get a quick idea if you've got contaminating species, however this does have the limitation that it doesn't normally work for bisulfite converted sequences unless you use especially prepared genomes (and even then you would get problems with methylated seqyuences). Looking at some of the sequences in the list it would appear however that your contaminating sequence are not bisulfite converted, in which case FastQ screen should work just fine. Since normal genomic sequences look like fully methylated sequences it is all the more important to remove these sequences since they could potentially affect the conclusions you draw from your experiment later on.

                Here is what I would do:
                1) Identify contaminating species using FastQ screen or similar things (you have already identified human, mouse and Arabidopsis)
                2) Align sequences against the contaminating genomes with Bismark using the option --unmapped. This will then write out FastQ files of all sequence that did not map against the contaminant, in other words remove sequences that align to the contaminants.
                3) Repeat step 2) for all contaminants
                4) Use the remaining unmapped FastQ files to align against the Rat genome and see if the results make any sense

                Comment


                • #9
                  Hi fkrueger, no idea about the source, nor do the sequencing stuff know clearly; they only told me the pollution may be brought in after library preparation or during sequencing.

                  Comment


                  • #10
                    Foehn,

                    In addition to the good advice given by others above, because you've got over 40% contamination, I would consider asking the sequencing centre to resequence that sample free of charge. We usually get some consideration from the sequencing centre in these cases.

                    Comment

                    Latest Articles

                    Collapse

                    • seqadmin
                      Strategies for Sequencing Challenging Samples
                      by seqadmin


                      Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                      03-22-2024, 06:39 AM
                    • seqadmin
                      Techniques and Challenges in Conservation Genomics
                      by seqadmin



                      The field of conservation genomics centers on applying genomics technologies in support of conservation efforts and the preservation of biodiversity. This article features interviews with two researchers who showcase their innovative work and highlight the current state and future of conservation genomics.

                      Avian Conservation
                      Matthew DeSaix, a recent doctoral graduate from Kristen Ruegg’s lab at The University of Colorado, shared that most of his research...
                      03-08-2024, 10:41 AM

                    ad_right_rmr

                    Collapse

                    News

                    Collapse

                    Topics Statistics Last Post
                    Started by seqadmin, Yesterday, 06:37 PM
                    0 responses
                    10 views
                    0 likes
                    Last Post seqadmin  
                    Started by seqadmin, Yesterday, 06:07 PM
                    0 responses
                    9 views
                    0 likes
                    Last Post seqadmin  
                    Started by seqadmin, 03-22-2024, 10:03 AM
                    0 responses
                    50 views
                    0 likes
                    Last Post seqadmin  
                    Started by seqadmin, 03-21-2024, 07:32 AM
                    0 responses
                    67 views
                    0 likes
                    Last Post seqadmin  
                    Working...
                    X