SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics



Similar Threads
Thread Thread Starter Forum Replies Last Post
Kmer Content wvie Bioinformatics 3 08-11-2012 09:07 AM
PubMed: Chromatin particle spectrum analysis: a method for comparative chromatin stru Newsbot! Literature Watch 0 06-01-2011 03:00 AM
Spanning the Spectrum of Life Sciences Analytics (March 18, 2010) JMP Genomics Events / Conferences 1 03-08-2010 02:05 PM
PubMed: The Expansion of the Microbiological Spectrum of Brain Abscesses with Use of Newsbot! Literature Watch 0 04-02-2009 06:00 AM

Reply
 
Thread Tools
Old 12-11-2012, 02:56 PM   #1
Gorgarian
Junior Member
 
Location: Canberra

Join Date: Dec 2012
Posts: 2
Default Kmer spectrum question

Hi all,

This is my first post on the forum, and I am new to genomic analysis so please bear with me.

I am doing a k-mer analysis on Illumina reads from 250bp, 500bp and 800bp insert libraries. K=17, using jellyfish

jellyfish count -m 17 -o a.out -C -c 7 -s 1000000000 -t 24 a.fas

The kmer spectra look normal when I run the analysis separately for each library, that is, a huge number of kmers represented only once or twice (read errors?), and a single mode, and long tail out to the right (non-orthogous kmers arising from repetitive elements?).

Here is the rub. Only the 250bp analysis yields a sensible estimate of genome size (1.8 Gb, estimated independently) (using number of Kmers/peak/2), and when I combine the spectra

jellyfish merge -o 250+500.out 250.out 500.out

I get two peaks. I would have thought the Illumina runs on the same samples using different short insert libraries would have been sampling the same overall sequence, and so the unimodal spectra should combine to yield a unimodal spectrum.

Any Illumina buffs or bioinformaticists out there who can shed some light on what might be happening here?

I have attached a file with the spectra.
kmer_spectra.pdf
Gorgarian is offline   Reply With Quote
Old 12-12-2012, 01:34 AM   #2
pallevillesen
Member
 
Location: Bioinformatics Research Center, Aarhus University, Denmark

Join Date: May 2012
Posts: 19
Default

Your 250 bp library looks a little weird (nearly bimodal - or a very "broad" peak).

Other than that you're right - you'll expect up to 4 peaks though:

1. Depth 1-2: sequencing errors
2. Heterozygote positions (small peak) - a small bump with low depth from kmer covering heterozygote positions
3. The large peak - the typical coverage (the one you clearly see in your 500bp library) - used for genome size estimate.
4. The repeat peak - a small bump with high depth covering repeat regions

This is assuming that a random 17mer is typically unique in the genome. But no matter what: two libraries may scale differently (i.e. different coverage due to library size differences) - but the shape of the kmer spectrum should NOT be different - and it is in your case.

What about quality check of the two libraries (fastqc?)
pallevillesen is offline   Reply With Quote
Old 12-12-2012, 02:24 PM   #3
Gorgarian
Junior Member
 
Location: Canberra

Join Date: Dec 2012
Posts: 2
Smile

OK, thanks for that advice. I installed fastqc and ran the 250 and 500 fastq files through it, and all looks good. I have attached examples of the fastqc output.

Maybe there is a double peak in there in the 250 set (leading to the "broad peak") and the double peak becomes better defined as I add in more data from the 500 and 800bp reads. There may be no problem at all?

Maybe do a subtraction between the 250 kmer set and the 500 kmer set to see if there is any systematic difference in representation. Might that clear the issue up? Any idea on how to do such a subtraction on jellyfish output files?
Attached Files
File Type: pdf qc.pdf (46.4 KB, 31 views)
Gorgarian is offline   Reply With Quote
Old 12-13-2012, 12:06 AM   #4
pallevillesen
Member
 
Location: Bioinformatics Research Center, Aarhus University, Denmark

Join Date: May 2012
Posts: 19
Default

I think there is a problem - but maybe that relates to the genome of your sample(?)

Is it a secret organism - or can you reveal anything? I thought a little more and I have more ugly suggestion: contamination (if you're sampling two genome with different coverage, you'll also get two peaks).

I would probably try and assemble it (if it's an unknown organism) - and then maybe remap all the 500bp lib reads to the genome - the scaffolds with reads are from your target organism.

Then the scaffolds only getting hits from the 250bp library and NOT the 500bp library is the "contaminant" - then you can blast and check it.

A lot of work - maybe it's not worth it - depends on your question/project.

On topic: I don't know how to subtract two jellyfish kmer spectra.
pallevillesen is offline   Reply With Quote
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 01:00 PM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2021, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO