SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics



Similar Threads
Thread Thread Starter Forum Replies Last Post
how to check copy number from certain gene cliff Bioinformatics 2 07-14-2014 03:52 PM
gene family and gene copy number xujie Bioinformatics 2 07-14-2014 02:38 PM
gene copy number -k parameter afadda Genomic Resequencing 1 07-14-2014 02:36 PM

Reply
 
Thread Tools
Old 02-19-2014, 04:36 PM   #1
AdrianP
Senior Member
 
Location: Ottawa

Join Date: Apr 2011
Posts: 130
Default Gene Copy Number

Hello,

I am attempting to determine genes that have a very high copy number in the genome. My organism of interest is a non-model, and I have Illumina Paired-End sequencing data.

What is the best approach to determine these genes that have either 2, 3 or even 10 copies in the genome?

My first thought is that it will most likely be coverage based. Some tools even use paired end information. I have a problem deciding which tool to use given my objective. I am not interested in regions of the genome that have a high copy number, I am particularly interested in which genes have a high copy number.

I have a genome assembly that has been annotated. I have tried to do the analysis as follows:
- Extract all ORFs (larger than 100AA) into a fasta file.
- Map my reads to all ORFs simultaneously. (Here, if a read can map to two ORFs, I allow it to map to both ORFs, because I suspect that in my assembly I have ORFs with 90% to 100% identity, and the point is to see copy number of ORFs, and if there are 2 ORFs that are the same, I should see the same copy number for both)
- After I determine the mean coverage for each individual ORFs, I divide it by the genome mean coverage, to determine the copy number.

This method seems a bit primitive, and I am sure there are better ways to do it. A similar method would be to map the reads to the ORFs and calculate FPKMs with cufflinks, but this method would be prone to the same pitfalls as the one I employed.

Any ideas?
AdrianP is offline   Reply With Quote
Old 02-19-2014, 05:57 PM   #2
JackieBadger
Senior Member
 
Location: Halifax, Nova Scotia

Join Date: Mar 2009
Posts: 381
Default

While I only have experience estimating CNV in targeted amplicons, and can't answer your question directly, I know there are at least a couple of reviews looking at CNV detection methods.

Here is one from a quick google http://www.plosone.org/article/info%...l.pone.0059128
JackieBadger is offline   Reply With Quote
Old 02-20-2014, 10:28 AM   #3
jgibbons1
Senior Member
 
Location: Worcester, MA

Join Date: Oct 2009
Posts: 133
Default

I think you're approach will work fine for rough estimates of high CN genes. However, I think you're going to be underestimating copy number, because you are normalizing high CN ORFs by the mean genome coverage. The mean genome coverage will include high CN regions, repetitive regions etc.

I just went through this myself and had to find an appropriate "baseline" measurement to normalize by. You really want your "baseline" coverage to reflect the coverage of a "single copy" region in the genome.

The way I went about doing this is by using the coverage of single copy exons as my baseline normalizing factor, but in your case, single copy ORFs would work fine. I BLASTed all exons against all exons and excluded all sequences that had hits to anything other than themselves. This worked quite well.

Good luck!
jgibbons1 is offline   Reply With Quote
Old 02-20-2014, 07:34 PM   #4
Jeremy
Senior Member
 
Location: Pathum Thani, Thailand

Join Date: Nov 2009
Posts: 190
Default

I think you would be better to map to the entire genome then extract coverage of the ORFs later using a gff of the ORF locations using something like HTSeq-count. The mapping quality will be much better, otherwise you could end up with a lot of falsely mapped reads.
Jeremy is offline   Reply With Quote
Old 02-20-2014, 08:05 PM   #5
jgibbons1
Senior Member
 
Location: Worcester, MA

Join Date: Oct 2009
Posts: 133
Default

Quote:
Originally Posted by Jeremy View Post
I think you would be better to map to the entire genome then extract coverage of the ORFs later using a gff of the ORF locations using something like HTSeq-count. The mapping quality will be much better, otherwise you could end up with a lot of falsely mapped reads.
That's true, but that is why using the "single copy ORFs" would circumvent the problem of mapping quality and falsely mapped reads.
jgibbons1 is offline   Reply With Quote
Old 02-20-2014, 08:12 PM   #6
AdrianP
Senior Member
 
Location: Ottawa

Join Date: Apr 2011
Posts: 130
Default

Quote:
Originally Posted by jgibbons1 View Post
That's true, but that is why using the "single copy ORFs" would circumvent the problem of mapping quality and falsely mapped reads.
Could you elaborate? I do not understand how to use single copy ORFs and what problem does that circumvent?
AdrianP is offline   Reply With Quote
Old 02-20-2014, 08:14 PM   #7
Jeremy
Senior Member
 
Location: Pathum Thani, Thailand

Join Date: Nov 2009
Posts: 190
Default

Wait, what sort of sequence data do you have? (normalised) RNA-Seq? genomic?

I assumed you had genomic which led me think along these lines: While I don't know what species you are working with, the % of your genome that is ORF is going to be what 1-3%? which means you will be mapping 97-99% of the reads against a reference that doesn't have the sequence that they should map to. That can result in a reasonble chance of them falsely mapping to one of the ORFs in the reference.
Jeremy is offline   Reply With Quote
Old 02-20-2014, 08:22 PM   #8
AdrianP
Senior Member
 
Location: Ottawa

Join Date: Apr 2011
Posts: 130
Default

Quote:
Originally Posted by Jeremy View Post
Wait, what sort of sequence data do you have? (normalised) RNA-Seq? genomic?

I assumed you had genomic which led me think along these lines: While I don't know what species you are working with, the % of your genome that is ORF is going to be what 1-3%? which means you will be mapping 97-99% of the reads against a reference that doesn't have the sequence that they should map to. That can result in a reasonble chance of them falsely mapping to one of the ORFs in the reference.
Yes it is genomic. Makes sense.
AdrianP is offline   Reply With Quote
Old 02-21-2014, 09:07 AM   #9
jgibbons1
Senior Member
 
Location: Worcester, MA

Join Date: Oct 2009
Posts: 133
Default

Quote:
Originally Posted by AdrianP View Post
Could you elaborate? I do not understand how to use single copy ORFs and what problem does that circumvent?
Using only single copy regions of the genome/transcriptome would give you a better idea of the true single copy coverage. When you take the average coverage of the genome/transcriptome, you are also introducing values from copy number variable genomic regions and/or highly homologous gene families. In this sense, your estimation of background coverage will be skewed higher, and thus, your copy number estimates will be lower (because background coverage is denominator). Does that make sense? It's worth putting the effort into being confident about your background coverage.
jgibbons1 is offline   Reply With Quote
Old 02-21-2014, 09:14 AM   #10
AdrianP
Senior Member
 
Location: Ottawa

Join Date: Apr 2011
Posts: 130
Default

Quote:
Originally Posted by jgibbons1 View Post
Using only single copy regions of the genome/transcriptome would give you a better idea of the true single copy coverage. When you take the average coverage of the genome/transcriptome, you are also introducing values from copy number variable genomic regions and/or highly homologous gene families. In this sense, your estimation of background coverage will be skewed higher, and thus, your copy number estimates will be lower (because background coverage is denominator). Does that make sense? It's worth putting the effort into being confident about your background coverage.
When I estimated the average coverage of the genome, I didn't take the value of the average coverage of all contigs, but rather found a 20kb region where the coverage is pretty evenly distributed, and took the average of that. In fact, most of the contigs have that coverage in most of their regions, I checked in a couple of other locations that do not look like they are repetitive.

It is slightly lower than the average coverage of all contigs, but that makes sense.

Really, my biggest problem is that in my ORFs, there are pairs of ORFs that have 90-100% pairwise identity at the nt level. This analysis would be more successful if such ORFs had their duplicates removed. Because 90% identity, we would still consider it the same gene that likely got duplicated.

So if I do map my reads to the entire assembly, and extract coverage of ORFs, I will select the option that allows a read to map to multiple locations? Otherwise if a gene has 4 copies, and only 2 are present in the assembly, each of those 2 ORFs will look like it has 2 copies, when in fact it's one ORF with 4 copies?
AdrianP is offline   Reply With Quote
Old 02-21-2014, 11:33 AM   #11
ashokrags
Junior Member
 
Location: Boston, MA, USA

Join Date: Dec 2010
Posts: 8
Default

I am using you could use a statistical approach for coverage. A very simplistic approach could assume reads to be poisson distributed and therefore you could look at deviations from the average coverage. So if the average coverage is L, two copies would generate 2L and 3 copies 3L and so on. Based on the average and the variance you could test whether the mean at a location is 2L or 3L etc to get an estimate for the number of copies
ashokrags is offline   Reply With Quote
Old 02-23-2014, 06:18 PM   #12
Jeremy
Senior Member
 
Location: Pathum Thani, Thailand

Join Date: Nov 2009
Posts: 190
Default

Quote:
Originally Posted by AdrianP View Post
When I estimated the average coverage of the genome, I didn't take the value of the average coverage of all contigs, but rather found a 20kb region where the coverage is pretty evenly distributed, and took the average of that. In fact, most of the contigs have that coverage in most of their regions, I checked in a couple of other locations that do not look like they are repetitive.

It is slightly lower than the average coverage of all contigs, but that makes sense.

Really, my biggest problem is that in my ORFs, there are pairs of ORFs that have 90-100% pairwise identity at the nt level. This analysis would be more successful if such ORFs had their duplicates removed. Because 90% identity, we would still consider it the same gene that likely got duplicated.

So if I do map my reads to the entire assembly, and extract coverage of ORFs, I will select the option that allows a read to map to multiple locations? Otherwise if a gene has 4 copies, and only 2 are present in the assembly, each of those 2 ORFs will look like it has 2 copies, when in fact it's one ORF with 4 copies?
Seems like that could easily be solved by blasting the ORFs against each other and identifying which are similar. You don't need any elaborate mapping approach for genes that are that different. If a gene is different enough that it gets separated out in assembly then it is pretty different. A gene with 90% identity can have a completely different function, are you sure you want to lump them together?

Your mapping approach can identify duplicate genes with just a few nt or even no difference at all. Such highly similar duplicates can be difficult to separate out during assembly resulting in assemblies where multiple identical duplicates have been collapsed into a single locus, these are the cases that you can find by looking a read depth.
Jeremy is offline   Reply With Quote
Old 02-23-2014, 06:32 PM   #13
AdrianP
Senior Member
 
Location: Ottawa

Join Date: Apr 2011
Posts: 130
Default

Quote:
Originally Posted by Jeremy View Post
Seems like that could easily be solved by blasting the ORFs against each other and identifying which are similar. You don't need any elaborate mapping approach for genes that are that different. If a gene is different enough that it gets separated out in assembly then it is pretty different. A gene with 90% identity can have a completely different function, are you sure you want to lump them together?

Your mapping approach can identify duplicate genes with just a few nt or even no difference at all. Such highly similar duplicates can be difficult to separate out during assembly resulting in assemblies where multiple identical duplicates have been collapsed into a single locus, these are the cases that you can find by looking a read depth.
Here we are getting into whether the alleles of a diploid genome have enough SNPs to be placed in 2 different contigs (one of them being redundant). I have taken care of that by running diplospades. But in this organism, there are genes that are actually duplicated, and are 100% identical. Some are 90% because they are trimmed at their 5' or 3' end, and shorter. Others are indeed 90% identical due to SNPs.

I agree that 90% identity, especially due to truncation can mean different functions or rather maybe slightly different functions. I am okay with that, because I will be looking if there is any KOG enrichment for these genes with many copies.

Now back to blasting. Sure I can do this, but then I would need to add the copy numbers of those genes which are very similar but located in different loci. This will be somewhat challenging.
AdrianP is offline   Reply With Quote
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 05:27 PM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2019, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO