Hello,
I am attempting to determine genes that have a very high copy number in the genome. My organism of interest is a non-model, and I have Illumina Paired-End sequencing data.
What is the best approach to determine these genes that have either 2, 3 or even 10 copies in the genome?
My first thought is that it will most likely be coverage based. Some tools even use paired end information. I have a problem deciding which tool to use given my objective. I am not interested in regions of the genome that have a high copy number, I am particularly interested in which genes have a high copy number.
I have a genome assembly that has been annotated. I have tried to do the analysis as follows:
- Extract all ORFs (larger than 100AA) into a fasta file.
- Map my reads to all ORFs simultaneously. (Here, if a read can map to two ORFs, I allow it to map to both ORFs, because I suspect that in my assembly I have ORFs with 90% to 100% identity, and the point is to see copy number of ORFs, and if there are 2 ORFs that are the same, I should see the same copy number for both)
- After I determine the mean coverage for each individual ORFs, I divide it by the genome mean coverage, to determine the copy number.
This method seems a bit primitive, and I am sure there are better ways to do it. A similar method would be to map the reads to the ORFs and calculate FPKMs with cufflinks, but this method would be prone to the same pitfalls as the one I employed.
Any ideas?
I am attempting to determine genes that have a very high copy number in the genome. My organism of interest is a non-model, and I have Illumina Paired-End sequencing data.
What is the best approach to determine these genes that have either 2, 3 or even 10 copies in the genome?
My first thought is that it will most likely be coverage based. Some tools even use paired end information. I have a problem deciding which tool to use given my objective. I am not interested in regions of the genome that have a high copy number, I am particularly interested in which genes have a high copy number.
I have a genome assembly that has been annotated. I have tried to do the analysis as follows:
- Extract all ORFs (larger than 100AA) into a fasta file.
- Map my reads to all ORFs simultaneously. (Here, if a read can map to two ORFs, I allow it to map to both ORFs, because I suspect that in my assembly I have ORFs with 90% to 100% identity, and the point is to see copy number of ORFs, and if there are 2 ORFs that are the same, I should see the same copy number for both)
- After I determine the mean coverage for each individual ORFs, I divide it by the genome mean coverage, to determine the copy number.
This method seems a bit primitive, and I am sure there are better ways to do it. A similar method would be to map the reads to the ORFs and calculate FPKMs with cufflinks, but this method would be prone to the same pitfalls as the one I employed.
Any ideas?
Comment