Hello All,
I had a question about per gene read-count data. I wrote 2 short perl scripts to find the read-count per gene, from the sam file I get in the Tophat output. However, I am not feeling too confident on the logic I used to find the read count. Therefore, I would appreciate any help from the community on this.
My project is to find differentially expressed genes between a normal drosophila fly and a mutant. Since Drosophila has a robust (compared to some other eukaryotes) gene annotation, I decided to run a bowtie build on the Drosophila gene sequences, rather than on the individual Chromosomes. I then used Tophat on this to align the sequencing reads (The files are fastq format with solexa1.3 quality values. The length of the reads is 42bp and they are NOT paired-end).
In the generated SAM file I would have Ambiguous reads (One read matching 2 or more different genes) and Unique reads (2 or more different reads matching the same gene). I decided to get rid of the ambiguos reads from the Sam file with the first perl script and then count the number of unique reads going to each gene with the second script. I am uncomfortable with this logic and here's why.
Lets consider Gene A and Gene B. In the original sam file lets suppose gene A has 25 reads (all ambiguous) while gene B has the same 25 ambiguous reads plus an additional 75 unique reads. If I use my method, I drop the read count of gene A from 25 to zero, while the read-count of gene B drops from 100 to 75. In other words, gene A will have "no expression" however in reality it does. I'd like to know what you guys think, since it does not sound ok to me. But I don't know what to do.
Is there a method to incorporate ambiguous reads? Is there a score associated with each read in the sam file or elsewhere, that outlines the strength of the gene match? Was my idea of using the gene sequences as subject for alignment rather than the chromosome sequences flawed? The total number of reads is about 1000 Mb and I get about 100 Mb of unaligned reads when I run against the gene sequences. Can this might improve if I run the sequences against the chromosomes?
Please let me know.
Thanks,
Abhijit
I had a question about per gene read-count data. I wrote 2 short perl scripts to find the read-count per gene, from the sam file I get in the Tophat output. However, I am not feeling too confident on the logic I used to find the read count. Therefore, I would appreciate any help from the community on this.
My project is to find differentially expressed genes between a normal drosophila fly and a mutant. Since Drosophila has a robust (compared to some other eukaryotes) gene annotation, I decided to run a bowtie build on the Drosophila gene sequences, rather than on the individual Chromosomes. I then used Tophat on this to align the sequencing reads (The files are fastq format with solexa1.3 quality values. The length of the reads is 42bp and they are NOT paired-end).
In the generated SAM file I would have Ambiguous reads (One read matching 2 or more different genes) and Unique reads (2 or more different reads matching the same gene). I decided to get rid of the ambiguos reads from the Sam file with the first perl script and then count the number of unique reads going to each gene with the second script. I am uncomfortable with this logic and here's why.
Lets consider Gene A and Gene B. In the original sam file lets suppose gene A has 25 reads (all ambiguous) while gene B has the same 25 ambiguous reads plus an additional 75 unique reads. If I use my method, I drop the read count of gene A from 25 to zero, while the read-count of gene B drops from 100 to 75. In other words, gene A will have "no expression" however in reality it does. I'd like to know what you guys think, since it does not sound ok to me. But I don't know what to do.
Is there a method to incorporate ambiguous reads? Is there a score associated with each read in the sam file or elsewhere, that outlines the strength of the gene match? Was my idea of using the gene sequences as subject for alignment rather than the chromosome sequences flawed? The total number of reads is about 1000 Mb and I get about 100 Mb of unaligned reads when I run against the gene sequences. Can this might improve if I run the sequences against the chromosomes?
Please let me know.
Thanks,
Abhijit
Comment