I have a long list of gene names with corresponding read counts. I'm mainly interested that the tRNAs with an identical anticodon are collapsed and the sum of their read counts is calculated.
Therefore, something like:
collapse names in lines containing "tRNA" based on the perfect match of the last 6 characters in the gene name (e.g. GluCTC) and sum up corresponding read counts. The new gene name can be "tRNA-" followed by the aforementioned 6 characters (e.g. tRNA-GluCTC)
The input (tab-delimited) looks like this:
Any ideas how to do this? Awk?
Therefore, something like:
collapse names in lines containing "tRNA" based on the perfect match of the last 6 characters in the gene name (e.g. GluCTC) and sum up corresponding read counts. The new gene name can be "tRNA-" followed by the aforementioned 6 characters (e.g. tRNA-GluCTC)
The input (tab-delimited) looks like this:
Code:
Gm26624 5761 Bre 5658 chr10.tRNA90-GluCTC 5573 chr3.tRNA303-GluCTC 5558 chr1.tRNA709-GluCTC 5489 chr1.tRNA706-GlyGCC 4891 chr1.tRNA704-GlyGCC 4838 chr1.tRNA702-GlyGCC 4796 chr13.tRNA110-GlyGCC 4753 Gm13247 4105 Rny3 3736 chr1.tRNA485-LysTTT 3548 Rn7s2 3385 chr19.tRNA107-LysTTT 3363