Hello,
I have environmental metatranscriptome data (Illumina HiSeq reads), that contains transcript reads of a mixture of prokaryotic organisms. I now want to identify all reads that belong to a certain bacterial genome via mapping (e.g. via MUMmer promer) the reads from the metatranscriptome to a reference genome. The mapped reads I want to count (works well with MUMmer results for the total number) specifically for every gene.
The problem is now the annotation part to get the number of reads for the specific genes. I would like to not only have the information on which part of the genome (bases 12345-12377) the read is mapped but also which gene (locus: gene_1, gene_2 etc.) that is.
In the end I basically want to end up with a result like this:
MUMer output is producing a table looking like this:
I know that there is GFF files that basically give you the information on which region on the genome there what gene (gene_1, gene_2 etc.) - now I would need a script or tool that brings both information together.
Since I am more a user and not so advanced at scripting it would be great to find a tool that already exits doing this. Maybe someone knows of such a tool, this would be great!
Thank you very much!
cumulonimbus
I have environmental metatranscriptome data (Illumina HiSeq reads), that contains transcript reads of a mixture of prokaryotic organisms. I now want to identify all reads that belong to a certain bacterial genome via mapping (e.g. via MUMmer promer) the reads from the metatranscriptome to a reference genome. The mapped reads I want to count (works well with MUMmer results for the total number) specifically for every gene.
The problem is now the annotation part to get the number of reads for the specific genes. I would like to not only have the information on which part of the genome (bases 12345-12377) the read is mapped but also which gene (locus: gene_1, gene_2 etc.) that is.
In the end I basically want to end up with a result like this:
Code:
gene_1 23 hits gene_3 0 hits gene_4 237 hits ...
Code:
[S1] [E1] | [S2] [E2] | [LEN 1] [LEN 2] | [% IDY] [% SIM] [% STP] | [LEN R] [LEN Q] | [COV R] [COV Q] | [FRM] [TAGS] ======================================================================================================================================================== 17690 17770 | 19 99 | 81 81 | 88.89 88.89 0.00 | 1645259 99 | 0.00 81.82 | 2 1 REF_genome HWI-ST908_0052:3:1101:2512:19931#CGATGG/1 27104 27190 | 98 12 | 87 87 | 89.66 89.66 0.00 | 1645259 100 | 0.01 87.00 | 2 -3 REF_genome HWI-ST908_0052:3:1101:21043:110201#CGATGT/1 27158 27238 | 18 98 | 81 81 | 85.19 88.89 0.00 | 1645259 100 | 0.00 81.00 | 2 3 REF_genome HWI-ST908_0052:3:1101:20301:105330#CGATGT/1 72735 72640 | 99 4 | 96 96 | 78.12 84.38 0.00 | 1645259 100 | 0.01 96.00 | -3 -2 REF_genome HWI-ST908_0052:3:1101:11083:19486#CGATGT/1 ...
I know that there is GFF files that basically give you the information on which region on the genome there what gene (gene_1, gene_2 etc.) - now I would need a script or tool that brings both information together.
Since I am more a user and not so advanced at scripting it would be great to find a tool that already exits doing this. Maybe someone knows of such a tool, this would be great!
Thank you very much!
cumulonimbus
Comment