Dear Members
I´m just starting my postdoctoral fellow and my new proyect involve sRNA Deep Seq. I´m having problems with the data normalization.
I´ve been reading papers about it and the variety of normalization method is huge!
The simplest one that I found is by TPM or RPM (transcript per millon or reads per millon), in which each raw read count is multipled by 1 millon and then divided bu the total read count of the whole library.
The question is: which library!?
I have 3 treatments. In one of them, the number of raw seq were 10 millons. After filtered of adaptor, length < 15nt, # <3 count, we generate 8 millons of mappable reads. Then, once we mapped against Arabidopsis genome, 3 libraries were generated. The mapped mRNA, tRNA, rRNA, RFAm, Repbase library (5.7 millon; 72%); the custom virus we sent library (20; 0%) and the miRNA (sRNA) library (2.3 millon; 28%). On the 3 treatments the numbers were diferent and always the miRNA library was smaller than the mRNA library.
In order to a direct compare between data set, we have to normalized the counts. If I want to do it with RPM, which library do I have to use? The total mappable reads (including mRNA, tRNA, etc) or only the miRNA library?
I have notized that the result is very different with one or the other, because the percentage of the libraries are not the same between treatments.
Any hint you can provide me will help me a lot.
Thanks in advance
I´m just starting my postdoctoral fellow and my new proyect involve sRNA Deep Seq. I´m having problems with the data normalization.
I´ve been reading papers about it and the variety of normalization method is huge!
The simplest one that I found is by TPM or RPM (transcript per millon or reads per millon), in which each raw read count is multipled by 1 millon and then divided bu the total read count of the whole library.
The question is: which library!?
I have 3 treatments. In one of them, the number of raw seq were 10 millons. After filtered of adaptor, length < 15nt, # <3 count, we generate 8 millons of mappable reads. Then, once we mapped against Arabidopsis genome, 3 libraries were generated. The mapped mRNA, tRNA, rRNA, RFAm, Repbase library (5.7 millon; 72%); the custom virus we sent library (20; 0%) and the miRNA (sRNA) library (2.3 millon; 28%). On the 3 treatments the numbers were diferent and always the miRNA library was smaller than the mRNA library.
In order to a direct compare between data set, we have to normalized the counts. If I want to do it with RPM, which library do I have to use? The total mappable reads (including mRNA, tRNA, etc) or only the miRNA library?
I have notized that the result is very different with one or the other, because the percentage of the libraries are not the same between treatments.
Any hint you can provide me will help me a lot.
Thanks in advance
Comment