Hello everyone,
we have just received our first RNA-Seq results/analysis and since I am completely new to interpretating the data I would like to ask for some pointers.
My main concern is the reliability/quality of the sequencing. Our samples contained amplified mRNAs from cortical neurons (rat and human) and the sequencing was done on the Illumina HiSeq 2000 platform.
Here are my questions:
1. It is stated in the report file that reads should be evenly distributed on reference genes, otherwise the low level of randomness will affect following analysis. How exactly is the analysis affected by this? Our reads are all concentrated on the 3' end.
2. The base percentage composition curves do not really overlap (A with T and C with G) and according to figures the G-C content of our samples is very low, around 20 %. Is that normal? I can't find any data on the supposed base composition of the transcriptome of rat cortical neurons. I guess it should be tissue- and species-specific, but in the mouse brain transcriptome data I was able to find the percentage of the bases was really even.
3. Two of our samples were sequenced twice and the subsequent analysis found that there are hundreds of differentially expressed genes between the identical samples. What does this say about the reliability of the results in overall?
4. The percentage of unmapped reads (to genome) is 60-70 %. I would say that this is really high, but again, I don't know what portion of the reads can be mapped back in a sequencing reaction. The percentage of multi-position matches also seems high, 20-35 %.
5. Regarding to the pathway analysis there are a lot of signaling pathways that are affected in these samples, even if they have nothing to do with neurons. Like bacterial and viral infections, several types of tumors. My main concern here is how can you trust the portion of tha data that seems relevant when the whole contains very unlikely details?
we have just received our first RNA-Seq results/analysis and since I am completely new to interpretating the data I would like to ask for some pointers.
My main concern is the reliability/quality of the sequencing. Our samples contained amplified mRNAs from cortical neurons (rat and human) and the sequencing was done on the Illumina HiSeq 2000 platform.
Here are my questions:
1. It is stated in the report file that reads should be evenly distributed on reference genes, otherwise the low level of randomness will affect following analysis. How exactly is the analysis affected by this? Our reads are all concentrated on the 3' end.
2. The base percentage composition curves do not really overlap (A with T and C with G) and according to figures the G-C content of our samples is very low, around 20 %. Is that normal? I can't find any data on the supposed base composition of the transcriptome of rat cortical neurons. I guess it should be tissue- and species-specific, but in the mouse brain transcriptome data I was able to find the percentage of the bases was really even.
3. Two of our samples were sequenced twice and the subsequent analysis found that there are hundreds of differentially expressed genes between the identical samples. What does this say about the reliability of the results in overall?
4. The percentage of unmapped reads (to genome) is 60-70 %. I would say that this is really high, but again, I don't know what portion of the reads can be mapped back in a sequencing reaction. The percentage of multi-position matches also seems high, 20-35 %.
5. Regarding to the pathway analysis there are a lot of signaling pathways that are affected in these samples, even if they have nothing to do with neurons. Like bacterial and viral infections, several types of tumors. My main concern here is how can you trust the portion of tha data that seems relevant when the whole contains very unlikely details?
Comment