After trying out various command line options with tophat2 (followed by htseq-count), as well as different sources for annotation files and genome sequences, I am somewhat lost as to what the best practices should be.
My samples are human and I would like to use the hg38 genome release. I think I understand the various differences between the UCSC, NCBI, and Ensembl annotations sufficiently now, and have convinced myself that Ensembl might be the best way to go (feedback on this choice would still be very appreciated).
However, I am still stuck on how to deal with genes that have been mapped to multiple "regular" as well as alternative assemblies from the hg38 releases. If I exclude alternative assemblies, I would, for example not have CCL3L1 included. If I included them, I would definitely need to use Ensembl's unique IDs, which complicates my downstream analysis somewhat (our biologists would prefer to use gene ids like 'CCL3L1').
As for tophat2 options, I have been comparing output from the default options plus --b2--very-sensitive with that using a transcriptome index. For the latter, I thought it would make sense to include hits to the transcriptome only and pre-filter them by excluding any reads with more than one hit. This (probably misguided) approach was thought to speed up counting of reads with htseq-count, which also would have discarded multiple hits of the same read. Alas, I end up with very different results compared to the default options, which I think I can explain but would love someone else's opinion on as well... basically, with the second approach, I am seeing reads INCLUDED that would have mapped better to the whole genome but also mapped to a transcript; vice versa, I am seeing reads EXCLUDED that only overlap splice junctions such that part of the read is exonic, the other part is intronic. htseq-count happily includes the latter in the total count.
Any input and advice would be greatly appreciated!
My samples are human and I would like to use the hg38 genome release. I think I understand the various differences between the UCSC, NCBI, and Ensembl annotations sufficiently now, and have convinced myself that Ensembl might be the best way to go (feedback on this choice would still be very appreciated).
However, I am still stuck on how to deal with genes that have been mapped to multiple "regular" as well as alternative assemblies from the hg38 releases. If I exclude alternative assemblies, I would, for example not have CCL3L1 included. If I included them, I would definitely need to use Ensembl's unique IDs, which complicates my downstream analysis somewhat (our biologists would prefer to use gene ids like 'CCL3L1').
As for tophat2 options, I have been comparing output from the default options plus --b2--very-sensitive with that using a transcriptome index. For the latter, I thought it would make sense to include hits to the transcriptome only and pre-filter them by excluding any reads with more than one hit. This (probably misguided) approach was thought to speed up counting of reads with htseq-count, which also would have discarded multiple hits of the same read. Alas, I end up with very different results compared to the default options, which I think I can explain but would love someone else's opinion on as well... basically, with the second approach, I am seeing reads INCLUDED that would have mapped better to the whole genome but also mapped to a transcript; vice versa, I am seeing reads EXCLUDED that only overlap splice junctions such that part of the read is exonic, the other part is intronic. htseq-count happily includes the latter in the total count.
Any input and advice would be greatly appreciated!
Comment