Hi @all,
I'm working currently on human whole genome NGS data and have some questions about repeat masking:
1) Would you say it's advisable to do masking before mapping to the genome? Why - Why not?
2) What tools could you suggest to run masking on reads? I am familiar with - and tried already - repeatmasker/repeatmodeler but it's working damn slow on reads. I thought of doing a mapping against the repbase?! Alternatively, I could use a repeat masked genome (where repeats are removed, not only in lower case, of course) and hope that repeats are not mapped anymore.
3) Through different filtering criteria, I have to discard already ~8% of my reads before doing any alignments against the reference. I think including a repeat-filtering will increase this drastically. Is there any possibility that this will have an effect on downstream analysis? Do you include such things in your calculation when estimating required read numbers to get a certain coverage?
4) (Not directly related to the question) I have recently tried a mapping against the "new" hg38 and saw that I just got more multi-reads. Having a brief look on the data, these were mainly repetitive sequences... Has anyone observed similar things?
Thanks for reading and any suggestions
I'm working currently on human whole genome NGS data and have some questions about repeat masking:
1) Would you say it's advisable to do masking before mapping to the genome? Why - Why not?
2) What tools could you suggest to run masking on reads? I am familiar with - and tried already - repeatmasker/repeatmodeler but it's working damn slow on reads. I thought of doing a mapping against the repbase?! Alternatively, I could use a repeat masked genome (where repeats are removed, not only in lower case, of course) and hope that repeats are not mapped anymore.
3) Through different filtering criteria, I have to discard already ~8% of my reads before doing any alignments against the reference. I think including a repeat-filtering will increase this drastically. Is there any possibility that this will have an effect on downstream analysis? Do you include such things in your calculation when estimating required read numbers to get a certain coverage?
4) (Not directly related to the question) I have recently tried a mapping against the "new" hg38 and saw that I just got more multi-reads. Having a brief look on the data, these were mainly repetitive sequences... Has anyone observed similar things?
Thanks for reading and any suggestions
Comment