Hi,
I am a computer programmer with absolutely negligible biology background working on an application framework for analyzing the human genome. Now I have access to the genome dataset from the NCB's ftp site.
I have decided to use the GRCh38 encoded sequence files for the purposes of my application. However since there are multiple overlapping sequences in these files pertaining to the individual chromosomes, I would like to extract the entire stretch with non-overlapping/unique sequences only.
I need some guidance as to how I can proceed with this.
Based on some preliminary research that I conducted, I found out that I can use the FASTX Toolkit for the tasks that I am looking to accomplish. However I am not able to understand the purpose and function of the different tools like fasta_formatter or fastx_collapser from the available documentation, due to which I am not able to identify if what I am doing is indeed correct.
I am a computer programmer with absolutely negligible biology background working on an application framework for analyzing the human genome. Now I have access to the genome dataset from the NCB's ftp site.
I have decided to use the GRCh38 encoded sequence files for the purposes of my application. However since there are multiple overlapping sequences in these files pertaining to the individual chromosomes, I would like to extract the entire stretch with non-overlapping/unique sequences only.
I need some guidance as to how I can proceed with this.
Based on some preliminary research that I conducted, I found out that I can use the FASTX Toolkit for the tasks that I am looking to accomplish. However I am not able to understand the purpose and function of the different tools like fasta_formatter or fastx_collapser from the available documentation, due to which I am not able to identify if what I am doing is indeed correct.
Comment