SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics



Similar Threads
Thread Thread Starter Forum Replies Last Post
Is it possible to create a consensus sequence from a fasta file? ssnowfox Bioinformatics 1 02-12-2015 01:01 PM
adapters sequence fasta file mmmm Bioinformatics 14 12-12-2013 11:39 PM
Find all occurrences of a sequence in a fasta file dphansti Bioinformatics 3 12-06-2011 07:11 AM
Obtaining unique sequence tag file from fastQ format ramadatta.88 Introductions 0 09-26-2011 02:25 AM
Converting Solexa FASTQ file to unique sequence tags DrD2009 Bioinformatics 16 08-09-2010 12:30 AM

Reply
 
Thread Tools
Old 08-12-2015, 11:30 PM   #1
cyberbeast
Junior Member
 
Location: Asia

Join Date: Aug 2015
Posts: 3
Smile Generating unique sequence from a fasta file

Hi,

I am a computer programmer with absolutely negligible biology background working on an application framework for analyzing the human genome. Now I have access to the genome dataset from the NCB's ftp site.

I have decided to use the GRCh38 encoded sequence files for the purposes of my application. However since there are multiple overlapping sequences in these files pertaining to the individual chromosomes, I would like to extract the entire stretch with non-overlapping/unique sequences only.

I need some guidance as to how I can proceed with this.

Based on some preliminary research that I conducted, I found out that I can use the FASTX Toolkit for the tasks that I am looking to accomplish. However I am not able to understand the purpose and function of the different tools like fasta_formatter or fastx_collapser from the available documentation, due to which I am not able to identify if what I am doing is indeed correct.
cyberbeast is offline   Reply With Quote
Old 08-13-2015, 10:48 AM   #2
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,707
Default

This is kind of difficult... you could try a program like Minimus2, or Dedupe to get rid of redundancy, but I think it would be best to either use the whole genome, or else just use only the primary chromosomes (1-22, X, Y, M) and throw away all the little addenda and alt contigs.
Brian Bushnell is offline   Reply With Quote
Old 08-13-2015, 08:17 PM   #3
cyberbeast
Junior Member
 
Location: Asia

Join Date: Aug 2015
Posts: 3
Default

Quote:
Originally Posted by Brian Bushnell View Post
This is kind of difficult... you could try a program like Minimus2, or Dedupe to get rid of redundancy, but I think it would be best to either use the whole genome, or else just use only the primary chromosomes (1-22, X, Y, M) and throw away all the little addenda and alt contigs.
Thank you for your reply. I am sorry, how can I consume the entire genome?
What I have with me right now are the FASTA files that represent the sequences of the individual chromosomes 1-22, X, Y and M. Is there any documentation that you could point me to so that I can understand the jargon associated with such data (alt contigs and the like).

Thank You
cyberbeast is offline   Reply With Quote
Old 08-13-2015, 10:16 PM   #4
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,707
Default

If you have 25 files, you're fine. Just use those. The alt contigs are smaller files that are not really necessary in most cases. They represent differences that are present in some people.

NCBI's FTP site does have some files that describe the contents of each directory, but they are a little hard to understand... I'm not really sure where a good resource is describing the human genome files.

Suffice to say - a "typical" person should have DNA corresponding to the 25 files 1-22, X, M, and possibly Y, depending on gender. If you have more than 25 files, the remainder are more controversial - maybe only some people have them; or maybe everyone has them but it's not clear where they go in the chromosome. For most analyses it's safe to ignore them. I'd say it's best to ignore them unless you understand exactly what they are and how to use them properly, since using them when mapping, for example, can cause spurious multimapping of reads which gives you inferior results.
Brian Bushnell is offline   Reply With Quote
Old 08-13-2015, 10:39 PM   #5
cyberbeast
Junior Member
 
Location: Asia

Join Date: Aug 2015
Posts: 3
Default

Quote:
Originally Posted by Brian Bushnell View Post
If you have 25 files, you're fine. Just use those. The alt contigs are smaller files that are not really necessary in most cases. They represent differences that are present in some people.

NCBI's FTP site does have some files that describe the contents of each directory, but they are a little hard to understand... I'm not really sure where a good resource is describing the human genome files.

Suffice to say - a "typical" person should have DNA corresponding to the 25 files 1-22, X, M, and possibly Y, depending on gender. If you have more than 25 files, the remainder are more controversial - maybe only some people have them; or maybe everyone has them but it's not clear where they go in the chromosome. For most analyses it's safe to ignore them. I'd say it's best to ignore them unless you understand exactly what they are and how to use them properly, since using them when mapping, for example, can cause spurious multimapping of reads which gives you inferior results.
Thank you for your prompt reply. A confirmation from you about what those files represent makes things a lot more clearer for me.

I just have one doubt though, these files contain multiple sequences and my application logic consumes an entire file for processing as compared to consuming just a sequence from the file. My question is, can these sequences overlap?

The first line of the file referencing chromosome 1 begins with
Quote:
>gi|568815364|ref|NT_077402.3| Homo sapiens chromosome 1 genomic scaffold, GRCh38 Primary Assembly HSCHR1_CTG1
There are multiple entries like the one above, in the rest of the file. Each record begins with a descriptor like this and is then followed by a huge sequence.

For the purpose of analysis, is it sound idea to get rid of the descriptor entries and concatenate the sequences together? Now, if there is an overlap of sequence within multiple descriptor records then it will hinder my analysis. However, if there isn't scope for overlapping then it will make my life so much easier with respect to programming.
cyberbeast is offline   Reply With Quote
Old 08-14-2015, 01:19 PM   #6
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,707
Default

You may have downloaded the wrong files.

Go here:
ftp://ftp.ncbi.nlm.nih.gov/genomes/H...romosomes/seq/

And download these:

ftp://ftp.ncbi.nlm.nih.gov/genomes/H....p2_chr1.fa.gz

...etc. Generally,

ftp://ftp.ncbi.nlm.nih.gov/genomes/H....p2_chr*.fa.gz

There are also other things in the directory like "hs_ref_GRCh38.p2_unplaced.fa.gz" and "hs_ref_GRCh38.p2_alts.fa.gz" and "hs_ref_GRCh38.p2_unlocalized.fa.gz". You can get those if you want.

But you do not want any of the .mfa.gz files, or the ones that look like this:
"hs_alt_CHM1_1.1_chr1.fa.gz".
Brian Bushnell is offline   Reply With Quote
Reply

Tags
fasta, fastx toolkit, fastx_clipper

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 12:28 PM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2020, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO