Seqanswers Leaderboard Ad

**Brian Bushnell** · 08-13-2015, 09:48 AM

This is kind of difficult... you could try a program like Minimus2, or Dedupe to get rid of redundancy, but I think it would be best to either use the whole genome, or else just use only the primary chromosomes (1-22, X, Y, M) and throw away all the little addenda and alt contigs.

**cyberbeast** · 08-13-2015, 07:17 PM

Originally posted by Brian Bushnell View Post

This is kind of difficult... you could try a program like Minimus2, or Dedupe to get rid of redundancy, but I think it would be best to either use the whole genome, or else just use only the primary chromosomes (1-22, X, Y, M) and throw away all the little addenda and alt contigs.

Thank you for your reply. I am sorry, how can I consume the entire genome?
What I have with me right now are the FASTA files that represent the sequences of the individual chromosomes 1-22, X, Y and M. Is there any documentation that you could point me to so that I can understand the jargon associated with such data (alt contigs and the like).

Thank You

**Brian Bushnell** · 08-13-2015, 09:16 PM

If you have 25 files, you're fine. Just use those. The alt contigs are smaller files that are not really necessary in most cases. They represent differences that are present in some people.

NCBI's FTP site does have some files that describe the contents of each directory, but they are a little hard to understand... I'm not really sure where a good resource is describing the human genome files.

Suffice to say - a "typical" person should have DNA corresponding to the 25 files 1-22, X, M, and possibly Y, depending on gender. If you have more than 25 files, the remainder are more controversial - maybe only some people have them; or maybe everyone has them but it's not clear where they go in the chromosome. For most analyses it's safe to ignore them. I'd say it's best to ignore them unless you understand exactly what they are and how to use them properly, since using them when mapping, for example, can cause spurious multimapping of reads which gives you inferior results.

**cyberbeast** · 08-13-2015, 09:39 PM

Originally posted by Brian Bushnell View Post

If you have 25 files, you're fine. Just use those. The alt contigs are smaller files that are not really necessary in most cases. They represent differences that are present in some people.

NCBI's FTP site does have some files that describe the contents of each directory, but they are a little hard to understand... I'm not really sure where a good resource is describing the human genome files.

Suffice to say - a "typical" person should have DNA corresponding to the 25 files 1-22, X, M, and possibly Y, depending on gender. If you have more than 25 files, the remainder are more controversial - maybe only some people have them; or maybe everyone has them but it's not clear where they go in the chromosome. For most analyses it's safe to ignore them. I'd say it's best to ignore them unless you understand exactly what they are and how to use them properly, since using them when mapping, for example, can cause spurious multimapping of reads which gives you inferior results.

Thank you for your prompt reply. A confirmation from you about what those files represent makes things a lot more clearer for me.

I just have one doubt though, these files contain multiple sequences and my application logic consumes an entire file for processing as compared to consuming just a sequence from the file. My question is, can these sequences overlap?

The first line of the file referencing chromosome 1 begins with

>gi|568815364|ref|NT_077402.3| Homo sapiens chromosome 1 genomic scaffold, GRCh38 Primary Assembly HSCHR1_CTG1

There are multiple entries like the one above, in the rest of the file. Each record begins with a descriptor like this and is then followed by a huge sequence.

For the purpose of analysis, is it sound idea to get rid of the descriptor entries and concatenate the sequences together? Now, if there is an overlap of sequence within multiple descriptor records then it will hinder my analysis. However, if there isn't scope for overlapping then it will make my life so much easier with respect to programming.

**Brian Bushnell** · 08-14-2015, 12:19 PM

You may have downloaded the wrong files.

Go here:
ftp://ftp.ncbi.nlm.nih.gov/genomes/H...romosomes/seq/

And download these:

ftp://ftp.ncbi.nlm.nih.gov/genomes/H....p2_chr1.fa.gz

...etc. Generally,

ftp://ftp.ncbi.nlm.nih.gov/genomes/H....p2_chr*.fa.gz

There are also other things in the directory like "hs_ref_GRCh38.p2_unplaced.fa.gz" and "hs_ref_GRCh38.p2_alts.fa.gz" and "hs_ref_GRCh38.p2_unlocalized.fa.gz". You can get those if you want.

But you do not want any of the .mfa.gz files, or the ones that look like this:
"hs_alt_CHM1_1.1_chr1.fa.gz".

Topics	Statistics	Last Post
The Role of Spliceosomes in RNA Splicing and Genome Evolution by seqadmin Started by seqadmin, Today, 07:03 AM	0 responses 9 views 0 likes	Last Post by seqadmin Today, 07:03 AM
A Closer Look at the Enigmatic Genomes of Oikopleura dioica by seqadmin Started by seqadmin, 05-10-2024, 06:35 AM	0 responses 27 views 0 likes	Last Post by seqadmin 05-10-2024, 06:35 AM
Advanced Epigenome Editing Platform Explores Gene Regulation Mechanisms by seqadmin Started by seqadmin, 05-09-2024, 02:46 PM	0 responses 32 views 0 likes	Last Post by seqadmin 05-09-2024, 02:46 PM
Telomere Maintenance by PARP1: A New Perspective in Cancer Research by seqadmin Started by seqadmin, 05-07-2024, 06:57 AM	0 responses 26 views 0 likes	Last Post by seqadmin 05-07-2024, 06:57 AM

Seqanswers Leaderboard Ad

Announcement

Generating unique sequence from a fasta file

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News