SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics



Similar Threads
Thread Thread Starter Forum Replies Last Post
align small reads on small sequences NicoBxl Bioinformatics 2 08-18-2011 04:21 AM
VCFtools Vcf.pm problem - broken VCF header on 1000genomes data naumenko.sa Bioinformatics 1 07-08-2011 04:17 AM
Needed: GAPed alignment tool to save my sequences from the SMART kit dagarfield Bioinformatics 3 12-10-2010 05:58 AM
Problems with small RNA adaptor sequences chris Bioinformatics 0 09-16-2010 08:04 AM
Illumina sequencing data from 1000genomes thsuk1 Illumina/Solexa 1 09-08-2010 02:48 AM

Reply
 
Thread Tools
Old 07-19-2010, 08:00 AM   #1
culmen
Member
 
Location: virginia

Join Date: Jul 2010
Posts: 12
Question small part of all 1000 sequences from 1000genomes data needed?

Hi,

I am a newbie to next-gen data (just working from past couple of days). I am working on 1000genomes data for my thesis work.
I need to extract all 1000 individual genome sequences at particular position
EX: chr 8 + 125975261-125977441

I don't have computing any resources to download all 1000 genome sequence read and align data ( which is > 200TB) from ftp site.
Is there any way that I could extract only a particular part of all 1000 genome sequences without downloading them?

Appreciate your help,

Thanks in advance,
Culmen.
culmen is offline   Reply With Quote
Old 07-19-2010, 09:31 AM   #2
nickloman
Senior Member
 
Location: Birmingham, UK

Join Date: Jul 2009
Posts: 356
Default

One way might be to use Amazon EC2 to do this. You would create an Amazon EC2 instance, for example with a Ubuntu image, and then access the 1000 genomes data which is apparently available through S3.

See also this thread

http://seqanswers.com/forums/showthread.php?t=4874

There might be other, easier ways of doing it .. but this is one method of avoiding downloading the data locally.
nickloman is offline   Reply With Quote
Old 07-19-2010, 09:47 AM   #3
krobison
Senior Member
 
Location: Boston area

Join Date: Nov 2007
Posts: 747
Default

samtools can access the 1000 genomes BAM files on their websites; it will download the index file for each alignment you access but not the entire alignment.

There are various wrappers for samtools & I don't know if this will work in them. It definitely works at the command line & in the current version of pysam (Python binding) with a few small mods.
krobison is offline   Reply With Quote
Old 07-19-2010, 01:21 PM   #4
culmen
Member
 
Location: virginia

Join Date: Jul 2010
Posts: 12
Default

Thanks a lot nickloman and Robison for your help.

Quote:
samtools can access the 1000 genomes BAM files on their websites; it will download the index file for each alignment you access but not the entire alignment.
--krobison
The alignment in the BAM file shows the reads alignment to reference sequence. Is there any way that I could get the consensus of that particular part (as shown in the ensembl browser of 1000genomes data with NA19238 selected) of each genome in 1000genomes data.

Are there any tools to blast each genome sequence of 1000genomes data (without downloading data) with a query sequence (primer)?

Thanks a lot,
Culmen

Last edited by culmen; 07-19-2010 at 01:24 PM. Reason: missed a paranthesis.
culmen is offline   Reply With Quote
Old 07-20-2010, 06:08 AM   #5
culmen
Member
 
Location: virginia

Join Date: Jul 2010
Posts: 12
Default

Basically I am looking for all the SNPs in the region of a STR (ex: [TCTA]8 whose marker D6S502 ) with 1000bp flanks on either streams. (from all 1000 genomes).

So I thought it would be great if I could extract that particular regions ( 1kbp < STR > 1kbp ) from all the 1000 genomes.

Expecting this table as a result of my data extraction.

Appreciate any kind of help or suggestion,
Culmen

Last edited by culmen; 07-20-2010 at 06:12 AM.
culmen is offline   Reply With Quote
Old 08-09-2010, 02:16 AM   #6
laura
Senior Member
 
Location: Cambridge UK

Join Date: Sep 2008
Posts: 151
Default

If what you are after is variant calls then you are better looking at the results in their july release of data

ftp://ftp.1000genomes.ebi.ac.uk/vol1...010_07_release

You can even download subsets of snps in vcf format using tabix

tabix ftp://ftp.1000genomes.ebi.ac.uk/vol1...notypes.vcf.gz 1:233411980:245804116


You can get tabix from the samtools website

https://sourceforge.net/projects/samtools/files/

and then vcftools are a set of perl and c++ scripts/programs for handling the vcftools

http://vcftools.sourceforge.net/
laura is offline   Reply With Quote
Old 08-09-2010, 12:52 PM   #7
tumorim
Junior Member
 
Location: US

Join Date: Aug 2010
Posts: 2
Default

download the three 1000G files from http://www.openbioinformatics.org/an..._download.html.

Then just do

perl -ne 'm/(\d+)\t(\d+)/ and $1 eq "8" and $2>=125975261 and $2<=125977441 and print' < hg18_CEU.sites.2010_03.txt

You'll get all variants in CEU population. Do the same for YRI/ASN.
tumorim is offline   Reply With Quote
Old 08-09-2010, 11:16 PM   #8
KevinLam
Senior Member
 
Location: SEA

Join Date: Nov 2009
Posts: 198
Default

caveat: I haven't done this yet so I might be way wrong.
but since you only have 'variant data' for a stretch of 2kb.
why not upload your bam / wig file up to ucsc instead?
2 kbase sounds quite manageable.
KevinLam is offline   Reply With Quote
Old 08-10-2010, 06:56 AM   #9
culmen
Member
 
Location: virginia

Join Date: Jul 2010
Posts: 12
Default

Thanks a lot for your suggestions guys.

@laura: Thanks I am following similar steps.

@tumorim: ANNOVAR looks cool. Thanks for letting me know about it.

@KevinLam: Thats a good idea. I would have tried UCSC, but I have more than 13 x (1000 files of 2kbps).
culmen is offline   Reply With Quote
Old 11-05-2012, 10:57 PM   #10
genesquared
Junior Member
 
Location: SF, CA

Join Date: Dec 2010
Posts: 6
Default any update on this method in 2012?

since the recent 1000 genome Nature paper (Nov 1, 2012 ), is there any update on how to download a 2+kb segment?

thanks in advance!
genesquared is offline   Reply With Quote
Old 11-06-2012, 06:10 AM   #11
KevinLam
Senior Member
 
Location: SEA

Join Date: Nov 2009
Posts: 198
Default

Quote:
Originally Posted by genesquared View Post
since the recent 1000 genome Nature paper (Nov 1, 2012 ), is there any update on how to download a 2+kb segment?

thanks in advance!
hmm is your problem related to the thread starter's?

else you could possibly see if galaxy already has the data else upload via the ftp link then extract the portion you want via the UCSC link on the data?

this way you won't have to 'download' all the info .. but the 1kg info is on galaxy
KevinLam is offline   Reply With Quote
Old 11-07-2012, 10:23 AM   #12
laura
Senior Member
 
Location: Cambridge UK

Join Date: Sep 2008
Posts: 151
Default

Quote:
Originally Posted by genesquared View Post
since the recent 1000 genome Nature paper (Nov 1, 2012 ), is there any update on how to download a 2+kb segment?

thanks in advance!

Like I told the previous poster the best way to do this is to use samtools or tabix

There is much more info about this in our faq

http://www.1000genomes.org/faq/how-d...ction-vcf-file
laura is offline   Reply With Quote
Reply

Tags
1000genomes, data extraction, sequence alignments, sequence data

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 09:13 PM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2020, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO