Seqanswers Leaderboard Ad

**simonandrews** · 04-14-2011, 11:50 PM

Originally posted by empyrean View Post

Chr 1 is in following fasta format

I need regions from 2 - 10 should give me AATTCCAAA

How can i do it either in perl or bioperl or awk or any other way?

We the basis of a solution in bioperl would be:

Code:

#!/usr/bin/perl
use warnings;
use strict;
use Bio::SeqIO;

my $in = Bio::SeqIO->new(-file => 'chr1.fa',
                         -format => 'Fasta');

# Assume only one sequence in the file
my $seq = $in->next_seq();

# Extract a subsequence
my $subseq = $seq->subseq(2,10);

print "The region from 2-10 is ",$subseq->seq(),"\n";

We have a script which does this on a larger scale (but is somewhat tied to our local infrastructure). What we found was that you either need to able to hold all of your genome in memory (which is doable on many machines these days), or if not, then pre-sort your list of regions so you do all the regions for one chromosome at the same time. This saves your script from spending all of its time loading in sequence files from disk.

**dariober** · 04-15-2011, 12:09 AM

Hi,

For sequence retrieval I quite like to use BSgenome from Bioconductor. It's quite straightforward, even if you are not very comfortable with R, and you don't need to hold the reference sequence in memory. (Well... provided that your genome is in the repository!)

Code:

library(BSgenome)
library("BSgenome.Hsapiens.UCSC.hg19")

myseq<- getSeq(Hsapiens, names= c('chr2', 'chrM'), start= c(10000, 100), end= c(10010, 110))
myseq
[1] "NCGTATCCCAC" "GGAGCCGGAGC"

## Write to file with write.table() or similar. E.g.:
writeLines(myseq, 'myseq.txt')

Which genomes can I use?

Code:

available.genomes()

BioC_mirror = http://bioconductor.org
Change using chooseBioCmirror().
 [1] "BSgenome.Amellifera.BeeBase.assembly4"
 [2] "BSgenome.Amellifera.UCSC.apiMel2"     
 [3] "BSgenome.Athaliana.TAIR.01222004"     
 [4] "BSgenome.Athaliana.TAIR.04232008"     
 [5] "BSgenome.Athaliana.TAIR.TAIR9"        
 [6] "BSgenome.Btaurus.UCSC.bosTau3"        
 [7] "BSgenome.Btaurus.UCSC.bosTau4"        
 [8] "BSgenome.Celegans.UCSC.ce2"           
 [9] "BSgenome.Celegans.UCSC.ce6"           
[10] "BSgenome.Cfamiliaris.UCSC.canFam2"    
[11] "BSgenome.Dmelanogaster.UCSC.dm2"      
[12] "BSgenome.Dmelanogaster.UCSC.dm3"      
[13] "BSgenome.Drerio.UCSC.danRer5"         
[14] "BSgenome.Drerio.UCSC.danRer6"         
[15] "BSgenome.Ecoli.NCBI.20080805"         
[16] "BSgenome.Ggallus.UCSC.galGal3"        
[17] "BSgenome.Hsapiens.UCSC.hg17"          
[18] "BSgenome.Hsapiens.UCSC.hg18"          
[19] "BSgenome.Hsapiens.UCSC.hg19"          
[20] "BSgenome.Mmusculus.UCSC.mm8"          
[21] "BSgenome.Mmusculus.UCSC.mm9"          
[22] "BSgenome.Ptroglodytes.UCSC.panTro2"   
[23] "BSgenome.Rnorvegicus.UCSC.rn4"        
[24] "BSgenome.Scerevisiae.UCSC.sacCer1"    
[25] "BSgenome.Scerevisiae.UCSC.sacCer2"

Hope this helps

Dario

**quinlana** · 04-15-2011, 03:43 AM

Originally posted by empyrean View Post

Hi.. I have a seperate chromosome sequences and i wanted to parse some regions of chromosome based on start site and end site.. how can i achieve this?

For Example Chr 1 is in following fasta format

I need regions from 2 - 10 should give me AATTCCAAA

and i a similar way 15- 25 should give me AAGATTGCAT

How can i do it either in perl or bioperl or awk or any other way?

You can do this very easily with the BEDTools fastaFromBed command. Also, here is a thread covering other solutions:

Site not found

http://biostar.stackexchange.com/questions/7504/batch-fetching-fasta-sequences-from-bed-file

We make Stack Overflow and 170+ other community-powered Q&A sites.

**empyrean** · 04-15-2011, 07:22 AM

use strict;
use Bio::SeqIO;

my $seqio = Bio::SeqIO->new(-file => "chr1.fa", -format => "fasta");
while(my $seq = $seqio->next_seq) {
print $seq->subseq(2,10), "\n";
}

The above code works perfectly fine for me but its becoming tricky now when i need to extract negative strand, How can i do that???

**simonandrews** · 04-15-2011, 07:31 AM

Originally posted by empyrean View Post

The above code works perfectly fine for me but its becoming tricky now when i need to extract negative strand, How can i do that???

Code:

$seq = $seq->revcom();

In this case $seq would be the subsequence you had extracted from the chromosome - not the whole chromosome.

**empyrean** · 04-15-2011, 07:58 AM

Got the solution.. used reverse complement for it..

**rahulvarma** · 09-08-2011, 08:00 AM

hi guys,
i would sum up the second colum of individual file and would like to save the respective sum(value) in individual files:i.e example i hav 1 to 10000 txt files.using shell script and awk am trying to do this job but it is not working .please kindly hlep me! any suggestions are welcome

files:
files

1.txt:
a 1
b 3
c 2

2.txt:
d 5
e 4
f 6

…….
…….
……..
10000.txt

list:
1.txt
2.txt
…..
…..
1000.txt

shell scripts to sum all $2 of indival 1000 and save in 1000 different files

count.sh
n=1
exec<$1
while read line
do

awk '{ SUM += $2} END { print SUM} ' $n > The_$n_sum

n = 'exep $n+1'
done

my directory i have files of 1 to 100o txt,list,count.sh
command in linux terminal: sh count.sh list

output should be

file name : The_1_sum:
content inside the file: 6

file name The_2_sum:
content inside the file: 15
……
……
The_1000_sum:
likewise 1000 files.

but am not getting this ..!!!!??!?!??!?

**westerman** · 09-08-2011, 08:47 AM

rahulvarma: So what are you getting? In my opinion half of debugging programs is not only knowing what you want but also knowing what you are generating.

By the way, it is polite to start up a new topic if you have a new topic.

BTW#2: I suspect your problem is in the increment of 'n'. You are using a method that I am not familiar with.

**rahulvarma** · 09-08-2011, 09:00 AM

hi westerman,

sorry for confusion.

i have 1000 files and i would like sum up each individual files and would like save those values in there respective files.

hope u got my problem!if not let me know, ll explain in detail

**westerman** · 09-08-2011, 09:25 AM

I understand the problem. But in order to debug your script I would like to know what type of output you are getting. Are you getting any output files? Or just the first output file? Does your shell script give an error? I know what *I* get when I run your script. But what do you get?

**rahulvarma** · 09-08-2011, 07:29 PM

hi westerman,

need to change
awk '{ SUM += $2} END { print SUM} ' $n > The_$n_sum
into
awk '{ SUM += $2} END { print SUM} ' $line > The_$n_sum

stil am getting error saying

count.sh: line 6: exep: command not found
more over it generates a empty files named as "The_"

**westerman** · 09-09-2011, 04:10 AM

Originally posted by rahulvarma View Post

hi westerman,

stil am getting error saying

count.sh: line 6: exep: command not found

That means to look at line 6 in your count.sh program and furthermore that the command 'exep' does not exist on your system. Why not? I'll give a broad hint: the command 'exep' does not exist on anyone's system. What command did you want to use?

more over it generates a empty files named as "The_"

Ah, that is a more subtle problem. Your first problem above is obvious and I am letting you figure it out. Either you are (1) creating this program on your own in which case you need to learn how to use commands on your system included 'apropos', 'man', and plain old trial-and-error or (2) you are copying this program from somewhere in which case you need to be able to copy properly -- I can guarantee that no one uses 'exep' in a shell program.

But the problem with 'The_' is one that not obvious so I will give you the answer.

Variable names included alphanumeric characters plus underscore. Thus when the shell sees 'The_$n_sum' then it thinks that you are referring to the variable '$n_sum' which, of course, does not exist. Therefore the shell makes 'The_$n_sum' into 'The_'. There are at least two solutions.

First you can simply follow '$n' with a non-dash. E.g., use 'The_$n-sum'. Of course that makes your file names different than what you want.

Second, and more proper, is to surround your variable with braces, e.g,. use 'The_${n}_sum'. That will tell the shell to just use the variable 'n'.

Even with the above fixes to 'exep' and '${n}' your program will not run properly. But I suspect that you will find this out soon and, I hope, come up with a solution.

**rahulvarma** · 09-10-2011, 09:23 PM

thanks alot

hi westerman,

tnq so much for suggestions and effort . i sort out this prob with help of u. ya just located the folder & files ext.thn using shell executed and retrieved output.

as u said i i removed "_" and tryed with "." it works once i alter the script a bit.

Topics	Statistics	Last Post
Evaluating Genome Sequencing for ECMO Patients in the NICU by seqadmin Started by seqadmin, 12-17-2024, 10:28 AM	0 responses 25 views 0 likes	Last Post by seqadmin 12-17-2024, 10:28 AM
New Genetic Toolkit Refines Studies on Gene Function and Disease by seqadmin Started by seqadmin, 12-13-2024, 08:24 AM	0 responses 42 views 0 likes	Last Post by seqadmin 12-13-2024, 08:24 AM
Study Links Brain Mechanism to Emotional Responses in Animals and Humans by seqadmin Started by seqadmin, 12-12-2024, 07:41 AM	0 responses 28 views 0 likes	Last Post by seqadmin 12-12-2024, 07:41 AM
Study Identifies Ribosomal RNA Fingerprints as Early Cancer Biomarkers by seqadmin Started by seqadmin, 12-11-2024, 07:45 AM	0 responses 42 views 0 likes	Last Post by seqadmin 12-11-2024, 07:45 AM

Seqanswers Leaderboard Ad

Announcement

Parsing a Chromosome sequence with start and end coordinates

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News