Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Parsing a Chromosome sequence with start and end coordinates

    Hi.. I have a seperate chromosome sequences and i wanted to parse some regions of chromosome based on start site and end site.. how can i achieve this?

    For Example Chr 1 is in following fasta format

    > chr1
    GAATTCCAAAGCCAAAGATTGCATCAGTTCTGCTGCTATTTCCTCCTATCATTCTTTCTGATGTTGAAAATGATATTAAG
    I need regions from 2 - 10 should give me AATTCCAAA

    and i a similar way 15- 25 should give me AAGATTGCAT

    How can i do it either in perl or bioperl or awk or any other way?

  • #2
    Originally posted by empyrean View Post
    Chr 1 is in following fasta format

    I need regions from 2 - 10 should give me AATTCCAAA

    How can i do it either in perl or bioperl or awk or any other way?
    We the basis of a solution in bioperl would be:

    Code:
    #!/usr/bin/perl
    use warnings;
    use strict;
    use Bio::SeqIO;
    
    my $in = Bio::SeqIO->new(-file => 'chr1.fa',
                             -format => 'Fasta');
    
    # Assume only one sequence in the file
    my $seq = $in->next_seq();
    
    # Extract a subsequence
    my $subseq = $seq->subseq(2,10);
    
    print "The region from 2-10 is ",$subseq->seq(),"\n";
    We have a script which does this on a larger scale (but is somewhat tied to our local infrastructure). What we found was that you either need to able to hold all of your genome in memory (which is doable on many machines these days), or if not, then pre-sort your list of regions so you do all the regions for one chromosome at the same time. This saves your script from spending all of its time loading in sequence files from disk.

    Comment


    • #3
      Hi,

      For sequence retrieval I quite like to use BSgenome from Bioconductor. It's quite straightforward, even if you are not very comfortable with R, and you don't need to hold the reference sequence in memory. (Well... provided that your genome is in the repository!)

      Code:
      library(BSgenome)
      library("BSgenome.Hsapiens.UCSC.hg19")
      
      myseq<- getSeq(Hsapiens, names= c('chr2', 'chrM'), start= c(10000, 100), end= c(10010, 110))
      myseq
      [1] "NCGTATCCCAC" "GGAGCCGGAGC"
      
      ## Write to file with write.table() or similar. E.g.:
      writeLines(myseq, 'myseq.txt')
      Which genomes can I use?
      Code:
      available.genomes()
      
      BioC_mirror = http://bioconductor.org
      Change using chooseBioCmirror().
       [1] "BSgenome.Amellifera.BeeBase.assembly4"
       [2] "BSgenome.Amellifera.UCSC.apiMel2"     
       [3] "BSgenome.Athaliana.TAIR.01222004"     
       [4] "BSgenome.Athaliana.TAIR.04232008"     
       [5] "BSgenome.Athaliana.TAIR.TAIR9"        
       [6] "BSgenome.Btaurus.UCSC.bosTau3"        
       [7] "BSgenome.Btaurus.UCSC.bosTau4"        
       [8] "BSgenome.Celegans.UCSC.ce2"           
       [9] "BSgenome.Celegans.UCSC.ce6"           
      [10] "BSgenome.Cfamiliaris.UCSC.canFam2"    
      [11] "BSgenome.Dmelanogaster.UCSC.dm2"      
      [12] "BSgenome.Dmelanogaster.UCSC.dm3"      
      [13] "BSgenome.Drerio.UCSC.danRer5"         
      [14] "BSgenome.Drerio.UCSC.danRer6"         
      [15] "BSgenome.Ecoli.NCBI.20080805"         
      [16] "BSgenome.Ggallus.UCSC.galGal3"        
      [17] "BSgenome.Hsapiens.UCSC.hg17"          
      [18] "BSgenome.Hsapiens.UCSC.hg18"          
      [19] "BSgenome.Hsapiens.UCSC.hg19"          
      [20] "BSgenome.Mmusculus.UCSC.mm8"          
      [21] "BSgenome.Mmusculus.UCSC.mm9"          
      [22] "BSgenome.Ptroglodytes.UCSC.panTro2"   
      [23] "BSgenome.Rnorvegicus.UCSC.rn4"        
      [24] "BSgenome.Scerevisiae.UCSC.sacCer1"    
      [25] "BSgenome.Scerevisiae.UCSC.sacCer2"
      Hope this helps

      Dario

      Comment


      • #4
        Originally posted by empyrean View Post
        Hi.. I have a seperate chromosome sequences and i wanted to parse some regions of chromosome based on start site and end site.. how can i achieve this?

        For Example Chr 1 is in following fasta format



        I need regions from 2 - 10 should give me AATTCCAAA

        and i a similar way 15- 25 should give me AAGATTGCAT

        How can i do it either in perl or bioperl or awk or any other way?
        You can do this very easily with the BEDTools fastaFromBed command. Also, here is a thread covering other solutions:

        Comment


        • #5
          use strict;
          use Bio::SeqIO;

          my $seqio = Bio::SeqIO->new(-file => "chr1.fa", -format => "fasta");
          while(my $seq = $seqio->next_seq) {
          print $seq->subseq(2,10), "\n";
          }
          The above code works perfectly fine for me but its becoming tricky now when i need to extract negative strand, How can i do that???

          Comment


          • #6
            Originally posted by empyrean View Post
            The above code works perfectly fine for me but its becoming tricky now when i need to extract negative strand, How can i do that???
            Code:
            $seq = $seq->revcom();
            In this case $seq would be the subsequence you had extracted from the chromosome - not the whole chromosome.
            Last edited by simonandrews; 04-15-2011, 07:38 AM. Reason: Minor clarification.

            Comment


            • #7
              Got the solution.. used reverse complement for it..

              Comment


              • #8
                hi guys,
                i would sum up the second colum of individual file and would like to save the respective sum(value) in individual files:i.e example i hav 1 to 10000 txt files.using shell script and awk am trying to do this job but it is not working .please kindly hlep me! any suggestions are welcome

                files:
                files

                1.txt:
                a 1
                b 3
                c 2

                2.txt:
                d 5
                e 4
                f 6

                …….
                …….
                ……..
                10000.txt




                list:
                1.txt
                2.txt
                …..
                …..
                1000.txt


                shell scripts to sum all $2 of indival 1000 and save in 1000 different files


                count.sh
                n=1
                exec<$1
                while read line
                do

                awk '{ SUM += $2} END { print SUM} ' $n > The_$n_sum

                n = 'exep $n+1'
                done


                my directory i have files of 1 to 100o txt,list,count.sh
                command in linux terminal: sh count.sh list

                output should be

                file name : The_1_sum:
                content inside the file: 6

                file name The_2_sum:
                content inside the file: 15
                ……
                ……
                The_1000_sum:
                likewise 1000 files.


                but am not getting this ..!!!!??!?!??!?

                Comment


                • #9
                  rahulvarma: So what are you getting? In my opinion half of debugging programs is not only knowing what you want but also knowing what you are generating.

                  By the way, it is polite to start up a new topic if you have a new topic.

                  BTW#2: I suspect your problem is in the increment of 'n'. You are using a method that I am not familiar with.

                  Comment


                  • #10
                    hi westerman,

                    sorry for confusion.

                    i have 1000 files and i would like sum up each individual files and would like save those values in there respective files.


                    hope u got my problem!if not let me know, ll explain in detail

                    Comment


                    • #11
                      I understand the problem. But in order to debug your script I would like to know what type of output you are getting. Are you getting any output files? Or just the first output file? Does your shell script give an error? I know what *I* get when I run your script. But what do you get?

                      Comment


                      • #12
                        hi westerman,

                        need to change
                        awk '{ SUM += $2} END { print SUM} ' $n > The_$n_sum
                        into
                        awk '{ SUM += $2} END { print SUM} ' $line > The_$n_sum

                        stil am getting error saying

                        count.sh: line 6: exep: command not found
                        more over it generates a empty files named as "The_"

                        Comment


                        • #13
                          Originally posted by rahulvarma View Post
                          hi westerman,

                          stil am getting error saying

                          count.sh: line 6: exep: command not found
                          That means to look at line 6 in your count.sh program and furthermore that the command 'exep' does not exist on your system. Why not? I'll give a broad hint: the command 'exep' does not exist on anyone's system. What command did you want to use?

                          more over it generates a empty files named as "The_"
                          Ah, that is a more subtle problem. Your first problem above is obvious and I am letting you figure it out. Either you are (1) creating this program on your own in which case you need to learn how to use commands on your system included 'apropos', 'man', and plain old trial-and-error or (2) you are copying this program from somewhere in which case you need to be able to copy properly -- I can guarantee that no one uses 'exep' in a shell program.

                          But the problem with 'The_' is one that not obvious so I will give you the answer.

                          Variable names included alphanumeric characters plus underscore. Thus when the shell sees 'The_$n_sum' then it thinks that you are referring to the variable '$n_sum' which, of course, does not exist. Therefore the shell makes 'The_$n_sum' into 'The_'. There are at least two solutions.

                          First you can simply follow '$n' with a non-dash. E.g., use 'The_$n-sum'. Of course that makes your file names different than what you want.

                          Second, and more proper, is to surround your variable with braces, e.g,. use 'The_${n}_sum'. That will tell the shell to just use the variable 'n'.


                          Even with the above fixes to 'exep' and '${n}' your program will not run properly. But I suspect that you will find this out soon and, I hope, come up with a solution.

                          Comment


                          • #14
                            thanks alot

                            hi westerman,

                            tnq so much for suggestions and effort . i sort out this prob with help of u. ya just located the folder & files ext.thn using shell executed and retrieved output.

                            as u said i i removed "_" and tryed with "." it works once i alter the script a bit.

                            Comment

                            Latest Articles

                            Collapse

                            • seqadmin
                              Advancing Precision Medicine for Rare Diseases in Children
                              by seqadmin




                              Many organizations study rare diseases, but few have a mission as impactful as Rady Children’s Institute for Genomic Medicine (RCIGM). “We are all about changing outcomes for children,” explained Dr. Stephen Kingsmore, President and CEO of the group. The institute’s initial goal was to provide rapid diagnoses for critically ill children and shorten their diagnostic odyssey, a term used to describe the long and arduous process it takes patients to obtain an accurate...
                              12-16-2024, 07:57 AM
                            • seqadmin
                              Recent Advances in Sequencing Technologies
                              by seqadmin



                              Innovations in next-generation sequencing technologies and techniques are driving more precise and comprehensive exploration of complex biological systems. Current advancements include improved accessibility for long-read sequencing and significant progress in single-cell and 3D genomics. This article explores some of the most impactful developments in the field over the past year.

                              Long-Read Sequencing
                              Long-read sequencing has seen remarkable advancements,...
                              12-02-2024, 01:49 PM

                            ad_right_rmr

                            Collapse

                            News

                            Collapse

                            Topics Statistics Last Post
                            Started by seqadmin, 12-17-2024, 10:28 AM
                            0 responses
                            25 views
                            0 likes
                            Last Post seqadmin  
                            Started by seqadmin, 12-13-2024, 08:24 AM
                            0 responses
                            42 views
                            0 likes
                            Last Post seqadmin  
                            Started by seqadmin, 12-12-2024, 07:41 AM
                            0 responses
                            28 views
                            0 likes
                            Last Post seqadmin  
                            Started by seqadmin, 12-11-2024, 07:45 AM
                            0 responses
                            42 views
                            0 likes
                            Last Post seqadmin  
                            Working...
                            X