Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • How to remove N from fasta sequences

    Hi everyone,

    I have a fasta file which is containing lots of N in sequences, that i want to remove. If that are coming constitutive greater than or equal to 25 times it need to be removed else N <25 times let it be there don't remove it.
    how to do this by perl, any other language or any other scripting?

  • #2
    Got the solution guys....
    in Vi can do that
    :%s/N\{25,\}//g

    Comment


    • #3
      It's working but only on for small file..
      it isn't working on big genome file ......
      please tell the some other way.......

      Comment


      • #4
        Originally posted by M.Verma View Post
        Hi everyone,

        I have a fasta file which is containing lots of N in sequences, that i want to remove. If that are coming constitutive greater than or equal to 25 times it need to be removed else N <25 times let it be there don't remove it.
        how to do this by perl, any other language or any other scripting?
        The 'N's are there intentionally for a very good reason. They indicate that there is unkown sequence in this region of the genome. They are placeholders to fill out the sequnce string to the proper length and to maintain the proper distance between sequnce regions flanking the 'N's.

        What is your reason for wanting to remove them?

        Comment


        • #5
          Do you want to remove all sequences with N's or just remove the N's from the file? I agree with kmcarr that they are there for a very important reason. I'm not a vi guru, but I don't think that command will delete the entire read, and I'm not aware of any vi implementations that can handle NGS sized fasta/fastq files...

          Comment


          • #6
            So, instead of no room in the inn , it's no room for the N. This sounds like the n-less bummer ... but ... what you're asking for is this ...

            This awk program will remove all upper and lower case Ns in a fasta file except for the ">" lines , it will not output lines left blank by removing Ns ...

            {
            if (index($0,">")==1) print $0; else {
            if (length($0)==0){printf("\n");}
            else {
            split($0, s, "") notN = 0;
            for (i=1; i <= length(s); i++) { if ((s[i]=="N")||(s[i]=="n")) {continue;} notN = 1; printf("%s", s[i]) }
            if (notN==1) { printf("\n");}
            }
            }
            }


            Example usage on command line ...
            cat withn.fa | awk '{if (index($0,">")==1) print $0; else { if (length($0)==0){printf("\n");} else { split($0, s, "") notN = 0; for (i=1; i <= length(s); i++) { if ((s[i]=="N")||(s[i]=="n")) {continue;} notN = 1; printf("%s", s[i]) } if (notN==1) { printf("\n");} } } }' > nless.fa
            Last edited by Richard Finney; 01-19-2013, 09:41 AM.

            Comment


            • #7
              This can be done with Biopieces (www.biopieces.org) like this:

              Code:
              read_fasta -i in.fna | substitute_vals -k SEQ -s 'N{25,}' -r '' -ig | write_fasta -o out.fna -x

              Comment


              • #8
                you can write a program in perl or C++ to solve this problem.

                #/usr/bin/perl
                open my $in,"gzip -dc $ARGV[0] | ";
                while(<$in>)
                {
                chomp;
                my $id = $_;
                my $seq=<$in>;
                my $temp = <$in>;
                my $qual=<$in>;
                print "$id\n$seq\n$temp\n$qual\n" unless( (()= $seq =~ /N/gi) >= 25);
                }
                close($in);

                Comment


                • #9
                  Thank you guys for help........

                  Comment


                  • #10
                    another solution for this question is faSplit from UCSC.

                    From the documentation, this program seems can split fasta at the gap (Ns) boundary, but I have tried a test and failed. the command is:
                    Code:
                     faGapLocs test.fa test.lift
                    faSplit  gap test.fa 2000000 out -lift=test.lift -minGapSize=1 -noGapDrops
                    or
                    faSplit  gap test.fa 2000000 out -minGapSize=1 -noGapDrops
                    Code:
                    faSplit - Split an fa file into several files.
                    usage:
                       faSplit how input.fa count outRoot
                    where how is either 'about' 'byname' 'base' 'gap' 'sequence' or 'size'.  
                    Files split by sequence will be broken at the nearest fa record boundary. 
                    Files split by base will be broken at any base.  
                    Files broken by size will be broken every count bases.
                    
                    Examples:
                       faSplit sequence estAll.fa 100 est
                    This will break up estAll.fa into 100 files
                    (numbered est001.fa est002.fa, ... est100.fa
                    Files will only be broken at fa record boundaries
                    
                       faSplit base chr1.fa 10 1_
                    This will break up chr1.fa into 10 files
                    
                       faSplit size input.fa 2000 outRoot
                    This breaks up input.fa into 2000 base chunks
                    
                       faSplit about est.fa 20000 outRoot
                    This will break up est.fa into files of about 20000 bytes each by record.
                    
                       faSplit byname scaffolds.fa outRoot/ 
                    This breaks up scaffolds.fa using sequence names as file names.
                           Use the terminating / on the outRoot to get it to work correctly.
                    
                       faSplit gap chrN.fa 20000 outRoot
                    This breaks up chrN.fa into files of at most 20000 bases each, 
                    at gap boundaries if possible.  If the sequence ends in N's, the last
                    piece, if larger than 20000, will be all one piece.
                    
                    Options:
                        -verbose=2 - Write names of each file created (=3 more details)
                        -maxN=N - Suppress pieces with more than maxN n's.  Only used with size.
                                  default is size-1 (only suppresses pieces that are all N).
                        -oneFile - Put output in one file. Only used with size
                        -extra=N - Add N extra bytes at the end to form overlapping pieces.  Only used with size.
                        -out=outFile Get masking from outfile.  Only used with size.
                        -lift=file.lft Put info on how to reconstruct sequence from
                                       pieces in file.lft.  Only used with size and gap.
                        -minGapSize=X Consider a block of Ns to be a gap if block size >= X.
                                      Default value 1000.  Only used with gap.
                        -noGapDrops - include all N's when splitting by gap.
                        -outDirDepth=N Create N levels of output directory under current dir.
                                       This helps prevent NFS problems with a large number of
                                       file in a directory.  Using -outDirDepth=3 would
                                       produce ./1/2/3/outRoot123.fa.
                        -prefixLength=N - used with byname option. create a separate output
                                       file for each group of sequences names with same prefix
                                       of length N.
                    Last edited by pengchy; 06-06-2013, 06:19 PM.

                    Comment

                    Latest Articles

                    Collapse

                    • seqadmin
                      Advancing Precision Medicine for Rare Diseases in Children
                      by seqadmin




                      Many organizations study rare diseases, but few have a mission as impactful as Rady Children’s Institute for Genomic Medicine (RCIGM). “We are all about changing outcomes for children,” explained Dr. Stephen Kingsmore, President and CEO of the group. The institute’s initial goal was to provide rapid diagnoses for critically ill children and shorten their diagnostic odyssey, a term used to describe the long and arduous process it takes patients to obtain an accurate...
                      12-16-2024, 07:57 AM
                    • seqadmin
                      Recent Advances in Sequencing Technologies
                      by seqadmin



                      Innovations in next-generation sequencing technologies and techniques are driving more precise and comprehensive exploration of complex biological systems. Current advancements include improved accessibility for long-read sequencing and significant progress in single-cell and 3D genomics. This article explores some of the most impactful developments in the field over the past year.

                      Long-Read Sequencing
                      Long-read sequencing has seen remarkable advancements,...
                      12-02-2024, 01:49 PM

                    ad_right_rmr

                    Collapse

                    News

                    Collapse

                    Topics Statistics Last Post
                    Started by seqadmin, 12-17-2024, 10:28 AM
                    0 responses
                    26 views
                    0 likes
                    Last Post seqadmin  
                    Started by seqadmin, 12-13-2024, 08:24 AM
                    0 responses
                    42 views
                    0 likes
                    Last Post seqadmin  
                    Started by seqadmin, 12-12-2024, 07:41 AM
                    0 responses
                    28 views
                    0 likes
                    Last Post seqadmin  
                    Started by seqadmin, 12-11-2024, 07:45 AM
                    0 responses
                    42 views
                    0 likes
                    Last Post seqadmin  
                    Working...
                    X