Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Remove N's and split contigs

    Hi,

    I have some genomes that I will be uploading to NCBI soon. I have been told that all N's need to be removed and the contigs split at this position.

    I am new to command line interface so I was hoping someone could recommend a program and simple script that could do this for me. I would like to remove all N's and then split the contig at the location of the N's results in two new contigs. For example

    Contig 1: ATCGGATAANNNNNNNNNATCGCCGAT

    Contig 1.1: ATCGGATAA

    Contig 1.2 ATCGCCGAT


    Thanks!

  • #2
    perl -ne 'if($_ =~ /([^N]+)N+([^N]+)/){print $1;print stderr $1}' input.seq >contig1.txt 2>contig2.txt

    It will split the input file (input.seq) into contig1.txt and contig2.txt

    Comment


    • #3
      should that be

      Code:
      print stderr $2

      Comment


      • #4
        Thanks for that!!

        Will this rename the contigs?

        Will the contig that is split be called the same thing in contig1.txt and contig2.txt.

        Is it possible to rename the contigs when they are split. For example, if contig 84 is split into two contigs can they be renamed contig 84.1 and contig 84.2 for each half, respectively?

        Comment


        • #5
          mastal: you are right!
          Dagga: This script does not handle the contig names, only sequences, because you do not tell us what kind of input format do you have.
          Last edited by TiborNagy; 02-18-2014, 05:42 AM.

          Comment


          • #6
            TiborNagy: Sorry, the file will be in fasta format post de novo assembly.

            would you be able to alter the script to handle contig names please?

            Thanks!

            Comment


            • #7
              If you are doing your assemblies with velvet, setting '-scaffolding no' will stop velvet from joining contigs together with stretches of Ns.

              Comment


              • #8
                Excellent!

                Whilst this does help with some genomes that I am assembling right now, we have some older genomes that were sequenced by BGI and these contain N's that we still need to have removed...

                Comment


                • #9
                  Just for you :-)
                  Code:
                  #!/usr/bin/perl
                  
                  $seq = "";
                  
                  while(<>){
                     chomp;
                  
                     if(/^>/){
                        if($seq ne ""){
                           if($seq =~ /([^N]+)N+([^N]+)/){
                              print  "$id.1\n$1\n";
                              print STDERR "$id.2\n$2\n";
                           }
                        }
                        $seq = "";
                        $id = $_;
                     }
                     else{
                        $seq .= $_;
                     }
                  }
                  
                  if($seq =~ /([^N]+)N+([^N]+)/){
                    print "$id.1\n$1\n";
                    print STDERR "$id.2\n$2\n";
                  }

                  Comment


                  • #10
                    Thanks!! I appreciate it!

                    Comment

                    Latest Articles

                    Collapse

                    • seqadmin
                      Essential Discoveries and Tools in Epitranscriptomics
                      by seqadmin




                      The field of epigenetics has traditionally concentrated more on DNA and how changes like methylation and phosphorylation of histones impact gene expression and regulation. However, our increased understanding of RNA modifications and their importance in cellular processes has led to a rise in epitranscriptomics research. “Epitranscriptomics brings together the concepts of epigenetics and gene expression,” explained Adrien Leger, PhD, Principal Research Scientist...
                      04-22-2024, 07:01 AM
                    • seqadmin
                      Current Approaches to Protein Sequencing
                      by seqadmin


                      Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
                      04-04-2024, 04:25 PM

                    ad_right_rmr

                    Collapse

                    News

                    Collapse

                    Topics Statistics Last Post
                    Started by seqadmin, Yesterday, 08:47 AM
                    0 responses
                    15 views
                    0 likes
                    Last Post seqadmin  
                    Started by seqadmin, 04-11-2024, 12:08 PM
                    0 responses
                    60 views
                    0 likes
                    Last Post seqadmin  
                    Started by seqadmin, 04-10-2024, 10:19 PM
                    0 responses
                    60 views
                    0 likes
                    Last Post seqadmin  
                    Started by seqadmin, 04-10-2024, 09:21 AM
                    0 responses
                    54 views
                    0 likes
                    Last Post seqadmin  
                    Working...
                    X