Unconfigured Ad

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts
  • mgaldos
    Member
    • May 2013
    • 12

    Concatenate GFF Files

    Dear everyone,
    I went through the threads but couldn't find anyone trying to do the same thing as I am.

    I am working with around 13k GFF files that need to be concatenated into a single one. Normally, a simple "cat" function would do, but I am trying to actually turn all those files into a single one that will have a new coordinate system.

    For example, if one of the gff files has annotations that range from 0 - 1000kb, the next gff file's table should be appended to that one and it's annotations should begin at 1000kb+.

    I've been looking everywhere for a way to do this and have had no luck.

    If anyone has any suggestions I'd greatly appreciate it.

    Thanks a bunch!
  • muthu545
    Member
    • Jul 2011
    • 32

    #2
    Just a Suggestion...
    You could use cat function and then do sorting [sort function] on the coordinates to put them in order.

    Thanks
    --

    Comment

    • mgaldos
      Member
      • May 2013
      • 12

      #3
      Thanks for the quick reply muthu.

      I guess that's not a bad idea but it won't work for me. I think I wasn't being very clear.

      I have a different GFF file for each scaffold that I'm working with. What I am trying to do is put all the scaffolds together into a giant one and preserve the coordinate scheme. The problem is that each GFF file has its own coordinates starting at 0 and ending at some number. I need to make it so that I can merge all the GFF files and make a single file with continuous coordinates.
      Last edited by mgaldos; 05-28-2013, 03:31 PM.

      Comment

      • muthu545
        Member
        • Jul 2011
        • 32

        #4
        If I understand it right, what you want to get to....
        Input:
        File 1: 0 - 1000
        File 2: 0 - 1000

        Output:
        File 0 -2000

        For this you must all the end coordinate of file1 to all of File 2 coordinates [I'm just think aloud].

        If you could do head -n 5 of both files and tail -n 5 of both files paste the output here, then it would be easier..

        Thanks
        --
        Muthu

        Comment

        • mgaldos
          Member
          • May 2013
          • 12

          #5
          That's exactly what I'm trying to do, except that its for 13,000 files. I'll send you the top and bottom from the first two files tomorrow since I don't have them with me right now.

          Thanks a bunch!
          Last edited by mgaldos; 05-28-2013, 03:43 PM. Reason: Forgot to change something as I typed

          Comment

          • muthu545
            Member
            • Jul 2011
            • 32

            #6
            Sure, once you have the head and tail of 3 Files... We could figure out some code that could concatenate all your 13000 files into one.

            Thanks
            --
            Muthu

            Comment

            • sdriscoll
              I like code
              • Sep 2009
              • 436

              #7
              unfortunately these gene annotation formats are not strictly sorted by position so checking the end of the file for the offset value for the next file may not be reliable. additionally it may be a hassle to sort the files by position to find that value because then you'll have to resort them back by feature.

              i think a brute force attack may be appropriate. for example: parse the first file and find the maximum position value in the 5th column (feature end coordinate) while at the same time printing it's content out to the new concatenated file. increment that maximum position and then parse the second file translating it's coordinates by that offset while simultaneously tracking the maximum position from its translated coordinates to use for the next file.

              it's quite possible this will work (or at least it's a good start). You want to pass all of the GTF file names to it at once so the useage string I included at the top is appropriate. if your GTF files are scattered around in folders you could replace 'ls *.gtf' with 'find . -name "*.gtf"' run from the most parent of the folders containing them all. hope it works!

              Code:
              #!/usr/bin/perl
              #
              # concatenates  and translates GFF/GTF and sends output to stdout
              # as it goes
              #
              # WARNING: UNTESTED
              #
              # Useage: ls *.gtf | xargs ./this-script.pl > concatenated.gtf
              #
              
              use strict;
              
              my $offset = 0;
              my $max_pos = 0;
              my @arl;
              my $fname;
              
              #
              # get first offset
              #
              
              $fname = shift @ARGV;
              open FIN, '<', $fname or die($!);
              while(<FIN>) {
              	# print out
              	print STDOUT $_;
              	
              	# process offset
              	chomp;
              	@arl = split(/\t/);
              	if($arl[4] > $max_pos) {
              		$max_pos = $5;
              	}
              }
              
              close FIN;
              
              # shift offset forward a base
              $offset = $max_pos+1;
              
              while(scalar @ARGV) {
              
              	$fname = shift @ARGV;
              	$max_pos = 0;
              	open FIN, '<', $fname or die($!);
              	
              	while(<FIN>) {
              		chomp;
              		@arl = split(/\t/);
              		
              		# translate this line's coordinates
              		$arl[3] += $offset;
              		$arl[4] += $offset;
              		
              		# update max position from translated file
              		if($arl[4] > $max_pos) {
              			$max_pos = $arl[4];
              		}
              		
              		# print translated line out
              		print STDOUT join("\t", @arl) . "\n";
              		
              	}
              	
              	close FIN;
              	
              	# update offset for next file
              	$offset = $max_pos + 1;
              
              }
              Last edited by sdriscoll; 05-28-2013, 05:24 PM. Reason: forgot something
              /* Shawn Driscoll, Gene Expression Laboratory, Pfaff
              Salk Institute for Biological Studies, La Jolla, CA, USA */

              Comment

              • malcook
                Member
                • Sep 2009
                • 24

                #8
                perl fu bwahaha

                assuming all your files are in the contig directory, how about this perl one-liner:

                Code:
                perl -lape '$F[4]+=$o; $F[3]+=$o; $_=join("\t",@F); $m=($m,$F[4])[$m < $F[4]]; $o=$m if eof;'  contig/*.gff > contigs.gff
                approach is to add an offset, $o, to 4th and 5th column, resetting the offset to the max offsetted value seen in column 5, $m, at each file boundry.

                Comment

                • sdriscoll
                  I like code
                  • Sep 2009
                  • 436

                  #9
                  ah, very nice! now there's a readable and non-readable option.
                  /* Shawn Driscoll, Gene Expression Laboratory, Pfaff
                  Salk Institute for Biological Studies, La Jolla, CA, USA */

                  Comment

                  • mgaldos
                    Member
                    • May 2013
                    • 12

                    #10
                    Wow guys, this is awesome. I'll try out the suggestions now and let you know how it all worked.

                    Thanks a lot!

                    Comment

                    • mgaldos
                      Member
                      • May 2013
                      • 12

                      #11
                      Alright, I tried both suggestions but both gave me the same error:

                      -bash: /usr/bin/perl: Argument list too long

                      I think that perl just refuses to work with 13000 files at a time. Is there anyway to bypass this?

                      Comment

                      • muthu545
                        Member
                        • Jul 2011
                        • 32

                        #12
                        Hi,
                        If the issue is only on the # of files, you could try combining batches of 100 files - which will leave you with 130 combined files (13000/100) ---> then you could combine these 130 files to 1 file.
                        Last edited by muthu545; 05-29-2013, 09:03 AM.

                        Comment

                        Latest Articles

                        Collapse

                        • SEQadmin2
                          Nine Things a Sample Prep Scientist Thinks About Before Sequencing
                          by SEQadmin2


                          I’m not a sequencing expert. I’m a purification scientist who uses NGS to evaluate workflows my group develops. With this perspective, we think about the sample first and the NGS workflow second. The sequencer is an exceptionally honest reporter, but it can only report on what you give it, so whether you get clean, interpretable data from an NGS workflow is largely determined before you begin.

                          Here are nine questions we think about, in roughly the order they matter, before...
                          06-18-2026, 07:11 AM
                        • SEQadmin2
                          From Collection to Sequencing: Why Sample Preparation and Preservation Define Sequencing Data
                          by SEQadmin2


                          Data variability is still an issue in sequencing technologies despite the advances in reproducibility and accuracy of these platforms. But the problem does not originate in the sequencing itself, but in the previous steps, before the sample reaches the sequencer.


                          The first step is collection, followed by preservation and sample preparation for analysis. Most scientists overlook those steps, but not being careful might just be skewing the experiment’s results.
                          ...
                          06-02-2026, 10:05 AM

                        ad_right_rmr

                        Collapse

                        News

                        Collapse

                        Topics Statistics Last Post
                        Started by SEQadmin2, 06-17-2026, 06:09 AM
                        0 responses
                        37 views
                        0 reactions
                        Last Post SEQadmin2  
                        Started by SEQadmin2, 06-09-2026, 11:58 AM
                        0 responses
                        100 views
                        0 reactions
                        Last Post SEQadmin2  
                        Started by SEQadmin2, 06-05-2026, 10:09 AM
                        0 responses
                        121 views
                        0 reactions
                        Last Post SEQadmin2  
                        Started by SEQadmin2, 06-04-2026, 08:59 AM
                        0 responses
                        114 views
                        0 reactions
                        Last Post SEQadmin2  
                        Working...