Hello All,
I have a problem that I am looking for some input on.
I have been using the genome assembler SPAdes, which outputs assembled contigs as a fasta file. I would like to sort my contigs based on average coverage (i.e. remove contigs with low coverage), which sounds like it should be a fairly easy task. However, all of the coverage information is contained within the fasta sequence header itself, for example (taken from my assembly file):
So the average coverage for this particular contig is .8.
The fact that the information is contained within the header itself and is not in a table format prevents me from doing some sort of copy/paste sorting shenanigans in Excel. So I figured I could write something up in Perl and use regular expression to sort based on the value of the numbers following
. But I ran into some issues with that, likely because I am a beginner and I still don't really know what I'm doing. I know I need to use BioPerl for the sequence/multifasta handling, and I know I need to restrict the matching to the header only and not the sequence itself, and then I need a way to delete all sequences with headers that do not meet a certain value (e.g. all values less than 10).
I've done some research via the almighty Google and come across people trying to complete similar tasks, but in all of the cases I found the individuals knew the EXACT header/sequence name of the sequence they wanted to extract. These methods are not very applicable to me since I am looking to sort sequences based on whether or not they meet a specific condition.
Any input or advice to lead me in the right direction would be greatly appreciated. So far all my code does is read the input file *golf clap*
I also know that because my file is relatively large (4MB) it is not efficient to have a script that reads everything line by line, but I'm not sure what else to do or how to address that issue.
Please help! And thanks in advance,
~Ana
I have a problem that I am looking for some input on.
I have been using the genome assembler SPAdes, which outputs assembled contigs as a fasta file. I would like to sort my contigs based on average coverage (i.e. remove contigs with low coverage), which sounds like it should be a fairly easy task. However, all of the coverage information is contained within the fasta sequence header itself, for example (taken from my assembly file):
Code:
>NODE_100_length_628_cov_0.818363_ID_199
The fact that the information is contained within the header itself and is not in a table format prevents me from doing some sort of copy/paste sorting shenanigans in Excel. So I figured I could write something up in Perl and use regular expression to sort based on the value of the numbers following
Code:
cov_
I've done some research via the almighty Google and come across people trying to complete similar tasks, but in all of the cases I found the individuals knew the EXACT header/sequence name of the sequence they wanted to extract. These methods are not very applicable to me since I am looking to sort sequences based on whether or not they meet a specific condition.
Any input or advice to lead me in the right direction would be greatly appreciated. So far all my code does is read the input file *golf clap*
I also know that because my file is relatively large (4MB) it is not efficient to have a script that reads everything line by line, but I'm not sure what else to do or how to address that issue.
Please help! And thanks in advance,
~Ana
Comment