Hello,
I am completely new to next generation sequencing and need some help.
I have a multi-sequence FASTA file from Cufflinks (a small portion of the file is below), and I would like to calculate the percentage of n's in each transcript. Is there an easy way to do the calculation for all 11,804 transcripts at once?
Thanks so much for your help!!
>NM_001034679 loc:chr1|327877-339078|+ exons:327877-328043,331233-331406,333963-334122,337306-339078 segs:1-167,168-341,342-501,502-2274 FPKM=0.0000000000 frac=0.000000 conf_lo=0.000000 conf_hi=0.000000 cov=0.000000 full_read_support=yes
AACAGCTTCGGGTAAAGgaactccatccacttggggcttgaccgcgggagtcgctgtagctctcactgtg
agaaagcaagatgcactttagaaattttaactacagttttagctccctgattgcctgtgtggcaaacagt
gagatcttcagcgaaagtgaaaccagggccaaatttgaatccctctttaggacttatgacaaagatatca
cctttcagtattttaaaagcttcaaacgggtcagaataaacttcagcaaccccttatctgcagcagatgc
caggctccagctacataagactgaatttcttggaaaggaaatgaaattatattttgctcagactttacac
ataggcagttcgcacctggcccctccgaatccagacaaacagttcctcatctctcctcctgcctcaccgc
cggtcggctggaaacaagtggaagacgctactcccgtcataaattatgaccttttatacgctatctccaa
gctagggccaggggaaaagtatgaactgcatgcagccaccgacaccacccccagcgtggtggtccacgtc
tgtgagagcgatcaggagaacgaggaagaagacgagatggaaagaatgaagagaccgaagcccaaaattatccagaccaggaggccagagtacacgcccatccacttaagctgaaccggcgcccggacgaagacgtgctc
caaaccatgctcgcaagaaggcatcttttactgtggaagcagccggtcacagctttggaggcggcagccg
tgaccgctgtggcggaaattccagttcacgttgctcagaagagaatcgaggcttcgtcccctggttctaa
cgctgcgcctcagtcagtgttcgaggctcctggccaggccccgagccaatcactgagcttggggtgatcg
cacaaggacatctgggagcatcgcgggaaaaccaataatgatagtcttttgtacttgttctcttctggta
ggttctgtcttggccaggggcagattgatccgtgggccccggggagagtctttgtgtttaatcagtctac
aaggtagacgcactctctctCCTGGTGGGAAAAGGCGCCACGnnnnnnnnnnnngCGTCTGGTGCAGAAAGGTTGTGAAAGCAACCGTGCAACGTggaaactgtagcgtttcaatttcccccttcatgttctgatgTTTGTGCATGTGTATTACTgatTTCTCAGAACTAACCTTTGTTTGTATGTAGAGTTGCGCCACTGCTGTTTTACATCTTCTGGGGAGATAAGAAGGCATctgtgaagtctgtcacctttgcagattcGTGACTGTCTTTGCAAGGGCACCCACGGCGGGGTGGAGGGGATTCTACCTGGAATACACACACCATTCCGCATCCTGTCCGATGCGACCAAnnnnnCGTGTTTTTGCAAAAGAAgtcgatctggaattcctgtgtagcgtttcgcttataaaattcagaaaatagcactttcactgccaactactagtgggtgagaaattttagtttagatgttttagatcaggcaatacgtaggtttcatttgtttctttgacgtggtggtttatataacatgaatcatagccaaaacccttttcggggggaatagtcagttgagatcattaatttttttacccccactaatacatcaagataaacttgtaaataaagccggtagtatatattcacacctgttgtgcacttgggtgagacatatatggccagggaagactagggtcagatgtgttgacctccccgtgaatcatatgttgtagaaaatgcctttcagatgtttgatgggacttgaattcaaagcacgtgaagtggatagtggatataagaagggtgcagtgcctttcccattaattcctggtggagttgtcacactaggttaacgtttgtaatttttttctagtgtccTGTGTATGTsTGGTCGATGGGTACTCCCTTTTGGCCTTACAAtattgtaacaatgtttgtccttttgaaatacctaatgccaagtaacagtgcatgctttagaaaaggggaagagggctttctttaagaagtaaaggcgtttggctgttcctgtcaagaaactgactgaatggtctccaaaccctgtttacaggacctggtggggtgtgggggacaaatgagcaagagatGCGTGCATAGTCGTTCAAGTGTTCGTAGTTCAGTGCTTTTAAACTGGGGAGGCTAACCACGAGATATTTTTTTTAACTGCATTCTCTAATAAATCGACGCAATATGCTCTTTA
>NM_001077977 loc:chr1|377376-383871|+ exons:377376-377447,378427-378521,383243-383871 segs:1-72,73-167,168-796 FPKM=0.0000000000 frac=0.000000 conf_lo=0.000000 conf_hi=0.000000 cov=0.000000 full_read_support=yes
nnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnngCACCCGGCTCGCTCCGCGCCTCAGACCCGgttctgaggaagcggattgttttgaaaataaagtacaccgtgaagacctccaaggaactcgctcttgtcatctcaagtctctgaaggcagagctgacagttttcctgggaagttcaccctccgggagtgaaacctgactctcaggatgatcctgtctaacgccacggccgtgacacctgtgctgaccaagttatggcaggggacagtt
caacagggcagcaacacatctgggctggcccgcaggttcccaggccacgaggacggcaagctggcggcgc
tctacatcctcatggtcctcggctttttcggcttcttcaccctgggcatcatgctgagttacatccgctc
caagaaactggagcactcccatgacccatacaacgtgtacattgagtctgacacttggcaggagcaggac
aaggcgtacttccaggcccggattctggagagctgcagggcgtgttacgtcattgagaaccaactggctg
tagagcgacccagtgcataccttcctgagatgaagcggtcgtcctgaccccaggaccagtcaaaactgga
cagagcctccctgatgagctgatttttctaatcacatgttccttttttctttattgtatgagtattattg
gggtttttgtctatcataaggggtgaaagggggatttaatatcactatatttctaaaatcacattccttc
tataatagattgtcagtcattcccca
I am completely new to next generation sequencing and need some help.
I have a multi-sequence FASTA file from Cufflinks (a small portion of the file is below), and I would like to calculate the percentage of n's in each transcript. Is there an easy way to do the calculation for all 11,804 transcripts at once?
Thanks so much for your help!!
>NM_001034679 loc:chr1|327877-339078|+ exons:327877-328043,331233-331406,333963-334122,337306-339078 segs:1-167,168-341,342-501,502-2274 FPKM=0.0000000000 frac=0.000000 conf_lo=0.000000 conf_hi=0.000000 cov=0.000000 full_read_support=yes
AACAGCTTCGGGTAAAGgaactccatccacttggggcttgaccgcgggagtcgctgtagctctcactgtg
agaaagcaagatgcactttagaaattttaactacagttttagctccctgattgcctgtgtggcaaacagt
gagatcttcagcgaaagtgaaaccagggccaaatttgaatccctctttaggacttatgacaaagatatca
cctttcagtattttaaaagcttcaaacgggtcagaataaacttcagcaaccccttatctgcagcagatgc
caggctccagctacataagactgaatttcttggaaaggaaatgaaattatattttgctcagactttacac
ataggcagttcgcacctggcccctccgaatccagacaaacagttcctcatctctcctcctgcctcaccgc
cggtcggctggaaacaagtggaagacgctactcccgtcataaattatgaccttttatacgctatctccaa
gctagggccaggggaaaagtatgaactgcatgcagccaccgacaccacccccagcgtggtggtccacgtc
tgtgagagcgatcaggagaacgaggaagaagacgagatggaaagaatgaagagaccgaagcccaaaattatccagaccaggaggccagagtacacgcccatccacttaagctgaaccggcgcccggacgaagacgtgctc
caaaccatgctcgcaagaaggcatcttttactgtggaagcagccggtcacagctttggaggcggcagccg
tgaccgctgtggcggaaattccagttcacgttgctcagaagagaatcgaggcttcgtcccctggttctaa
cgctgcgcctcagtcagtgttcgaggctcctggccaggccccgagccaatcactgagcttggggtgatcg
cacaaggacatctgggagcatcgcgggaaaaccaataatgatagtcttttgtacttgttctcttctggta
ggttctgtcttggccaggggcagattgatccgtgggccccggggagagtctttgtgtttaatcagtctac
aaggtagacgcactctctctCCTGGTGGGAAAAGGCGCCACGnnnnnnnnnnnngCGTCTGGTGCAGAAAGGTTGTGAAAGCAACCGTGCAACGTggaaactgtagcgtttcaatttcccccttcatgttctgatgTTTGTGCATGTGTATTACTgatTTCTCAGAACTAACCTTTGTTTGTATGTAGAGTTGCGCCACTGCTGTTTTACATCTTCTGGGGAGATAAGAAGGCATctgtgaagtctgtcacctttgcagattcGTGACTGTCTTTGCAAGGGCACCCACGGCGGGGTGGAGGGGATTCTACCTGGAATACACACACCATTCCGCATCCTGTCCGATGCGACCAAnnnnnCGTGTTTTTGCAAAAGAAgtcgatctggaattcctgtgtagcgtttcgcttataaaattcagaaaatagcactttcactgccaactactagtgggtgagaaattttagtttagatgttttagatcaggcaatacgtaggtttcatttgtttctttgacgtggtggtttatataacatgaatcatagccaaaacccttttcggggggaatagtcagttgagatcattaatttttttacccccactaatacatcaagataaacttgtaaataaagccggtagtatatattcacacctgttgtgcacttgggtgagacatatatggccagggaagactagggtcagatgtgttgacctccccgtgaatcatatgttgtagaaaatgcctttcagatgtttgatgggacttgaattcaaagcacgtgaagtggatagtggatataagaagggtgcagtgcctttcccattaattcctggtggagttgtcacactaggttaacgtttgtaatttttttctagtgtccTGTGTATGTsTGGTCGATGGGTACTCCCTTTTGGCCTTACAAtattgtaacaatgtttgtccttttgaaatacctaatgccaagtaacagtgcatgctttagaaaaggggaagagggctttctttaagaagtaaaggcgtttggctgttcctgtcaagaaactgactgaatggtctccaaaccctgtttacaggacctggtggggtgtgggggacaaatgagcaagagatGCGTGCATAGTCGTTCAAGTGTTCGTAGTTCAGTGCTTTTAAACTGGGGAGGCTAACCACGAGATATTTTTTTTAACTGCATTCTCTAATAAATCGACGCAATATGCTCTTTA
>NM_001077977 loc:chr1|377376-383871|+ exons:377376-377447,378427-378521,383243-383871 segs:1-72,73-167,168-796 FPKM=0.0000000000 frac=0.000000 conf_lo=0.000000 conf_hi=0.000000 cov=0.000000 full_read_support=yes
nnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnngCACCCGGCTCGCTCCGCGCCTCAGACCCGgttctgaggaagcggattgttttgaaaataaagtacaccgtgaagacctccaaggaactcgctcttgtcatctcaagtctctgaaggcagagctgacagttttcctgggaagttcaccctccgggagtgaaacctgactctcaggatgatcctgtctaacgccacggccgtgacacctgtgctgaccaagttatggcaggggacagtt
caacagggcagcaacacatctgggctggcccgcaggttcccaggccacgaggacggcaagctggcggcgc
tctacatcctcatggtcctcggctttttcggcttcttcaccctgggcatcatgctgagttacatccgctc
caagaaactggagcactcccatgacccatacaacgtgtacattgagtctgacacttggcaggagcaggac
aaggcgtacttccaggcccggattctggagagctgcagggcgtgttacgtcattgagaaccaactggctg
tagagcgacccagtgcataccttcctgagatgaagcggtcgtcctgaccccaggaccagtcaaaactgga
cagagcctccctgatgagctgatttttctaatcacatgttccttttttctttattgtatgagtattattg
gggtttttgtctatcataaggggtgaaagggggatttaatatcactatatttctaaaatcacattccttc
tataatagattgtcagtcattcccca
Comment