Hi,
I have a problem with my perl script can anyone please help me out.
I work on Meta transcriptomics data.I assembled my data files(after clustering) using Trinity assembler and also mapped reads back to transcripts using Bowtie2 alignment tool.Now I want to find out the sequence abundance in my libraries for that I am trying to modify my perl script.
For instance I have two files
1) mapping output in sam format(modified with only 3 columns I am interested on)
HWI-ST3635:262:C0RY1ACXX:6:1307:9852:25251_1:N:0:GTGAAA_weight|1 comp2872_c0_seq1 ACTGAGATTTGCGGAACCCGTGCCCACCATCCGATTCGAGCAGGAAGGCCAGCAGGTTGGCTGCATCGAGGGCGCAAATCTGCGTAACGCTGCACTTGAT
HWI-ST3655:262:C0RY1ACXX:6:2201:4395:24339_1:N:0:GTGAAA_weight|1 * CGCTTGAGCCCCGTGAAGTGCAAATAATCAGCGACTCCTTCGGCCTTGAAGGAGCCTCCTGCAAAACGTTAGAAGAAATTGGCGGTGAAATGGGCGTAAC
HWI-ST5365:262:C0RY1ACXX:6:2109:15325:63331_1:N:0:GTGAAA_weight|1 comp1371_c0_seq1 CTTCCGGTCGGAGAGTGACAGACCGAATGGCTGAGGAGCGCGTTCAACCGATCGCCCTGCATCAGGAGATGCAGCGCTCCTACCTCGAATACG
HWI-ST3365:262:C0RY1ACXX:6:1107:21149:80481_1:N:0:GTGAAA_weight|1 comp25106_c0_seq1 AATTCCTCTTTCTCGAATTGAAGAAAAAGAAAGTATTCGCCTGATGAATATGGAGGAAGAGTTATCGGAGAAAGGTGTCGGTCAGTGAG
HWI-ST3655:262:C0RY1ACXX:6:1113:18720:38413_1:N:0:GTGAAA_weight|6 comp16g27_c0_seq1 CAGCAAGCAGGGCACTGCGgGCCACCACAGCGGTAATTGCCAAGCTTAAAGGATATGAAGGAGATGCTTGCCAGGAATGCGGgCCTAACACTGGTA
HWI-ST365:262:C0RY1ACXX:6:1310:17075:45332_1:N:0:GTGAAA_weight|3 * ATAGTTATATGCATGCATATGCAGAAAGGATATTTTAATTGAAGACATGACTGCCATTATTTGTAAAGATCTTGATGGTTAATTATTGTGG
Weight:denotes the number of times the read exists in the data file
2)Trinity contigs file
>comp0_c0_seq1 len=159 path=[12:0-66 885:67-79 1106:80-158]
GGTTAATATTCCCGAGCCACGAGATTGGAGGGACGGCGTCCAGAGCACCTGCGGACTGAT
AGAATAGCCCGTTGCACGTACTGAGTTGGAGAGGATTAAAAGACTCTCATAATACAAGGC
CCGTACTAAGCCCACTTACGATGGAATAGATGGTAGGAA
>comp0_c0_seq2 len=159 path=[12:0-66 2349:67-79 1106:80-158]
GGTTAATATTCCCGAGCCACGAGATTGGAGGGACGGCGTCCAGAGCACCTGCGGACTGAT
AGAATAGTCCGTTGGATCAAATGAGTTGGAGAGGATTAAAAGACTCTCATAATACAAGGC
CCGTACTAAGCCCACTTACGATGGAATAGATGGTAGGAA
First I want to find out how many unicals were mapped back to each contig or transcript(assembly output) and list of reads(add the weights),secondly extract the sequence of the contig and its length from the Trinity contig file and print out something like as following
Contig ID, Length, Number of unique clusters mapped, Number of reads mapped, Sequence
and here is script with which I could get the number of reads mapped to each contig by modifying it, but it is generating errors when I proceed further to generate list of sequences for each contig and add their weights...
#!usr/bin/perl-w
use strict;
use warnings;
my %hash = ();
my $seq = ();
while(<>){
chomp;
next if($_ =~ /^@/); ## remove the headers in sam file
my @s =split;
my $seq = $s[0];
#print "$seq, \n";
my $contig = $s[1];
#print " $seq \t $contig \n";
$hash{$contig}++;
}
foreach (sort keys %hash){
print join ("\t", $_,$hash{$_}, $seq{$_}),"\n"
I could get a tabular column as mentioned above by using a list of bash commands where I cannot do it manually for each contig as the data file is large.
Can anyone please help me how to modify the script to get the results I am trying to.
I have a problem with my perl script can anyone please help me out.
I work on Meta transcriptomics data.I assembled my data files(after clustering) using Trinity assembler and also mapped reads back to transcripts using Bowtie2 alignment tool.Now I want to find out the sequence abundance in my libraries for that I am trying to modify my perl script.
For instance I have two files
1) mapping output in sam format(modified with only 3 columns I am interested on)
HWI-ST3635:262:C0RY1ACXX:6:1307:9852:25251_1:N:0:GTGAAA_weight|1 comp2872_c0_seq1 ACTGAGATTTGCGGAACCCGTGCCCACCATCCGATTCGAGCAGGAAGGCCAGCAGGTTGGCTGCATCGAGGGCGCAAATCTGCGTAACGCTGCACTTGAT
HWI-ST3655:262:C0RY1ACXX:6:2201:4395:24339_1:N:0:GTGAAA_weight|1 * CGCTTGAGCCCCGTGAAGTGCAAATAATCAGCGACTCCTTCGGCCTTGAAGGAGCCTCCTGCAAAACGTTAGAAGAAATTGGCGGTGAAATGGGCGTAAC
HWI-ST5365:262:C0RY1ACXX:6:2109:15325:63331_1:N:0:GTGAAA_weight|1 comp1371_c0_seq1 CTTCCGGTCGGAGAGTGACAGACCGAATGGCTGAGGAGCGCGTTCAACCGATCGCCCTGCATCAGGAGATGCAGCGCTCCTACCTCGAATACG
HWI-ST3365:262:C0RY1ACXX:6:1107:21149:80481_1:N:0:GTGAAA_weight|1 comp25106_c0_seq1 AATTCCTCTTTCTCGAATTGAAGAAAAAGAAAGTATTCGCCTGATGAATATGGAGGAAGAGTTATCGGAGAAAGGTGTCGGTCAGTGAG
HWI-ST3655:262:C0RY1ACXX:6:1113:18720:38413_1:N:0:GTGAAA_weight|6 comp16g27_c0_seq1 CAGCAAGCAGGGCACTGCGgGCCACCACAGCGGTAATTGCCAAGCTTAAAGGATATGAAGGAGATGCTTGCCAGGAATGCGGgCCTAACACTGGTA
HWI-ST365:262:C0RY1ACXX:6:1310:17075:45332_1:N:0:GTGAAA_weight|3 * ATAGTTATATGCATGCATATGCAGAAAGGATATTTTAATTGAAGACATGACTGCCATTATTTGTAAAGATCTTGATGGTTAATTATTGTGG
Weight:denotes the number of times the read exists in the data file
2)Trinity contigs file
>comp0_c0_seq1 len=159 path=[12:0-66 885:67-79 1106:80-158]
GGTTAATATTCCCGAGCCACGAGATTGGAGGGACGGCGTCCAGAGCACCTGCGGACTGAT
AGAATAGCCCGTTGCACGTACTGAGTTGGAGAGGATTAAAAGACTCTCATAATACAAGGC
CCGTACTAAGCCCACTTACGATGGAATAGATGGTAGGAA
>comp0_c0_seq2 len=159 path=[12:0-66 2349:67-79 1106:80-158]
GGTTAATATTCCCGAGCCACGAGATTGGAGGGACGGCGTCCAGAGCACCTGCGGACTGAT
AGAATAGTCCGTTGGATCAAATGAGTTGGAGAGGATTAAAAGACTCTCATAATACAAGGC
CCGTACTAAGCCCACTTACGATGGAATAGATGGTAGGAA
First I want to find out how many unicals were mapped back to each contig or transcript(assembly output) and list of reads(add the weights),secondly extract the sequence of the contig and its length from the Trinity contig file and print out something like as following
Contig ID, Length, Number of unique clusters mapped, Number of reads mapped, Sequence
and here is script with which I could get the number of reads mapped to each contig by modifying it, but it is generating errors when I proceed further to generate list of sequences for each contig and add their weights...
#!usr/bin/perl-w
use strict;
use warnings;
my %hash = ();
my $seq = ();
while(<>){
chomp;
next if($_ =~ /^@/); ## remove the headers in sam file
my @s =split;
my $seq = $s[0];
#print "$seq, \n";
my $contig = $s[1];
#print " $seq \t $contig \n";
$hash{$contig}++;
}
foreach (sort keys %hash){
print join ("\t", $_,$hash{$_}, $seq{$_}),"\n"
I could get a tabular column as mentioned above by using a list of bash commands where I cannot do it manually for each contig as the data file is large.
Can anyone please help me how to modify the script to get the results I am trying to.
Comment