SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics



Similar Threads
Thread Thread Starter Forum Replies Last Post
ChIP-Seq: Enabling Data Analysis on High-Throughput Data in Large Data Depository Usi Newsbot! Literature Watch 1 04-18-2018 09:50 PM
Metatranscriptomics and residual human sequences harlequin Metagenomics 3 02-07-2014 08:01 AM
Metatranscriptomics - RNA gel image capsicum Sample Prep / Library Generation 0 02-12-2013 12:42 PM
RNA-Seq: Directed culturing of microorganisms using metatranscriptomics. Newsbot! Literature Watch 0 04-07-2011 02:00 AM

Reply
 
Thread Tools
Old 02-18-2014, 08:06 AM   #1
bambus
Member
 
Location: Germany

Join Date: Nov 2013
Posts: 20
Question Metatranscriptomics data analysis:

Hi,

I have a problem with my perl script can anyone please help me out.

I work on Meta transcriptomics data.I assembled my data files(after clustering) using Trinity assembler and also mapped reads back to transcripts using Bowtie2 alignment tool.Now I want to find out the sequence abundance in my libraries for that I am trying to modify my perl script.

For instance I have two files
1) mapping output in sam format(modified with only 3 columns I am interested on)

HWI-ST3635:262:C0RY1ACXX:6:1307:9852:25251_1:N:0:GTGAAA_weight|1 comp2872_c0_seq1 ACTGAGATTTGCGGAACCCGTGCCCACCATCCGATTCGAGCAGGAAGGCCAGCAGGTTGGCTGCATCGAGGGCGCAAATCTGCGTAACGCTGCACTTGAT
HWI-ST3655:262:C0RY1ACXX:6:2201:4395:24339_1:N:0:GTGAAA_weight|1 * CGCTTGAGCCCCGTGAAGTGCAAATAATCAGCGACTCCTTCGGCCTTGAAGGAGCCTCCTGCAAAACGTTAGAAGAAATTGGCGGTGAAATGGGCGTAAC
HWI-ST5365:262:C0RY1ACXX:6:2109:15325:63331_1:N:0:GTGAAA_weight|1 comp1371_c0_seq1 CTTCCGGTCGGAGAGTGACAGACCGAATGGCTGAGGAGCGCGTTCAACCGATCGCCCTGCATCAGGAGATGCAGCGCTCCTACCTCGAATACG
HWI-ST3365:262:C0RY1ACXX:6:1107:21149:80481_1:N:0:GTGAAA_weight|1 comp25106_c0_seq1 AATTCCTCTTTCTCGAATTGAAGAAAAAGAAAGTATTCGCCTGATGAATATGGAGGAAGAGTTATCGGAGAAAGGTGTCGGTCAGTGAG
HWI-ST3655:262:C0RY1ACXX:6:1113:18720:38413_1:N:0:GTGAAA_weight|6 comp16g27_c0_seq1 CAGCAAGCAGGGCACTGCGgGCCACCACAGCGGTAATTGCCAAGCTTAAAGGATATGAAGGAGATGCTTGCCAGGAATGCGGgCCTAACACTGGTA
HWI-ST365:262:C0RY1ACXX:6:1310:17075:45332_1:N:0:GTGAAA_weight|3 * ATAGTTATATGCATGCATATGCAGAAAGGATATTTTAATTGAAGACATGACTGCCATTATTTGTAAAGATCTTGATGGTTAATTATTGTGG


Weight:denotes the number of times the read exists in the data file

2)Trinity contigs file

>comp0_c0_seq1 len=159 path=[12:0-66 885:67-79 1106:80-158]
GGTTAATATTCCCGAGCCACGAGATTGGAGGGACGGCGTCCAGAGCACCTGCGGACTGAT
AGAATAGCCCGTTGCACGTACTGAGTTGGAGAGGATTAAAAGACTCTCATAATACAAGGC
CCGTACTAAGCCCACTTACGATGGAATAGATGGTAGGAA
>comp0_c0_seq2 len=159 path=[12:0-66 2349:67-79 1106:80-158]
GGTTAATATTCCCGAGCCACGAGATTGGAGGGACGGCGTCCAGAGCACCTGCGGACTGAT
AGAATAGTCCGTTGGATCAAATGAGTTGGAGAGGATTAAAAGACTCTCATAATACAAGGC
CCGTACTAAGCCCACTTACGATGGAATAGATGGTAGGAA



First I want to find out how many unicals were mapped back to each contig or transcript(assembly output) and list of reads(add the weights),secondly extract the sequence of the contig and its length from the Trinity contig file and print out something like as following

Contig ID, Length, Number of unique clusters mapped, Number of reads mapped, Sequence

and here is script with which I could get the number of reads mapped to each contig by modifying it, but it is generating errors when I proceed further to generate list of sequences for each contig and add their weights...

#!usr/bin/perl-w
use strict;
use warnings;
my %hash = ();
my $seq = ();
while(<>){
chomp;
next if($_ =~ /^@/); ## remove the headers in sam file
my @s =split;
my $seq = $s[0];
#print "$seq, \n";
my $contig = $s[1];
#print " $seq \t $contig \n";
$hash{$contig}++;

}
foreach (sort keys %hash){
print join ("\t", $_,$hash{$_}, $seq{$_}),"\n"



I could get a tabular column as mentioned above by using a list of bash commands where I cannot do it manually for each contig as the data file is large.

Can anyone please help me how to modify the script to get the results I am trying to.
Attached Files
File Type: doc expected_output.doc (14.0 KB, 5 views)
bambus is offline   Reply With Quote
Old 02-18-2014, 12:26 PM   #2
mastal
Senior Member
 
Location: uk

Join Date: Mar 2009
Posts: 667
Default

What error messages are you getting, which lines in the script are causing the problem?

I am not sure which of the two files your script is trying to parse, but in any case the lines in the sam file begin with the read ids 'HWI...' not '@'

Code:
 next if($_ =~ /^@/); ## remove the headers in sam file
so that could be one problem.

Last edited by mastal; 02-18-2014 at 12:29 PM.
mastal is offline   Reply With Quote
Old 02-18-2014, 11:31 PM   #3
bambus
Member
 
Location: Germany

Join Date: Nov 2013
Posts: 20
Default

Hi Mastal,thank you for the reply.

Yes,the lines begin with 'HWI' as you mentioned because I modified the original SAM output file.In general the there are headers in Sam file,but it doesn't harm in anyway to the script.

For instance an error message I am getting

Global symbol "%seq" requires explicit package name at read_count.pl line 22.
Execution of read_count.pl aborted due to compilation errors.
bambus is offline   Reply With Quote
Old 02-19-2014, 12:44 AM   #4
mastal
Senior Member
 
Location: uk

Join Date: Mar 2009
Posts: 667
Default

The error may refer to the $seq{$_) in your print statement in the last line of the script.

There is no previous reference to a hash named 'seq' in your script, but you do have a local scalar variable named $seq (my $seq = $s[0]).
mastal is offline   Reply With Quote
Old 02-19-2014, 12:58 AM   #5
bambus
Member
 
Location: Germany

Join Date: Nov 2013
Posts: 20
Default

Okay,how can I solve this?

Can you please help me.
bambus is offline   Reply With Quote
Old 02-19-2014, 12:59 AM   #6
boetsie
Senior Member
 
Location: NL, Leiden

Join Date: Feb 2010
Posts: 245
Default

The error is caused, because in the corresponding line you are refering to a hash ($seq{$_}), while you did not define a hash for seq in line 5;
my $seq = ();

In addition, your 'seq' hash is not filled anywhere.

To solve this, I would recommend the following pseudocode;

- Read your alignment, and store the results (as you did already)

- Read contig file and go through each contig like the code belowbelow;
Code:
open(IN,$file);
while(<IN>){
  chomp;
  $contigSeq.= $_ if(eof(IN));
  if (/\>(\S+)/ || eof(IN)){
     my $head=$1;
     if($contigSeq ne ''){
       #$contigSeq is the contig sequence, $prevhead is your contig
       my $len = length($contigSeq);
       #Now print the results
       print "$prevhead\t$len\t$hash{$prevhead}\t$contigSeq\n";
     }
     $prevhead = $head;
     $contigSeq='';
  }else{
     $contigSeq .= $_;
  }
}
close IN;
Quote:
Originally Posted by bambus View Post
Hi Mastal,thank you for the reply.

Yes,the lines begin with 'HWI' as you mentioned because I modified the original SAM output file.In general the there are headers in Sam file,but it doesn't harm in anyway to the script.

For instance an error message I am getting

Global symbol "%seq" requires explicit package name at read_count.pl line 22.
Execution of read_count.pl aborted due to compilation errors.
boetsie is offline   Reply With Quote
Old 02-19-2014, 01:09 AM   #7
bambus
Member
 
Location: Germany

Join Date: Nov 2013
Posts: 20
Default

I am sorry,i am not clear.

So now how does this code works.Do I have to input two files (Mapped sam file to count contig frequency and contig file to extract the sequence)or?
bambus is offline   Reply With Quote
Old 02-19-2014, 01:16 AM   #8
boetsie
Senior Member
 
Location: NL, Leiden

Join Date: Feb 2010
Posts: 245
Default

Quote:
Originally Posted by bambus View Post
I am sorry,i am not clear.

So now how does this code works.Do I have to input two files (Mapped sam file to count contig frequency and contig file to extract the sequence)or?
Yes. The first part (reading the mapped sam file) you should do yourself. But basically, this is what you already have. The second part (the code shown above) is for extracting the contig sequence.
boetsie is offline   Reply With Quote
Old 02-19-2014, 01:21 AM   #9
bambus
Member
 
Location: Germany

Join Date: Nov 2013
Posts: 20
Default

Yes I have an ouput file named 'read_count' with "contig ID" Number of Unicals mapped".

Now I also want to print out the "list of Unicals(sequences) that were mapped" "contig sequence" "contig length".

So I tried to execute your code with two input files

1) read count
2) contig_file as follows

perl contig_seq.pl read count contig_file >seq.txt

but,it gives me no output.
bambus is offline   Reply With Quote
Old 02-19-2014, 01:22 AM   #10
boetsie
Senior Member
 
Location: NL, Leiden

Join Date: Feb 2010
Posts: 245
Default

PM me your code and the two input files, so I can help you with that..
boetsie is offline   Reply With Quote
Old 02-19-2014, 01:38 AM   #11
bambus
Member
 
Location: Germany

Join Date: Nov 2013
Posts: 20
Default

Quote:
Originally Posted by boetsie View Post
PM me your code and the two input files, so I can help you with that..
I am sorry I could attach only 1 file at a time
Attached Files
File Type: pl read_count.pl (297 Bytes, 3 views)
bambus is offline   Reply With Quote
Old 02-19-2014, 01:47 AM   #12
boetsie
Senior Member
 
Location: NL, Leiden

Join Date: Feb 2010
Posts: 245
Default

I did not test it, but I think it would be something like this. Run it as;

perl script <samfile> <contigfile>

Code:
#!usr/bin/perl-w
use strict;
use warnings;

open(SAM,$ARGV[0]);
my %hash = ();
while(<SAM>){
  chomp;
  next if($_ =~ /^@/); ## remove the headers in sam file
  my @s =split;
  my $contig = $s[2];
  $hash{$contig}++;
}
close SAM;

open(CTG,$ARGV[1]);
my ($contigSeq,$prevhead) = ("","");
while(<CTG>){
  chomp;
  $contigSeq.= $_ if(eof(CTG));
  if (/\>(\S+)/ || eof(CTG)){
     my $head=$1;
     if($contigSeq ne ''){
       #$contigSeq is the contig sequence, $prevhead is your contig
       my $len = length($contigSeq);
       #Now print the results
       print "$prevhead\t$len\t$hash{$prevhead}\t$contigSeq\n";
     }
     $prevhead = $head;
     $contigSeq='';
  }else{
     $contigSeq .= $_;
  }
}
close CTG;
boetsie is offline   Reply With Quote
Old 02-19-2014, 01:48 AM   #13
bambus
Member
 
Location: Germany

Join Date: Nov 2013
Posts: 20
Default

Quote:
Originally Posted by bambus View Post
I am sorry I could attach only 1 file at a time

input file 1 with mapped reads
Attached Files
File Type: txt Sample_aligned_reads.txt (19.0 KB, 2 views)
bambus is offline   Reply With Quote
Old 02-19-2014, 01:49 AM   #14
bambus
Member
 
Location: Germany

Join Date: Nov 2013
Posts: 20
Default

Quote:
Originally Posted by bambus View Post
input file 1 with mapped reads
Input file 2 -list of contigs
Attached Files
File Type: txt sample_contig.txt (2.1 KB, 3 views)
bambus is offline   Reply With Quote
Old 02-19-2014, 01:53 AM   #15
bambus
Member
 
Location: Germany

Join Date: Nov 2013
Posts: 20
Default

yes I did,and I got this following error many time with different line numbers


error

Use of uninitialized value in concatenation (.) or string at contig_table.pl line 27, <CTG> line 140568.
bambus is offline   Reply With Quote
Old 02-19-2014, 02:09 AM   #16
boetsie
Senior Member
 
Location: NL, Leiden

Join Date: Feb 2010
Posts: 245
Default

Here you go. You had a mistake in your code, you took the third column as your contig in this line;

my $contig = $s[2];

while the contig was in the second column.

In addition, since you only have a subset of your data, you'll get these warnings. I removed the use warnings line. Run the script as;

perl script.pl Sample_aligned_reads.txt sample_contig.txt

Code:
#!usr/bin/perl-w
use strict;

open(SAM,$ARGV[0]);
my %hash = ();
while(<SAM>){
  chomp;
  next if($_ =~ /^@/); ## remove the headers in sam file
  #split the line and obtain the read and contig
  my ($read,$contig,$sequence) =split;
#split the read on the '|' character, to obtain the weight
  my (undef, $weight) = split(/\|/,$read);
#save the total number of reads and clusters in the hash
  $hash{$contig}{'clusters'}++;
  $hash{$contig}{'total'}+=$weight;
}
close SAM;


open(CTG,$ARGV[1]);
my ($contigSeq,$prevhead) = ("","");
while(<CTG>){
  chomp;
  $contigSeq.= $_ if(eof(CTG));
  if (/\>(\S+)/ || eof(CTG)){
     my $head=$1;
     if($contigSeq ne ''){
       #$contigSeq is the contig sequence, $prevhead is your contig
       my $len = length($contigSeq);
       #Now print the results
       print "$prevhead\t$len\t$hash{$prevhead}{'clusters'}\t$hash{$prevhead}{'total'}\t$contigSeq\n" if(defined $hash{$prevhead});
     }
     $prevhead = $head;
     $contigSeq='';
  }else{
     $contigSeq .= $_;
  }
}
close CTG;
boetsie is offline   Reply With Quote
Old 02-19-2014, 02:14 AM   #17
bambus
Member
 
Location: Germany

Join Date: Nov 2013
Posts: 20
Default

Yes,I am sorry forgot to mention , I modified the data while extracting the columns of interest from original SAM file,but now I executed the script on complete data files.

Again I will try with this modified script.
bambus is offline   Reply With Quote
Old 02-19-2014, 02:25 AM   #18
bambus
Member
 
Location: Germany

Join Date: Nov 2013
Posts: 20
Default

Oh great! its working,thank you very much.
bambus is offline   Reply With Quote
Old 02-19-2014, 02:30 AM   #19
boetsie
Senior Member
 
Location: NL, Leiden

Join Date: Feb 2010
Posts: 245
Default

Great, good luck with your analysis.
boetsie is offline   Reply With Quote
Old 02-19-2014, 02:30 AM   #20
bambus
Member
 
Location: Germany

Join Date: Nov 2013
Posts: 20
Default

Thank you once again.
bambus is offline   Reply With Quote
Reply

Tags
mapping reads, meta-transcriptomics, perl script, rna-seq

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 04:06 PM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2019, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO