Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Metatranscriptomics data analysis:

    Hi,

    I have a problem with my perl script can anyone please help me out.

    I work on Meta transcriptomics data.I assembled my data files(after clustering) using Trinity assembler and also mapped reads back to transcripts using Bowtie2 alignment tool.Now I want to find out the sequence abundance in my libraries for that I am trying to modify my perl script.

    For instance I have two files
    1) mapping output in sam format(modified with only 3 columns I am interested on)

    HWI-ST3635:262:C0RY1ACXX:6:1307:9852:25251_1:N:0:GTGAAA_weight|1 comp2872_c0_seq1 ACTGAGATTTGCGGAACCCGTGCCCACCATCCGATTCGAGCAGGAAGGCCAGCAGGTTGGCTGCATCGAGGGCGCAAATCTGCGTAACGCTGCACTTGAT
    HWI-ST3655:262:C0RY1ACXX:6:2201:4395:24339_1:N:0:GTGAAA_weight|1 * CGCTTGAGCCCCGTGAAGTGCAAATAATCAGCGACTCCTTCGGCCTTGAAGGAGCCTCCTGCAAAACGTTAGAAGAAATTGGCGGTGAAATGGGCGTAAC
    HWI-ST5365:262:C0RY1ACXX:6:2109:15325:63331_1:N:0:GTGAAA_weight|1 comp1371_c0_seq1 CTTCCGGTCGGAGAGTGACAGACCGAATGGCTGAGGAGCGCGTTCAACCGATCGCCCTGCATCAGGAGATGCAGCGCTCCTACCTCGAATACG
    HWI-ST3365:262:C0RY1ACXX:6:1107:21149:80481_1:N:0:GTGAAA_weight|1 comp25106_c0_seq1 AATTCCTCTTTCTCGAATTGAAGAAAAAGAAAGTATTCGCCTGATGAATATGGAGGAAGAGTTATCGGAGAAAGGTGTCGGTCAGTGAG
    HWI-ST3655:262:C0RY1ACXX:6:1113:18720:38413_1:N:0:GTGAAA_weight|6 comp16g27_c0_seq1 CAGCAAGCAGGGCACTGCGgGCCACCACAGCGGTAATTGCCAAGCTTAAAGGATATGAAGGAGATGCTTGCCAGGAATGCGGgCCTAACACTGGTA
    HWI-ST365:262:C0RY1ACXX:6:1310:17075:45332_1:N:0:GTGAAA_weight|3 * ATAGTTATATGCATGCATATGCAGAAAGGATATTTTAATTGAAGACATGACTGCCATTATTTGTAAAGATCTTGATGGTTAATTATTGTGG


    Weight:denotes the number of times the read exists in the data file

    2)Trinity contigs file

    >comp0_c0_seq1 len=159 path=[12:0-66 885:67-79 1106:80-158]
    GGTTAATATTCCCGAGCCACGAGATTGGAGGGACGGCGTCCAGAGCACCTGCGGACTGAT
    AGAATAGCCCGTTGCACGTACTGAGTTGGAGAGGATTAAAAGACTCTCATAATACAAGGC
    CCGTACTAAGCCCACTTACGATGGAATAGATGGTAGGAA
    >comp0_c0_seq2 len=159 path=[12:0-66 2349:67-79 1106:80-158]
    GGTTAATATTCCCGAGCCACGAGATTGGAGGGACGGCGTCCAGAGCACCTGCGGACTGAT
    AGAATAGTCCGTTGGATCAAATGAGTTGGAGAGGATTAAAAGACTCTCATAATACAAGGC
    CCGTACTAAGCCCACTTACGATGGAATAGATGGTAGGAA



    First I want to find out how many unicals were mapped back to each contig or transcript(assembly output) and list of reads(add the weights),secondly extract the sequence of the contig and its length from the Trinity contig file and print out something like as following

    Contig ID, Length, Number of unique clusters mapped, Number of reads mapped, Sequence

    and here is script with which I could get the number of reads mapped to each contig by modifying it, but it is generating errors when I proceed further to generate list of sequences for each contig and add their weights...

    #!usr/bin/perl-w
    use strict;
    use warnings;
    my %hash = ();
    my $seq = ();
    while(<>){
    chomp;
    next if($_ =~ /^@/); ## remove the headers in sam file
    my @s =split;
    my $seq = $s[0];
    #print "$seq, \n";
    my $contig = $s[1];
    #print " $seq \t $contig \n";
    $hash{$contig}++;

    }
    foreach (sort keys %hash){
    print join ("\t", $_,$hash{$_}, $seq{$_}),"\n"



    I could get a tabular column as mentioned above by using a list of bash commands where I cannot do it manually for each contig as the data file is large.

    Can anyone please help me how to modify the script to get the results I am trying to.
    Attached Files

  • #2
    What error messages are you getting, which lines in the script are causing the problem?

    I am not sure which of the two files your script is trying to parse, but in any case the lines in the sam file begin with the read ids 'HWI...' not '@'

    Code:
     next if($_ =~ /^@/); ## remove the headers in sam file
    so that could be one problem.
    Last edited by mastal; 02-18-2014, 01:29 PM.

    Comment


    • #3
      Hi Mastal,thank you for the reply.

      Yes,the lines begin with 'HWI' as you mentioned because I modified the original SAM output file.In general the there are headers in Sam file,but it doesn't harm in anyway to the script.

      For instance an error message I am getting

      Global symbol "%seq" requires explicit package name at read_count.pl line 22.
      Execution of read_count.pl aborted due to compilation errors.

      Comment


      • #4
        The error may refer to the $seq{$_) in your print statement in the last line of the script.

        There is no previous reference to a hash named 'seq' in your script, but you do have a local scalar variable named $seq (my $seq = $s[0]).

        Comment


        • #5
          Okay,how can I solve this?

          Can you please help me.

          Comment


          • #6
            The error is caused, because in the corresponding line you are refering to a hash ($seq{$_}), while you did not define a hash for seq in line 5;
            my $seq = ();

            In addition, your 'seq' hash is not filled anywhere.

            To solve this, I would recommend the following pseudocode;

            - Read your alignment, and store the results (as you did already)

            - Read contig file and go through each contig like the code belowbelow;
            Code:
            open(IN,$file);
            while(<IN>){
              chomp;
              $contigSeq.= $_ if(eof(IN));
              if (/\>(\S+)/ || eof(IN)){
                 my $head=$1;
                 if($contigSeq ne ''){
                   #$contigSeq is the contig sequence, $prevhead is your contig
                   my $len = length($contigSeq);
                   #Now print the results
                   print "$prevhead\t$len\t$hash{$prevhead}\t$contigSeq\n";
                 }
                 $prevhead = $head;
                 $contigSeq='';
              }else{
                 $contigSeq .= $_;
              }
            }
            close IN;
            Originally posted by bambus View Post
            Hi Mastal,thank you for the reply.

            Yes,the lines begin with 'HWI' as you mentioned because I modified the original SAM output file.In general the there are headers in Sam file,but it doesn't harm in anyway to the script.

            For instance an error message I am getting

            Global symbol "%seq" requires explicit package name at read_count.pl line 22.
            Execution of read_count.pl aborted due to compilation errors.

            Comment


            • #7
              I am sorry,i am not clear.

              So now how does this code works.Do I have to input two files (Mapped sam file to count contig frequency and contig file to extract the sequence)or?

              Comment


              • #8
                Originally posted by bambus View Post
                I am sorry,i am not clear.

                So now how does this code works.Do I have to input two files (Mapped sam file to count contig frequency and contig file to extract the sequence)or?
                Yes. The first part (reading the mapped sam file) you should do yourself. But basically, this is what you already have. The second part (the code shown above) is for extracting the contig sequence.

                Comment


                • #9
                  Yes I have an ouput file named 'read_count' with "contig ID" Number of Unicals mapped".

                  Now I also want to print out the "list of Unicals(sequences) that were mapped" "contig sequence" "contig length".

                  So I tried to execute your code with two input files

                  1) read count
                  2) contig_file as follows

                  perl contig_seq.pl read count contig_file >seq.txt

                  but,it gives me no output.

                  Comment


                  • #10
                    PM me your code and the two input files, so I can help you with that..

                    Comment


                    • #11
                      Originally posted by boetsie View Post
                      PM me your code and the two input files, so I can help you with that..
                      I am sorry I could attach only 1 file at a time
                      Attached Files

                      Comment


                      • #12
                        I did not test it, but I think it would be something like this. Run it as;

                        perl script <samfile> <contigfile>

                        Code:
                        #!usr/bin/perl-w
                        use strict;
                        use warnings;
                        
                        open(SAM,$ARGV[0]);
                        my %hash = ();
                        while(<SAM>){
                          chomp;
                          next if($_ =~ /^@/); ## remove the headers in sam file
                          my @s =split;
                          my $contig = $s[2];
                          $hash{$contig}++;
                        }
                        close SAM;
                        
                        open(CTG,$ARGV[1]);
                        my ($contigSeq,$prevhead) = ("","");
                        while(<CTG>){
                          chomp;
                          $contigSeq.= $_ if(eof(CTG));
                          if (/\>(\S+)/ || eof(CTG)){
                             my $head=$1;
                             if($contigSeq ne ''){
                               #$contigSeq is the contig sequence, $prevhead is your contig
                               my $len = length($contigSeq);
                               #Now print the results
                               print "$prevhead\t$len\t$hash{$prevhead}\t$contigSeq\n";
                             }
                             $prevhead = $head;
                             $contigSeq='';
                          }else{
                             $contigSeq .= $_;
                          }
                        }
                        close CTG;

                        Comment


                        • #13
                          Originally posted by bambus View Post
                          I am sorry I could attach only 1 file at a time

                          input file 1 with mapped reads
                          Attached Files

                          Comment


                          • #14
                            Originally posted by bambus View Post
                            input file 1 with mapped reads
                            Input file 2 -list of contigs
                            Attached Files

                            Comment


                            • #15
                              yes I did,and I got this following error many time with different line numbers


                              error

                              Use of uninitialized value in concatenation (.) or string at contig_table.pl line 27, <CTG> line 140568.

                              Comment

                              Latest Articles

                              Collapse

                              • seqadmin
                                Current Approaches to Protein Sequencing
                                by seqadmin


                                Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
                                04-04-2024, 04:25 PM
                              • seqadmin
                                Strategies for Sequencing Challenging Samples
                                by seqadmin


                                Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                                03-22-2024, 06:39 AM

                              ad_right_rmr

                              Collapse

                              News

                              Collapse

                              Topics Statistics Last Post
                              Started by seqadmin, 04-11-2024, 12:08 PM
                              0 responses
                              17 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 04-10-2024, 10:19 PM
                              0 responses
                              22 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 04-10-2024, 09:21 AM
                              0 responses
                              16 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 04-04-2024, 09:00 AM
                              0 responses
                              46 views
                              0 likes
                              Last Post seqadmin  
                              Working...
                              X