Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Converting Solexa FASTQ file to unique sequence tags

    Hello everyone,

    I'm trying to use DSAP: Deep-Sequencing Analysis Pipeline. In order to start processing your data the sequencing files must be in unique sequence tags instead of Solexa FASTQ format. The site suggests a PERL program associated with BioPieces called "read_solexa". I've attempted to install this program with no luck and errors saying no such program module found.

    Can someone suggest a program that will convert Solexa FASTQ files into unique sequence tags.


    Thanks,
    Brandon

  • #2
    Hello DrD2009,

    Are your sequences TagSeq or other else?
    If you just want to get the uniue sequence in whole length of reads, you can just use the command in Linux:
    awk '{if (NR%4==2){t[$1]++}}END{for (i in t){print i,t[i]}}' IN.SolexaReads.fastq > OUT.UniqTags.copyNumber

    Or if it is TagSeq in 35bp reads length, mostly the last 17bp ("TCGTAGCCGTCTTCTGC", DpnII Enzyme Solexa adapter) are adapter sequence. the other is tag reads.
    So you can try this:
    awk 'NR%4==2' IN.SolexaReads.fastq | cut -c 1-18 |sort |uniq -c |awk 'BEGIN{OFS="\t"}{print $2,$1}' >OUT.UniqTags.copyNumber
    Last edited by BENM; 06-28-2010, 12:22 AM.

    Comment


    • #3
      awk 'NR%4==2' in.fastq|sort |uniq -c|awk '{print $1"\t"$2}' > out

      Comment


      • #4
        @lh3:
        That worked, thanks.
        I really need to learn awk it seems to be extremely useful for these file manipulations.

        @BENM
        What I have are files in this format:

        @SEQ_ID
        GATTTGGGGTTCAAAGCAGTATCGATCAAATAGTAAATCCATTTGTTCAACTCACAGTTT
        +
        !''*((((***+))%%%++)(%%%%).1***-+*''))**55CCF>>>>>>CCCCCCC65

        They are already trimmed so their length varies.

        And I need the data to be in this form:

        23445 ATCGCGGGAGCCCG
        23489 ACTGCTAGTGCATGCTATGTCA
        49732 ACTGTCATCGTAGCTCAC

        Where the first column is the number of unique counts and the sequence is comprises of the second column.

        Comment


        • #5
          How about:

          Code:
          #!/usr/bin/perl
          use warnings;
          use strict;
          
          my %seqs;
          
          while (<>) {
            if (/^\@/) {
              my $seq = <>;
              chomp $seq;
              ++$seqs{$seq};
            }
          }
          
          foreach my $seq (keys %seqs) {
            print join("\t",($seqs{$seq},$seq)),"\n";
          }
          Pass the file you want to process as an argument.

          perl this_script.pl some_fastq_file.txt

          Comment


          • #6
            Thanks Simon, I will give this a try as well and get back to you on it.

            The program I'm using, DSAP, suggests I use a program from Biopieces called 'read_solexa' to perform the conversion. After trying to install Biopieces and the module itself I had zero luck. So anything that will perform the same function as 'read_solexa' would be perfect.

            Comment


            • #7
              @DrD2009

              Go to www.biopieces.org

              Locate the link to Installation and follow the instructions step-by-step.

              Then you get what you want and much more!


              Cheers,


              Martin

              Comment


              • #8
                Also, the other suggestions in this thread don't take into account reverse-complement tags. You really want the pipeline suggested by DSAP:



                read_solexa -i INPUT.fastq |
                uniq_seq -c |
                sort_records -r -k SEQ_COUNTn |
                write_tab -k SEQ_COUNT,SEQ -xo OUTPUT.tag


                Martin

                Comment


                • #9
                  Originally posted by maasha View Post
                  Also, the other suggestions in this thread don't take into account reverse-complement tags. You really want the pipeline suggested by DSAP:



                  read_solexa -i INPUT.fastq |
                  uniq_seq -c |
                  sort_records -r -k SEQ_COUNTn |
                  write_tab -k SEQ_COUNT,SEQ -xo OUTPUT.tag


                  Martin
                  Martin,

                  I actually tried using biopieces before. I've tried installing it on two separate occasions and have never been able to get it to work. I'd like to think I'm relatively decent with running program in Unix systems, but this has defeated me.

                  Comment


                  • #10
                    Originally posted by DrD2009 View Post
                    Martin,

                    I actually tried using biopieces before. I've tried installing it on two separate occasions and have never been able to get it to work. I'd like to think I'm relatively decent with running program in Unix systems, but this has defeated me.
                    Now that is not good. I am very interested in making the Biopieces installation as smooth as possible. So, what seems to be the problem?

                    I am also thinking about writing a installer/setup script, but those always need to be tested on a lot of different platforms and may not be very robust.


                    Martin

                    Comment


                    • #11
                      Well, honestly I'm not sure. I'm used to just unzipping files and copying the binaries to the path so the way you install biopieces is a little new to me.

                      I think going through it this third time did the charm, although I don't really understand how to make use of the biopieces.

                      I'm running Ubuntu 10.04.

                      I already had Perl installed so then I installed the modules needed. This was my first problem I couldn't get any of the modules to install. After doing some searching I discovered the resolution. "sudo cpan", then I was able to install all of the modules.

                      Then,
                      Basically you can checkout a read-only version like this:

                      cd ~/ # or wherever you want the Biopiece source code tree rooted

                      svn checkout http://biopieces.googlecode.com/svn/trunk/ biopieces

                      After you have installed the source code tree, you need to install the wiki tree as well:

                      cd ~/biopieces

                      svn checkout http://biopieces.googlecode.com/svn/wiki bp_usage
                      I skipped this part:

                      If you are a recognized developer you simply add your username to the checkout command,

                      cd ~/ # or wherever you want the Biopiece source code tree rooted

                      svn checkout https://biopieces.googlecode.com/svn/trunk/ biopieces --username <username>

                      and when prompted you enter the password from



                      And for the wiki pages:

                      cd ~/biopieces

                      svn checkout https://biopieces.googlecode.com/svn/wiki bp_usage --username <username>
                      I next added the below to my .bashrc file

                      # >>>>>>>>>>>>>>>>>>>>>>> Enabling Biopieces if installed <<<<<<<<<<<<<<<<<<<<<<<

                      # Modify the below paths according to your settings.
                      # If you have followed the installation step-by-step as described above,
                      # the below should work just fine.

                      export BP_DIR="$HOME/biopieces" # Directory where biopieces are installed
                      export BP_DATA="$HOME/BP_DATA" # Contains genomic data etc.
                      export BP_TMP="$HOME/tmp" # Required temporary directory.
                      export BP_LOG="$HOME/BP_LOG" # Required log directory.

                      if [ -f "$BP_DIR/bp_conf/bashrc" ]; then
                      source "$BP_DIR/bp_conf/bashrc"
                      fi

                      alias bp_update="cd $BP_DIR && svn update; cd $BP_DIR/bp_usage && svn update;"

                      # >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>><<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<
                      I created the needed directories. I used the
                      Code:
                      mkdir $BP_DATA $BP_TMP $BP_LOG
                      within the biopieces folder in my home directory. The BP_TMP became tmp in my home directory and BP_LOG and BP_DATA were also placed in the home directory.

                      After that, I tested Biopieces:
                      Code:
                      $BP_DIR/bp_test/test_all
                      Which resulted in:

                      Code:
                      brandon@brandon:~/biopieces/bp_test$ ./test_all
                      Testing add_ident -I /home/brandon/biopieces/bp_test/in/add_ident.in -O /home/brandon/tmp/add_ident.out ... OK
                      Testing add_ident -I /home/brandon/biopieces/bp_test/in/add_ident.in -k CUSTOM_KEY -O /home/brandon/tmp/add_ident.out ... OK
                      Testing add_ident -I /home/brandon/biopieces/bp_test/in/add_ident.in -p PREFIX -O /home/brandon/tmp/add_ident.out ... OK
                      Testing align_seq -I /home/brandon/biopieces/bp_test/in/align_seq.in -O /home/brandon/tmp/align_seq.out ... FAIL
                      Testing analyze_gc -I /home/brandon/biopieces/bp_test/in/analyze_gc.in -O /home/brandon/tmp/analyze_gc.out ... OK
                      Testing analyze_vals -I /home/brandon/biopieces/bp_test/in/analyze_vals.in -o /home/brandon/tmp/analyze_vals.out -x ... OK
                      Testing analyze_vals -I /home/brandon/biopieces/bp_test/in/analyze_vals.in -k V0 -o /home/brandon/tmp/analyze_vals.out -x ... OK
                      Testing analyze_vals -I /home/brandon/biopieces/bp_test/in/analyze_vals.in -K V0 -o /home/brandon/tmp/analyze_vals.out -x ... OK
                      Testing grab -I /home/brandon/biopieces/bp_test/in/grab.in -p SEQ -O /home/brandon/tmp/grab.out ... OK
                      Testing grab -I /home/brandon/biopieces/bp_test/in/grab.in -p SEQ,COUNT -O /home/brandon/tmp/grab.out ... OK
                      Testing grab -I /home/brandon/biopieces/bp_test/in/grab.in -P /home/brandon/biopieces/bp_test/in/grab.in.pat -O /home/brandon/tmp/grab.out ... OK
                      Testing grab -I /home/brandon/biopieces/bp_test/in/grab.in -p SEQ -i -O /home/brandon/tmp/grab.out ... OK
                      Testing grab -I /home/brandon/biopieces/bp_test/in/grab.in -p SEQ -K -O /home/brandon/tmp/grab.out ... OK
                      Testing grab -I /home/brandon/biopieces/bp_test/in/grab.in -p SEQ -V -O /home/brandon/tmp/grab.out ... OK
                      Testing grab -I /home/brandon/biopieces/bp_test/in/grab.in -p SEQ -k PAT -O /home/brandon/tmp/grab.out ... OK
                      Testing grab -I /home/brandon/biopieces/bp_test/in/grab.in -r a -k SEQ -O /home/brandon/tmp/grab.out ... OK
                      Testing grab -I /home/brandon/biopieces/bp_test/in/grab.in -r a -k SEQ -c -O /home/brandon/tmp/grab.out ... OK
                      Testing grab -I /home/brandon/biopieces/bp_test/in/grab.in -e 'SEQ_LEN<10' -O /home/brandon/tmp/grab.out ... OK
                      Testing lowercase_seq -I /home/brandon/biopieces/bp_test/in/lowercase_seq.in -O /home/brandon/tmp/lowercase_seq.out ... OK
                      Testing read_fasta -i /home/brandon/biopieces/bp_test/in/read_fasta.in -O /home/brandon/tmp/read_fasta.out ... OK
                      Testing read_fasta -i /home/brandon/biopieces/bp_test/in/read_fasta.in -n 1 -O /home/brandon/tmp/read_fasta.out ... OK
                      Testing read_solexa -i /home/brandon/biopieces/bp_test/in/read_solexa.in -O /home/brandon/tmp/read_solexa.out ... OK
                      Testing read_solexa -i /home/brandon/biopieces/bp_test/in/read_solexa.in -n 1 -O /home/brandon/tmp/read_solexa.out ... OK
                      Testing read_solexa -i /home/brandon/biopieces/bp_test/in/read_solexa.in -c -O /home/brandon/tmp/read_solexa.out ... OK
                      Testing read_solexa -i /home/brandon/biopieces/bp_test/in/read_solexa.in -s -O /home/brandon/tmp/read_solexa.out ... OK
                      Testing read_solexa -i /home/brandon/biopieces/bp_test/in/read_solexa.in -s -C 30 -O /home/brandon/tmp/read_solexa.out ... OK
                      Testing read_tab -i /home/brandon/biopieces/bp_test/in/read_tab.in.1 -O /home/brandon/tmp/read_tab.out ... OK
                      Testing read_tab -i /home/brandon/biopieces/bp_test/in/read_tab.in.1 -s 1 -O /home/brandon/tmp/read_tab.out ... OK
                      Testing read_tab -i /home/brandon/biopieces/bp_test/in/read_tab.in.1 -s 1 -k ORGANISM,SEQ,COUNT -O /home/brandon/tmp/read_tab.out ... OK
                      Testing read_tab -i /home/brandon/biopieces/bp_test/in/read_tab.in.1 -s 1 -c 2,1 -O /home/brandon/tmp/read_tab.out ... OK
                      Testing read_tab -i /home/brandon/biopieces/bp_test/in/read_tab.in.1 -s 1 -c 2,1 -k COUNT,SEQ -O /home/brandon/tmp/read_tab.out ... OK
                      Testing read_tab -i /home/brandon/biopieces/bp_test/in/read_tab.in.2 -O /home/brandon/tmp/read_tab.out ... OK
                      Testing read_tab -i /home/brandon/biopieces/bp_test/in/read_tab.in.2 -n 1 -O /home/brandon/tmp/read_tab.out ... OK
                      Testing read_tab -i /home/brandon/biopieces/bp_test/in/read_tab.in.3 -d ';' -O /home/brandon/tmp/read_tab.out ... OK
                      Testing swapcase_seq -I /home/brandon/biopieces/bp_test/in/swapcase_seq.in -O /home/brandon/tmp/swapcase_seq.out ... diff: /home/brandon/tmp/swapcase_seq.out: No such file or directory
                      OK
                      rm: cannot remove `/home/brandon/tmp/swapcase_seq.out': No such file or directory
                      Testing uppercase_seq -I /home/brandon/biopieces/bp_test/in/uppercase_seq.in -O /home/brandon/tmp/uppercase_seq.out ... OK
                      Testing write_fasta -I /home/brandon/biopieces/bp_test/in/write_fasta.in -o /home/brandon/tmp/write_fasta.out -x ... OK
                      Testing write_fasta -I /home/brandon/biopieces/bp_test/in/write_fasta.in -w 4 -o /home/brandon/tmp/write_fasta.out -x ... OK
                      Testing write_fasta -I /home/brandon/biopieces/bp_test/in/write_fasta.in -w 4 -Z -o /home/brandon/tmp/write_fasta.out.gz -x ... OK
                      Testing write_tab -I /home/brandon/biopieces/bp_test/in/write_tab.in -o /home/brandon/tmp/write_tab.out -x ... OK
                      Testing write_tab -I /home/brandon/biopieces/bp_test/in/write_tab.in -c -o /home/brandon/tmp/write_tab.out -x ... OK
                      Testing write_tab -I /home/brandon/biopieces/bp_test/in/write_tab.in -d ',' -o /home/brandon/tmp/write_tab.out -x ... OK
                      Testing write_tab -I /home/brandon/biopieces/bp_test/in/write_tab.in -Z -o /home/brandon/tmp/write_tab.out.gz -x ... OK
                      Testing write_tab -I /home/brandon/biopieces/bp_test/in/write_tab.in -k 'Count' -o /home/brandon/tmp/write_tab.out -x ... OK
                      Testing write_tab -I /home/brandon/biopieces/bp_test/in/write_tab.in -K 'Count' -o /home/brandon/tmp/write_tab.out -x ... OK
                      Biopieces tested: 13   Tests run: 45   OK: 44   FAIL: 1   Time: 7 secs
                      So I'm guessing it's installed correctly? I just need to understand how to run the programs now. I'm used to just typing the program in the command line and running it, but it seems biopieces operates a little different so I'm still trying to wrap my head around it.

                      Comment


                      • #12
                        Just a caveat, does unique tags deal with SNPs?
                        A tag could be the same, except one base; should these tags be counted as separate tags?
                        If there is a model genome, then maybe counting tags based on genome position?

                        Comment


                        • #13
                          @Drd2009,

                          I know that the installation of the Biopieces is a bit untypical since they are installed using a checkout from a code repository and not using some installer or tarball. The advantage of doing it this way is clearly on the developers side, but at the same time I promise to deal with bugs immediately.

                          Now, it looks like you have managed to deal with the installation almost perfectly, there is only one test FAIL - and that is actually my fault - I forgot to add Ruby to the prerequisites (as of last Friday!). apt-get install ruby1.9 or grab it from ruby-lang.org (I shall add this to the Installation Guide in a sec). Also, I did write in the Installation Guide that you might need admin rights to install CPAN modules so using 'sudo' is just perfect.

                          Now, for getting started using Biopieces there is the Introduction:



                          Feedback is welcome.

                          Cheers,


                          Martin

                          Comment


                          • #14
                            Originally posted by poisson200 View Post
                            Just a caveat, does unique tags deal with SNPs?
                            A tag could be the same, except one base; should these tags be counted as separate tags?
                            If there is a model genome, then maybe counting tags based on genome position?
                            uniq_seq does not deal with SNPs. It only checks for identical sequences (and reverse-complement). SNP-finding is a whole different game, where you need to take into account sequencing error/quality and gene copy number. There is a lot of papers on the subject.


                            Martin

                            Comment


                            • #15
                              Hi Martin,
                              I only mention SNPs as they invent new reads into your counts. So if you have a 50mer read called A, that maps to position X in genome, then read B that also maps to position X but differs to A by one SNP. Depending on why you are counting reads, it could be useful to treat A and B as the same read. Anyway, it could be not important, it was just an observation.

                              Comment

                              Latest Articles

                              Collapse

                              • seqadmin
                                Current Approaches to Protein Sequencing
                                by seqadmin


                                Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
                                04-04-2024, 04:25 PM
                              • seqadmin
                                Strategies for Sequencing Challenging Samples
                                by seqadmin


                                Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                                03-22-2024, 06:39 AM

                              ad_right_rmr

                              Collapse

                              News

                              Collapse

                              Topics Statistics Last Post
                              Started by seqadmin, 04-11-2024, 12:08 PM
                              0 responses
                              18 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 04-10-2024, 10:19 PM
                              0 responses
                              22 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 04-10-2024, 09:21 AM
                              0 responses
                              17 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 04-04-2024, 09:00 AM
                              0 responses
                              49 views
                              0 likes
                              Last Post seqadmin  
                              Working...
                              X