SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics



Similar Threads
Thread Thread Starter Forum Replies Last Post
Converting FASTA/qual file pair from 454 to FASTQ oiiio Bioinformatics 9 01-01-2016 03:55 PM
Obtaining unique sequence tag file from fastQ format ramadatta.88 Introductions 0 09-26-2011 01:25 AM
Solexa - same sequence but unique identifier Layla Bioinformatics 5 11-27-2009 05:08 AM
Converting Solexa new format to FASTQ asafle Bioinformatics 3 08-01-2009 10:07 AM
solexa protocol: unique tags fabio25 Illumina/Solexa 0 12-22-2008 03:47 AM

Reply
 
Thread Tools
Old 06-26-2010, 03:16 PM   #1
DrD2009
Member
 
Location: Kansas City

Join Date: Oct 2009
Posts: 88
Default Converting Solexa FASTQ file to unique sequence tags

Hello everyone,

I'm trying to use DSAP: Deep-Sequencing Analysis Pipeline. In order to start processing your data the sequencing files must be in unique sequence tags instead of Solexa FASTQ format. The site suggests a PERL program associated with BioPieces called "read_solexa". I've attempted to install this program with no luck and errors saying no such program module found.

Can someone suggest a program that will convert Solexa FASTQ files into unique sequence tags.


Thanks,
Brandon
DrD2009 is offline   Reply With Quote
Old 06-26-2010, 10:24 PM   #2
BENM
Member
 
Location: PRC

Join Date: May 2009
Posts: 33
Default

Hello DrD2009,

Are your sequences TagSeq or other else?
If you just want to get the uniue sequence in whole length of reads, you can just use the command in Linux:
awk '{if (NR%4==2){t[$1]++}}END{for (i in t){print i,t[i]}}' IN.SolexaReads.fastq > OUT.UniqTags.copyNumber

Or if it is TagSeq in 35bp reads length, mostly the last 17bp ("TCGTAGCCGTCTTCTGC", DpnII Enzyme Solexa adapter) are adapter sequence. the other is tag reads.
So you can try this:
awk 'NR%4==2' IN.SolexaReads.fastq | cut -c 1-18 |sort |uniq -c |awk 'BEGIN{OFS="\t"}{print $2,$1}' >OUT.UniqTags.copyNumber

Last edited by BENM; 06-28-2010 at 12:22 AM.
BENM is offline   Reply With Quote
Old 06-27-2010, 06:08 AM   #3
lh3
Senior Member
 
Location: Boston

Join Date: Feb 2008
Posts: 693
Default

awk 'NR%4==2' in.fastq|sort |uniq -c|awk '{print $1"\t"$2}' > out
lh3 is offline   Reply With Quote
Old 06-27-2010, 01:49 PM   #4
DrD2009
Member
 
Location: Kansas City

Join Date: Oct 2009
Posts: 88
Default

@lh3:
That worked, thanks.
I really need to learn awk it seems to be extremely useful for these file manipulations.

@BENM
What I have are files in this format:

@SEQ_ID
GATTTGGGGTTCAAAGCAGTATCGATCAAATAGTAAATCCATTTGTTCAACTCACAGTTT
+
!''*((((***+))%%%++)(%%%%).1***-+*''))**55CCF>>>>>>CCCCCCC65

They are already trimmed so their length varies.

And I need the data to be in this form:

23445 ATCGCGGGAGCCCG
23489 ACTGCTAGTGCATGCTATGTCA
49732 ACTGTCATCGTAGCTCAC

Where the first column is the number of unique counts and the sequence is comprises of the second column.
DrD2009 is offline   Reply With Quote
Old 06-27-2010, 11:16 PM   #5
simonandrews
Simon Andrews
 
Location: Babraham Inst, Cambridge, UK

Join Date: May 2009
Posts: 871
Default

How about:

Code:
#!/usr/bin/perl
use warnings;
use strict;

my %seqs;

while (<>) {
  if (/^\@/) {
    my $seq = <>;
    chomp $seq;
    ++$seqs{$seq};
  }
}

foreach my $seq (keys %seqs) {
  print join("\t",($seqs{$seq},$seq)),"\n";
}
Pass the file you want to process as an argument.

perl this_script.pl some_fastq_file.txt
simonandrews is offline   Reply With Quote
Old 06-28-2010, 10:31 AM   #6
DrD2009
Member
 
Location: Kansas City

Join Date: Oct 2009
Posts: 88
Default

Thanks Simon, I will give this a try as well and get back to you on it.

The program I'm using, DSAP, suggests I use a program from Biopieces called 'read_solexa' to perform the conversion. After trying to install Biopieces and the module itself I had zero luck. So anything that will perform the same function as 'read_solexa' would be perfect.
DrD2009 is offline   Reply With Quote
Old 08-06-2010, 07:10 AM   #7
maasha
Senior Member
 
Location: Denmark

Join Date: Apr 2009
Posts: 153
Default

@DrD2009

Go to www.biopieces.org

Locate the link to Installation and follow the instructions step-by-step.

Then you get what you want and much more!


Cheers,


Martin
maasha is offline   Reply With Quote
Old 08-06-2010, 11:02 PM   #8
maasha
Senior Member
 
Location: Denmark

Join Date: Apr 2009
Posts: 153
Default

Also, the other suggestions in this thread don't take into account reverse-complement tags. You really want the pipeline suggested by DSAP:

http://dsap.cgu.edu.tw/tutorial.html

read_solexa -i INPUT.fastq |
uniq_seq -c |
sort_records -r -k SEQ_COUNTn |
write_tab -k SEQ_COUNT,SEQ -xo OUTPUT.tag


Martin
maasha is offline   Reply With Quote
Old 08-07-2010, 05:04 PM   #9
DrD2009
Member
 
Location: Kansas City

Join Date: Oct 2009
Posts: 88
Default

Quote:
Originally Posted by maasha View Post
Also, the other suggestions in this thread don't take into account reverse-complement tags. You really want the pipeline suggested by DSAP:

http://dsap.cgu.edu.tw/tutorial.html

read_solexa -i INPUT.fastq |
uniq_seq -c |
sort_records -r -k SEQ_COUNTn |
write_tab -k SEQ_COUNT,SEQ -xo OUTPUT.tag


Martin
Martin,

I actually tried using biopieces before. I've tried installing it on two separate occasions and have never been able to get it to work. I'd like to think I'm relatively decent with running program in Unix systems, but this has defeated me.
DrD2009 is offline   Reply With Quote
Old 08-08-2010, 12:47 AM   #10
maasha
Senior Member
 
Location: Denmark

Join Date: Apr 2009
Posts: 153
Default

Quote:
Originally Posted by DrD2009 View Post
Martin,

I actually tried using biopieces before. I've tried installing it on two separate occasions and have never been able to get it to work. I'd like to think I'm relatively decent with running program in Unix systems, but this has defeated me.
Now that is not good. I am very interested in making the Biopieces installation as smooth as possible. So, what seems to be the problem?

I am also thinking about writing a installer/setup script, but those always need to be tested on a lot of different platforms and may not be very robust.


Martin
maasha is offline   Reply With Quote
Old 08-08-2010, 01:58 AM   #11
DrD2009
Member
 
Location: Kansas City

Join Date: Oct 2009
Posts: 88
Default

Well, honestly I'm not sure. I'm used to just unzipping files and copying the binaries to the path so the way you install biopieces is a little new to me.

I think going through it this third time did the charm, although I don't really understand how to make use of the biopieces.

I'm running Ubuntu 10.04.

I already had Perl installed so then I installed the modules needed. This was my first problem I couldn't get any of the modules to install. After doing some searching I discovered the resolution. "sudo cpan", then I was able to install all of the modules.

Then,
Quote:
Basically you can checkout a read-only version like this:

cd ~/ # or wherever you want the Biopiece source code tree rooted

svn checkout http://biopieces.googlecode.com/svn/trunk/ biopieces

After you have installed the source code tree, you need to install the wiki tree as well:

cd ~/biopieces

svn checkout http://biopieces.googlecode.com/svn/wiki bp_usage
I skipped this part:

Quote:
If you are a recognized developer you simply add your username to the checkout command,

cd ~/ # or wherever you want the Biopiece source code tree rooted

svn checkout https://biopieces.googlecode.com/svn/trunk/ biopieces --username <username>

and when prompted you enter the password from

http://code.google.com/hosting/settings

And for the wiki pages:

cd ~/biopieces

svn checkout https://biopieces.googlecode.com/svn/wiki bp_usage --username <username>
I next added the below to my .bashrc file

Quote:
# >>>>>>>>>>>>>>>>>>>>>>> Enabling Biopieces if installed <<<<<<<<<<<<<<<<<<<<<<<

# Modify the below paths according to your settings.
# If you have followed the installation step-by-step as described above,
# the below should work just fine.

export BP_DIR="$HOME/biopieces" # Directory where biopieces are installed
export BP_DATA="$HOME/BP_DATA" # Contains genomic data etc.
export BP_TMP="$HOME/tmp" # Required temporary directory.
export BP_LOG="$HOME/BP_LOG" # Required log directory.

if [ -f "$BP_DIR/bp_conf/bashrc" ]; then
source "$BP_DIR/bp_conf/bashrc"
fi

alias bp_update="cd $BP_DIR && svn update; cd $BP_DIR/bp_usage && svn update;"

# >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>><<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<
I created the needed directories. I used the
Code:
mkdir $BP_DATA $BP_TMP $BP_LOG
within the biopieces folder in my home directory. The BP_TMP became tmp in my home directory and BP_LOG and BP_DATA were also placed in the home directory.

After that, I tested Biopieces:
Code:
$BP_DIR/bp_test/test_all
Which resulted in:

Code:
brandon@brandon:~/biopieces/bp_test$ ./test_all
Testing add_ident -I /home/brandon/biopieces/bp_test/in/add_ident.in -O /home/brandon/tmp/add_ident.out ... OK
Testing add_ident -I /home/brandon/biopieces/bp_test/in/add_ident.in -k CUSTOM_KEY -O /home/brandon/tmp/add_ident.out ... OK
Testing add_ident -I /home/brandon/biopieces/bp_test/in/add_ident.in -p PREFIX -O /home/brandon/tmp/add_ident.out ... OK
Testing align_seq -I /home/brandon/biopieces/bp_test/in/align_seq.in -O /home/brandon/tmp/align_seq.out ... FAIL
Testing analyze_gc -I /home/brandon/biopieces/bp_test/in/analyze_gc.in -O /home/brandon/tmp/analyze_gc.out ... OK
Testing analyze_vals -I /home/brandon/biopieces/bp_test/in/analyze_vals.in -o /home/brandon/tmp/analyze_vals.out -x ... OK
Testing analyze_vals -I /home/brandon/biopieces/bp_test/in/analyze_vals.in -k V0 -o /home/brandon/tmp/analyze_vals.out -x ... OK
Testing analyze_vals -I /home/brandon/biopieces/bp_test/in/analyze_vals.in -K V0 -o /home/brandon/tmp/analyze_vals.out -x ... OK
Testing grab -I /home/brandon/biopieces/bp_test/in/grab.in -p SEQ -O /home/brandon/tmp/grab.out ... OK
Testing grab -I /home/brandon/biopieces/bp_test/in/grab.in -p SEQ,COUNT -O /home/brandon/tmp/grab.out ... OK
Testing grab -I /home/brandon/biopieces/bp_test/in/grab.in -P /home/brandon/biopieces/bp_test/in/grab.in.pat -O /home/brandon/tmp/grab.out ... OK
Testing grab -I /home/brandon/biopieces/bp_test/in/grab.in -p SEQ -i -O /home/brandon/tmp/grab.out ... OK
Testing grab -I /home/brandon/biopieces/bp_test/in/grab.in -p SEQ -K -O /home/brandon/tmp/grab.out ... OK
Testing grab -I /home/brandon/biopieces/bp_test/in/grab.in -p SEQ -V -O /home/brandon/tmp/grab.out ... OK
Testing grab -I /home/brandon/biopieces/bp_test/in/grab.in -p SEQ -k PAT -O /home/brandon/tmp/grab.out ... OK
Testing grab -I /home/brandon/biopieces/bp_test/in/grab.in -r a -k SEQ -O /home/brandon/tmp/grab.out ... OK
Testing grab -I /home/brandon/biopieces/bp_test/in/grab.in -r a -k SEQ -c -O /home/brandon/tmp/grab.out ... OK
Testing grab -I /home/brandon/biopieces/bp_test/in/grab.in -e 'SEQ_LEN<10' -O /home/brandon/tmp/grab.out ... OK
Testing lowercase_seq -I /home/brandon/biopieces/bp_test/in/lowercase_seq.in -O /home/brandon/tmp/lowercase_seq.out ... OK
Testing read_fasta -i /home/brandon/biopieces/bp_test/in/read_fasta.in -O /home/brandon/tmp/read_fasta.out ... OK
Testing read_fasta -i /home/brandon/biopieces/bp_test/in/read_fasta.in -n 1 -O /home/brandon/tmp/read_fasta.out ... OK
Testing read_solexa -i /home/brandon/biopieces/bp_test/in/read_solexa.in -O /home/brandon/tmp/read_solexa.out ... OK
Testing read_solexa -i /home/brandon/biopieces/bp_test/in/read_solexa.in -n 1 -O /home/brandon/tmp/read_solexa.out ... OK
Testing read_solexa -i /home/brandon/biopieces/bp_test/in/read_solexa.in -c -O /home/brandon/tmp/read_solexa.out ... OK
Testing read_solexa -i /home/brandon/biopieces/bp_test/in/read_solexa.in -s -O /home/brandon/tmp/read_solexa.out ... OK
Testing read_solexa -i /home/brandon/biopieces/bp_test/in/read_solexa.in -s -C 30 -O /home/brandon/tmp/read_solexa.out ... OK
Testing read_tab -i /home/brandon/biopieces/bp_test/in/read_tab.in.1 -O /home/brandon/tmp/read_tab.out ... OK
Testing read_tab -i /home/brandon/biopieces/bp_test/in/read_tab.in.1 -s 1 -O /home/brandon/tmp/read_tab.out ... OK
Testing read_tab -i /home/brandon/biopieces/bp_test/in/read_tab.in.1 -s 1 -k ORGANISM,SEQ,COUNT -O /home/brandon/tmp/read_tab.out ... OK
Testing read_tab -i /home/brandon/biopieces/bp_test/in/read_tab.in.1 -s 1 -c 2,1 -O /home/brandon/tmp/read_tab.out ... OK
Testing read_tab -i /home/brandon/biopieces/bp_test/in/read_tab.in.1 -s 1 -c 2,1 -k COUNT,SEQ -O /home/brandon/tmp/read_tab.out ... OK
Testing read_tab -i /home/brandon/biopieces/bp_test/in/read_tab.in.2 -O /home/brandon/tmp/read_tab.out ... OK
Testing read_tab -i /home/brandon/biopieces/bp_test/in/read_tab.in.2 -n 1 -O /home/brandon/tmp/read_tab.out ... OK
Testing read_tab -i /home/brandon/biopieces/bp_test/in/read_tab.in.3 -d ';' -O /home/brandon/tmp/read_tab.out ... OK
Testing swapcase_seq -I /home/brandon/biopieces/bp_test/in/swapcase_seq.in -O /home/brandon/tmp/swapcase_seq.out ... diff: /home/brandon/tmp/swapcase_seq.out: No such file or directory
OK
rm: cannot remove `/home/brandon/tmp/swapcase_seq.out': No such file or directory
Testing uppercase_seq -I /home/brandon/biopieces/bp_test/in/uppercase_seq.in -O /home/brandon/tmp/uppercase_seq.out ... OK
Testing write_fasta -I /home/brandon/biopieces/bp_test/in/write_fasta.in -o /home/brandon/tmp/write_fasta.out -x ... OK
Testing write_fasta -I /home/brandon/biopieces/bp_test/in/write_fasta.in -w 4 -o /home/brandon/tmp/write_fasta.out -x ... OK
Testing write_fasta -I /home/brandon/biopieces/bp_test/in/write_fasta.in -w 4 -Z -o /home/brandon/tmp/write_fasta.out.gz -x ... OK
Testing write_tab -I /home/brandon/biopieces/bp_test/in/write_tab.in -o /home/brandon/tmp/write_tab.out -x ... OK
Testing write_tab -I /home/brandon/biopieces/bp_test/in/write_tab.in -c -o /home/brandon/tmp/write_tab.out -x ... OK
Testing write_tab -I /home/brandon/biopieces/bp_test/in/write_tab.in -d ',' -o /home/brandon/tmp/write_tab.out -x ... OK
Testing write_tab -I /home/brandon/biopieces/bp_test/in/write_tab.in -Z -o /home/brandon/tmp/write_tab.out.gz -x ... OK
Testing write_tab -I /home/brandon/biopieces/bp_test/in/write_tab.in -k 'Count' -o /home/brandon/tmp/write_tab.out -x ... OK
Testing write_tab -I /home/brandon/biopieces/bp_test/in/write_tab.in -K 'Count' -o /home/brandon/tmp/write_tab.out -x ... OK
Biopieces tested: 13   Tests run: 45   OK: 44   FAIL: 1   Time: 7 secs
So I'm guessing it's installed correctly? I just need to understand how to run the programs now. I'm used to just typing the program in the command line and running it, but it seems biopieces operates a little different so I'm still trying to wrap my head around it.
DrD2009 is offline   Reply With Quote
Old 08-08-2010, 03:12 AM   #12
poisson200
Member
 
Location: united kingdom

Join Date: Feb 2010
Posts: 63
Default

Just a caveat, does unique tags deal with SNPs?
A tag could be the same, except one base; should these tags be counted as separate tags?
If there is a model genome, then maybe counting tags based on genome position?
poisson200 is offline   Reply With Quote
Old 08-08-2010, 03:24 AM   #13
maasha
Senior Member
 
Location: Denmark

Join Date: Apr 2009
Posts: 153
Default

@Drd2009,

I know that the installation of the Biopieces is a bit untypical since they are installed using a checkout from a code repository and not using some installer or tarball. The advantage of doing it this way is clearly on the developers side, but at the same time I promise to deal with bugs immediately.

Now, it looks like you have managed to deal with the installation almost perfectly, there is only one test FAIL - and that is actually my fault - I forgot to add Ruby to the prerequisites (as of last Friday!). apt-get install ruby1.9 or grab it from ruby-lang.org (I shall add this to the Installation Guide in a sec). Also, I did write in the Installation Guide that you might need admin rights to install CPAN modules so using 'sudo' is just perfect.

Now, for getting started using Biopieces there is the Introduction:

http://code.google.com/p/biopieces/wiki/Introduction

Feedback is welcome.

Cheers,


Martin
maasha is offline   Reply With Quote
Old 08-08-2010, 03:33 AM   #14
maasha
Senior Member
 
Location: Denmark

Join Date: Apr 2009
Posts: 153
Default

Quote:
Originally Posted by poisson200 View Post
Just a caveat, does unique tags deal with SNPs?
A tag could be the same, except one base; should these tags be counted as separate tags?
If there is a model genome, then maybe counting tags based on genome position?
uniq_seq does not deal with SNPs. It only checks for identical sequences (and reverse-complement). SNP-finding is a whole different game, where you need to take into account sequencing error/quality and gene copy number. There is a lot of papers on the subject.


Martin
maasha is offline   Reply With Quote
Old 08-08-2010, 05:04 AM   #15
poisson200
Member
 
Location: united kingdom

Join Date: Feb 2010
Posts: 63
Default

Hi Martin,
I only mention SNPs as they invent new reads into your counts. So if you have a 50mer read called A, that maps to position X in genome, then read B that also maps to position X but differs to A by one SNP. Depending on why you are counting reads, it could be useful to treat A and B as the same read. Anyway, it could be not important, it was just an observation.
poisson200 is offline   Reply With Quote
Old 08-08-2010, 10:13 PM   #16
DrD2009
Member
 
Location: Kansas City

Join Date: Oct 2009
Posts: 88
Default

Martin,

Thanks for telling me how to fix the one failed problem. Actually I understand how to use Biopieces. For some reason when I type programs I've used before I could start the name of it in the terminal command and then just tab over and it would complete the program name instead of having to type the complete name. Biopieces doesn't seem to do that, but when I just type the full name and enter it works perfectly. That was where my confusion was.

Thanks again. I look forward to working with Biopieces.


-Brandon
DrD2009 is offline   Reply With Quote
Old 08-08-2010, 11:30 PM   #17
maasha
Senior Member
 
Location: Denmark

Join Date: Apr 2009
Posts: 153
Default

Quote:
Originally Posted by DrD2009 View Post
Martin,

For some reason when I type programs I've used before I could start the name of it in the terminal command and then just tab over and it would complete the program name instead of having to type the complete name. Biopieces doesn't seem to do that, but when I just type the full name and enter it works perfectly.

-Brandon
Tab completion should work. That has nothing to do with Biopieces, but is a feature of the shell (BASH on Ubuntu IIRC). Check if $BP_DIR/bp_bin/ is included in your $PATH. Perhaps you need to "source ~/.bashrc" or relogin.


Martin
maasha is offline   Reply With Quote
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 09:30 PM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2020, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO