Seqanswers Leaderboard Ad

**BENM** · 06-26-2010, 10:24 PM

Hello DrD2009,

Are your sequences TagSeq or other else?
If you just want to get the uniue sequence in whole length of reads, you can just use the command in Linux:
awk '{if (NR%4==2){t[$1]++}}END{for (i in t){print i,t[i]}}' IN.SolexaReads.fastq > OUT.UniqTags.copyNumber

Or if it is TagSeq in 35bp reads length, mostly the last 17bp ("TCGTAGCCGTCTTCTGC", DpnII Enzyme Solexa adapter) are adapter sequence. the other is tag reads.
So you can try this:
awk 'NR%4==2' IN.SolexaReads.fastq | cut -c 1-18 |sort |uniq -c |awk 'BEGIN{OFS="\t"}{print $2,$1}' >OUT.UniqTags.copyNumber

**lh3** · 06-27-2010, 06:08 AM

awk 'NR%4==2' in.fastq|sort |uniq -c|awk '{print $1"\t"$2}' > out

**DrD2009** · 06-27-2010, 01:49 PM

@lh3:
That worked, thanks.
I really need to learn awk it seems to be extremely useful for these file manipulations.

@BENM
What I have are files in this format:

@SEQ_ID
GATTTGGGGTTCAAAGCAGTATCGATCAAATAGTAAATCCATTTGTTCAACTCACAGTTT
+
!''*((((***+))%%%++)(%%%%).1***-+*''))**55CCF>>>>>>CCCCCCC65

They are already trimmed so their length varies.

And I need the data to be in this form:

23445 ATCGCGGGAGCCCG
23489 ACTGCTAGTGCATGCTATGTCA
49732 ACTGTCATCGTAGCTCAC

Where the first column is the number of unique counts and the sequence is comprises of the second column.

**simonandrews** · 06-27-2010, 11:16 PM

How about:

Code:

#!/usr/bin/perl
use warnings;
use strict;

my %seqs;

while (<>) {
  if (/^\@/) {
    my $seq = <>;
    chomp $seq;
    ++$seqs{$seq};
  }
}

foreach my $seq (keys %seqs) {
  print join("\t",($seqs{$seq},$seq)),"\n";
}

Pass the file you want to process as an argument.

perl this_script.pl some_fastq_file.txt

**DrD2009** · 06-28-2010, 10:31 AM

Thanks Simon, I will give this a try as well and get back to you on it.

The program I'm using, DSAP, suggests I use a program from Biopieces called 'read_solexa' to perform the conversion. After trying to install Biopieces and the module itself I had zero luck. So anything that will perform the same function as 'read_solexa' would be perfect.

**maasha** · 08-06-2010, 07:10 AM

@DrD2009

Go to www.biopieces.org

Locate the link to Installation and follow the instructions step-by-step.

Then you get what you want and much more!

Cheers,

Martin

**maasha** · 08-06-2010, 11:02 PM

Also, the other suggestions in this thread don't take into account reverse-complement tags. You really want the pipeline suggested by DSAP:

http://dsap.cgu.edu.tw/tutorial.html

read_solexa -i INPUT.fastq |
uniq_seq -c |
sort_records -r -k SEQ_COUNTn |
write_tab -k SEQ_COUNT,SEQ -xo OUTPUT.tag

Martin

**DrD2009** · 08-07-2010, 05:04 PM

Originally posted by maasha View Post

Also, the other suggestions in this thread don't take into account reverse-complement tags. You really want the pipeline suggested by DSAP:

http://dsap.cgu.edu.tw/tutorial.html

read_solexa -i INPUT.fastq |
uniq_seq -c |
sort_records -r -k SEQ_COUNTn |
write_tab -k SEQ_COUNT,SEQ -xo OUTPUT.tag

Martin

Martin,

I actually tried using biopieces before. I've tried installing it on two separate occasions and have never been able to get it to work. I'd like to think I'm relatively decent with running program in Unix systems, but this has defeated me.

**maasha** · 08-08-2010, 12:47 AM

Originally posted by DrD2009 View Post

Martin,

I actually tried using biopieces before. I've tried installing it on two separate occasions and have never been able to get it to work. I'd like to think I'm relatively decent with running program in Unix systems, but this has defeated me.

Now that is not good. I am very interested in making the Biopieces installation as smooth as possible. So, what seems to be the problem?

I am also thinking about writing a installer/setup script, but those always need to be tested on a lot of different platforms and may not be very robust.

Martin

**DrD2009** · 08-08-2010, 01:58 AM

Well, honestly I'm not sure. I'm used to just unzipping files and copying the binaries to the path so the way you install biopieces is a little new to me.

I think going through it this third time did the charm, although I don't really understand how to make use of the biopieces.

I'm running Ubuntu 10.04.

I already had Perl installed so then I installed the modules needed. This was my first problem I couldn't get any of the modules to install. After doing some searching I discovered the resolution. "sudo cpan", then I was able to install all of the modules.

Then,

Basically you can checkout a read-only version like this:

cd ~/ # or wherever you want the Biopiece source code tree rooted

svn checkout http://biopieces.googlecode.com/svn/trunk/ biopieces

After you have installed the source code tree, you need to install the wiki tree as well:

cd ~/biopieces

svn checkout http://biopieces.googlecode.com/svn/wiki bp_usage

I skipped this part:

If you are a recognized developer you simply add your username to the checkout command,

cd ~/ # or wherever you want the Biopiece source code tree rooted

svn checkout https://biopieces.googlecode.com/svn/trunk/ biopieces --username <username>

and when prompted you enter the password from

http://code.google.com/hosting/settings

And for the wiki pages:

cd ~/biopieces

svn checkout https://biopieces.googlecode.com/svn/wiki bp_usage --username <username>

I next added the below to my .bashrc file

# >>>>>>>>>>>>>>>>>>>>>>> Enabling Biopieces if installed <<<<<<<<<<<<<<<<<<<<<<<

# Modify the below paths according to your settings.
# If you have followed the installation step-by-step as described above,
# the below should work just fine.

export BP_DIR="$HOME/biopieces" # Directory where biopieces are installed
export BP_DATA="$HOME/BP_DATA" # Contains genomic data etc.
export BP_TMP="$HOME/tmp" # Required temporary directory.
export BP_LOG="$HOME/BP_LOG" # Required log directory.

if [ -f "$BP_DIR/bp_conf/bashrc" ]; then
source "$BP_DIR/bp_conf/bashrc"
fi

alias bp_update="cd $BP_DIR && svn update; cd $BP_DIR/bp_usage && svn update;"

# >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>><<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<

I created the needed directories. I used the

Code:

mkdir $BP_DATA $BP_TMP $BP_LOG

within the biopieces folder in my home directory. The BP_TMP became tmp in my home directory and BP_LOG and BP_DATA were also placed in the home directory.

After that, I tested Biopieces:

Code:

$BP_DIR/bp_test/test_all

Which resulted in:

Code:

brandon@brandon:~/biopieces/bp_test$ ./test_all
Testing add_ident -I /home/brandon/biopieces/bp_test/in/add_ident.in -O /home/brandon/tmp/add_ident.out ... OK
Testing add_ident -I /home/brandon/biopieces/bp_test/in/add_ident.in -k CUSTOM_KEY -O /home/brandon/tmp/add_ident.out ... OK
Testing add_ident -I /home/brandon/biopieces/bp_test/in/add_ident.in -p PREFIX -O /home/brandon/tmp/add_ident.out ... OK
Testing align_seq -I /home/brandon/biopieces/bp_test/in/align_seq.in -O /home/brandon/tmp/align_seq.out ... FAIL
Testing analyze_gc -I /home/brandon/biopieces/bp_test/in/analyze_gc.in -O /home/brandon/tmp/analyze_gc.out ... OK
Testing analyze_vals -I /home/brandon/biopieces/bp_test/in/analyze_vals.in -o /home/brandon/tmp/analyze_vals.out -x ... OK
Testing analyze_vals -I /home/brandon/biopieces/bp_test/in/analyze_vals.in -k V0 -o /home/brandon/tmp/analyze_vals.out -x ... OK
Testing analyze_vals -I /home/brandon/biopieces/bp_test/in/analyze_vals.in -K V0 -o /home/brandon/tmp/analyze_vals.out -x ... OK
Testing grab -I /home/brandon/biopieces/bp_test/in/grab.in -p SEQ -O /home/brandon/tmp/grab.out ... OK
Testing grab -I /home/brandon/biopieces/bp_test/in/grab.in -p SEQ,COUNT -O /home/brandon/tmp/grab.out ... OK
Testing grab -I /home/brandon/biopieces/bp_test/in/grab.in -P /home/brandon/biopieces/bp_test/in/grab.in.pat -O /home/brandon/tmp/grab.out ... OK
Testing grab -I /home/brandon/biopieces/bp_test/in/grab.in -p SEQ -i -O /home/brandon/tmp/grab.out ... OK
Testing grab -I /home/brandon/biopieces/bp_test/in/grab.in -p SEQ -K -O /home/brandon/tmp/grab.out ... OK
Testing grab -I /home/brandon/biopieces/bp_test/in/grab.in -p SEQ -V -O /home/brandon/tmp/grab.out ... OK
Testing grab -I /home/brandon/biopieces/bp_test/in/grab.in -p SEQ -k PAT -O /home/brandon/tmp/grab.out ... OK
Testing grab -I /home/brandon/biopieces/bp_test/in/grab.in -r a -k SEQ -O /home/brandon/tmp/grab.out ... OK
Testing grab -I /home/brandon/biopieces/bp_test/in/grab.in -r a -k SEQ -c -O /home/brandon/tmp/grab.out ... OK
Testing grab -I /home/brandon/biopieces/bp_test/in/grab.in -e 'SEQ_LEN<10' -O /home/brandon/tmp/grab.out ... OK
Testing lowercase_seq -I /home/brandon/biopieces/bp_test/in/lowercase_seq.in -O /home/brandon/tmp/lowercase_seq.out ... OK
Testing read_fasta -i /home/brandon/biopieces/bp_test/in/read_fasta.in -O /home/brandon/tmp/read_fasta.out ... OK
Testing read_fasta -i /home/brandon/biopieces/bp_test/in/read_fasta.in -n 1 -O /home/brandon/tmp/read_fasta.out ... OK
Testing read_solexa -i /home/brandon/biopieces/bp_test/in/read_solexa.in -O /home/brandon/tmp/read_solexa.out ... OK
Testing read_solexa -i /home/brandon/biopieces/bp_test/in/read_solexa.in -n 1 -O /home/brandon/tmp/read_solexa.out ... OK
Testing read_solexa -i /home/brandon/biopieces/bp_test/in/read_solexa.in -c -O /home/brandon/tmp/read_solexa.out ... OK
Testing read_solexa -i /home/brandon/biopieces/bp_test/in/read_solexa.in -s -O /home/brandon/tmp/read_solexa.out ... OK
Testing read_solexa -i /home/brandon/biopieces/bp_test/in/read_solexa.in -s -C 30 -O /home/brandon/tmp/read_solexa.out ... OK
Testing read_tab -i /home/brandon/biopieces/bp_test/in/read_tab.in.1 -O /home/brandon/tmp/read_tab.out ... OK
Testing read_tab -i /home/brandon/biopieces/bp_test/in/read_tab.in.1 -s 1 -O /home/brandon/tmp/read_tab.out ... OK
Testing read_tab -i /home/brandon/biopieces/bp_test/in/read_tab.in.1 -s 1 -k ORGANISM,SEQ,COUNT -O /home/brandon/tmp/read_tab.out ... OK
Testing read_tab -i /home/brandon/biopieces/bp_test/in/read_tab.in.1 -s 1 -c 2,1 -O /home/brandon/tmp/read_tab.out ... OK
Testing read_tab -i /home/brandon/biopieces/bp_test/in/read_tab.in.1 -s 1 -c 2,1 -k COUNT,SEQ -O /home/brandon/tmp/read_tab.out ... OK
Testing read_tab -i /home/brandon/biopieces/bp_test/in/read_tab.in.2 -O /home/brandon/tmp/read_tab.out ... OK
Testing read_tab -i /home/brandon/biopieces/bp_test/in/read_tab.in.2 -n 1 -O /home/brandon/tmp/read_tab.out ... OK
Testing read_tab -i /home/brandon/biopieces/bp_test/in/read_tab.in.3 -d ';' -O /home/brandon/tmp/read_tab.out ... OK
Testing swapcase_seq -I /home/brandon/biopieces/bp_test/in/swapcase_seq.in -O /home/brandon/tmp/swapcase_seq.out ... diff: /home/brandon/tmp/swapcase_seq.out: No such file or directory
OK
rm: cannot remove `/home/brandon/tmp/swapcase_seq.out': No such file or directory
Testing uppercase_seq -I /home/brandon/biopieces/bp_test/in/uppercase_seq.in -O /home/brandon/tmp/uppercase_seq.out ... OK
Testing write_fasta -I /home/brandon/biopieces/bp_test/in/write_fasta.in -o /home/brandon/tmp/write_fasta.out -x ... OK
Testing write_fasta -I /home/brandon/biopieces/bp_test/in/write_fasta.in -w 4 -o /home/brandon/tmp/write_fasta.out -x ... OK
Testing write_fasta -I /home/brandon/biopieces/bp_test/in/write_fasta.in -w 4 -Z -o /home/brandon/tmp/write_fasta.out.gz -x ... OK
Testing write_tab -I /home/brandon/biopieces/bp_test/in/write_tab.in -o /home/brandon/tmp/write_tab.out -x ... OK
Testing write_tab -I /home/brandon/biopieces/bp_test/in/write_tab.in -c -o /home/brandon/tmp/write_tab.out -x ... OK
Testing write_tab -I /home/brandon/biopieces/bp_test/in/write_tab.in -d ',' -o /home/brandon/tmp/write_tab.out -x ... OK
Testing write_tab -I /home/brandon/biopieces/bp_test/in/write_tab.in -Z -o /home/brandon/tmp/write_tab.out.gz -x ... OK
Testing write_tab -I /home/brandon/biopieces/bp_test/in/write_tab.in -k 'Count' -o /home/brandon/tmp/write_tab.out -x ... OK
Testing write_tab -I /home/brandon/biopieces/bp_test/in/write_tab.in -K 'Count' -o /home/brandon/tmp/write_tab.out -x ... OK
Biopieces tested: 13   Tests run: 45   OK: 44   FAIL: 1   Time: 7 secs

So I'm guessing it's installed correctly? I just need to understand how to run the programs now. I'm used to just typing the program in the command line and running it, but it seems biopieces operates a little different so I'm still trying to wrap my head around it.

**poisson200** · 08-08-2010, 03:12 AM

Just a caveat, does unique tags deal with SNPs?
A tag could be the same, except one base; should these tags be counted as separate tags?
If there is a model genome, then maybe counting tags based on genome position?

**maasha** · 08-08-2010, 03:24 AM

@Drd2009,

I know that the installation of the Biopieces is a bit untypical since they are installed using a checkout from a code repository and not using some installer or tarball. The advantage of doing it this way is clearly on the developers side, but at the same time I promise to deal with bugs immediately.

Now, it looks like you have managed to deal with the installation almost perfectly, there is only one test FAIL - and that is actually my fault - I forgot to add Ruby to the prerequisites (as of last Friday!). apt-get install ruby1.9 or grab it from ruby-lang.org (I shall add this to the Installation Guide in a sec). Also, I did write in the Installation Guide that you might need admin rights to install CPAN modules so using 'sudo' is just perfect.

Now, for getting started using Biopieces there is the Introduction:

500 Internal Server Error

http://code.google.com/p/biopieces/wiki/Introduction

Feedback is welcome.

Cheers,

Martin

**maasha** · 08-08-2010, 03:33 AM

Originally posted by poisson200 View Post

Just a caveat, does unique tags deal with SNPs?
A tag could be the same, except one base; should these tags be counted as separate tags?
If there is a model genome, then maybe counting tags based on genome position?

uniq_seq does not deal with SNPs. It only checks for identical sequences (and reverse-complement). SNP-finding is a whole different game, where you need to take into account sequencing error/quality and gene copy number. There is a lot of papers on the subject.

Martin

**poisson200** · 08-08-2010, 05:04 AM

Hi Martin,
I only mention SNPs as they invent new reads into your counts. So if you have a 50mer read called A, that maps to position X in genome, then read B that also maps to position X but differs to A by one SNP. Depending on why you are counting reads, it could be useful to treat A and B as the same read. Anyway, it could be not important, it was just an observation.

Topics	Statistics	Last Post
Cancer Metastasis: A Deep Dive into Cellular Plasticity by seqadmin Started by seqadmin, 04-11-2024, 12:08 PM	0 responses 18 views 0 likes	Last Post by seqadmin 04-11-2024, 12:08 PM
Proteogenomic Profiles Offer New Clues in Prostate Cancer by seqadmin Started by seqadmin, 04-10-2024, 10:19 PM	0 responses 22 views 0 likes	Last Post by seqadmin 04-10-2024, 10:19 PM
Novel Diagnostic Assay Enhances Ovarian Cancer Detection by seqadmin Started by seqadmin, 04-10-2024, 09:21 AM	0 responses 17 views 0 likes	Last Post by seqadmin 04-10-2024, 09:21 AM
Evolutionary Dynamics of Centromeres: A Comparative Genomic Analysis by seqadmin Started by seqadmin, 04-04-2024, 09:00 AM	0 responses 49 views 0 likes	Last Post by seqadmin 04-04-2024, 09:00 AM

Seqanswers Leaderboard Ad

Announcement

Converting Solexa FASTQ file to unique sequence tags

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News