Seqanswers Leaderboard Ad

**pmiguel** · 10-14-2011, 04:24 AM

BTW, SOLiD does no read filtering, so all the worst reads taken from beads sitting at the very edge of the flowcell cause the reads at the beginning and end of the files to be very low quality. You might want to try a test on reads pulled from the middle of the file. If those are okay, just filter you input data by throwing out reads that have missing data (".") bases. Not really worth the effort to get a conversion program to stop choking on garbage reads.

--
Phillip

**maubp** · 10-14-2011, 05:15 AM

Your color space FASTA file:

Code:

# Title: Corrida_16_01RMDSPFR004_1
>1117_10_107_F3
T02...0..03.120.2...3.300..00.3.2..31.203.3...1..03
>1117_10_146_F3
T30...2..10.303.2...2.110..11.0.1..32.033.1...2..33
>1117_10_1017_F3
T32...1..30.210.3...2.013..01.2.0..23.233.2...0..33
>1117_11_136_F3
T20...3..30.203.2...0.232..31.2.32.22.222.3...0..03

These are leading base and fifty colour scores, total length 51.

Your color space QUAL file:

Code:

# Title: Corrida_16_01RMDSPFR004_1
>1117_10_107_F3
23 31 -1 -1 -1 29 -1 -1 20 32 -1 18 25 7 -1 6 -1 -1 -1 30 -1 20 13 7 -1 -1 21 30 -1 24 -1 22 -1 -1 22 14 -1 12 26 21 -1 5 -1 -1 -1 20 -1 -1 12 28 
>1117_10_146_F3
20 33 -1 -1 -1 29 -1 -1 28 28 -1 7 16 5 -1 30 -1 -1 -1 14 -1 4 13 4 -1 -1 11 13 -1 5 -1 7 -1 -1 10 16 -1 4 12 15 -1 8 -1 -1 -1 16 -1 -1 10 4 
>1117_10_1017_F3
33 33 -1 -1 -1 27 -1 -1 17 16 -1 28 24 11 -1 6 -1 -1 -1 29 -1 8 29 24 -1 -1 8 8 -1 20 -1 13 -1 -1 8 13 -1 28 10 24 -1 10 -1 -1 -1 4 -1 -1 7 6 
>1117_11_136_F3
16 22 -1 -1 -1 33 -1 -1 30 27 -1 27 28 32 -1 29 -1 -1 -1 27 -1 18 9 6 -1 -1 23 16 -1 26 -1 5 7 -1 22 7 -1 18 14 8 -1 8 -1 -1 -1 11 -1 -1 4 24

These have 50 quality scores, as expected. I'm not sure why there are some -1 scores, PHRED only goes down to zero, but I would expect your FASTQ to look like this (treating those as PHRED 0 which becomes ! in FASTQ):

Code:

@1117_10_107_F3
T02...0..03.120.2...3.300..00.3.2..31.203.3...1..03
+
8@!!!>!!5A!3:(!'!!!?!5.(!!6?!9!7!!7/!-;6!&!!!5!!-=
@1117_10_146_F3
T30...2..10.303.2...2.110..11.0.1..32.033.1...2..33
+
5B!!!>!!==!(1&!?!!!/!%.%!!,.!&!(!!+1!%-0!)!!!1!!+%
@1117_10_1017_F3
T32...1..30.210.3...2.013..01.2.0..23.233.2...0..33
+
BB!!!<!!21!=9,!'!!!>!)>9!!))!5!.!!).!=+9!+!!!%!!('
@1117_11_136_F3
T20...3..30.203.2...0.232..31.2.32.22.222.3...0..03
+
17!!!B!!?<!<=A!>!!!<!3*'!!81!;!&(!7(!3/)!)!!!,!!%9

**pepperoni** · 10-14-2011, 05:17 AM

Ok thanks pmiguel, I'll try that

**maubp** · 10-14-2011, 05:21 AM

Originally posted by pmiguel View Post

BTW, SOLiD does no read filtering, so all the worst reads taken from beads sitting at the very edge of the flowcell cause the reads at the beginning and end of the files to be very low quality. You might want to try a test on reads pulled from the middle of the file. If those are okay, just filter you input data by throwing out reads that have missing data (".") bases. Not really worth the effort to get a conversion program to stop choking on garbage reads.

If that is the problem, it does seem worth reporting it and getting it fixed to stop someone else wasting their time with this kind of issue.

My guess is solid2fastq from maq doesn't like these -1 quality scores.

**pepperoni** · 10-14-2011, 07:33 AM

Hello again, I have already removed the sequences with dots in the .csfasta file and created a file with the list of IDs.
>1117_22_215_F3
T32332201112312003133333333333333333033333333333103
>1117_22_218_F3
T13321013031133113333112332130011113223331203321333
>1117_22_388_F3
T32022222220031010131122221332210302310301030210322

Now I need to choose the corresponding lines in the .qual file.

I tried to convert the .qual file into .tab first but it removed the spaces:

original .qual
>1117_10_107_F3
23 31 -1 -1 -1 29 -1 -1 20 32 -1 18 25 7 -1 6 -1 -1 -1 30 -1 20 13 7 -1 -1 21 30 -1 24 -1 22 -1 -1 22 14 -1 12 26 21 -1 5 -1 -1 -1 20 -1 -1 12 28
>1117_10_146_F3
20 33 -1 -1 -1 29 -1 -1 28 28 -1 7 16 5 -1 30 -1 -1 -1 14 -1 4 13 4 -1 -1 11 13 -1 5 -1 7 -1 -1 10 16 -1 4 12 15 -1 8 -1 -1 -1 16 -1 -1 10 4

.tab

1117_10_107_F3 2331-1-1-129-1-12032-118257-16-1-1-130-120137-1-12130-124-122-1-12214-1122621-15-1-1-120-1-11228
1117_10_146_F3 2033-1-1-129-1-12828-17165-130-1-1-114-14134-1-11113-15-17-1-11016-141215-18-1-1-116-1-1104

Does any one know how can I choose the corresponding .qual data?
thanks

Alejandra

**pmiguel** · 10-14-2011, 08:03 AM

Hi Alejandra,
Previously I was just tossing out ideas.
But, originally you wanted to pull a set of records out of a fastq file. For this I would recommend cdbfasta/cdbyank.
Phillip

**maubp** · 10-14-2011, 08:34 AM

Originally posted by pepperoni View Post

Does any one know how can I choose the corresponding .qual data?

It is quite possible given basic scripting/programming skills. What languages are you learning?

If Biopython didn't regard your QUAL file as invalid (something I have tweaked for the next release), you could use the script I originally posted for "sff" or "fastq", but substitute "qual" for the file format.

My personal preference is to combine FASTA+QUAL into FASTQ as early as possible, to avoid all the headaches of keeping them in sync for filtering or trimming operations.

**maasha** · 10-14-2011, 11:54 AM

@maubp OK, I wrote an installer for Biopieces. Feedback welcome (not here).

**pepperoni** · 10-17-2011, 09:09 AM

Originally posted by maubp View Post

It is quite possible given basic scripting/programming skills. What languages are you learning?

If Biopython didn't regard your QUAL file as invalid (something I have tweaked for the next release), you could use the script I originally posted for "sff" or "fastq", but substitute "qual" for the file format.

My personal preference is to combine FASTA+QUAL into FASTQ as early as possible, to avoid all the headaches of keeping them in sync for filtering or trimming operations.

Yes Phillip, originally I wanted to extract some sequences from a fastq file. I tried the strategies that were recommended in this thread and I got the same error with all of them " the quality values are longer than the sequences"

Since one reason could be that the conversion from .csfasta & .qual to .fastq has mistakes and may not handle very well the non-called bases "." then I was trying to remove the dots before converting them to fastq.

For that purpose I removed the dots from the .csfasta and tried your scripts Peter, to extract the corresponding .qual data but the scripts regard the Qual format as invalid. Then I tried with some scripts from the scriptome in perl but they are for fasta and cannot handle the spaces in the second row. Any suggestions? or does anyone know what can I change on the following script made for fasta? I know very very little programming

(

perl -e ' ($id,$fasta)=@ARGV; open(ID,$id); while (<ID>) { s/\r?\n//; /^>?(\S+)/; $ids{$1}++; } $num_ids = keys %ids; open(F, $fasta); $s_read = $s_wrote = $print_it = 0; while (<F>) { if (/^>(\S+)/) { $s_read++; if ($ids{$1}) { $s_wrote++; $print_it = 1; delete $ids{$1} } else { $print_it = 0 } }; if ($print_it) { print $_ } }; END { warn "Searched $s_read FASTA records.\nFound $s_wrote IDs out of $num_ids in the ID list.\n" } ' id_list a.fsa > found.fsa

thanks

**maubp** · 10-17-2011, 09:19 AM

Hi pepperoni,

Looking at your .csfasta & .qual files, do you also see lots of -1 quality scores? My guess is*those are what is breaking your conversion to FASTQ.

Peter

**pepperoni** · 10-17-2011, 09:42 AM

Originally posted by maubp View Post

Hi pepperoni,

Looking at your .csfasta & .qual files, do you also see lots of -1 quality scores? My guess is*those are what is breaking your conversion to FASTQ.

Peter

Yes I do, and I guess they correspond to the dots in the csfasta, aren't they? that's why it would be better to extract them before converting isn't it?

**pepperoni** · 10-17-2011, 09:58 AM

Originally posted by pepperoni View Post

Yes I do, and I guess they correspond to the dots in the csfasta, aren't they? that's why it would be better to extract them before converting isn't it?

The script that I posted before, actually worked (it didn't worked before because of memory problems). So now I have my .csfasta & .qual without the dots and -1s. I'll proceed and post my results.
thank you all

**maubp** · 10-17-2011, 10:01 AM

Originally posted by pepperoni View Post

Yes I do, and I guess they correspond to the dots in the csfasta, aren't they? that's why it would be better to extract them before converting isn't it?

Looks like the dots and the -1 quality scores do go together, yes.

I don't think you can just remove them, but as I've never worked with color-space data first hand, hopefully someone on here can give a more authoritative answer.

**pmiguel** · 10-17-2011, 10:10 AM

Hi Peter,
You have to discard the entire read (and possible the read pair, depending on your downstream processing) not just the base.
The dots are failures to collect data on the bead for that cycle. There are rare, but painful, cases where a single cycle fails for one reason or another for all the beads in a flow cell, but all the other cycles are okay. However, except in these rare cases, I don't think there is compelling reason to keep reads that have dots in them. They are probably junk.
That said, if your software transparently deals with them, you can keep the around. But the decision to denote them with negative quality values seems unfortunate to me.

--
Phillip

**maubp** · 10-17-2011, 11:07 AM

Originally posted by pmiguel View Post

But the decision to denote them with negative quality values seems unfortunate to me.

Very misguided given PHRED zero would have been fine for this

Thanks for the information. I'm not sure what off the shelf solution to recommend here - personally I'd write a Python script to filter out these duff reads...

Topics	Statistics	Last Post
Genetic Variants and Diabetes Risk in Childhood Cancer Survivors by seqadmin Started by seqadmin, Yesterday, 08:47 AM	0 responses 16 views 0 likes	Last Post by seqadmin Yesterday, 08:47 AM
Cancer Metastasis: A Deep Dive into Cellular Plasticity by seqadmin Started by seqadmin, 04-11-2024, 12:08 PM	0 responses 60 views 0 likes	Last Post by seqadmin 04-11-2024, 12:08 PM
Proteogenomic Profiles Offer New Clues in Prostate Cancer by seqadmin Started by seqadmin, 04-10-2024, 10:19 PM	0 responses 60 views 0 likes	Last Post by seqadmin 04-10-2024, 10:19 PM
Novel Diagnostic Assay Enhances Ovarian Cancer Detection by seqadmin Started by seqadmin, 04-10-2024, 09:21 AM	0 responses 54 views 0 likes	Last Post by seqadmin 04-10-2024, 09:21 AM

Seqanswers Leaderboard Ad

Announcement

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News