Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • #16
    BTW, SOLiD does no read filtering, so all the worst reads taken from beads sitting at the very edge of the flowcell cause the reads at the beginning and end of the files to be very low quality. You might want to try a test on reads pulled from the middle of the file. If those are okay, just filter you input data by throwing out reads that have missing data (".") bases. Not really worth the effort to get a conversion program to stop choking on garbage reads.

    --
    Phillip

    Comment


    • #17
      Your color space FASTA file:
      Code:
      # Title: Corrida_16_01RMDSPFR004_1
      >1117_10_107_F3
      T02...0..03.120.2...3.300..00.3.2..31.203.3...1..03
      >1117_10_146_F3
      T30...2..10.303.2...2.110..11.0.1..32.033.1...2..33
      >1117_10_1017_F3
      T32...1..30.210.3...2.013..01.2.0..23.233.2...0..33
      >1117_11_136_F3
      T20...3..30.203.2...0.232..31.2.32.22.222.3...0..03
      These are leading base and fifty colour scores, total length 51.

      Your color space QUAL file:
      Code:
      # Title: Corrida_16_01RMDSPFR004_1
      >1117_10_107_F3
      23 31 -1 -1 -1 29 -1 -1 20 32 -1 18 25 7 -1 6 -1 -1 -1 30 -1 20 13 7 -1 -1 21 30 -1 24 -1 22 -1 -1 22 14 -1 12 26 21 -1 5 -1 -1 -1 20 -1 -1 12 28 
      >1117_10_146_F3
      20 33 -1 -1 -1 29 -1 -1 28 28 -1 7 16 5 -1 30 -1 -1 -1 14 -1 4 13 4 -1 -1 11 13 -1 5 -1 7 -1 -1 10 16 -1 4 12 15 -1 8 -1 -1 -1 16 -1 -1 10 4 
      >1117_10_1017_F3
      33 33 -1 -1 -1 27 -1 -1 17 16 -1 28 24 11 -1 6 -1 -1 -1 29 -1 8 29 24 -1 -1 8 8 -1 20 -1 13 -1 -1 8 13 -1 28 10 24 -1 10 -1 -1 -1 4 -1 -1 7 6 
      >1117_11_136_F3
      16 22 -1 -1 -1 33 -1 -1 30 27 -1 27 28 32 -1 29 -1 -1 -1 27 -1 18 9 6 -1 -1 23 16 -1 26 -1 5 7 -1 22 7 -1 18 14 8 -1 8 -1 -1 -1 11 -1 -1 4 24
      These have 50 quality scores, as expected. I'm not sure why there are some -1 scores, PHRED only goes down to zero, but I would expect your FASTQ to look like this (treating those as PHRED 0 which becomes ! in FASTQ):
      Code:
      @1117_10_107_F3
      T02...0..03.120.2...3.300..00.3.2..31.203.3...1..03
      +
      8@!!!>!!5A!3:(!'!!!?!5.(!!6?!9!7!!7/!-;6!&!!!5!!-=
      @1117_10_146_F3
      T30...2..10.303.2...2.110..11.0.1..32.033.1...2..33
      +
      5B!!!>!!==!(1&!?!!!/!%.%!!,.!&!(!!+1!%-0!)!!!1!!+%
      @1117_10_1017_F3
      T32...1..30.210.3...2.013..01.2.0..23.233.2...0..33
      +
      BB!!!<!!21!=9,!'!!!>!)>9!!))!5!.!!).!=+9!+!!!%!!('
      @1117_11_136_F3
      T20...3..30.203.2...0.232..31.2.32.22.222.3...0..03
      +
      17!!!B!!?<!<=A!>!!!<!3*'!!81!;!&(!7(!3/)!)!!!,!!%9

      Comment


      • #18
        Ok thanks pmiguel, I'll try that

        Comment


        • #19
          Originally posted by pmiguel View Post
          BTW, SOLiD does no read filtering, so all the worst reads taken from beads sitting at the very edge of the flowcell cause the reads at the beginning and end of the files to be very low quality. You might want to try a test on reads pulled from the middle of the file. If those are okay, just filter you input data by throwing out reads that have missing data (".") bases. Not really worth the effort to get a conversion program to stop choking on garbage reads.
          If that is the problem, it does seem worth reporting it and getting it fixed to stop someone else wasting their time with this kind of issue.

          My guess is solid2fastq from maq doesn't like these -1 quality scores.

          Comment


          • #20
            Hello again, I have already removed the sequences with dots in the .csfasta file and created a file with the list of IDs.
            >1117_22_215_F3
            T32332201112312003133333333333333333033333333333103
            >1117_22_218_F3
            T13321013031133113333112332130011113223331203321333
            >1117_22_388_F3
            T32022222220031010131122221332210302310301030210322

            Now I need to choose the corresponding lines in the .qual file.

            I tried to convert the .qual file into .tab first but it removed the spaces:

            original .qual
            >1117_10_107_F3
            23 31 -1 -1 -1 29 -1 -1 20 32 -1 18 25 7 -1 6 -1 -1 -1 30 -1 20 13 7 -1 -1 21 30 -1 24 -1 22 -1 -1 22 14 -1 12 26 21 -1 5 -1 -1 -1 20 -1 -1 12 28
            >1117_10_146_F3
            20 33 -1 -1 -1 29 -1 -1 28 28 -1 7 16 5 -1 30 -1 -1 -1 14 -1 4 13 4 -1 -1 11 13 -1 5 -1 7 -1 -1 10 16 -1 4 12 15 -1 8 -1 -1 -1 16 -1 -1 10 4

            .tab

            1117_10_107_F3 2331-1-1-129-1-12032-118257-16-1-1-130-120137-1-12130-124-122-1-12214-1122621-15-1-1-120-1-11228
            1117_10_146_F3 2033-1-1-129-1-12828-17165-130-1-1-114-14134-1-11113-15-17-1-11016-141215-18-1-1-116-1-1104

            Does any one know how can I choose the corresponding .qual data?
            thanks

            Alejandra

            Comment


            • #21
              Hi Alejandra,
              Previously I was just tossing out ideas.
              But, originally you wanted to pull a set of records out of a fastq file. For this I would recommend cdbfasta/cdbyank.
              Phillip

              Comment


              • #22
                Originally posted by pepperoni View Post
                Does any one know how can I choose the corresponding .qual data?
                It is quite possible given basic scripting/programming skills. What languages are you learning?

                If Biopython didn't regard your QUAL file as invalid (something I have tweaked for the next release), you could use the script I originally posted for "sff" or "fastq", but substitute "qual" for the file format.

                My personal preference is to combine FASTA+QUAL into FASTQ as early as possible, to avoid all the headaches of keeping them in sync for filtering or trimming operations.

                Comment


                • #23
                  @maubp OK, I wrote an installer for Biopieces. Feedback welcome (not here).

                  Comment


                  • #24
                    Originally posted by maubp View Post
                    It is quite possible given basic scripting/programming skills. What languages are you learning?

                    If Biopython didn't regard your QUAL file as invalid (something I have tweaked for the next release), you could use the script I originally posted for "sff" or "fastq", but substitute "qual" for the file format.

                    My personal preference is to combine FASTA+QUAL into FASTQ as early as possible, to avoid all the headaches of keeping them in sync for filtering or trimming operations.
                    Yes Phillip, originally I wanted to extract some sequences from a fastq file. I tried the strategies that were recommended in this thread and I got the same error with all of them " the quality values are longer than the sequences"

                    Since one reason could be that the conversion from .csfasta & .qual to .fastq has mistakes and may not handle very well the non-called bases "." then I was trying to remove the dots before converting them to fastq.

                    For that purpose I removed the dots from the .csfasta and tried your scripts Peter, to extract the corresponding .qual data but the scripts regard the Qual format as invalid. Then I tried with some scripts from the scriptome in perl but they are for fasta and cannot handle the spaces in the second row. Any suggestions? or does anyone know what can I change on the following script made for fasta? I know very very little programming (

                    perl -e ' ($id,$fasta)=@ARGV; open(ID,$id); while (<ID>) { s/\r?\n//; /^>?(\S+)/; $ids{$1}++; } $num_ids = keys %ids; open(F, $fasta); $s_read = $s_wrote = $print_it = 0; while (<F>) { if (/^>(\S+)/) { $s_read++; if ($ids{$1}) { $s_wrote++; $print_it = 1; delete $ids{$1} } else { $print_it = 0 } }; if ($print_it) { print $_ } }; END { warn "Searched $s_read FASTA records.\nFound $s_wrote IDs out of $num_ids in the ID list.\n" } ' id_list a.fsa > found.fsa


                    thanks

                    Comment


                    • #25
                      Hi pepperoni,

                      Looking at your .csfasta & .qual files, do you also see lots of -1 quality scores? My guess is*those are what is breaking your conversion to FASTQ.

                      Peter

                      Comment


                      • #26
                        Originally posted by maubp View Post
                        Hi pepperoni,

                        Looking at your .csfasta & .qual files, do you also see lots of -1 quality scores? My guess is*those are what is breaking your conversion to FASTQ.

                        Peter
                        Yes I do, and I guess they correspond to the dots in the csfasta, aren't they? that's why it would be better to extract them before converting isn't it?

                        Comment


                        • #27
                          Originally posted by pepperoni View Post
                          Yes I do, and I guess they correspond to the dots in the csfasta, aren't they? that's why it would be better to extract them before converting isn't it?
                          The script that I posted before, actually worked (it didn't worked before because of memory problems). So now I have my .csfasta & .qual without the dots and -1s. I'll proceed and post my results.
                          thank you all

                          Comment


                          • #28
                            Originally posted by pepperoni View Post
                            Yes I do, and I guess they correspond to the dots in the csfasta, aren't they? that's why it would be better to extract them before converting isn't it?
                            Looks like the dots and the -1 quality scores do go together, yes.

                            I don't think you can just remove them, but as I've never worked with color-space data first hand, hopefully someone on here can give a more authoritative answer.

                            Comment


                            • #29
                              Hi Peter,
                              You have to discard the entire read (and possible the read pair, depending on your downstream processing) not just the base.
                              The dots are failures to collect data on the bead for that cycle. There are rare, but painful, cases where a single cycle fails for one reason or another for all the beads in a flow cell, but all the other cycles are okay. However, except in these rare cases, I don't think there is compelling reason to keep reads that have dots in them. They are probably junk.
                              That said, if your software transparently deals with them, you can keep the around. But the decision to denote them with negative quality values seems unfortunate to me.

                              --
                              Phillip

                              Comment


                              • #30
                                Originally posted by pmiguel View Post
                                But the decision to denote them with negative quality values seems unfortunate to me.
                                Very misguided given PHRED zero would have been fine for this

                                Thanks for the information. I'm not sure what off the shelf solution to recommend here - personally I'd write a Python script to filter out these duff reads...

                                Comment

                                Latest Articles

                                Collapse

                                • seqadmin
                                  Essential Discoveries and Tools in Epitranscriptomics
                                  by seqadmin




                                  The field of epigenetics has traditionally concentrated more on DNA and how changes like methylation and phosphorylation of histones impact gene expression and regulation. However, our increased understanding of RNA modifications and their importance in cellular processes has led to a rise in epitranscriptomics research. “Epitranscriptomics brings together the concepts of epigenetics and gene expression,” explained Adrien Leger, PhD, Principal Research Scientist...
                                  04-22-2024, 07:01 AM
                                • seqadmin
                                  Current Approaches to Protein Sequencing
                                  by seqadmin


                                  Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
                                  04-04-2024, 04:25 PM

                                ad_right_rmr

                                Collapse

                                News

                                Collapse

                                Topics Statistics Last Post
                                Started by seqadmin, Yesterday, 08:47 AM
                                0 responses
                                16 views
                                0 likes
                                Last Post seqadmin  
                                Started by seqadmin, 04-11-2024, 12:08 PM
                                0 responses
                                60 views
                                0 likes
                                Last Post seqadmin  
                                Started by seqadmin, 04-10-2024, 10:19 PM
                                0 responses
                                60 views
                                0 likes
                                Last Post seqadmin  
                                Started by seqadmin, 04-10-2024, 09:21 AM
                                0 responses
                                54 views
                                0 likes
                                Last Post seqadmin  
                                Working...
                                X