Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • fastq with reads missing 3rd line

    Hi all,
    I have a set of fastq files. Some of the fastq files have reads which are missing the 3rd line (which begins with +).

    Code:
    @HWI-ST750:151:C1C6AACXX:5:2316:17997:100881 1:N:0:
    AGCGGTNCGCAATATTTTAGTAGCTCGTTACAGTCCGGTGCGTTTTTGGTTTTTTGAAAGTGCGTCTTCAGAGCGCTTTTGGTTTTCAAAAGCGCTCTGAAGTT+
    1:+B+0#2<DDDDIIIIIIFIIIIIIIIIIIIIEEIID?DDDBBADIII@DDDDD@@AAAAAAA??A?ADAAAA?????A>?8<>?>AAAAA8>>?><AAAA:4
    @HWI-ST750:151:C1C6AACXX:5:2316:19129:100793 1:N:0:
    AGCGCTNCTTGATATCAATCAACTGCTAGACAAATCCAATAGTAAATTGGGTAAACCAAATCTCGATATCGACAGCAAAGTATCACAATATGCCTATAACTACA+
    ;1=DD?#2ACDDDIDEEIIIEIEIEIEEIIIIDEIEIIIDEDDCEDDDEID?BBDIDIIIIEECDIDA@DD=A?D@DAAAA@DDDA>AAABE>AAAA>A>A>AA
    @HWI-ST750:151:C1C6AACXX:5:2316:19695:100854 1:N:0:
    AGCGCTNCACCGCGGTAAGCTTTAGCAGATCTCACTTTGTCTAGCGTTTGAACCATGTTTTCAAGGATATTGGCTCTAAGTTGTGGGTATTTTTCGATCACTTC+
    @<1DDD#2<DFDDGI@CEEGHIIIIIIIEGIIHCGHIIIHGGIGIIAFFHFHAHHIG?CCHFHEEBBC@CDCCCCACCCC5>CCBBB'>ACDECCBBDB7?CC>
    And also sequence line contains the + at the end. I guess 3rd line has been concatenated to the end of 2nd line.
    Any thoughts on how to proceed with this kind of data?? Any scripts to change it into proper format ??

  • #2
    If this problem is consistent throughout the file

    Code:
    sed "s/+$/\\`echo -e '\n\r'`+/g" bad.fastq > good.fastq
    should do the trick.

    EDIT: never remember adding a newline with sed to be complicated like that, but just tested on a mac and this was required. Maybe on linux it is simpler but I do not have a system here to test.
    Last edited by jiaco; 12-22-2012, 12:16 AM.

    Comment


    • #3
      Thanks for the snippet jiaco.
      Even I had tried this before but this aslso messes up with quality score line which are ending with +

      And also there are reads having empty lines in between. My question is what is the source of this kind of output?? Is this some sort of of sequencing error ??

      Code:
      @@1D4A#2AFHHFIHIIIIIIIIIIIIIIIIIBHHIIIIIIIIIIIIIIIIIIIIIIIDEHIIIHFFHFEEBDEEECCCBBBCC?CB?CCCBBBBB@BBBBBBB/1
      
      @HWI-ST750:151:C1C6AACXX:5:2316:9996:50328/1
      GGCCCCNATACATTTACTGATTCATCCTCAGCGGACTCTGATATGACATCCACTAAAAAATATGTCAGACCACCACCAATGTTAACCTCACCTAATGACTTTCC+
      =71?A@#23CDCD@E@ED?FEFCEI<ECFEA>CDDD6?BDEEC9<DBEEIC<BEEIE3@8?;=>?BA>A:(;;@;=???3:>>D####################/1
      
      @HWI-ST750:151:C1C6AACXX:5:2316:9999:44022/1
      GGCCACNATCTCGATAATTATAAGATATCTTTAGCACAGGCAAATTGGAACGCAAGCGAAGTTTCGAAAAAGCTAGTAAATATTCAAACAGATGGGTCTATTTC+
      ???D;B#2ADDDDIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIEIIIIIIIIIIID?CDEDDD@AAA?AAAADEAEEDDDEDBA?AAAAA?A>?ADDD3/1

      Comment


      • #4
        You could expand the expression to match
        Code:
        /^[ACGT].*+$/
        to avoid quality lines, but I have no idea where you got the file, let alone how it got corrupted.

        EDIT: saw your new example just now, there is an issue with this file. Maybe someone else has seen it before.
        But I would not try to fix this mess. You need to re-acquire the data.

        Comment


        • #5
          Yes, sequence files were given to me by our sequence provider, which I demultiplexed. But after demultiplexing this is the result. May be there is an issue with this. Anyways I will contact them. Thanks for the suggestion.

          Comment


          • #6
            How do the original files look like? Format?
            How did you multiplex? What program?

            Sven

            Comment

            Latest Articles

            Collapse

            • seqadmin
              Essential Discoveries and Tools in Epitranscriptomics
              by seqadmin


              The field of epigenetics has traditionally concentrated more on DNA and how changes like methylation and phosphorylation of histones impact gene expression and regulation. However, our increased understanding of RNA modifications and their importance in cellular processes has led to a rise in epitranscriptomics research. “Epitranscriptomics brings together the concepts of epigenetics and gene expression,” explained Adrien Leger, PhD, Principal Research Scientist on Modified Bases...
              Yesterday, 07:01 AM
            • seqadmin
              Current Approaches to Protein Sequencing
              by seqadmin


              Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
              04-04-2024, 04:25 PM

            ad_right_rmr

            Collapse

            News

            Collapse

            Topics Statistics Last Post
            Started by seqadmin, 04-11-2024, 12:08 PM
            0 responses
            54 views
            0 likes
            Last Post seqadmin  
            Started by seqadmin, 04-10-2024, 10:19 PM
            0 responses
            50 views
            0 likes
            Last Post seqadmin  
            Started by seqadmin, 04-10-2024, 09:21 AM
            0 responses
            44 views
            0 likes
            Last Post seqadmin  
            Started by seqadmin, 04-04-2024, 09:00 AM
            0 responses
            55 views
            0 likes
            Last Post seqadmin  
            Working...
            X