Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Demultiplexing disabling FastQC

    Hi all,

    I got a big file of data back from the sequencing centre that worked fine when I put it through fastqc, but after demultiplexing it into the individuals, FastQC complains that the id lines don't start with @. I've used two demultiplexers (Stack's process_radtags and GBSX) and it occurs with both of them.

    So the demultiplexing process is causing this, which is odd as it worked fine with the other two datasets I've got, and put through exactly the same process.

    Has anyone got experience of why this might be?

    Cheers,
    Steve

  • #2
    Can you show a couple of fastq records from your demultiplexed files?

    Comment


    • #3
      Picking one at random:


      more EL12.fq
      @5_1112_1374_2158_1
      TGCATAAAGGCTTGTAAATTGTAGCATGCAAAAATTATAACAATTAATTAAACAAAAACAAAGAAAGTAAGAACATAAGAACCTT
      +
      FFFBFFFFFFFFFFFFFFFFFFFFFFFFBFFFFFFFFFFFFFBBFFFFFFFFFFFFBFFFFFFFFFFFFF<FFFFFFFFFFBFFF
      @5_1112_1498_2210_1
      TGCATGCGGAATGGTTTGTTCAATGCAAGATCGGAAGAGCGGTTCAGCAGGAATGCCGAGACCGATCTCGTATGCCGTCTTCTGC
      +
      FFFFFFFFF<FFFFFFFFFFFFFFFFFFFFFFFFBFFFFFFFFFFFFFFFFFFFFFFFFFBFFFFFFFFFFFFFFFFFFFFFFFF
      @5_1112_2030_2132_1
      TGCATACTACCTGTACATTCGGCAGATCATGCAAGATCGGAAGAGCGGTTCAGCAGGAATGCCGAGACCGATCTCGTATGCCGTC
      +
      FFFFFFFBFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF
      @5_1112_2477_2111_1
      TGCATTTCCATAATTTTTAAATTATTAGTCAATTGATTGAAAATGCAAGATCGGAAGAGCGGTTCAGCAGGAATGCCGAGACCGA
      +
      FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFBFFFFF
      @5_1112_2381_2221_1
      TGCATGAAATGAATGAATTCTCAATGGAACAACTAGCCCACCATGATGTTATGCCAACTTACATGCAAGATCGGAAGAGCGGTTC
      +
      FFFFFFFFFFFFFFFFFFFFFFFFFBFFFBFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFBFFFFFFFFF<BFFFFFBFFF
      @5_1112_2774_2200_1
      TGCATGGCAAGTCTCCCAATGCAAGATCGGAAGAGCGGTTCAGCAGGAATGCCGAGACCGATCTCGTATGCCGTCTTCTGCTTGA
      +
      FFFFFFFFFFFFFFFFFFFFBFFFFFFFFFFFFFFFFFFFFFFFFFFFFBFFFFFFFFFFFFFFFFFFFFFFFFFFFFFBFFBFF
      @5_1112_3092_2050_1
      TGCATTATGACATCACAATATACATTATGACATCACAATATGCAAGATCGGAAGAGCGGTTCAGCAGGAATGCCGAGACCGATCT
      +
      FFFFFFFFFFFFFFFBFFFFBFFFFFFFFFFFFFFBFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF<FFFFFFFBFFFFFF
      @5_1112_3360_2165_1
      TGCATATCATGTACCTTGGGCTTAATCGGATACTGTGTGTACAGAATACTATGAGATGCTAAGGTTTGGAATATGAGATACTTAG
      +
      BFFFFFFFFFFFFFFFFBFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFBFFFFFFFFFFFFFFFF
      @5_1112_3573_2051_1
      TGCATAATGGACAATGGTAGTACTAGTATCTATTTATAAAACAATTTGTATCTTGTTTTTGTGCCTTTATTCACGAAAAATCAGT
      +
      FFFFFF/FFFFFF<FFFFFFFFFFFFFFFFF<F/FFFFFFFFFFFFFFFFBFFFFFFFFFFFFFFFFFFFB/<BFBFFFFFBFFF
      @5_1112_4698_2165_1
      TGCATGGTAGGCATATACCTGTTTACTTGTGTTTAAATGCAAGATCGGAAGAGCGGTTCAGCAGGAATGCCGAGACCGATCTCGT
      +
      FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFBFFFFFF
      @5_1112_4817_2244_1
      TGCATGCAATTAACAAAAAAAACACATAAAGTTCTACAGCCAGTGTCTTTCATTCAACAGGTTAAATCGAACTCTCTGTATATTG
      +
      FFFFFFFBFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF
      @5_1112_5056_2182_1
      TGCATTTAGAACTAACATATTTATTGGTACAGCTAGATGCACAGGGGTGAGATACGGCAATCGATGCATAAAACAAATGCGAAAA
      +
      FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFBFFFFFFFFBFFFFFFFFFFBFFFFFFFFFFFFFFFFFFFFFFFFF
      @5_1112_5830_2069_1
      TGCATTGGGGTAGAATCTCAATTTTTTGACTTTGGCAAAAATTCAATTTTTTTGAGTATTTTCACAAACACATGATACCGATCAT
      +
      FFFFFFFFFFFFF/BFFFFFFFFFFFFFFFFFFFFFFBFBFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF
      @5_1112_5767_2206_1
      TGCATATTTTCACTACTAGTCTCCAAAAGGTTAAAACTTGCAAATTAGGCCAATATTGACACCATCAGTAAAGGCTACAAGTGAT
      +
      FFFFFFFFFF/<FFBFFFFFFFFBFFFFFFFFFFFF<FBFF<FFFF<BFFFFFBFBFFB<BFF<BBFFFFFFFFFFFF/FFFFFF
      @5_1112_6137_2149_1
      TGCATGTCTAATTTTGACACCGCCTACACTAATCTAAATACACCCCAGGGTGCATGATATTGGCCAATGGGGTTTGAACTGAATG
      And going to the bottom:

      tail EL12.fq
      +
      FFFFFBFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF<FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF
      @5_1204_19608_101358_1
      TGCATGCTGGAGCATGTATGACTGTACCACATTTTCATGAAATGATGTCAAACATGCAACCATCATATCCACCAGGCAGATTAGT
      +
      FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF
      @5_1204_20838_101289_1
      TGCATCTATCCCATGCCCAGGAGTTGACTGCCGACAGCAACTGTTTGTTTCCTGTCTTTCCTAAATGCTCCCTGCATAATACATG
      +
      FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFBFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF
      And the FastQC output:

      Started analysis of EL12.fq
      Failed to process file EL12.fq
      uk.ac.babraham.FastQC.Sequence.SequenceFormatException: ID line didn't start with '@'
      at uk.ac.babraham.FastQC.Sequence.FastQFile.readNext(FastQFile.java:158)
      at uk.ac.babraham.FastQC.Sequence.FastQFile.next(FastQFile.java:125)
      at uk.ac.babraham.FastQC.Analysis.AnalysisRunner.run(AnalysisRunner.java:76)
      at java.lang.Thread.run(Thread.java:722)

      Comment


      • #4
        Hi Steve,
        maybe something went wrong on the way. You may check if the number of lines is a multiple of four. Or you may check if every fourth line (starting with the first) has a @ at the beginning:
        Code:
        awk 'NR%4==1{t[substr($1,0,1)]++}END{for(i in t){print i"\t"t[i]}}' EL12.fq
        With the NR mod 4, you get every fourth line (1,5,9,...) and with the associative array, you count the occurrences of the first character. If your file somewhere has a flaw, you'll get something else than:
        @ #reads

        Cheers,

        Michael
        Last edited by Michael.Ante; 03-14-2016, 03:37 AM. Reason: Typo

        Comment


        • #5
          It appears that one (or more) fastq records must have gotten mangled in the demultiplexing process.

          You can download fastq validator (https://github.com/statgen/fastQValidator) and see if you can find out where the problem is (is it with all files?). Simon Andrews had posted a perl script to do something similar a while ago on SeqAnswers. I will see if I can find that post.

          Comment


          • #6
            Originally posted by Michael.Ante View Post
            Hi Steve,
            maybe something went wrong on the way. You may check if the number of lines is a multiple of four. Or you may check if every fourth line (starting with the first) has a @ at the beginning:
            Code:
            awk 'NR%4==1{t[substr($1,0,1)]++}END{for(i in t){print i"\t"t[i]}}' EL12.fq
            With the NR mod 4, you get every fourth line (1,5,9,...) and with the associative array, you count the occurrences of the first character. If your file somewhere has a flaw, you'll get something else than:
            @ #reads

            Cheers,

            Michael
            Hi Michael,

            It look like you're right. running the awk script gave:

            B 40
            + 331
            F 729
            T 134
            / 21
            < 20
            1 1
            @ 3163170
            Originally posted by GenoMax View Post
            It appears that one (or more) fastq records must have gotten mangled in the demultiplexing process.

            You can download fastq validator (https://github.com/statgen/fastQValidator) and see if you can find out where the problem is (is it with all files?). Simon Andrews had posted a perl script to do something similar a while ago on SeqAnswers. I will see if I can find that post.
            Yep the error occurs with every individual. Cheers I'll give the validator a go

            Comment


            • #7
              @Bourney: Here is that post with Simon's code: http://seqanswers.com/forums/showpos...75&postcount=8

              Comment


              • #8
                I had a look at Simon's perl code. It seems to throw also an error, if the quality-string starts with an @ (Quality of 31 in Illumina 1.8 ; 0 in Illumina 1.3).
                You might loose to many reads, if you run it as is.

                Comment


                • #9
                  @Michael.Ante: Good point. Should only require a minor update to original code.

                  Following code (derived from Simon's example) should help pull out ID's of problem fastq records and write them to a problem_id.txt file. They would need to be dealt with separately

                  Code:
                  #!/usr/bin/perl
                  use warnings;
                  use strict;
                  
                  die "usage: file.pl <sequence.fq> \n" unless @ARGV == 1;
                  open (OUT1,">problem_id.txt") or die "can't open the outputfile\n";
                  
                  while (<>) {
                  
                    unless (/^\@/) {
                          chomp;
                          print OUT1 "$_"."\tmissing @\n";
                          my $seq = <>;
                          my $id2 = <>;
                          my $qual = <>;
                      next;
                    }
                    my $id1 = $_;
                    my $seq = <>;
                    my $id2 = <>;
                    my $qual = <>;
                  
                    if ($id2 !~ /^\+/) {
                          chomp;
                          print OUT1 "$_"."\tmissing +\n";
                      next;
                    }
                  }
                  close OUT1;
                  Last edited by GenoMax; 03-14-2016, 10:43 AM.

                  Comment


                  • #10
                    Originally posted by Michael.Ante View Post
                    I had a look at Simon's perl code. It seems to throw also an error, if the quality-string starts with an @ (Quality of 31 in Illumina 1.8 ; 0 in Illumina 1.3).
                    You might loose to many reads, if you run it as is.
                    Originally posted by GenoMax View Post
                    @Michael.Ante: Good point. Should only require a minor update to original code.

                    Following code (derived from Simon's example) should help pull out ID's of problem fastq records and write them to a problem_id.txt file. They would need to be dealt with separately
                    Cheers guys

                    Comment

                    Latest Articles

                    Collapse

                    • seqadmin
                      Strategies for Sequencing Challenging Samples
                      by seqadmin


                      Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                      03-22-2024, 06:39 AM
                    • seqadmin
                      Techniques and Challenges in Conservation Genomics
                      by seqadmin



                      The field of conservation genomics centers on applying genomics technologies in support of conservation efforts and the preservation of biodiversity. This article features interviews with two researchers who showcase their innovative work and highlight the current state and future of conservation genomics.

                      Avian Conservation
                      Matthew DeSaix, a recent doctoral graduate from Kristen Ruegg’s lab at The University of Colorado, shared that most of his research...
                      03-08-2024, 10:41 AM

                    ad_right_rmr

                    Collapse

                    News

                    Collapse

                    Topics Statistics Last Post
                    Started by seqadmin, Yesterday, 06:37 PM
                    0 responses
                    10 views
                    0 likes
                    Last Post seqadmin  
                    Started by seqadmin, Yesterday, 06:07 PM
                    0 responses
                    9 views
                    0 likes
                    Last Post seqadmin  
                    Started by seqadmin, 03-22-2024, 10:03 AM
                    0 responses
                    51 views
                    0 likes
                    Last Post seqadmin  
                    Started by seqadmin, 03-21-2024, 07:32 AM
                    0 responses
                    67 views
                    0 likes
                    Last Post seqadmin  
                    Working...
                    X