SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics



Similar Threads
Thread Thread Starter Forum Replies Last Post
Demultiplexing pingu Bioinformatics 3 09-21-2015 02:44 AM
Demultiplexing softwares ClemBuntu Bioinformatics 9 01-02-2015 01:11 AM
demultiplexing data rahbz Metagenomics 0 08-27-2013 10:31 PM
help. Casava 1.8 demultiplexing senpeng Illumina/Solexa 1 09-19-2011 07:40 AM
demultiplexing 384 honey Bioinformatics 0 05-25-2011 12:03 PM

Reply
 
Thread Tools
Old 03-14-2016, 02:14 AM   #1
Bourney
Junior Member
 
Location: South England

Join Date: Oct 2014
Posts: 5
Default Demultiplexing disabling FastQC

Hi all,

I got a big file of data back from the sequencing centre that worked fine when I put it through fastqc, but after demultiplexing it into the individuals, FastQC complains that the id lines don't start with @. I've used two demultiplexers (Stack's process_radtags and GBSX) and it occurs with both of them.

So the demultiplexing process is causing this, which is odd as it worked fine with the other two datasets I've got, and put through exactly the same process.

Has anyone got experience of why this might be?

Cheers,
Steve
Bourney is offline   Reply With Quote
Old 03-14-2016, 02:59 AM   #2
GenoMax
Senior Member
 
Location: East Coast USA

Join Date: Feb 2008
Posts: 6,961
Default

Can you show a couple of fastq records from your demultiplexed files?
GenoMax is offline   Reply With Quote
Old 03-14-2016, 03:16 AM   #3
Bourney
Junior Member
 
Location: South England

Join Date: Oct 2014
Posts: 5
Default

Picking one at random:


Quote:
more EL12.fq
@5_1112_1374_2158_1
TGCATAAAGGCTTGTAAATTGTAGCATGCAAAAATTATAACAATTAATTAAACAAAAACAAAGAAAGTAAGAACATAAGAACCTT
+
FFFBFFFFFFFFFFFFFFFFFFFFFFFFBFFFFFFFFFFFFFBBFFFFFFFFFFFFBFFFFFFFFFFFFF<FFFFFFFFFFBFFF
@5_1112_1498_2210_1
TGCATGCGGAATGGTTTGTTCAATGCAAGATCGGAAGAGCGGTTCAGCAGGAATGCCGAGACCGATCTCGTATGCCGTCTTCTGC
+
FFFFFFFFF<FFFFFFFFFFFFFFFFFFFFFFFFBFFFFFFFFFFFFFFFFFFFFFFFFFBFFFFFFFFFFFFFFFFFFFFFFFF
@5_1112_2030_2132_1
TGCATACTACCTGTACATTCGGCAGATCATGCAAGATCGGAAGAGCGGTTCAGCAGGAATGCCGAGACCGATCTCGTATGCCGTC
+
FFFFFFFBFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF
@5_1112_2477_2111_1
TGCATTTCCATAATTTTTAAATTATTAGTCAATTGATTGAAAATGCAAGATCGGAAGAGCGGTTCAGCAGGAATGCCGAGACCGA
+
FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFBFFFFF
@5_1112_2381_2221_1
TGCATGAAATGAATGAATTCTCAATGGAACAACTAGCCCACCATGATGTTATGCCAACTTACATGCAAGATCGGAAGAGCGGTTC
+
FFFFFFFFFFFFFFFFFFFFFFFFFBFFFBFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFBFFFFFFFFF<BFFFFFBFFF
@5_1112_2774_2200_1
TGCATGGCAAGTCTCCCAATGCAAGATCGGAAGAGCGGTTCAGCAGGAATGCCGAGACCGATCTCGTATGCCGTCTTCTGCTTGA
+
FFFFFFFFFFFFFFFFFFFFBFFFFFFFFFFFFFFFFFFFFFFFFFFFFBFFFFFFFFFFFFFFFFFFFFFFFFFFFFFBFFBFF
@5_1112_3092_2050_1
TGCATTATGACATCACAATATACATTATGACATCACAATATGCAAGATCGGAAGAGCGGTTCAGCAGGAATGCCGAGACCGATCT
+
FFFFFFFFFFFFFFFBFFFFBFFFFFFFFFFFFFFBFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF<FFFFFFFBFFFFFF
@5_1112_3360_2165_1
TGCATATCATGTACCTTGGGCTTAATCGGATACTGTGTGTACAGAATACTATGAGATGCTAAGGTTTGGAATATGAGATACTTAG
+
BFFFFFFFFFFFFFFFFBFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFBFFFFFFFFFFFFFFFF
@5_1112_3573_2051_1
TGCATAATGGACAATGGTAGTACTAGTATCTATTTATAAAACAATTTGTATCTTGTTTTTGTGCCTTTATTCACGAAAAATCAGT
+
FFFFFF/FFFFFF<FFFFFFFFFFFFFFFFF<F/FFFFFFFFFFFFFFFFBFFFFFFFFFFFFFFFFFFFB/<BFBFFFFFBFFF
@5_1112_4698_2165_1
TGCATGGTAGGCATATACCTGTTTACTTGTGTTTAAATGCAAGATCGGAAGAGCGGTTCAGCAGGAATGCCGAGACCGATCTCGT
+
FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFBFFFFFF
@5_1112_4817_2244_1
TGCATGCAATTAACAAAAAAAACACATAAAGTTCTACAGCCAGTGTCTTTCATTCAACAGGTTAAATCGAACTCTCTGTATATTG
+
FFFFFFFBFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF
@5_1112_5056_2182_1
TGCATTTAGAACTAACATATTTATTGGTACAGCTAGATGCACAGGGGTGAGATACGGCAATCGATGCATAAAACAAATGCGAAAA
+
FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFBFFFFFFFFBFFFFFFFFFFBFFFFFFFFFFFFFFFFFFFFFFFFF
@5_1112_5830_2069_1
TGCATTGGGGTAGAATCTCAATTTTTTGACTTTGGCAAAAATTCAATTTTTTTGAGTATTTTCACAAACACATGATACCGATCAT
+
FFFFFFFFFFFFF/BFFFFFFFFFFFFFFFFFFFFFFBFBFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF
@5_1112_5767_2206_1
TGCATATTTTCACTACTAGTCTCCAAAAGGTTAAAACTTGCAAATTAGGCCAATATTGACACCATCAGTAAAGGCTACAAGTGAT
+
FFFFFFFFFF/<FFBFFFFFFFFBFFFFFFFFFFFF<FBFF<FFFF<BFFFFFBFBFFB<BFF<BBFFFFFFFFFFFF/FFFFFF
@5_1112_6137_2149_1
TGCATGTCTAATTTTGACACCGCCTACACTAATCTAAATACACCCCAGGGTGCATGATATTGGCCAATGGGGTTTGAACTGAATG
And going to the bottom:

Quote:
tail EL12.fq
+
FFFFFBFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF<FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF
@5_1204_19608_101358_1
TGCATGCTGGAGCATGTATGACTGTACCACATTTTCATGAAATGATGTCAAACATGCAACCATCATATCCACCAGGCAGATTAGT
+
FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF
@5_1204_20838_101289_1
TGCATCTATCCCATGCCCAGGAGTTGACTGCCGACAGCAACTGTTTGTTTCCTGTCTTTCCTAAATGCTCCCTGCATAATACATG
+
FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFBFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF
And the FastQC output:

Quote:
Started analysis of EL12.fq
Failed to process file EL12.fq
uk.ac.babraham.FastQC.Sequence.SequenceFormatException: ID line didn't start with '@'
at uk.ac.babraham.FastQC.Sequence.FastQFile.readNext(FastQFile.java:158)
at uk.ac.babraham.FastQC.Sequence.FastQFile.next(FastQFile.java:125)
at uk.ac.babraham.FastQC.Analysis.AnalysisRunner.run(AnalysisRunner.java:76)
at java.lang.Thread.run(Thread.java:722)
Bourney is offline   Reply With Quote
Old 03-14-2016, 03:36 AM   #4
Michael.Ante
Senior Member
 
Location: Vienna

Join Date: Oct 2011
Posts: 121
Default

Hi Steve,
maybe something went wrong on the way. You may check if the number of lines is a multiple of four. Or you may check if every fourth line (starting with the first) has a @ at the beginning:
Code:
awk 'NR%4==1{t[substr($1,0,1)]++}END{for(i in t){print i"\t"t[i]}}' EL12.fq
With the NR mod 4, you get every fourth line (1,5,9,...) and with the associative array, you count the occurrences of the first character. If your file somewhere has a flaw, you'll get something else than:
@ #reads

Cheers,

Michael

Last edited by Michael.Ante; 03-14-2016 at 03:37 AM. Reason: Typo
Michael.Ante is offline   Reply With Quote
Old 03-14-2016, 03:36 AM   #5
GenoMax
Senior Member
 
Location: East Coast USA

Join Date: Feb 2008
Posts: 6,961
Default

It appears that one (or more) fastq records must have gotten mangled in the demultiplexing process.

You can download fastq validator (https://github.com/statgen/fastQValidator) and see if you can find out where the problem is (is it with all files?). Simon Andrews had posted a perl script to do something similar a while ago on SeqAnswers. I will see if I can find that post.
GenoMax is offline   Reply With Quote
Old 03-14-2016, 03:51 AM   #6
Bourney
Junior Member
 
Location: South England

Join Date: Oct 2014
Posts: 5
Default

Quote:
Originally Posted by Michael.Ante View Post
Hi Steve,
maybe something went wrong on the way. You may check if the number of lines is a multiple of four. Or you may check if every fourth line (starting with the first) has a @ at the beginning:
Code:
awk 'NR%4==1{t[substr($1,0,1)]++}END{for(i in t){print i"\t"t[i]}}' EL12.fq
With the NR mod 4, you get every fourth line (1,5,9,...) and with the associative array, you count the occurrences of the first character. If your file somewhere has a flaw, you'll get something else than:
@ #reads

Cheers,

Michael
Hi Michael,

It look like you're right. running the awk script gave:

Quote:
B 40
+ 331
F 729
T 134
/ 21
< 20
1 1
@ 3163170
Quote:
Originally Posted by GenoMax View Post
It appears that one (or more) fastq records must have gotten mangled in the demultiplexing process.

You can download fastq validator (https://github.com/statgen/fastQValidator) and see if you can find out where the problem is (is it with all files?). Simon Andrews had posted a perl script to do something similar a while ago on SeqAnswers. I will see if I can find that post.
Yep the error occurs with every individual. Cheers I'll give the validator a go
Bourney is offline   Reply With Quote
Old 03-14-2016, 03:59 AM   #7
GenoMax
Senior Member
 
Location: East Coast USA

Join Date: Feb 2008
Posts: 6,961
Default

@Bourney: Here is that post with Simon's code: http://seqanswers.com/forums/showpos...75&postcount=8
GenoMax is offline   Reply With Quote
Old 03-14-2016, 05:14 AM   #8
Michael.Ante
Senior Member
 
Location: Vienna

Join Date: Oct 2011
Posts: 121
Default

I had a look at Simon's perl code. It seems to throw also an error, if the quality-string starts with an @ (Quality of 31 in Illumina 1.8 ; 0 in Illumina 1.3).
You might loose to many reads, if you run it as is.
Michael.Ante is offline   Reply With Quote
Old 03-14-2016, 06:02 AM   #9
GenoMax
Senior Member
 
Location: East Coast USA

Join Date: Feb 2008
Posts: 6,961
Default

@Michael.Ante: Good point. Should only require a minor update to original code.

Following code (derived from Simon's example) should help pull out ID's of problem fastq records and write them to a problem_id.txt file. They would need to be dealt with separately

Code:
#!/usr/bin/perl
use warnings;
use strict;

die "usage: file.pl <sequence.fq> \n" unless @ARGV == 1;
open (OUT1,">problem_id.txt") or die "can't open the outputfile\n";

while (<>) {

  unless (/^\@/) {
        chomp;
        print OUT1 "$_"."\tmissing @\n";
        my $seq = <>;
        my $id2 = <>;
        my $qual = <>;
    next;
  }
  my $id1 = $_;
  my $seq = <>;
  my $id2 = <>;
  my $qual = <>;

  if ($id2 !~ /^\+/) {
        chomp;
        print OUT1 "$_"."\tmissing +\n";
    next;
  }
}
close OUT1;

Last edited by GenoMax; 03-14-2016 at 10:43 AM.
GenoMax is offline   Reply With Quote
Old 03-15-2016, 12:02 AM   #10
Bourney
Junior Member
 
Location: South England

Join Date: Oct 2014
Posts: 5
Default

Quote:
Originally Posted by Michael.Ante View Post
I had a look at Simon's perl code. It seems to throw also an error, if the quality-string starts with an @ (Quality of 31 in Illumina 1.8 ; 0 in Illumina 1.3).
You might loose to many reads, if you run it as is.
Quote:
Originally Posted by GenoMax View Post
@Michael.Ante: Good point. Should only require a minor update to original code.

Following code (derived from Simon's example) should help pull out ID's of problem fastq records and write them to a problem_id.txt file. They would need to be dealt with separately
Cheers guys
Bourney is offline   Reply With Quote
Reply

Tags
demultiplex, fastqc, stacks

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 06:29 AM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2019, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO