SEQanswers

Go Back   SEQanswers > Sequencing Technologies/Companies > Pacific Biosciences



Similar Threads
Thread Thread Starter Forum Replies Last Post
PubMed: Pacific biosciences sequencing technology for genotyping and variation discov Newsbot! Literature Watch 0 08-07-2012 02:00 AM
An inside look at Pacific Biosciences technologies... ECO The Pipeline 5 01-06-2010 06:28 PM

Reply
 
Thread Tools
Old 11-13-2012, 10:19 AM   #1
[email protected]
Member
 
Location: Burnaby

Join Date: Sep 2012
Posts: 17
Default How to remove the newlines in pacific biosciences fastq file

Hi All,

Hope someone could help me out here.
I am trying to analyze a pacbio data set. Because of long reads, the sequences and quality scores have multiple lines with 51 characters per line. When I ran this through fastqc to check quality and statistics, it complains about the format because there are multiple lines. My question is how I can concatenate the sequence into one line and quality score into another line.

Thank you very much!
zszong@hotmail.com is offline   Reply With Quote
Old 11-13-2012, 10:39 AM   #2
Richard Finney
Senior Member
 
Location: bethesda

Join Date: Feb 2009
Posts: 699
Default

Please post an example of the data you need to fix.
Richard Finney is offline   Reply With Quote
Old 11-13-2012, 11:12 AM   #3
[email protected]
Member
 
Location: Burnaby

Join Date: Sep 2012
Posts: 17
Default

Thank! Richard.

An example sequence is below. For the sequence and qualtiy score, there are 5 lines each. I am trying to concatenate them separately.

Orignal format:

@chlamy1234
ATGTGGGCCCAATTTATGTGGGCCCAATTTATGTGGGCC
CAATTTATGTGGGCCCAATTTATGTGGGCCCAATTTATG
GGGCCCAATTTATGTGGGCCCAATTTATGTGGGCCCAAT
TTATGTGGGCCCAATTTATGTGGGCCCAATTTATGTGGG
CCCAATTTATGTGGGCCCAATTTATGTGGGCCCAATTT
+
%^&*$%^&^%^&*$%^&^%^&*$%^&^%^&*$%
^&^%^&*$%^&^%^&*$%^&^%^&*$%^&^%^&
*$%^&^%^&*$%^&^%^&*$%^&^%^&*$%^&^
%^&*$%^&^%^&*$%^&^%^&*$%^&^%^&*$%
^&^%^&*$%^&^%^&*$%^&^%^&*$%^&^$)


format to be converted to:
@chlamy1234
ATGTGGGCCCAATTTATGTGGGCCCAATTTATGTGGGCCCAATTTATGTGGGCCCAATTATGTGGGCCCAATTTATGGGGCCCAATTTATGTGGGCCCAATTTATGTGGGCCCAATTTATGTGGGCCCAATTTATGTGGGCCCAATTTATGTGGGCCCAATTTATGTGGGCCCAATTTATGTGGGCCCAATTT
+
%^&*$%^&^%^&*$%^&^%^&*$%^&^%^&*$%^&^%^&*$%^&^%^&*$%^&^%^&*$%^&^%^&*$%^&^%^&*$%^&^%^&*$%^&^%^&*$%^&^%^&*$%^&^%^&*$%^&^%^&*$%^&^%^&*$%^&^%^&*$%^&^%^&*$%^&^%^&*$%^&^$)
zszong@hotmail.com is offline   Reply With Quote
Old 11-13-2012, 11:44 AM   #4
Richard Finney
Senior Member
 
Location: bethesda

Join Date: Feb 2009
Posts: 699
Default

cat filename.fastq| awk '{p=(NR%12); printf "%s",$0 ; if ((p==1)||(p==6)||(p==7)||(p==0)) printf "\n"}' > newfilename.fastq
Richard Finney is offline   Reply With Quote
Old 11-13-2012, 12:41 PM   #5
[email protected]
Member
 
Location: Burnaby

Join Date: Sep 2012
Posts: 17
Default

Thank you Richard! As you can tell that I am new in this area. Apparently, it works for this particular example. But I am dealing with about 1 million reads. The length of every reads varies, which means one has five lines long (as this example) and another has 20 lines long. I think there must be a better way to decide which lines needs to be concatenated.

Your help is greatly appreciated.

Stuart
zszong@hotmail.com is offline   Reply With Quote
Old 11-14-2012, 01:26 AM   #6
flxlex
Moderator
 
Location: Oslo, Norway

Join Date: Nov 2008
Posts: 415
Default

seqtk to the rescue: https://github.com/lh3/seqtk

Code:
seqtk seq -l 0 infile.fastq > outfile.fastq
should do it...
flxlex is offline   Reply With Quote
Old 11-14-2012, 02:00 PM   #7
[email protected]
Member
 
Location: Burnaby

Join Date: Sep 2012
Posts: 17
Default

Thank you flxlex. will try it out and let you know if it works.
zszong@hotmail.com is offline   Reply With Quote
Old 11-14-2012, 06:03 PM   #8
[email protected]
Member
 
Location: Burnaby

Join Date: Sep 2012
Posts: 17
Default

It worked perfectly. thanks, flxlex
zszong@hotmail.com is offline   Reply With Quote
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 06:58 PM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2018, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO