SEQanswers

Go Back   SEQanswers > Applications Forums > RNA Sequencing



Similar Threads
Thread Thread Starter Forum Replies Last Post
hisat2 output more reads than are in the file frymor Bioinformatics 1 04-12-2016 11:01 AM
Chromosome sizes don't match sjb343 General 1 01-20-2016 12:49 AM
de novo contigs don't match mapped reads bdorsey Bioinformatics 1 09-24-2015 11:39 PM
DESeq2 - COLUMN and ROWS number don't match KYR RNA Sequencing 1 07-16-2014 01:21 PM
454IsotigsLayout.txt - contigs don't match to any isotig? dschika Bioinformatics 9 10-11-2010 04:58 AM

Reply
 
Thread Tools
Old 06-19-2016, 08:28 PM   #1
ronaldrcutler
Member
 
Location: Virginia

Join Date: May 2016
Posts: 80
Default hisat2 file reads don't match up

Hello all. For a few of my samples, I am getting this error that the reads aren't matching up between my two mates:
Code:
Error, fewer reads in file specified with -1 than in file specified with -2
libc++abi.dylib: terminating with uncaught exception of type int
(ERR): hisat2-align died with signal 6 (ABRT)
Here is the command that I used for reference:
Code:
hisat2 -q -p 8 -x /Volumes/cachannel/RNA_SEQ/Notch_RNASeq/in_silico_test/HISAT2_INDEXES/Xenopus_Laevis -1 /Volumes/cachannel/RNA_SEQ/Notch_RNASeq/in_silico_test/25\%/Sample_10/10_TAGCTT_L005_R1_001.fastq_sub_sample_0.25,/Volumes/cachannel/RNA_SEQ/Notch_RNASeq/in_silico_test/25\%/Sample_10/10_TAGCTT_L006_R1_001.fastq_sub_sample_0.25,/Volumes/cachannel/RNA_SEQ/Notch_RNASeq/in_silico_test/25\%/Sample_10/10_TAGCTT_L007_R1_001.fastq_sub_sample_0.25,/Volumes/cachannel/RNA_SEQ/Notch_RNASeq/in_silico_test/25\%/Sample_10/10_TAGCTT_L008_R1_001.fastq_sub_sample_0.25 -2 /Volumes/cachannel/RNA_SEQ/Notch_RNASeq/in_silico_test/25\%/Sample_10/10_TAGCTT_L005_R2_001.fastq_sub_sample_0.25,/Volumes/cachannel/RNA_SEQ/Notch_RNASeq/in_silico_test/25\%/Sample_10/10_TAGCTT_L006_R2_001.fastq_sub_sample_0.25,/Volumes/cachannel/RNA_SEQ/Notch_RNASeq/in_silico_test/25\%/Sample_10/10_TAGCTT_L007_R2_001.fastq_sub_sample_0.25,/Volumes/cachannel/RNA_SEQ/Notch_RNASeq/in_silico_test/25\%/Sample_10/10_TAGCTT_L008_R2_001.fastq_sub_sample_0.25 -S Sample_10_hisat_results.sam
It should be noted that these samples have been subsampled using this script that utilizes htseq to subsample mate pairs:
Code:
tophat_folders = [os.path.join(root, name)
             for root, dirs, files in os.walk(os.getcwd()) #does for the current directory in the shell!!!!     
             for name in files
             if name.endswith(".fastq")]

#within the sample file, need to go into results folder to get the .fastq file 
for files in tophat_folders:
	print(files)

fraction = float(.1)

print("fastq files being ran through")
for v, w in zip(tophat_folders[::2], tophat_folders[1::2]):
	in1 = v 
	in2 = w
	print("Mate 1 is", v)
	print("Mate 2 is", w)
	iter1 = iter( HTSeq.FastqReader( in1 ) )
	iter2 = iter( HTSeq.FastqReader( in2 ) )

	output1 = in1 + "_sub_sample_" + str(fraction)
	output2 = in2 + "_sub_sample_" + str(fraction)
	out1 = open( output1, "w" )
	out2 = open( output2, "w" )

	for read1, read2 in itertools.izip( iter1, iter2 ):
	   if random.random() < fraction:
	      read1.write_to_fastq_file( out1 )
	      read2.write_to_fastq_file( out2 )
	      
	out1.close()
	out2.close()
When I ran the above script it executed quite smoothly, but it seems to have had errors in a couple files. When I re-run the subsampling script through samples that get the above error when running through hisat2, it works fine. Any idea on this?
ronaldrcutler is offline   Reply With Quote
Old 06-20-2016, 12:34 AM   #2
dpryan
Devon Ryan
 
Location: Freiburg, Germany

Join Date: Jul 2011
Posts: 3,480
Default

No clue why the script is giving you issues. In the future, though, you might use seqtk to subsample files. It's quite quick (it'll very likely be faster than python) and you can give it a seed, so the subsampling is reproducible.
dpryan is offline   Reply With Quote
Old 06-20-2016, 06:23 AM   #3
ronaldrcutler
Member
 
Location: Virginia

Join Date: May 2016
Posts: 80
Default

What do you mean by seed?
ronaldrcutler is offline   Reply With Quote
Old 06-20-2016, 06:28 AM   #4
dpryan
Devon Ryan
 
Location: Freiburg, Germany

Join Date: Jul 2011
Posts: 3,480
Default

Have a look at wikipedia.
dpryan is offline   Reply With Quote
Old 06-20-2016, 09:12 AM   #5
ronaldrcutler
Member
 
Location: Virginia

Join Date: May 2016
Posts: 80
Default

Thanks for bringing that to my attention. So currently each time a file is being subsampled it is presumably using a different seed.
Code:
random.seed(None)
However, to fix this and establish a seed for better consistency using this line at the beginning of the script will establish the seed:
Code:
random.seed(100)
My question when using a consistent seed is does it make a difference when subsampling multiple mate pairs of the same sample? Presumably the reads in the mate pairs are random when comparing the pairs against each other.
ronaldrcutler is offline   Reply With Quote
Old 06-20-2016, 12:40 PM   #6
dpryan
Devon Ryan
 
Location: Freiburg, Germany

Join Date: Jul 2011
Posts: 3,480
Default

Stated differently, "Suppose I manually specify a seed before going through each file, if the 'i'th read in file 1 is selected will the 'i'th read of other files also be selected?" The answer is yes. Having said that, if you use something pre-existing (e.g., seqtk), then you don't need to spend time debugging your own scripts.
dpryan is offline   Reply With Quote
Old 06-20-2016, 01:58 PM   #7
ronaldrcutler
Member
 
Location: Virginia

Join Date: May 2016
Posts: 80
Default

Yes, I understand the 'i'th read will be selected from both files if specified with the same seed. Although this is good for reproducibility, what are the other advantages? Why is it important to use the same seed when subsamplng paired-end mates?

Would it be wise to use this across samples that one is subsampling, or does it not make a difference?
ronaldrcutler is offline   Reply With Quote
Old 06-20-2016, 11:30 PM   #8
dpryan
Devon Ryan
 
Location: Freiburg, Germany

Join Date: Jul 2011
Posts: 3,480
Default

If you don't use the same seed for mates then the results will be out of sync.

It's often convenient to use the same seed for all samples.
dpryan is offline   Reply With Quote
Old 06-21-2016, 03:44 AM   #9
GenoMax
Senior Member
 
Location: East Coast USA

Join Date: Feb 2008
Posts: 6,888
Default

If you were using BBMap you could set sampling parameters with the aligner itself and save yourself a step.

Code:
reads=-1                Set to a positive number N to only process the first N
                        reads (or pairs), then quit.  -1 means use all reads.
samplerate=1            Set to a number from 0 to 1 to randomly select that
                        fraction of reads for mapping. 1 uses all reads.
Any specific reason you are only using a sample of the data?
GenoMax is offline   Reply With Quote
Old 06-21-2016, 06:16 AM   #10
ronaldrcutler
Member
 
Location: Virginia

Join Date: May 2016
Posts: 80
Default

Quote:
Originally Posted by dpryan View Post
If you don't use the same seed for mates then the results will be out of sync.

It's often convenient to use the same seed for all samples.
I am not using mates but paired-end reads (but that shouldn't make too much of a difference). Though I am still unsure how it will be out of sync if we don't use the same seed. I say this because as far as I can tell, the order of reads in each file of the 2 paired-end reads are sorted in a random order to begin with - please correct me if I'm wrong.

So please explain why paired reads need to have the same 'random' reads subsampled. I only see this making sense if the paired-end reads were in some kind of order relative to each other.

Thanks
ronaldrcutler is offline   Reply With Quote
Old 06-21-2016, 06:23 AM   #11
GenoMax
Senior Member
 
Location: East Coast USA

Join Date: Feb 2008
Posts: 6,888
Default

Quote:
Originally Posted by ronaldrcutler View Post
I say this because as far as I can tell, the order of reads in each file of the 2 paired-end reads are sorted in a random order to begin with - please correct me if I'm wrong.
That should not be the case, unless something has been done to the files to perturb the order of reads (e.g. using a trimming program that is not PE aware). Most aligners expect the reads to be in the same order in the R1/R2 files (and don't check if that is so). If the reads are out of order in the files then you are likely to get odd results (discordant alignments etc).
GenoMax is offline   Reply With Quote
Old 06-21-2016, 06:44 AM   #12
ronaldrcutler
Member
 
Location: Virginia

Join Date: May 2016
Posts: 80
Default

Quote:
Originally Posted by GenoMax View Post
If you were using BBMap you could set sampling parameters with the aligner itself and save yourself a step.

Code:
reads=-1                Set to a positive number N to only process the first N
                        reads (or pairs), then quit.  -1 means use all reads.
samplerate=1            Set to a number from 0 to 1 to randomly select that
                        fraction of reads for mapping. 1 uses all reads.
Any specific reason you are only using a sample of the data?
I will go ahead and check out BBmap, do you know how it compares against hisat2 and tophat2 when mapping?

I am subsampling at different fractions of an original DE experiment to determine what is the lowest amount of reads we can use for 3 replicates while maintaining power to get a high percentage of the first 300 DE genes ordered by padj value using 100% of reads.
ronaldrcutler is offline   Reply With Quote
Old 06-21-2016, 06:52 AM   #13
ronaldrcutler
Member
 
Location: Virginia

Join Date: May 2016
Posts: 80
Default

Quote:
Originally Posted by GenoMax View Post
Most aligners expect the reads to be in the same order in the R1/R2 files (and don't check if that is so).
Ah, and this order should be the same as both forward and reverse reads are sequenced from the 5' end right? I'm also assuming that both of these reads contains the same number of reads.
ronaldrcutler is offline   Reply With Quote
Old 06-21-2016, 06:53 AM   #14
GenoMax
Senior Member
 
Location: East Coast USA

Join Date: Feb 2008
Posts: 6,888
Default

BBMap is splice-aware and should do very well against any published aligner out there. Choice of aligners becomes a personal choice (since they offer so many options it is hard to master them all and switching between aligners becomes a hassle after a time). It is multi-threaded and should be one of the faster aligners (I am biased, so make your own judgement ). @Brian (author of BBMap) participates here regularly and can answer any questions (including advanced usage that sometimes only he knows about).

Here is a thread about BBMap aligner. I have tried to collect all things that BBMap suite of programs do here.
GenoMax is offline   Reply With Quote
Old 06-21-2016, 07:01 AM   #15
GenoMax
Senior Member
 
Location: East Coast USA

Join Date: Feb 2008
Posts: 6,888
Default

Quote:
Originally Posted by ronaldrcutler View Post
Ah, and this order should be the same as both forward and reverse reads are sequenced from the 5' end right? I'm also assuming that both of these reads contains the same number of reads.
Order with respect to the fragment to ensure that R1 and R2 from a specific fragment will be at the same relative location in both files. Sequencing is always in 5'->3' direction.

If a PE aware aligner is not used (or PE files are not trimmed together) then one of the files may end up with more reads than the other and the order of reads would no longer be maintained.
GenoMax is offline   Reply With Quote
Old 06-21-2016, 07:14 AM   #16
ronaldrcutler
Member
 
Location: Virginia

Join Date: May 2016
Posts: 80
Default

Thanks for the insightful info GenoMax. I had always thought reads were in a random order in .fastq files. Then I would also think that reads from two biological replicates where each had 3 sets of PE reads from different lanes would also have a relatively similar order.

In essence R1 and R2 are technical replicates due to bridge amplification. I am going to try and compare any differences I find in read counts using a seed in a sample and not using one. And might even compare alignment between BBMap, Hisat2, and Tophat2
ronaldrcutler is offline   Reply With Quote
Old 06-21-2016, 08:00 AM   #17
GenoMax
Senior Member
 
Location: East Coast USA

Join Date: Feb 2008
Posts: 6,888
Default

Quote:
Originally Posted by ronaldrcutler View Post
Thanks for the insightful info GenoMax. I had always thought reads were in a random order in .fastq files. Then I would also think that reads from two biological replicates where each had 3 sets of PE reads from different lanes would also have a relatively similar order.
Reads include positional information (lane, X, Y location) in the header which is only valid for a specific fragment/cluster. Ultimately the physical position of a cluster does not matter (as far as sequence goes).

Quote:
In essence R1 and R2 are technical replicates due to bridge amplification. I am going to try and compare any differences I find in read counts using a seed in a sample and not using one. And might even compare alignment between BBMap, Hisat2, and Tophat2
Only if they are fully overlapping though I have never thought of R1/R2 as technical replicates of each other. In most cases R1/R2 would not be expected to completely overlap (there are special applications where they are expected to overlap partially).

You could use reformat.sh from BBMap to sample reads for use with all three programs. It will accept this parameter for reproducible sampling.
Code:
sampleseed=-1           Set to a positive number to use that prng seed for sampling (allowing deterministic sampling)

Last edited by GenoMax; 06-21-2016 at 08:02 AM.
GenoMax is offline   Reply With Quote
Old 06-21-2016, 08:47 AM   #18
ronaldrcutler
Member
 
Location: Virginia

Join Date: May 2016
Posts: 80
Default

Quote:
Originally Posted by GenoMax View Post
Reads include positional information (lane, X, Y location) in the header which is only valid for a specific fragment/cluster. Ultimately the physical position of a cluster does not matter (as far as sequence goes).
Quote:
Originally Posted by GenoMax View Post
Order with respect to the fragment to ensure that R1 and R2 from a specific fragment will be at the same relative location in both files. Sequencing is always in 5'->3' direction.
Just for clarification - If I wanted to look at a fragment that is close to the beginning of the R1 file, the same fragment would be near the end of the R2 file. This preserves relative location so the two reads can match up.

In regards to the physical location of a cluster, it does not matter in the reads because it is the job of the aligner to map it to the transcriptome?
ronaldrcutler is offline   Reply With Quote
Old 06-21-2016, 09:44 AM   #19
GenoMax
Senior Member
 
Location: East Coast USA

Join Date: Feb 2008
Posts: 6,888
Default

No. The order of reads is the same in both R1/R2 files.

Quote:
R1_file

Cluster_1_R1_read
Cluster_2_R1_read
Cluster_3_R1_read

R2 file

Cluster_1_R2_read
Cluster_2_R2_read
Cluster_3_R2_read
Position of the clusters (which is encoded in the fastq header) is only important for the sequencer during sequencing since instrument software keeps track of each individual cluster (there are millions per lane) over the duration of the run. The position itself has no influence on the sequence of that cluster (as long as the software that does the base-calling has done its job right).
GenoMax is offline   Reply With Quote
Old 06-21-2016, 11:04 AM   #20
ronaldrcutler
Member
 
Location: Virginia

Join Date: May 2016
Posts: 80
Default

Alright, that clears a lot up. Your expertise is appreciated.
ronaldrcutler is offline   Reply With Quote
Reply

Tags
bowtie 2, hisat2, htseq, subsample

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 11:21 AM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2019, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO