![]() |
|
![]() |
||||
Thread | Thread Starter | Forum | Replies | Last Post |
hisat2 output more reads than are in the file | frymor | Bioinformatics | 1 | 04-12-2016 12:01 PM |
Chromosome sizes don't match | sjb343 | General | 1 | 01-20-2016 01:49 AM |
de novo contigs don't match mapped reads | bdorsey | Bioinformatics | 1 | 09-25-2015 12:39 AM |
DESeq2 - COLUMN and ROWS number don't match | KYR | RNA Sequencing | 1 | 07-16-2014 02:21 PM |
454IsotigsLayout.txt - contigs don't match to any isotig? | dschika | Bioinformatics | 9 | 10-11-2010 05:58 AM |
![]() |
|
Thread Tools |
![]() |
#1 |
Member
Location: Virginia Join Date: May 2016
Posts: 80
|
![]()
Hello all. For a few of my samples, I am getting this error that the reads aren't matching up between my two mates:
Code:
Error, fewer reads in file specified with -1 than in file specified with -2 libc++abi.dylib: terminating with uncaught exception of type int (ERR): hisat2-align died with signal 6 (ABRT) Code:
hisat2 -q -p 8 -x /Volumes/cachannel/RNA_SEQ/Notch_RNASeq/in_silico_test/HISAT2_INDEXES/Xenopus_Laevis -1 /Volumes/cachannel/RNA_SEQ/Notch_RNASeq/in_silico_test/25\%/Sample_10/10_TAGCTT_L005_R1_001.fastq_sub_sample_0.25,/Volumes/cachannel/RNA_SEQ/Notch_RNASeq/in_silico_test/25\%/Sample_10/10_TAGCTT_L006_R1_001.fastq_sub_sample_0.25,/Volumes/cachannel/RNA_SEQ/Notch_RNASeq/in_silico_test/25\%/Sample_10/10_TAGCTT_L007_R1_001.fastq_sub_sample_0.25,/Volumes/cachannel/RNA_SEQ/Notch_RNASeq/in_silico_test/25\%/Sample_10/10_TAGCTT_L008_R1_001.fastq_sub_sample_0.25 -2 /Volumes/cachannel/RNA_SEQ/Notch_RNASeq/in_silico_test/25\%/Sample_10/10_TAGCTT_L005_R2_001.fastq_sub_sample_0.25,/Volumes/cachannel/RNA_SEQ/Notch_RNASeq/in_silico_test/25\%/Sample_10/10_TAGCTT_L006_R2_001.fastq_sub_sample_0.25,/Volumes/cachannel/RNA_SEQ/Notch_RNASeq/in_silico_test/25\%/Sample_10/10_TAGCTT_L007_R2_001.fastq_sub_sample_0.25,/Volumes/cachannel/RNA_SEQ/Notch_RNASeq/in_silico_test/25\%/Sample_10/10_TAGCTT_L008_R2_001.fastq_sub_sample_0.25 -S Sample_10_hisat_results.sam Code:
tophat_folders = [os.path.join(root, name) for root, dirs, files in os.walk(os.getcwd()) #does for the current directory in the shell!!!! for name in files if name.endswith(".fastq")] #within the sample file, need to go into results folder to get the .fastq file for files in tophat_folders: print(files) fraction = float(.1) print("fastq files being ran through") for v, w in zip(tophat_folders[::2], tophat_folders[1::2]): in1 = v in2 = w print("Mate 1 is", v) print("Mate 2 is", w) iter1 = iter( HTSeq.FastqReader( in1 ) ) iter2 = iter( HTSeq.FastqReader( in2 ) ) output1 = in1 + "_sub_sample_" + str(fraction) output2 = in2 + "_sub_sample_" + str(fraction) out1 = open( output1, "w" ) out2 = open( output2, "w" ) for read1, read2 in itertools.izip( iter1, iter2 ): if random.random() < fraction: read1.write_to_fastq_file( out1 ) read2.write_to_fastq_file( out2 ) out1.close() out2.close() |
![]() |
![]() |
![]() |
#2 |
Devon Ryan
Location: Freiburg, Germany Join Date: Jul 2011
Posts: 3,480
|
![]()
No clue why the script is giving you issues. In the future, though, you might use seqtk to subsample files. It's quite quick (it'll very likely be faster than python) and you can give it a seed, so the subsampling is reproducible.
|
![]() |
![]() |
![]() |
#3 |
Member
Location: Virginia Join Date: May 2016
Posts: 80
|
![]()
What do you mean by seed?
|
![]() |
![]() |
![]() |
#5 |
Member
Location: Virginia Join Date: May 2016
Posts: 80
|
![]()
Thanks for bringing that to my attention. So currently each time a file is being subsampled it is presumably using a different seed.
Code:
random.seed(None) Code:
random.seed(100) |
![]() |
![]() |
![]() |
#6 |
Devon Ryan
Location: Freiburg, Germany Join Date: Jul 2011
Posts: 3,480
|
![]()
Stated differently, "Suppose I manually specify a seed before going through each file, if the 'i'th read in file 1 is selected will the 'i'th read of other files also be selected?" The answer is yes. Having said that, if you use something pre-existing (e.g., seqtk), then you don't need to spend time debugging your own scripts.
|
![]() |
![]() |
![]() |
#7 |
Member
Location: Virginia Join Date: May 2016
Posts: 80
|
![]()
Yes, I understand the 'i'th read will be selected from both files if specified with the same seed. Although this is good for reproducibility, what are the other advantages? Why is it important to use the same seed when subsamplng paired-end mates?
Would it be wise to use this across samples that one is subsampling, or does it not make a difference? |
![]() |
![]() |
![]() |
#8 |
Devon Ryan
Location: Freiburg, Germany Join Date: Jul 2011
Posts: 3,480
|
![]()
If you don't use the same seed for mates then the results will be out of sync.
It's often convenient to use the same seed for all samples. |
![]() |
![]() |
![]() |
#9 |
Senior Member
Location: East Coast USA Join Date: Feb 2008
Posts: 7,091
|
![]()
If you were using BBMap you could set sampling parameters with the aligner itself and save yourself a step.
Code:
reads=-1 Set to a positive number N to only process the first N reads (or pairs), then quit. -1 means use all reads. samplerate=1 Set to a number from 0 to 1 to randomly select that fraction of reads for mapping. 1 uses all reads. |
![]() |
![]() |
![]() |
#10 | |
Member
Location: Virginia Join Date: May 2016
Posts: 80
|
![]() Quote:
So please explain why paired reads need to have the same 'random' reads subsampled. I only see this making sense if the paired-end reads were in some kind of order relative to each other. Thanks |
|
![]() |
![]() |
![]() |
#11 |
Senior Member
Location: East Coast USA Join Date: Feb 2008
Posts: 7,091
|
![]()
That should not be the case, unless something has been done to the files to perturb the order of reads (e.g. using a trimming program that is not PE aware). Most aligners expect the reads to be in the same order in the R1/R2 files (and don't check if that is so). If the reads are out of order in the files then you are likely to get odd results (discordant alignments etc).
|
![]() |
![]() |
![]() |
#12 | |
Member
Location: Virginia Join Date: May 2016
Posts: 80
|
![]() Quote:
I am subsampling at different fractions of an original DE experiment to determine what is the lowest amount of reads we can use for 3 replicates while maintaining power to get a high percentage of the first 300 DE genes ordered by padj value using 100% of reads. |
|
![]() |
![]() |
![]() |
#13 |
Member
Location: Virginia Join Date: May 2016
Posts: 80
|
![]()
Ah, and this order should be the same as both forward and reverse reads are sequenced from the 5' end right? I'm also assuming that both of these reads contains the same number of reads.
|
![]() |
![]() |
![]() |
#14 |
Senior Member
Location: East Coast USA Join Date: Feb 2008
Posts: 7,091
|
![]()
BBMap is splice-aware and should do very well against any published aligner out there. Choice of aligners becomes a personal choice (since they offer so many options it is hard to master them all and switching between aligners becomes a hassle after a time). It is multi-threaded and should be one of the faster aligners (I am biased, so make your own judgement
![]() Here is a thread about BBMap aligner. I have tried to collect all things that BBMap suite of programs do here. |
![]() |
![]() |
![]() |
#15 | |
Senior Member
Location: East Coast USA Join Date: Feb 2008
Posts: 7,091
|
![]() Quote:
If a PE aware aligner is not used (or PE files are not trimmed together) then one of the files may end up with more reads than the other and the order of reads would no longer be maintained. |
|
![]() |
![]() |
![]() |
#16 |
Member
Location: Virginia Join Date: May 2016
Posts: 80
|
![]()
Thanks for the insightful info GenoMax. I had always thought reads were in a random order in .fastq files. Then I would also think that reads from two biological replicates where each had 3 sets of PE reads from different lanes would also have a relatively similar order.
In essence R1 and R2 are technical replicates due to bridge amplification. I am going to try and compare any differences I find in read counts using a seed in a sample and not using one. And might even compare alignment between BBMap, Hisat2, and Tophat2 |
![]() |
![]() |
![]() |
#17 | ||
Senior Member
Location: East Coast USA Join Date: Feb 2008
Posts: 7,091
|
![]() Quote:
Quote:
You could use reformat.sh from BBMap to sample reads for use with all three programs. It will accept this parameter for reproducible sampling. Code:
sampleseed=-1 Set to a positive number to use that prng seed for sampling (allowing deterministic sampling) Last edited by GenoMax; 06-21-2016 at 09:02 AM. |
||
![]() |
![]() |
![]() |
#18 | ||
Member
Location: Virginia Join Date: May 2016
Posts: 80
|
![]() Quote:
Quote:
In regards to the physical location of a cluster, it does not matter in the reads because it is the job of the aligner to map it to the transcriptome? |
||
![]() |
![]() |
![]() |
#19 | |
Senior Member
Location: East Coast USA Join Date: Feb 2008
Posts: 7,091
|
![]()
No. The order of reads is the same in both R1/R2 files.
Quote:
|
|
![]() |
![]() |
![]() |
#20 |
Member
Location: Virginia Join Date: May 2016
Posts: 80
|
![]()
Alright, that clears a lot up. Your expertise is appreciated.
|
![]() |
![]() |
![]() |
Tags |
bowtie 2, hisat2, htseq, subsample |
Thread Tools | |
|
|