UniqSketch is a sensitive tool for identifying strains and their relative abundances in metagenomics samples. The algorithm consists of two stages: Indexing and Querying.
-
Indexing: takes a list of target sequences (usually reference genomes) and identifies all k-mers within each target that uniquely belong to that specific sequence. It then checks a set of criteria such as low-complexity on the k-mer candidates to refine the list and provide the final set of unique k-mers for each target. To efficiently keep track of k-mers and their frequencies, UniqSketch utilizes the Bloom filter data structure as an alternative to hash tables.
-
Querying: raw read sequencing data are queried against the sketch index. To remove k-mers caused by sequencing errors (k-mers with very low depth), UniqSketch utilizes a cascading Bloom filter to discard weak k-mers. The solid k-mers are then queried against the index and the count for the related reference target is updated accordingly. After all solid k-mers in the raw read input data are queried, the reference count table is computed to generate the final output of most abundant references.
conda install -c bioconda -c conda-forge uniqsketchgit clone https://github.com/amazon-science/uniqsketch
cd uniqsketch
cmake -S . -B build
cmake --build buildBinaries are placed in build/bin/.
git clone https://github.com/amazon-science/uniqsketch
cd uniqsketch
makeBinaries are placed in bin/.
uniqsketch --help# CMake
cmake --build build --target test
# Make
make testUniqSketch consists of three utilities:
uniqsketch: build a signature index from a reference/target databasequerysketch: query input read files against a constructed indexcomparesketch: compute pairwise similarity between two sets of reference genomes
- Construct a sketch index of size
81bpfrom three references and store it insketch_index.tsv:
uniqsketch --sensitive -k81 -o sketch_index.tsv ref_1.fa ref_2.fa ref_3.fa- Cluster nearly identical references and build a sketch from representatives only:
uniqsketch --cluster=2000 --sensitive -k81 -o sketch_index.tsv @refs.txtReferences with fewer than 2000 unique k-mers between them are merged into clusters. A clusters.tsv file is written to the output directory mapping each reference to its cluster representative.
- Query a metagenomics sample with read files
r1.fq.gzandr2.fq.gz:
querysketch --r1 r1.fq.gz --r2 r2.fq.gz --ref sketch_index.tsv --out out_sample.tsvsketch_uniq.tsv: primary output — tab-separated file storing the final set of signatures for each reference.outdir_uniqsketch/: folder containing a file per reference with all signature candidates.db_uniq_count.tsv: tab-separated summary of all references and total number of candidate signatures.clusters.tsv(when--clusteris used): tab-separated file mapping each reference to its cluster representative, with columns:cluster_id,representative,member,num_kmers.
out_sample.tsv: primary output reporting references and their abundance within the sample:
ref abundance count
ref_1 0.7862 3033
ref_2 0.1993 769
ref_3 0.01426 55
log_out_sample.tsv: detailed log showing for each reference the total number of assigned reads and allsignature:countpairs.logread_out_sample.tsv: read-level log listing all matched read IDs per reference.
Usage: uniqsketch [OPTION] @LIST_FILES (or FILES)
Creates unique sketch from a list of fasta reference files.
A list of files containing file names in each row can be passed with @ prefix.
Options:
-t, --threads=N use N parallel threads [1]
-k, --kmer=N the length of kmer [81]
-b, --bit=N use N bits per element in Bloom filter [128]
-d, --hash1=N distinct Bloom filter hash number [5]
-s, --hash2=N repeat Bloom filter hash number [5]
-c, --cov=N number of unique k-mers to represent a reference [100]
-f, --outdir=STRING dir for universe unique k-mers [outdir_uniqsketch]
-o, --out=STRING the output sketch file name [sketch_uniq.tsv]
-r, --stat=STRING the output unique kmer stat file name [db_uniq_count.tsv]
-e, --entropy sets the aggregate entropy rate threshold [0.65]
--cluster=N cluster similar references with N unique k-mer threshold [0=off]
--sensitive sets sensitivity parameter c to 100
--very-sensitive sets sensitivity parameter c to 1000
--help display this help and exit
--version output version information and exit
Usage: querysketch [OPTIONS] [ARGS]
Identify references and their abundance in FILE(S).
Acceptable file formats: fastq in compressed formats gz, bz, zip, xz.
Options:
-t, --threads=N use N parallel threads [1]
-b, --bit=N use N bits per element in Bloom filter [64]
-o, --out=STRING the output file name
-l, --r1=STRING input read 1
-r, --r2=STRING input read 2
-g, --ref=STRING input uniqsketch reference
-h, --hit=N number of uniqsketch hits to call a reference [10]
-a, --acutoff=N abundance cutoff to report [0.0]
-s, --rcutoff=N read cutoff to report [2]
--sensitive sets sensitivity parameter h=10
--very-sensitive sets sensitivity parameter h=5
--solid only use solid k-mers in reads
--help display this help and exit
--version version information and exit
Usage: comparesketch [OPTION] LIST1 LIST2
Compare two sets of fasta reference files |LIST1|*|LIST2|.
Two lists of files containing file paths in each row.
Options:
-t, --threads=N use N parallel threads [1]
-k, --kmer=N the length of kmer [81]
-b, --bit=N use N bits per element in Bloom filter [16]
-d, --hash=N Bloom filter hash number [3]
-g, --gsize=N approximate size for reference sequence [5000000]
-o, --out=STRING the output similarity file name [reference_similarity.tsv]
--help display this help and exit
--version output version information and exit
UniqSketch vendors the following libraries:
- ntHash — recursive nucleotide hashing
- ntCard — streaming k-mer cardinality estimation
- kseq.h — fast FASTA/FASTQ parsing (Heng Li)
System requirements:
- C++17 compiler (GCC 7+, Clang 5+)
- zlib
- OpenMP (optional, for multi-threading)