The SeqWare Query Engine is intended to be a universal store for sequence variants. Features include fine-grained control over versioning, a plug-in framework for new MapReduce plug-ins, rich value tags/annotations on most objects, and set organization. The current iteration of the SeqWare Query engine has the ability to store and search data stored in a modern NoSQL database, HBase, while using Google’s Protocol Buffers for serialization.
Most users will want to use our pre-configured VMs, see the SeqWare Install Guide for how to get the VM.
Please see the Install Guide for installing Query Engine from scratch.
We currently load data via our command-line programs. In order to do this, you will want to go to the root directory and compile a full version of our jar with dependencies included. You will probably wish to skip the tests for our other components. This should look something like:
cd ~/gitroot/seqware/ mvn clean install cd seqware-distribution/target
In this distribution directory, you can run our command line tools for import.
The first time you run these tools in a new namespace, you may wish to create a common reference and then create an ad hoc tag set that will store all tags that do not match known tags. Most of our command-line tools output a key value file that you can keep in order to record the ID of your created objects.
~/seqware_github/seqware-distribution/target$ java -classpath seqware-distribution-1.0.6-qe-full.jar com.github.seqware.queryengine.system.ReferenceCreator Only 0 arguments found ReferenceCreator[output_file] ~/seqware_github/seqware-distribution/target$ java -classpath seqware-distribution-1.0.6-qe-full.jar com.github.seqware.queryengine.system.ReferenceCreator hg_19 keyValue_ref.out Reference written with an ID of: hg_19 ~/seqware_github/seqware-distribution/target$ cat keyValue_ref.out referenceID hg_19 ~/seqware_github/seqware-distribution/target$ java -classpath seqware-distribution-1.0.6-qe-full.jar com.github.seqware.queryengine.system.TagSetCreator Only 0 arguments found TagSetCreator [output_file] ~/seqware_github/seqware-distribution/target$ java -classpath seqware-distribution-1.0.6-qe-full.jar com.github.seqware.queryengine.system.TagSetCreator ad_hoc keyValue_adHoc.out TagSet written with an ID of: ad_hoc ~/seqware_github/seqware-distribution/target$ cat keyValue_adHoc.out TagSetID ad_hoc namespace BATMAN
You may also wish to pre-populate the database with (Sequence Ontology) SO terms in a TagSet:
~/seqware_github/seqware-distribution/target$ java -classpath seqware-distribution-1.0.6-qe-full.jar com.github.seqware.queryengine.system.importers.OBOImporter Only 0 arguments found OBOImporter[output_file] ~/seqware_github/seqware-distribution/target$ java -classpath seqware-distribution-1.0.6-qe-full.jar com.github.seqware.queryengine.system.importers.OBOImporter ../../seqware-queryengine-backend/src/test/resources/com/github/seqware/queryengine/system/so.obo keyValueOBO.out 6861 terms written to a TagSet written with an ID of: 42860461-0620-4990-bf15-32e6d34701b3 dyuen@odl-dyuen:~/seqware_github/seqware-distribution/target$ cat keyValueOBO.out TagSetID 42860461-0620-4990-bf15-32e6d34701b3 namespace BATMAN
The previous steps should only really need to be done once when first setting up a namespace. Afterwards, the VCF file importer can be called repeatedly for each of your datasets. Note that you will need to substitute the TagSetID from the previous step into the next step.
~/seqware_github/seqware-distribution/target$ java -classpath seqware-distribution-1.0.6-qe-full.jar com.github.seqware.queryengine.system.importers.SOFeatureImporter usage: SOFeatureImporter -a(optional) an ID for an ad hoc TagSet, Tags will either be found or added to this set, a new TagSet will be generated if this option is not used -b (optional) batch-size for the number of features in memory to keep before a flush, will automatically be chosen if not specified, we use 100000 for now -c (optional) whether we are working with compressed input -f (optional) for benchmarking for now, append features to an existing featureset -i (required) comma separated input files -o (optional) output file with our resulting key values -r (required) the reference ID to attach our FeatureSet to -s (optional) comma separated TagSet IDs, new Tags will be linked to the first set that they appear, these TagSets will not be modified -t (optional: default 1) the number of threads to use in our import -w (required) the work module and thus the type of file we are working with ~/seqware_github/seqware-distribution/target$ java -classpath seqware-distribution-1.0.6-qe-full.jar com.github.seqware.queryengine.system.importers.SOFeatureImporter -i ../../seqware-queryengine-backend/src/test/resources/com/github/seqware/queryengine/system/FeatureImporter/consequences_annotated.vcf -o keyValueVCF.out -r hg_19 -s 42860461-0620-4990-bf15-32e6d34701b3 -a ad_hoc -w VCFVariantImportWorker FeatureSet written with an ID of: 4bd2ced0-5e37-4930-bffe-207b862a09a6 ~/seqware_github/seqware-distribution/target$ echo $? 0 ~/seqware_github/seqware-distribution/target$ cat keyValueVCF.out FeatureSetID 4bd2ced0-5e37-4930-bffe-207b862a09a6
The ID of your feature sets and parameters will obviously change from run to run. Depending on the size of your data, you may also need to either tune the size of your batches (via the '-b
option) when loading features or allocate more memory to java (java -Xmx4096m -classpath ...
).
The SOFeatureImporter will output a FeatureSet ID that should be used as part of the input to the VCFDumper command in order to export a FeatureSet.
~/seqware_github/seqware-distribution/target$ java -classpath seqware-distribution-1.0.6-qe-full.jar com.github.seqware.queryengine.system.exporters.VCFDumper 0 arguments found VCFDumper[outputFile] ~/seqware_github/seqware-distribution/target$ java -classpath seqware-distribution-1.0.6-qe-full.jar com.github.seqware.queryengine.system.exporters.VCFDumper 4bd2ced0-5e37-4930-bffe-207b862a09a6 test_out.vcf ~/seqware_github/seqware-distribution/target$ sort test_out.vcf > sorted_test_out.vcf ~/seqware_github/seqware-distribution/target$ sort ../../seqware-queryengine-backend/src/test/resources/com/github/seqware/queryengine/system/FeatureImporter/consequences_annotated.vcf > control.vcf ~/seqware_github/seqware-distribution/target$ diff -b sorted_test_out.vcf control.vcf
Most of these commands use non-zero return codes to indicate that an error occurred. For example:
~/seqware_github/seqware-distribution/target$ java -classpath seqware-distribution-1.0.6-qe-full.jar com.github.seqware.queryengine.system.importers.SOFeatureImporter -i ../../seqware-queryengine-backend/src/test/resources/com/github/seqware/queryengine/system/FeatureImporter/test_invalid.vcf -o keyValueVCF.out -r hg_19 -s 42860461-0620-4990-bf15-32e6d34701b3 -a ad_hoc -w VCFVariantImportWorker [SeqWare Query Engine] 1 [Thread-3] FATAL com.github.seqware.queryengine.system.importers.workers.VCFVariantImportWorker - Exception thrown with file: ../../seqware-queryengine-backend/src/test/resources/com/github/seqware/queryengine/system/FeatureImporter/test_invalid.vcf java.lang.NumberFormatException: For input string: “51xxx” at java.lang.NumberFormatException.forInputString(NumberFormatException.java:48) at java.lang.Integer.parseInt(Integer.java:458) at java.lang.Integer.parseInt(Integer.java:499) at com.github.seqware.queryengine.system.importers.workers.VCFVariantImportWorker.run(VCFVariantImportWorker.java:285) ~/seqware_github/seqware-distribution/target$ echo $? 10
The default VCFDumper simply exports all Features that are in a FeatureSet. However, it is also an excellent starting point for experimenting with new queries and various features of the query engine.
One example of how one may wish to extend and adapt the VCFDumper is available in com.github.seqware.queryengine.tutorial.BrianTest
while specific queries can be found in com.github.seqware.queryengine.model.test.QueryInterfaceTest
and copy-and-pasted into BrianTest
.
Note: More details on the inner workings of the query engine will be available in “Extending the Query Engine.” However, for a developer that will primarily interact with the model objects and command-line tools, the most important distinction to be aware of is which operations update individual features in a feature set and which operations perform a copy-on-write operation for the whole feature set since the latter is much more expensive. In general, operations that iterate through a feature set, read or update individual features. Operations that go through the Query Interface and call plug-ins will perform a copy-on-write. Operations that perform a copy-on-write always provide a TTL (time-to-live) parameter.
QueryInterfaceTest
brings us to the code in the testing directories which demonstrate many of the features available to the Query Engine.
The QueryVCFDumper
allows you to test queries written in our query language. This class takes in a featureID as input in order to perform a query and output the last feature set to VCF format. Our query language is currently specified here but will be migrated to this site.
Here is an example of how to interact with the utility, here we run through compiling a few queries and running them.
~/seqware_github/seqware-distribution/target$ java -cp seqware-distribution-1.0.6-qe-full.jar com.github.seqware.queryengine.system.exporters.QueryVCFDumper usage: QueryVCFDumper -f(required) the ID of the featureset that we will be querying and exporting -k (optional) a key value file that includes the featureset ID of each featureset that is created during querying and the final featureset ID -o
The QueryVCFDumper
also allows you to quickly test Java queries written against the QueryInterface
directly. This class takes in a featureID as input and a class name for a class that implements the com.github.seqware.queryengine.system.exporters.QueryDumperInterface
in order to perform a few queries and output the last feature set to VCF format.
Here is an example of how to interact with the utility, here we run through compiling a few queries and running them.
~/seqware_github/seqware-distribution/target$ cp ../../seqware-queryengine-backend/src/test/java/com/github/seqware/queryengine/system/test/queryDumper/VCFDumperParameterExample.java QueryTutorial.java ~/seqware_github/seqware-distribution/target$ gvim QueryTutorial.java ~/seqware_github/seqware-distribution/target$ javac -cp seqware-distribution-1.0.6-qe-full.jar QueryTutorial.java ~/seqware_github/seqware-distribution/target$ java -cp .:seqware-distribution-1.0.6-qe-full.jar com.github.seqware.queryengine.system.exporters.QueryVCFDumper -f 4bd2ced0-5e37-4930-bffe-207b862a09a6 -k keyValue.out -o output.vcf -p QueryTutorial
During the gvim
step, it is important to delete the package line, delete the import from the test package, change the classname, and then perform the required changes to the queries. For example, to search for intron_variants rather than non_synonymous_codon, we will need to change the file to the following:
/*
* Copyright (C) 2012 SeqWare
*
* This program is free software: you can redistribute it and/or modify
* it under the terms of the GNU General Public License as published by
* the Free Software Foundation, either version 3 of the License, or
* (at your option) any later version.
*
* This program is distributed in the hope that it will be useful,
* but WITHOUT ANY WARRANTY; without even the implied warranty of
* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
* GNU General Public License for more details.
*
* You should have received a copy of the GNU General Public License
* along with this program. If not, see .
*/
import com.github.seqware.queryengine.factory.SWQEFactory;
import com.github.seqware.queryengine.kernel.RPNStack;
import com.github.seqware.queryengine.model.FeatureSet;
import com.github.seqware.queryengine.model.QueryFuture;
import com.github.seqware.queryengine.model.QueryInterface;
import com.github.seqware.queryengine.system.exporters.QueryDumperInterface;
/**
* An example of a parameter file. See more possible Queries in {@link QueryInterfaceTest}.
* @author dyuen
*/
public class QueryTutorial implements QueryDumperInterface{
@Override
public int getNumQueries() {
// we will run three queries
return 3;
}
@Override
public QueryFuture getQuery(FeatureSet set, int queryNum) {
if (queryNum == 0){
/// limits us to CHROM #21
return SWQEFactory.getQueryInterface().getFeaturesByAttributes(0, set, new RPNStack(new RPNStack.FeatureAttribute("seqid"), new RPNStack.Constant("21"), RPNStack.Operation.EQUAL));
} else if (queryNum == 1){
// limits us to the range of 20000000 through 30000000
return SWQEFactory.getQueryInterface().getFeaturesByRange(0, set, QueryInterface.Location.INCLUDES, "21", 20000000, 30000000);
} else{
// limits us to features with a particular tag
return SWQEFactory.getQueryInterface().getFeaturesByAttributes(0, set, new RPNStack(new RPNStack.TagOccurrence("ad_hoc", "intron_variant")));
}
}
}
The testing directories are com.github.seqware.queryengine.model.test
, com.github.seqware.queryengine.impl.test
, and com.github.seqware.queryengine.system.test
. These directories test the model objects that outside developers can manipulate and interact with, specific features of the back-end, and the command-line tools respectively. Note that the tests can be run from a TestSuite
that is available in each directory while new tests should be added to the DynamicSuiteBuilder
in each directory. Note that the tests in the model directory can be run against a variety of back-ends and two serialization techniques.
By default, tests will run against a mini-HBase cluster that starts up on the localhost. However, if you override this via Constants
or the ~/.seqware/settings
file, after running through the full test suite multiple times, the tests run against the simpler back-ends with no optimization will slow down. This can be fixed by running the following code via the HBase shell in order to clear out all stored data:
hbase shell disable_all '.*' drop_all '.*'
Note:
This will destroy all data on the HBase storage. If you have any doubt or are working in a production environment, it is better to restrict the delete to one namespace, for example disable_all '
Our unit tests are designed to test and highlight various features that are available in the Query Engine. Some highlights include:
Seeing how the command-line tools work in com.github.seqware.queryengine.system
is a good introduction to our QueryEngine. In particular, VCFImportWorker
and VCFDumper
show how a front-end developer should interact with our code. New entities are created via the CreateUpdateManager
and Queries are performed through the QueryInterface
. Note the use of a Builder Design Pattern in the ImportWorker
. This allows us to abstract the actual implementation of many of our model classes and emphasizes the idea that we create objects that are largely immutable once we write them to the database although we can always write new versions.
Currently, the plug-in infrastructure allows you to:
installAnalysisPlugin
, getAnalysisPlugins
, and getFeaturesByPlugin
methods in the QueryInterface
class and demonstrated in QueryInterfaceTest
.com.github.seqware.queryengine.plugins.inmemory
directory.AnalysisPluginInterface
directly or (preferred) extending the plugins in com.github.seqware.queryengine.plugins.hbasemr
We intend on further cleaning up the plug-in architecture, improving the persistence of analysis events, and adding support for scan plug-ins.
In the future, we will be developing: