SeqWare › SeqWare Query Engine

Overview

The SeqWare Query Engine is intended to be a universal store for sequence variants. Features include fine-grained control over versioning, a plug-in framework for new MapReduce plug-ins, rich value tags/annotations on most objects, and set organization. The current iteration of the SeqWare Query engine has the ability to store and search data stored in a modern NoSQL database, HBase, while using Google’s Protocol Buffers for serialization.

Building and Installing

Most users will want to use our pre-configured VMs, see the SeqWare Install Guide for how to get the VM.

Please see the Install Guide for installing Query Engine from scratch.

Loading Data

We currently load data via our command-line programs. In order to do this, you will want to go to the root directory and compile a full version of our jar with dependencies included. You will probably wish to skip the tests for our other components. This should look something like:

  cd ~/gitroot/seqware/
  mvn clean install
  cd seqware-distribution/target

In this distribution directory, you can run our command line tools for import.

The first time you run these tools in a new namespace, you may wish to create a common reference and then create an ad hoc tag set that will store all tags that do not match known tags. Most of our command-line tools output a key value file that you can keep in order to record the ID of your created objects.

~/seqware_github/seqware-distribution/target$ java -classpath seqware-distribution-1.0.6-qe-full.jar com.github.seqware.queryengine.system.ReferenceCreator
Only 0 arguments found
ReferenceCreator  [output_file]
~/seqware_github/seqware-distribution/target$ java -classpath seqware-distribution-1.0.6-qe-full.jar com.github.seqware.queryengine.system.ReferenceCreator hg_19 keyValue_ref.out
Reference written with an ID of:
hg_19
~/seqware_github/seqware-distribution/target$ cat keyValue_ref.out
referenceID    hg_19
~/seqware_github/seqware-distribution/target$ java -classpath seqware-distribution-1.0.6-qe-full.jar com.github.seqware.queryengine.system.TagSetCreator
Only 0 arguments found
TagSetCreator  [output_file]
~/seqware_github/seqware-distribution/target$ java -classpath seqware-distribution-1.0.6-qe-full.jar com.github.seqware.queryengine.system.TagSetCreator ad_hoc keyValue_adHoc.out
TagSet written with an ID of:
ad_hoc
~/seqware_github/seqware-distribution/target$ cat keyValue_adHoc.out
TagSetID    ad_hoc
namespace   BATMAN

You may also wish to pre-populate the database with (Sequence Ontology) SO terms in a TagSet:

~/seqware_github/seqware-distribution/target$ java -classpath seqware-distribution-1.0.6-qe-full.jar com.github.seqware.queryengine.system.importers.OBOImporter
Only 0 arguments found
OBOImporter  [output_file]
~/seqware_github/seqware-distribution/target$ java -classpath seqware-distribution-1.0.6-qe-full.jar com.github.seqware.queryengine.system.importers.OBOImporter ../../seqware-queryengine-backend/src/test/resources/com/github/seqware/queryengine/system/so.obo keyValueOBO.out
6861 terms written to a TagSet written with an ID of:
42860461-0620-4990-bf15-32e6d34701b3
dyuen@odl-dyuen:~/seqware_github/seqware-distribution/target$ cat keyValueOBO.out
TagSetID    42860461-0620-4990-bf15-32e6d34701b3
namespace   BATMAN

The previous steps should only really need to be done once when first setting up a namespace. Afterwards, the VCF file importer can be called repeatedly for each of your datasets. Note that you will need to substitute the TagSetID from the previous step into the next step.

~/seqware_github/seqware-distribution/target$ java -classpath seqware-distribution-1.0.6-qe-full.jar com.github.seqware.queryengine.system.importers.SOFeatureImporter
usage: SOFeatureImporter
 -a    (optional) an ID for an ad hoc TagSet, Tags will
                    either be found or added to this set, a new TagSet
                    will be generated if this option is not used
 -b     (optional) batch-size for the number of features in
                    memory to keep before a flush, will automatically be
                    chosen if not specified, we use 100000 for now
 -c                 (optional) whether we are working with compressed
                    input
 -f     (optional) for benchmarking for now, append features
                    to an existing featureset
 -i      (required) comma separated input files
 -o     (optional) output file with our resulting key values
 -r      (required) the reference ID to attach our FeatureSet
                    to
 -s         (optional) comma separated TagSet IDs, new Tags will
                    be linked to the first set that they appear, these
                    TagSets will not be modified
 -t        (optional: default 1) the number of threads to use in
                    our import
 -w         (required) the work module and thus the type of file
                    we are working with
~/seqware_github/seqware-distribution/target$ java -classpath seqware-distribution-1.0.6-qe-full.jar com.github.seqware.queryengine.system.importers.SOFeatureImporter -i ../../seqware-queryengine-backend/src/test/resources/com/github/seqware/queryengine/system/FeatureImporter/consequences_annotated.vcf -o keyValueVCF.out -r hg_19 -s 42860461-0620-4990-bf15-32e6d34701b3 -a ad_hoc -w VCFVariantImportWorker 
FeatureSet written with an ID of:
4bd2ced0-5e37-4930-bffe-207b862a09a6
~/seqware_github/seqware-distribution/target$  echo $? 
0
~/seqware_github/seqware-distribution/target$  cat keyValueVCF.out 
FeatureSetID    4bd2ced0-5e37-4930-bffe-207b862a09a6

The ID of your feature sets and parameters will obviously change from run to run. Depending on the size of your data, you may also need to either tune the size of your batches (via the '-b option) when loading features or allocate more memory to java (java -Xmx4096m -classpath ... ).

Querying Data

The SOFeatureImporter will output a FeatureSet ID that should be used as part of the input to the VCFDumper command in order to export a FeatureSet.

~/seqware_github/seqware-distribution/target$   java -classpath seqware-distribution-1.0.6-qe-full.jar com.github.seqware.queryengine.system.exporters.VCFDumper
0 arguments found
VCFDumper  [outputFile]
~/seqware_github/seqware-distribution/target$   java -classpath seqware-distribution-1.0.6-qe-full.jar com.github.seqware.queryengine.system.exporters.VCFDumper 4bd2ced0-5e37-4930-bffe-207b862a09a6 test_out.vcf
~/seqware_github/seqware-distribution/target$   sort test_out.vcf > sorted_test_out.vcf
~/seqware_github/seqware-distribution/target$   sort ../../seqware-queryengine-backend/src/test/resources/com/github/seqware/queryengine/system/FeatureImporter/consequences_annotated.vcf  > control.vcf
~/seqware_github/seqware-distribution/target$   diff -b sorted_test_out.vcf control.vcf

Most of these commands use non-zero return codes to indicate that an error occurred. For example:

~/seqware_github/seqware-distribution/target$ java -classpath seqware-distribution-1.0.6-qe-full.jar com.github.seqware.queryengine.system.importers.SOFeatureImporter -i ../../seqware-queryengine-backend/src/test/resources/com/github/seqware/queryengine/system/FeatureImporter/test_invalid.vcf -o keyValueVCF.out -r hg_19 -s 42860461-0620-4990-bf15-32e6d34701b3 -a ad_hoc -w VCFVariantImportWorker
[SeqWare Query Engine] 1    [Thread-3] FATAL com.github.seqware.queryengine.system.importers.workers.VCFVariantImportWorker  - Exception thrown with file: ../../seqware-queryengine-backend/src/test/resources/com/github/seqware/queryengine/system/FeatureImporter/test_invalid.vcf
java.lang.NumberFormatException: For input string: “51xxx”
    at java.lang.NumberFormatException.forInputString(NumberFormatException.java:48)
    at java.lang.Integer.parseInt(Integer.java:458)
    at java.lang.Integer.parseInt(Integer.java:499)
    at com.github.seqware.queryengine.system.importers.workers.VCFVariantImportWorker.run(VCFVariantImportWorker.java:285)
~/seqware_github/seqware-distribution/target$   echo $?
10

The default VCFDumper simply exports all Features that are in a FeatureSet. However, it is also an excellent starting point for experimenting with new queries and various features of the query engine.

One example of how one may wish to extend and adapt the VCFDumper is available in com.github.seqware.queryengine.tutorial.BrianTest while specific queries can be found in com.github.seqware.queryengine.model.test.QueryInterfaceTest and copy-and-pasted into BrianTest.

Note: More details on the inner workings of the query engine will be available in “Extending the Query Engine.” However, for a developer that will primarily interact with the model objects and command-line tools, the most important distinction to be aware of is which operations update individual features in a feature set and which operations perform a copy-on-write operation for the whole feature set since the latter is much more expensive. In general, operations that iterate through a feature set, read or update individual features. Operations that go through the Query Interface and call plug-ins will perform a copy-on-write. Operations that perform a copy-on-write always provide a TTL (time-to-live) parameter.

QueryInterfaceTest brings us to the code in the testing directories which demonstrate many of the features available to the Query Engine.

Querying Data via Query Language

The QueryVCFDumper allows you to test queries written in our query language. This class takes in a featureID as input in order to perform a query and output the last feature set to VCF format. Our query language is currently specified here but will be migrated to this site.

Here is an example of how to interact with the utility, here we run through compiling a few queries and running them.

~/seqware_github/seqware-distribution/target$   java -cp seqware-distribution-1.0.6-qe-full.jar  com.github.seqware.queryengine.system.exporters.QueryVCFDumper
usage: QueryVCFDumper
 -f      (required) the ID of the featureset that we will
                         be querying and exporting
 -k       (optional) a key value file that includes the
                         featureset ID of each featureset that is created
                         during querying and the final featureset ID
 -o         (optional) output file for the VCF
 -p    (required) full classname for the class
                         implementing QueryDumperInterface
 -s        (required) plain text query
[SeqWare Query Engine] 0    [main] FATAL com.github.seqware.queryengine.system.exporters.QueryVCFDumper  - 
org.apache.commons.cli.MissingOptionException: Missing required options: f, [-s (required) plain text query, -p (required) full classname for the class implementing QueryDumperInterface]
    at org.apache.commons.cli.Parser.checkRequiredOptions(Parser.java:299)
    at org.apache.commons.cli.Parser.parse(Parser.java:231)
    at org.apache.commons.cli.Parser.parse(Parser.java:85)
    at com.github.seqware.queryengine.system.exporters.QueryVCFDumper.runMain(QueryVCFDumper.java:78)
    at com.github.seqware.queryengine.system.exporters.QueryVCFDumper.main(QueryVCFDumper.java:42)
~/seqware_github/seqware-distribution/target$ java -classpath  seqware-distribution-1.0.6-qe-full.jar com.github.seqware.queryengine.system.exporters.QueryVCFDumper -f 0737bf87-28c5-4323-a285-7898137e22ab -k keyValue.out -o output.vcf -s "seqid==\"21\""
~/seqware_github/seqware-distribution/target$ java -classpath  seqware-distribution-1.0.6-qe-full.jar com.github.seqware.queryengine.system.exporters.QueryVCFDumper -f 0737bf87-28c5-4323-a285-7898137e22ab -k keyValue.out -o output.vcf -s "seqid==\"21\" && start >= 20000000 && stop <= 30000000"
~/seqware_github/seqware-distribution/target$ java -classpath  seqware-distribution-1.0.6-qe-full.jar com.github.seqware.queryengine.system.exporters.QueryVCFDumper -f 0737bf87-28c5-4323-a285-7898137e22ab -k keyValue.out -o output.vcf -s "seqid==\"21\" && start >= 20000000 && stop <= 30000000 && tagOccurrence(\"ad_hoc\",\"non_synonymous_codon\")"

Querying Data via Java API Calls

The QueryVCFDumper also allows you to quickly test Java queries written against the QueryInterface directly. This class takes in a featureID as input and a class name for a class that implements the com.github.seqware.queryengine.system.exporters.QueryDumperInterface in order to perform a few queries and output the last feature set to VCF format.

Here is an example of how to interact with the utility, here we run through compiling a few queries and running them.

~/seqware_github/seqware-distribution/target$   cp ../../seqware-queryengine-backend/src/test/java/com/github/seqware/queryengine/system/test/queryDumper/VCFDumperParameterExample.java QueryTutorial.java
~/seqware_github/seqware-distribution/target$  gvim QueryTutorial.java
~/seqware_github/seqware-distribution/target$  javac -cp seqware-distribution-1.0.6-qe-full.jar QueryTutorial.java
~/seqware_github/seqware-distribution/target$  java -cp .:seqware-distribution-1.0.6-qe-full.jar  com.github.seqware.queryengine.system.exporters.QueryVCFDumper -f 4bd2ced0-5e37-4930-bffe-207b862a09a6 -k keyValue.out -o output.vcf -p QueryTutorial

During the gvim step, it is important to delete the package line, delete the import from the test package, change the classname, and then perform the required changes to the queries. For example, to search for intron_variants rather than non_synonymous_codon, we will need to change the file to the following:

/*
 * Copyright (C) 2012 SeqWare
 *
 * This program is free software: you can redistribute it and/or modify
 * it under the terms of the GNU General Public License as published by
 * the Free Software Foundation, either version 3 of the License, or
 * (at your option) any later version.
 *
 * This program is distributed in the hope that it will be useful,
 * but WITHOUT ANY WARRANTY; without even the implied warranty of
 * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
 * GNU General Public License for more details.
 *
 * You should have received a copy of the GNU General Public License
 * along with this program.  If not, see .
 */

import com.github.seqware.queryengine.factory.SWQEFactory;
import com.github.seqware.queryengine.kernel.RPNStack;
import com.github.seqware.queryengine.model.FeatureSet;
import com.github.seqware.queryengine.model.QueryFuture;
import com.github.seqware.queryengine.model.QueryInterface;
import com.github.seqware.queryengine.system.exporters.QueryDumperInterface;

/**
 * An example of a parameter file. See more possible Queries in {@link QueryInterfaceTest}.
 * @author dyuen
 */
public class QueryTutorial implements QueryDumperInterface{

    @Override
    public int getNumQueries() {
        // we will run three queries
        return 3;
    }

    @Override
    public QueryFuture getQuery(FeatureSet set, int queryNum) {
        if (queryNum == 0){
            /// limits us to CHROM #21
            return SWQEFactory.getQueryInterface().getFeaturesByAttributes(0, set, new RPNStack(new RPNStack.FeatureAttribute("seqid"), new RPNStack.Constant("21"), RPNStack.Operation.EQUAL));
        } else if (queryNum == 1){
            // limits us to the range of 20000000 through 30000000
            return SWQEFactory.getQueryInterface().getFeaturesByRange(0, set, QueryInterface.Location.INCLUDES, "21", 20000000, 30000000);
        } else{
            // limits us to features with a particular tag
            return SWQEFactory.getQueryInterface().getFeaturesByAttributes(0, set, new RPNStack(new RPNStack.TagOccurrence("ad_hoc", "intron_variant")));
        }
    }
    
}

Testing

The testing directories are com.github.seqware.queryengine.model.test, com.github.seqware.queryengine.impl.test, and com.github.seqware.queryengine.system.test. These directories test the model objects that outside developers can manipulate and interact with, specific features of the back-end, and the command-line tools respectively. Note that the tests can be run from a TestSuite that is available in each directory while new tests should be added to the DynamicSuiteBuilder in each directory. Note that the tests in the model directory can be run against a variety of back-ends and two serialization techniques.

By default, tests will run against a mini-HBase cluster that starts up on the localhost. However, if you override this via Constants or the ~/.seqware/settings file, after running through the full test suite multiple times, the tests run against the simpler back-ends with no optimization will slow down. This can be fixed by running the following code via the HBase shell in order to clear out all stored data:

  hbase shell
  disable_all '.*'
  drop_all '.*'

Note: This will destroy all data on the HBase storage. If you have any doubt or are working in a production environment, it is better to restrict the delete to one namespace, for example disable_all '.*'

Features

Our unit tests are designed to test and highlight various features that are available in the Query Engine. Some highlights include:

QueryInterface - demonstrates our query language and running queries that create new filtered feature sets
TaggableTest - demonstrates how to associate tags, read tags, and what can be stored in tags
TTLTest - demonstrates storing and reading time-to-live fields
FriendlyNameTest - an interface for explicitly naming objects rather than using randomly generated IDs
SOFeatureImporterTest - demonstrates various some options available to the feature importer command-line tool

Extending the Query Engine

Seeing how the command-line tools work in com.github.seqware.queryengine.system is a good introduction to our QueryEngine. In particular, VCFImportWorker and VCFDumper show how a front-end developer should interact with our code. New entities are created via the CreateUpdateManager and Queries are performed through the QueryInterface. Note the use of a Builder Design Pattern in the ImportWorker. This allows us to abstract the actual implementation of many of our model classes and emphasizes the idea that we create objects that are largely immutable once we write them to the database although we can always write new versions.

Making Custom Plugins

Currently, the plug-in infrastructure allows you to:

Install, list, and call arbitrary plug-ins via the installAnalysisPlugin, getAnalysisPlugins, and getFeaturesByPlugin methods in the QueryInterface class and demonstrated in QueryInterfaceTest.
Specify and prototype new plug-ins by working in the com.github.seqware.queryengine.plugins.inmemory directory.
Create new Map/Reduce plugins by extending AnalysisPluginInterface directly or (preferred) extending the plugins in com.github.seqware.queryengine.plugins.hbasemr

We intend on further cleaning up the plug-in architecture, improving the persistence of analysis events, and adding support for scan plug-ins.

More to Come …

In the future, we will be developing:

REST API: We will be developing a RESTful API to allow for the development of web applications and to allow users to query and view their data easily.
Web App: A sample web application demoing genome browsing