The SeqWare Pipeline sub-project is really the heart of the overall SeqWare
project. This provides the core functionality of SeqWare; it is workflow
developer environment and a series of tools for installing, running, and
We currently support one workflow language (Java) and four
workflow engines (oozie, oozie-sge, whitestar, and whitestar-sge).
(Previously, we also supported Pegasus/Condor/Globus as a workflow engine).
Our current recommended combination is
Java workflows with the Oozie-sge engine.
- Oozie uses the Hadoop Workflow Scheduler to schedule steps in workflows on the Hadoop ecosystem (JobTrackers and TaskTrackers).
- Oozie-sge uses Oozie, but in conjunction with a oozie-sge plugin to schedule steps in workflows on a pre-existing sge cluster.
- WhiteStar is a synchronous workflow engine used by SeqWare developers to debug, it runs steps locally via Bash.
- WhiteStar-sge runs steps on a local sge cluster.
We highly recommend you go through the
Admin tutorials since the
documentation below assumes you already have.
SeqWare Pipeline has several key features that distinguish it from other open source and private workflow solutions. These include:
- developer framework focused
- focused on automated analysis
- includes cluster abstraction
- supports detailed provenance tracking
- supports user-created workflows
- implements a self-contained workflow packaging standard
- includes fault tolerance
- focuses on meeting workflow needs of big projects (thousands of samples)
- is open source
See About for more information.
Building and Installing
- This is our installation guide based on VMs that we recommend for most users. You will be left with a functioning SeqWare install including SeqWare Pipeline.
- Installation From Scratch
- This guide walks you through how we built the VMs and will be of interest to anyone that needs to see the details of SeqWare setup starting with an empty Linux server. It is complicated so we highly recommend using a VM (which can be connected to a real cluster).
- Building from Source
- These directions show you how to build the whole project, including SeqWare Pipeline, using Maven.
- User Settings
- Information about configuring user settings files.
- Monitor Configuration
- Setting up the SeqWare-associated tools that need to run to launch workflows and monitor workflows.
- Connecting to a Real Cluster
- Once you are happy with writing, installing, and running workflows on a stand-alone VM you will want to connect to a “real” cluster. This guide walks you through the process of connecting a VM to a cluster (HPC & Hadoop, depending on your workflow engine of choice).
Workflows define a series of steps and how they relate to each other.
Typically, these encode a series of calls to command line tools that operate on
files read from and written to a shared filesystem. Individual steps usually
run on a randomly chosen cluster node.
- Java Workflows
- This is our newer workflow language that is much simpler than the FTL and more expressive. We recommend this for all new workflow development.
- Workflow Bundle Conventions
- We rely on a bundle format for packaging up and exchanging workflows. This document describes the format and directory structure.
- Workflow Config Files
- This document describes the ini configuration file used to describe (and type) workflow parameters.
- Workflow Metadata File
- This document describes the metadata XML file used to describe workflows. It provides workflow names, versions, descriptions, and information for running and testing the workflow.
- File Type Conventions
- This document describes the standardized file meta types (MIME-like types) we use in the project and how to add files to a community-writable file type registration.
Modules are really optional for those interested in workflow development since
most workflows simply refer to command line tools bundled inside the workflow.
For those interested in extending the underlying SeqWare system, Modules
provide a way to define new step types and could be useful for writing custom
steps that interact with databases, trigger analysis in other frameworks
(Pig/Hive/MapReduce), make calls to web services, etc. We use Modules to
provide core services in SeqWare (such as file provisioning and bash shell
execution). Again, Modules are mainly targeted at core SeqWare developers not
general workflow developers.
- Writing Modules
- How to extend SeqWare with Java tool wrappers. Can be used in workflows or as stand-alone utilities that know how to record provenance data back to SeqWare MetaDB.
The Deciders framework allows for the automatic parameterization and calling of workflows in SeqWare Pipeline. It allows you to easily encode the parent workflow and file types that, when present, enable a subsequent workflow to be launched.
- Basic Deciders
- A generic Decider that can be used to launch a workflow using simple criteria like parent workflow and input file type.
- Making a Custom Decider
- How to create a custom decider for your workflow, useful if your logic for running your workflow is more complicated than simple parent workflow + input file requirements.
A major focus of the SeqWare Web Service is providing reporting resources. These are command line tools that are particularly useful for generating reports for SeqWare entities such as workflow runs and their outputs.
- seqware files report
- Gives you a view of all files and their position in the database hierarchy from study on down
- Workflow Run Reporter
- Find the identity and library samples and input and output files from one or more workflow runs.
Command Line Reference
We have provided a new, simplified command line interface. The best way to learn its features is to simply add
$ seqware --help
Usage: seqware [<flag>]
seqware <command> [--help]
annotate Add arbitrary key/value pairs to seqware objects
bundle Interact with a workflow bundle during development/admin
copy Copy files between local and remote file systems
create Create new seqware objects (e.g., study)
files Extract information about workflow output files
study Extract information about studies
workflow Interact with workflows
workflow-run Interact with workflow runs
checkdb Check the seqware database for convention errors
check Check the seqware environment for configuration issues
--help Print help out
--version Print Seqware's version
$ seqware workflow --help
Usage: seqware workflow [--help]
seqware workflow <sub-command> [--help]
Interact with workflows.
ini Generate an ini file for a workflow
list List all installed workflows
report List the details of all runs of a given workflow
schedule Schedule a workflow to be run
Most commands will print the help if no arguments are provided.
The old command line still exists, and its documentation is auto-generated and covers the Plugins (which are utility tools used outside of workflows) and Modules (which model custom steps in workflows and know how to integrate with the SeqWare MetaDB for metadata writeback).
- The command line utilities of SeqWare.
- Can be used as custom steps in workflows or on the command line. The most important modules are the GenericCommandRunner and the ProvisionFiles modules. These are used to call individual Bash steps in workflows and to move input/outputs around respectively.