SeqWare › SeqWare Pipeline: Basic Deciders

Overview

So far you’ve seen how to write a module to wrap a given tool in a consistent way. You’ve also seen how to write a workflow that links together multiple modules and parameterizes them through a simple ini file. Finally, you can put these together and run the workflow using the commands you saw in the introduction with HelloWorld. But the reality of large projects dictates that you can’t run each workflow manually. That’s where deciders come in, they essentially are a bit of code that links up the metadata in the MetaDB with a given workflow based on a set of rules encoded in the decider. At UNC and OICR we have deciders for each of our workflows and they let us process data on hourly cron jobs. These deciders, for example, look in the database for all RNASeq human samples that have not previously been processed through a workflows, pulls back the needed data from the MetaDB, and launches a workflow run for that particular lane. This is the heart of what a decider is trying to do, it links metadata to the actual execution of workflows in an automated way.

So we created the deciders for:

automation, running workflows as a cron
a place for nasty code, frequently changing code, other code that is site specific

For the last point we’ve tried to make modules very clean and generic with little business logic, mainly focused on parameterization for behavior. The workflows are very dumb as well and really depend on parameterization. So we’ve essentially used the deciders as places to contain logic that might be site specific or simply frequently changing.

API Classes

The interface for a decider is DeciderInterface. In practice, when creating your own deciders you will probably extend BasicDecider

BasicDecider

Full Command-line options for the Basic Decider are available. Below, we will high-light some key points.

Required parameters of note:

wf-accession: The SWID for the workflow that you wish to run
study-name OR sample-name OR all OR sequencer-run-name: Limit the subset of samples to look for files associated with particular study, sample, sequencer-run, or just run over everything.
parent-wf-accessions OR meta-types: Limit the files under consideration to those of specific file types (such as (gz) files or those generated by particular parent workflows.

Optional parameters of note:

check-wf-accessions: Comma-separated accessions of workflows that you may already be satisfied with (for example, satisfactory runs of an older version of a workflow). If the input files in a previous workflow run in this list match the current workflow run under consideration, it will not be re-run.
rerun-max: The maximum number of times to re-launch a workflowrun if failed. Default: 5.
group-by: Group by one of the headings in FindAllTheFilesHeader. Default: FILE_SWA.

When extending the BasicDecider, you may also place the parent-wf-accessions, check-wf-accessions, and wf-accession into a decider.properties file to permanently set these values.

Decision Logic

The BasicDecider performs the following steps:

Pull down the list of relevant file paths from study down to file (based on a starting point of a specific study, sample, sequencer-run or across all studies)
Group file paths based on the grouping strategy determined by “–group-by”. This behaviour can be overridden in custom methods by overriding “handleGroupByAttribute(String)” in the BasicDecider
Files can be omitted from groups in custom deciders by overridding the “checkFileDetails(ReturnValue, FileMetadata)” method
For each grouping determine whether there exists a blocking workflow run in the metadb based on the following decision table
Create the ini file. This behaviour can be overriden by overriding “modifyIniFile(String, String)” method

Decision Table

A general guideline here is that we only count failures when they are associated with workflow runs that ran on the same files (in the metadb) as we are currently attempting to run on.

Status of previous workflow run	1 (Input files are disjoint from previous run)	2 (Input files partially overlap a previous run)	3 (Input files match a previous run exactly)	4 (Past run contains all current input files)
Failed	ignore (allow rerun)	allow re-run	count as failure, re-run if less than max-failures	allow re-run (with warning message)
Other (pending, running, no status, etc.)	ignore (allow re-run)	allow re-run	block re-run	block re-run
Completed	ignore (allow rerun)	allow re-run	block re-run	do not re-run

In addition, if for a group of files, we also count up more failures than rerun-max, then we also block a re-run.

Search for Relevant Workflow Runs

In order to determine whether there are existing workflow runs in the database that would block re-running a workflow we use a different method depending on the version of decider.

For 1.0.X deciders, when looking for previous workflow runs, we look in a table, workflow_run_input_files. This new table specifically stores the input files and is populated at schedule time by the workflow launcher.

For 0.13.6.X deciders, when looking for previous workflow runs run on these files, we will search under three search patterns.

Look for Workflow Runs which are linked to a File via the File, processing_file, Processing, processing_relationship, Processing, Workflow_Run path. i.e. Are there completed workflow runs (whether previous runs or previous workflows to check) that disallow running on this file?
Look for Workflow Runs which are linked to a File via the File, processing_file, Processing, Workflow_Run, IUS, Workflow_Run path. In addition, for each of these resulting Workflow Runs, we ensure that they are not connected to Processing. i.e. Are there workflow runs in progress that share the same IUS as one of our files (and thus are not hooked up to the processing hierarchy)?
The same as 2. except with Lane. i.e. Are there workflow runs in progress that share the same Lane as one of our files (and thus are not hooked up to the processing hierarchy)?

Tutorial

We will demonstrate usage of the basic decider by creating a few toy workflows and connecting them up via basic decider calls.

As a pre-requisite, you need a clean meta-db. You can clean your postgres database quickly if it is named “test_seqware_meta_db” with the following postgres commands from the seqware-meta-db directory:

~/seqware_github/seqware-meta-db$ dropdb test_seqware_meta_db
~/seqware_github/seqware-meta-db$ createdb test_seqware_meta_db
~/seqware_github/seqware-meta-db$ psql test_seqware_meta_db < seqware_meta_db.sql
~/seqware_github/seqware-meta-db$ psql test_seqware_meta_db < seqware_meta_db_data.sql

Next, create required metadata that represents the way your experiments are organized. These steps mimic the start of the user tutorial User Tutorial:

$ seqware create study --title 'Study1' --description 'This is a test description' --accession 'InternalID123' --center-name 'SeqWare' --center-project-name 'SeqWare Test Project' --study-type 4

Created study with SWID: 1 

$ seqware create experiment --title 'New Test Experiment' --description 'This is a test description' --platform-id 26 --study-accession 1

Created experiment with SWID:2 

$ seqware create sample --title 'New Test Sample' --description 'This is a test description' --organism-id 26 --experiment-accession 2 

Created sample with SWID: 3

Next, create some files that represent data you will be operating on and a target directory for results:

$ mkdir -p /datastore/seqware-results
$ echo 'testing basic workflows' > /datastore/input.txt

First, create a sequencer_run, lane, and ius. These structures represent a lane from a sequencer run and an independent unit of sequencing. For this toy example, we will link new files to these starting points. (Note that the created SWIDs will vary depending on how much work you’ve done)

$ seqware create sequencer-run --description description --file-path file_path --name name --paired-end paired_end --platform-accession 26  --skip false

Created sequencer run with SWID: 4 

$ seqware create lane --sequencer-run-accession 4 --study-type-accession 4 --cycle-descriptor cycle_descriptor --description description --lane-number 1 --library-selection-accession 25 --library-source-accession 5 --library-strategy-accession 21 --name name --skip false

Created lane with SWID: 5 

$ seqware create ius --barcode barcode --description description --lane-accession 5 --name name --sample-accession  3 --skip false

Created IUS with SWID: 6

Next, you need to inject your input file that will be used as input for the workflows that you are about to create. When using the decider framework, the starting point for your deciders is a root workflow run. So we create a “root” workflow and create a workflow run that has as its output the input file from the tutorial.

$ seqware create workflow --name FileImport --version 1.0 --description description
Added 'FileImport' (SWID: 7)
Created workflow 'FileImport' version 1.0 with SWID: 7

$ seqware create workflow-run  --workflow-accession 7 --file imported_file::text/plain::/datastore/input.txt --parent-accession 5 --parent-accession 6
Created workflow run with SWID: 8

Next, we create two workflows, one which converts text files to tar format, and a second which converts tar files to tar.gz format. Here, we assume that you have already run through the Developer Tutorial. Use Maven Archetypes to generate two workflows in the workflow-dev directory called TarWorkflow and GZWorkflow. Use the following parameters for your archetypes.

Define value for property 'package': : com.github.seqware
Define value for property 'groupId':  com.github.seqware: : 
Define value for property 'workflow-name': : Tar
Define value for property 'artifactId':  workflow-Tar: : Tar
Define value for property 'version':  1.0-SNAPSHOT: : 
[INFO] Using property: package = com.github.seqware
Confirm properties configuration:
package: com.github.seqware
groupId: com.github.seqware
workflow-name: Tar
artifactId: Tar
version: 1.0-SNAPSHOT
package: com.github.seqware

Define value for property 'package': : com.github.seqware
Define value for property 'groupId':  com.github.seqware: : 
Define value for property 'workflow-name': : GZ
Define value for property 'artifactId':  workflow-GZ: : GZ
Define value for property 'version':  1.0-SNAPSHOT: : 
[INFO] Using property: package = com.github.seqware
Confirm properties configuration:
package: com.github.seqware
groupId: com.github.seqware
workflow-name: GZ
artifactId: GZ
version: 1.0-SNAPSHOT
package: com.github.seqware

You will need to edit two files in each workflow, the workflow Java client and the workflow.ini.

For TarWorkflow/src/main/java/com/github/seqware/TarWorkflow.java:

package com.github.seqware;

import java.util.Map;
import java.util.logging.Level;
import java.util.logging.Logger;
import net.sourceforge.seqware.pipeline.workflowV2.AbstractWorkflowDataModel;
import net.sourceforge.seqware.pipeline.workflowV2.model.Job;
import net.sourceforge.seqware.pipeline.workflowV2.model.SqwFile;

public class TarWorkflow extends AbstractWorkflowDataModel {

    @Override
    public Map<String, SqwFile> setupFiles() {

	try {
	    // register an input file from the workflow.ini
	    SqwFile file0 = this.createFile("file_in_0");
	    file0.setSourcePath(getProperty("input_files"));
	    file0.setType("text/plain");
	    file0.setIsInput(true);

	    // register an output file
	    SqwFile file1 = this.createFile("file_out");
	    file1.setSourcePath("input.tar");
	    file1.setType("application/x-tar");
	    file1.setIsOutput(true);
	    file1.setForceCopy(true);
	    // determines an output path based on the contents of the workflow.ini to follow and a random number (to avoid overwriting) 
 		    file1.setOutputPath(this.getMetadata_output_file_prefix() + getMetadata_output_dir() + "/" + this.getName() + "_" + this.getVersion() + "/" + this.getRandom()+"/"+"input.tar");
 
	    return this.getFiles();

	} catch (Exception ex) {
	    Logger.getLogger(this.getClass().getName()).log(Level.SEVERE, null, ex);
	    throw new RuntimeException(ex);
	}
    }

    @Override
    public void buildWorkflow() {
	Job job11 = this.getWorkflow().createBashJob("bash_tar");
	job11.setCommand("tar -cvf input.tar " + this.getFiles().get("file_in_0").getProvisionedPath());
    }
}

Note that input_files is automatically populated in the workflow.ini when the decider runs and generates ini files.

For GZ/src/main/java/com/github/seqware/GZWorkflow.java

package com.github.seqware;

import java.util.Map;
import java.util.logging.Level;
import java.util.logging.Logger;
import net.sourceforge.seqware.pipeline.workflowV2.AbstractWorkflowDataModel;
import net.sourceforge.seqware.pipeline.workflowV2.model.Job;
import net.sourceforge.seqware.pipeline.workflowV2.model.SqwFile;

public class GZWorkflow extends AbstractWorkflowDataModel {

    @Override
    public Map<String, SqwFile> setupFiles() {
	try {
	    // register an input file
	    SqwFile file0 = this.createFile("file_in_0");
	    file0.setSourcePath(getProperty("input_files"));
	    file0.setType("application/x-tar");
	    file0.setIsInput(true);
	    // register an output file
	    SqwFile file1 = this.createFile("file_out");
	    file1.setSourcePath("input.tar.gz");
	    file1.setType("application/gzip");
	    file1.setIsOutput(true);
	    file1.setForceCopy(true);
 		    // determines an output path based on the contents of the workflow.ini to follow and a random number (to avoid overwriting) 
 		    file1.setOutputPath(this.getMetadata_output_file_prefix() + getMetadata_output_dir() + "/" + this.getName() + "_" + this.getVersion() + "/" + this.getRandom() + "/" + "input.tar.gz");
	    return this.getFiles();
	} catch (Exception ex) {
	    Logger.getLogger(this.getClass().getName()).log(Level.SEVERE, null, ex);
	    return (null);
	}
    }

    @Override
    public void buildWorkflow() {
	Job job11 = this.getWorkflow().createBashJob("bash_gz");
	job11.setCommand("tar -zcvf input.tar.gz " + this.getFiles().get("file_in_0").getProvisionedPath());
    }
}

For Tar/workflow/config/TarWorkflow.ini and GZ/workflow/config/GZworkflow.ini:

# the output_prefix is used to specify the root of the absolute output path or an S3 bucket name 
# you should pick a path that is available on all cluster nodes and can be written by your user
output_prefix=/datastore/
# this indicates in what directory your results will be placed into under the output_prefix, in this example
# your results will reside in /datastore/seqware-results
output_dir=seqware-results

Build, package, and install both workflows. Record the accession for both workflows:

$ cd Tar/
$ mvn clean install 
(output snipped)
$ cd ..
$ cd GZ/
$ mvn clean install 
(output snipped)
$ cd ..
$ seqware bundle package --dir GZ/target/Workflow_Bundle_GZ_1.0-SNAPSHOT_SeqWare_1.1.0/
Validating Bundle structure
Packaging Bundle
Bundle has been packaged to /home/seqware/workflow-dev
$ seqware bundle package --dir Tar/target/Workflow_Bundle_Tar_1.0-SNAPSHOT_SeqWare_1.1.0/
Validating Bundle structure
Packaging Bundle
Bundle has been packaged to /home/seqware/workflow-dev
$ seqware bundle install --zip Workflow_Bundle_Tar_1.0-SNAPSHOT_SeqWare_1.1.0.zip 
Installing Bundle
Bundle: Workflow_Bundle_Tar_1.0-SNAPSHOT_SeqWare_1.1.0.zip
Now transferring /home/seqware/workflow-dev/Workflow_Bundle_Tar_1.0-SNAPSHOT_SeqWare_1.1.0.zip to the directory: /home/seqware/released-bundles Please be aware, this process can take hours if the bundle is many GB in size.
Processing input: /home/seqware/workflow-dev/Workflow_Bundle_Tar_1.0-SNAPSHOT_SeqWare_1.1.0.zip
      output-dir: /home/seqware/released-bundles

Added 'Tar' (SWID: 13)
Bundle Has Been Installed to the MetaDB and Provisioned to Workflow_Bundle_Tar_1.0-SNAPSHOT_SeqWare_1.1.0.zip!
$ seqware bundle install --zip Workflow_Bundle_GZ_1.0-SNAPSHOT_SeqWare_1.1.0.zip 
Installing Bundle
Bundle: Workflow_Bundle_GZ_1.0-SNAPSHOT_SeqWare_1.1.0.zip
Now transferring /home/seqware/workflow-dev/Workflow_Bundle_GZ_1.0-SNAPSHOT_SeqWare_1.1.0.zip to the directory: /home/seqware/released-bundles Please be aware, this process can take hours if the bundle is many GB in size.
Processing input: /home/seqware/workflow-dev/Workflow_Bundle_GZ_1.0-SNAPSHOT_SeqWare_1.1.0.zip
      output-dir: /home/seqware/released-bundles

Added 'GZ' (SWID: 14)
Bundle Has Been Installed to the MetaDB and Provisioned to Workflow_Bundle_GZ_1.0-SNAPSHOT_SeqWare_1.1.0.zip!

Next, deciders rely upon an up-to-date files report. Refresh this with the following command (this should probably be in a cron on a production system):

$ seqware files refresh

Now, you are ready to use the basic decider to configure and schedule your workflows. The following command lets the BasicDecider run across the entire database, looking for plain text files that have been imported, in order to schedule TarWorkflows.

$ java -jar ~/.seqware/self-installs/seqware-distribution-1.1.0-full.jar -p net.sourceforge.seqware.pipeline.deciders.BasicDecider -- --all --meta-types text/plain --wf-accession 13 

Created workflow run with SWID: 15
Scheduling.

java -jar seqware-distribution-1.1.0-full.jar --plugin io.seqware.pipeline.plugins.WorkflowScheduler -- -- --workflow-accession 13 --ini-files /tmp/7745435635015076544835954197.ini --input-files 12 --parent-accessions 9 --link-workflow-run-to-parents 6 --host master 
 
$ seqware workflow-run launch-scheduled 

[2014/07/16 13:58:48] | Number of submitted workflows: 1
Working Run: 15
Valid run by host check: 15
Launching via new launcher: 15
WARNING: No entry in settings for OOZIE_SGE_THREADS_PARAM_FORMAT, omitting threads option from qsub. Fix by providing the format of qsub threads option, using the '${threads}' variable.
Using working directory: /usr/tmp/oozie/oozie-f2b1939f-d9cc-46f9-b09e-a3640695fd28
[SeqWare Pipeline] WARN [2014/07/16 13:58:48] | Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Files copied to /usr/tmp/oozie/oozie-f2b1939f-d9cc-46f9-b09e-a3640695fd28
Submitted Oozie job: 0000071-140609173140913-oozie-oozi-W

Next, wait for the workflow to run to completion, you can use either the ‘seqware workflow-run report’ or refresh the files report in order to determine when the workflow is complete, but it should take roughly 2 minutes. Another option is the Oozie web interface at http://localhost:11000/oozie/ You may also need to run ‘seqware workflow-run propagate-statuses’ in order to update the seqware meta-db with workflow run statuses if it is not already in your crontab.

$ seqware files refresh
$ java -jar ~/.seqware/self-installs/seqware-distribution-1.1.0-full.jar -p net.sourceforge.seqware.pipeline.deciders.BasicDecider -- --all --meta-types application/x-tar --wf-accession 14 --schedule

Created workflow run with SWID: 21
Launching.

java -jar seqware-distribution-1.0.11-full.jar --plugin net.sourceforge.seqware.pipeline.plugins.WorkflowLauncher -- --workflow-accession 14 --ini-files /tmp/-15384347942082223855192608272.ini --input-files 20 --parent-accessions 19 --link-workflow-run-to-parents 6 --schedule --host master --

$ seqware workflow-run launch-scheduled

Results

In this tutorial you have learned the following things:

how to create simple workflows with a single job (tar or gz)
how to create metadata that can trigger workflows via deciders
how to run the basic decider
how to pass along data from one workflow to another