So far you’ve seen how to write a module to wrap a given tool in a consistent way. You’ve also seen how to write a workflow that links together multiple modules and parameterizes them through a simple ini file. Finally, you can put these together and run the workflow using the commands you saw in the introduction with HelloWorld. But the reality of large projects dictates that you can’t run each workflow manually. That’s where deciders come in, they essentially are a bit of code that links up the metadata in the MetaDB with a given workflow based on a set of rules encoded in the decider. At UNC and OICR we have deciders for each of our workflows and they let us process data on hourly cron jobs. These deciders, for example, look in the database for all RNASeq human samples that have not previously been processed through a workflows, pulls back the needed data from the MetaDB, and launches a workflow run for that particular lane. This is the heart of what a decider is trying to do, it links metadata to the actual execution of workflows in an automated way.
So we created the deciders for:
For the last point we’ve tried to make modules very clean and generic with little business logic, mainly focused on parameterization for behavior. The workflows are very dumb as well and really depend on parameterization. So we’ve essentially used the deciders as places to contain logic that might be site specific or simply frequently changing.
The interface for a decider is DeciderInterface. In practice, when creating your own deciders you will probably extend BasicDecider
Full Command-line options for the Basic Decider are available. Below, we will high-light some key points.
Required parameters of note:
Optional parameters of note:
When extending the BasicDecider, you may also place the parent-wf-accessions, check-wf-accessions, and wf-accession into a decider.properties file to permanently set these values.
The BasicDecider performs the following steps:
A general guideline here is that we only count failures when they are associated with workflow runs that ran on the same files (in the metadb) as we are currently attempting to run on.
Status of previous workflow run | 1 (Input files are disjoint from previous run) | 2 (Input files partially overlap a previous run) | 3 (Input files match a previous run exactly) | 4 (Past run contains all current input files) |
---|---|---|---|---|
Failed | ignore (allow rerun) | allow re-run | count as failure, re-run if less than max-failures | allow re-run (with warning message) |
Other (pending, running, no status, etc.) | ignore (allow re-run) | allow re-run | block re-run | block re-run |
Completed | ignore (allow rerun) | allow re-run | block re-run | do not re-run |
In addition, if for a group of files, we also count up more failures than rerun-max, then we also block a re-run.
In order to determine whether there are existing workflow runs in the database that would block re-running a workflow we use a different method depending on the version of decider.
For 1.0.X deciders, when looking for previous workflow runs, we look in a table, workflow_run_input_files. This new table specifically stores the input files and is populated at schedule time by the workflow launcher.
For 0.13.6.X deciders, when looking for previous workflow runs run on these files, we will search under three search patterns.
We will demonstrate usage of the basic decider by creating a few toy workflows and connecting them up via basic decider calls.
As a pre-requisite, you need a clean meta-db. You can clean your postgres database quickly if it is named “test_seqware_meta_db” with the following postgres commands from the seqware-meta-db directory:
~/seqware_github/seqware-meta-db$ dropdb test_seqware_meta_db
~/seqware_github/seqware-meta-db$ createdb test_seqware_meta_db
~/seqware_github/seqware-meta-db$ psql test_seqware_meta_db < seqware_meta_db.sql
~/seqware_github/seqware-meta-db$ psql test_seqware_meta_db < seqware_meta_db_data.sql
Next, create required metadata that represents the way your experiments are organized. These steps mimic the start of the user tutorial User Tutorial:
$ seqware create study --title 'Study1' --description 'This is a test description' --accession 'InternalID123' --center-name 'SeqWare' --center-project-name 'SeqWare Test Project' --study-type 4
Created study with SWID: 1
$ seqware create experiment --title 'New Test Experiment' --description 'This is a test description' --platform-id 26 --study-accession 1
Created experiment with SWID:2
$ seqware create sample --title 'New Test Sample' --description 'This is a test description' --organism-id 26 --experiment-accession 2
Created sample with SWID: 3
Next, create some files that represent data you will be operating on and a target directory for results:
$ mkdir -p /datastore/seqware-results
$ echo 'testing basic workflows' > /datastore/input.txt
First, create a sequencer_run, lane, and ius. These structures represent a lane from a sequencer run and an independent unit of sequencing. For this toy example, we will link new files to these starting points. (Note that the created SWIDs will vary depending on how much work you’ve done)
$ seqware create sequencer-run --description description --file-path file_path --name name --paired-end paired_end --platform-accession 26 --skip false
Created sequencer run with SWID: 4
$ seqware create lane --sequencer-run-accession 4 --study-type-accession 4 --cycle-descriptor cycle_descriptor --description description --lane-number 1 --library-selection-accession 25 --library-source-accession 5 --library-strategy-accession 21 --name name --skip false
Created lane with SWID: 5
$ seqware create ius --barcode barcode --description description --lane-accession 5 --name name --sample-accession 3 --skip false
Created IUS with SWID: 6
Next, you need to inject your input file that will be used as input for the workflows that you are about to create. When using the decider framework, the starting point for your deciders is a root workflow run. So we create a “root” workflow and create a workflow run that has as its output the input file from the tutorial.
$ seqware create workflow --name FileImport --version 1.0 --description description
Added 'FileImport' (SWID: 7)
Created workflow 'FileImport' version 1.0 with SWID: 7
$ seqware create workflow-run --workflow-accession 7 --file imported_file::text/plain::/datastore/input.txt --parent-accession 5 --parent-accession 6
Created workflow run with SWID: 8
Next, we create two workflows, one which converts text files to tar format, and a second which converts tar files to tar.gz format. Here, we assume that you have already run through the Developer Tutorial. Use Maven Archetypes to generate two workflows in the workflow-dev directory called TarWorkflow and GZWorkflow. Use the following parameters for your archetypes.
Define value for property 'package': : com.github.seqware
Define value for property 'groupId': com.github.seqware: :
Define value for property 'workflow-name': : Tar
Define value for property 'artifactId': workflow-Tar: : Tar
Define value for property 'version': 1.0-SNAPSHOT: :
[INFO] Using property: package = com.github.seqware
Confirm properties configuration:
package: com.github.seqware
groupId: com.github.seqware
workflow-name: Tar
artifactId: Tar
version: 1.0-SNAPSHOT
package: com.github.seqware
Define value for property 'package': : com.github.seqware
Define value for property 'groupId': com.github.seqware: :
Define value for property 'workflow-name': : GZ
Define value for property 'artifactId': workflow-GZ: : GZ
Define value for property 'version': 1.0-SNAPSHOT: :
[INFO] Using property: package = com.github.seqware
Confirm properties configuration:
package: com.github.seqware
groupId: com.github.seqware
workflow-name: GZ
artifactId: GZ
version: 1.0-SNAPSHOT
package: com.github.seqware
You will need to edit two files in each workflow, the workflow Java client and the workflow.ini.
For TarWorkflow/src/main/java/com/github/seqware/TarWorkflow.java:
package com.github.seqware;
import java.util.Map;
import java.util.logging.Level;
import java.util.logging.Logger;
import net.sourceforge.seqware.pipeline.workflowV2.AbstractWorkflowDataModel;
import net.sourceforge.seqware.pipeline.workflowV2.model.Job;
import net.sourceforge.seqware.pipeline.workflowV2.model.SqwFile;
public class TarWorkflow extends AbstractWorkflowDataModel {
@Override
public Map<String, SqwFile> setupFiles() {
try {
// register an input file from the workflow.ini
SqwFile file0 = this.createFile("file_in_0");
file0.setSourcePath(getProperty("input_files"));
file0.setType("text/plain");
file0.setIsInput(true);
// register an output file
SqwFile file1 = this.createFile("file_out");
file1.setSourcePath("input.tar");
file1.setType("application/x-tar");
file1.setIsOutput(true);
file1.setForceCopy(true);
// determines an output path based on the contents of the workflow.ini to follow and a random number (to avoid overwriting)
file1.setOutputPath(this.getMetadata_output_file_prefix() + getMetadata_output_dir() + "/" + this.getName() + "_" + this.getVersion() + "/" + this.getRandom()+"/"+"input.tar");
return this.getFiles();
} catch (Exception ex) {
Logger.getLogger(this.getClass().getName()).log(Level.SEVERE, null, ex);
throw new RuntimeException(ex);
}
}
@Override
public void buildWorkflow() {
Job job11 = this.getWorkflow().createBashJob("bash_tar");
job11.setCommand("tar -cvf input.tar " + this.getFiles().get("file_in_0").getProvisionedPath());
}
}
Note that input_files is automatically populated in the workflow.ini when the decider runs and generates ini files.
For GZ/src/main/java/com/github/seqware/GZWorkflow.java
package com.github.seqware;
import java.util.Map;
import java.util.logging.Level;
import java.util.logging.Logger;
import net.sourceforge.seqware.pipeline.workflowV2.AbstractWorkflowDataModel;
import net.sourceforge.seqware.pipeline.workflowV2.model.Job;
import net.sourceforge.seqware.pipeline.workflowV2.model.SqwFile;
public class GZWorkflow extends AbstractWorkflowDataModel {
@Override
public Map<String, SqwFile> setupFiles() {
try {
// register an input file
SqwFile file0 = this.createFile("file_in_0");
file0.setSourcePath(getProperty("input_files"));
file0.setType("application/x-tar");
file0.setIsInput(true);
// register an output file
SqwFile file1 = this.createFile("file_out");
file1.setSourcePath("input.tar.gz");
file1.setType("application/gzip");
file1.setIsOutput(true);
file1.setForceCopy(true);
// determines an output path based on the contents of the workflow.ini to follow and a random number (to avoid overwriting)
file1.setOutputPath(this.getMetadata_output_file_prefix() + getMetadata_output_dir() + "/" + this.getName() + "_" + this.getVersion() + "/" + this.getRandom() + "/" + "input.tar.gz");
return this.getFiles();
} catch (Exception ex) {
Logger.getLogger(this.getClass().getName()).log(Level.SEVERE, null, ex);
return (null);
}
}
@Override
public void buildWorkflow() {
Job job11 = this.getWorkflow().createBashJob("bash_gz");
job11.setCommand("tar -zcvf input.tar.gz " + this.getFiles().get("file_in_0").getProvisionedPath());
}
}
For Tar/workflow/config/TarWorkflow.ini and GZ/workflow/config/GZworkflow.ini:
# the output_prefix is used to specify the root of the absolute output path or an S3 bucket name
# you should pick a path that is available on all cluster nodes and can be written by your user
output_prefix=/datastore/
# this indicates in what directory your results will be placed into under the output_prefix, in this example
# your results will reside in /datastore/seqware-results
output_dir=seqware-results
Build, package, and install both workflows. Record the accession for both workflows:
$ cd Tar/
$ mvn clean install
(output snipped)
$ cd ..
$ cd GZ/
$ mvn clean install
(output snipped)
$ cd ..
$ seqware bundle package --dir GZ/target/Workflow_Bundle_GZ_1.0-SNAPSHOT_SeqWare_1.1.0/
Validating Bundle structure
Packaging Bundle
Bundle has been packaged to /home/seqware/workflow-dev
$ seqware bundle package --dir Tar/target/Workflow_Bundle_Tar_1.0-SNAPSHOT_SeqWare_1.1.0/
Validating Bundle structure
Packaging Bundle
Bundle has been packaged to /home/seqware/workflow-dev
$ seqware bundle install --zip Workflow_Bundle_Tar_1.0-SNAPSHOT_SeqWare_1.1.0.zip
Installing Bundle
Bundle: Workflow_Bundle_Tar_1.0-SNAPSHOT_SeqWare_1.1.0.zip
Now transferring /home/seqware/workflow-dev/Workflow_Bundle_Tar_1.0-SNAPSHOT_SeqWare_1.1.0.zip to the directory: /home/seqware/released-bundles Please be aware, this process can take hours if the bundle is many GB in size.
Processing input: /home/seqware/workflow-dev/Workflow_Bundle_Tar_1.0-SNAPSHOT_SeqWare_1.1.0.zip
output-dir: /home/seqware/released-bundles
Added 'Tar' (SWID: 13)
Bundle Has Been Installed to the MetaDB and Provisioned to Workflow_Bundle_Tar_1.0-SNAPSHOT_SeqWare_1.1.0.zip!
$ seqware bundle install --zip Workflow_Bundle_GZ_1.0-SNAPSHOT_SeqWare_1.1.0.zip
Installing Bundle
Bundle: Workflow_Bundle_GZ_1.0-SNAPSHOT_SeqWare_1.1.0.zip
Now transferring /home/seqware/workflow-dev/Workflow_Bundle_GZ_1.0-SNAPSHOT_SeqWare_1.1.0.zip to the directory: /home/seqware/released-bundles Please be aware, this process can take hours if the bundle is many GB in size.
Processing input: /home/seqware/workflow-dev/Workflow_Bundle_GZ_1.0-SNAPSHOT_SeqWare_1.1.0.zip
output-dir: /home/seqware/released-bundles
Added 'GZ' (SWID: 14)
Bundle Has Been Installed to the MetaDB and Provisioned to Workflow_Bundle_GZ_1.0-SNAPSHOT_SeqWare_1.1.0.zip!
Next, deciders rely upon an up-to-date files report. Refresh this with the following command (this should probably be in a cron on a production system):
$ seqware files refresh
Now, you are ready to use the basic decider to configure and schedule your workflows. The following command lets the BasicDecider run across the entire database, looking for plain text files that have been imported, in order to schedule TarWorkflows.
$ java -jar ~/.seqware/self-installs/seqware-distribution-1.1.0-full.jar -p net.sourceforge.seqware.pipeline.deciders.BasicDecider -- --all --meta-types text/plain --wf-accession 13
Created workflow run with SWID: 15
Scheduling.
java -jar seqware-distribution-1.1.0-full.jar --plugin io.seqware.pipeline.plugins.WorkflowScheduler -- -- --workflow-accession 13 --ini-files /tmp/7745435635015076544835954197.ini --input-files 12 --parent-accessions 9 --link-workflow-run-to-parents 6 --host master
$ seqware workflow-run launch-scheduled
[2014/07/16 13:58:48] | Number of submitted workflows: 1
Working Run: 15
Valid run by host check: 15
Launching via new launcher: 15
WARNING: No entry in settings for OOZIE_SGE_THREADS_PARAM_FORMAT, omitting threads option from qsub. Fix by providing the format of qsub threads option, using the '${threads}' variable.
Using working directory: /usr/tmp/oozie/oozie-f2b1939f-d9cc-46f9-b09e-a3640695fd28
[SeqWare Pipeline] WARN [2014/07/16 13:58:48] | Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Files copied to /usr/tmp/oozie/oozie-f2b1939f-d9cc-46f9-b09e-a3640695fd28
Submitted Oozie job: 0000071-140609173140913-oozie-oozi-W
Next, wait for the workflow to run to completion, you can use either the ‘seqware workflow-run report’ or refresh the files report in order to determine when the workflow is complete, but it should take roughly 2 minutes. Another option is the Oozie web interface at http://localhost:11000/oozie/ You may also need to run ‘seqware workflow-run propagate-statuses’ in order to update the seqware meta-db with workflow run statuses if it is not already in your crontab.
$ seqware files refresh
$ java -jar ~/.seqware/self-installs/seqware-distribution-1.1.0-full.jar -p net.sourceforge.seqware.pipeline.deciders.BasicDecider -- --all --meta-types application/x-tar --wf-accession 14 --schedule
Created workflow run with SWID: 21
Launching.
java -jar seqware-distribution-1.0.11-full.jar --plugin net.sourceforge.seqware.pipeline.plugins.WorkflowLauncher -- --workflow-accession 14 --ini-files /tmp/-15384347942082223855192608272.ini --input-files 20 --parent-accessions 19 --link-workflow-run-to-parents 6 --schedule --host master --
$ seqware workflow-run launch-scheduled
In this tutorial you have learned the following things: