1. Introduction to SeqWare
  2. Installation
  3. Getting Started
    1. By the End of This Tutorial
    2. The Theory Behind a SeqWare Workflow Bundle
    3. Note About Working Directory vs. Workflow Bundle Directory
    4. First Steps
    5. Overview of Workflow Development Using the VM
    6. Creating a New Workflow
    7. A Tour of Workflow Bundle Components
    8. Modifying the Workflow
    9. Building the Workflow
    10. Testing the Workflow
    11. Packaging the Workflow into a Workflow Bundle
    12. Next Steps
  4. SeqWare Pipeline
  5. SeqWare MetaDB
  6. SeqWare Portal
  7. SeqWare Web Service
  8. SeqWare Query Engine
  9. Glossary
  10. Frequently Asked Questions
  11. APIs
  12. Source Code
  13. Plugins
  14. Modules
  15. Advanced Topics

Developer Tutorial

Note:This guide assumes you have installed SeqWare already. If you have not, please install SeqWare by either downloading the VirtualBox VM or launching the AMI on the Amazon cloud. See Installation for directions. We also recommend you follow the User Tutorial before this guide.

This guide picks up where the User Tutorial left off. In that previous guide we showed you how to start up your local VM, create studies, experiments, and samples, associate an input file with a sample, and then launch a workflow to process that file. These workflows can be complex (they include branching and looping) and in future tutorials you will see how to string multiple workflows together (output of one as input for the next) using deciders for automation.

In this tutorial the focus is on creating a workflow of your own based on the HelloWorld that comes bundled with the VM. In theory you could use either a local VM or an Amazon instance to follow the tutorial below but in our case we will base it on the local VM.

By the End of This Tutorial

By the end of these tutorials you will:

  • create a new SeqWare Pipeline workflow bundle based on HelloWorld
  • test your workflow bundle locally
  • package your new workflow as a bundle for hand-off to an administrator for installation into SeqWare Pipeline

The Theory Behind a SeqWare Workflow Bundle

In many workflow environments the concept of a workflow is encoded as a simple XML markup file that defines a series of steps, data inputs, etc. This may be interpreted by a user interface of some sort, e.g. a drag-n-drop workflow creation tool. These workflow systems tend to treat workflows as very light-weigh representations of steps. One problem with this lightweight approach is dependencies for steps in the workflow, such as genome indexes for an aligner, are often times treated as parameters and are not managed by the workflow system. SeqWare’s concept of a workflow is much more akin to a Linux distribution package (like RPM or DEB files) in which all necessary components are packaged inside a single binary file. In SeqWare we use Zip64 files to group the workflow definition file, workflow itself, sample settings, and data dependencies in a single file that can be exchanged between SeqWare users or archived. This allows SeqWare bundles to be much more portable than lightweight workflows that reference external tools and data. Being self-contained is at the core of the design goals for SeqWare bundles with the expense of often times large workflow bundle sizes.

Note About Working Directory vs. Workflow Bundle Directory

Just to be clear, there are two directory locations to consider when working with workflows. The workflow bundle directory (often referred to as ${workflow_bundle_dir} in various components) refers to the location of the installed workflow bundle. You use this variable throughout your workflow bundle to refer to the install directory since that will only be known after the workflow bundle is installed. For example, in the Java workflow language you would refer to a script called foo.pl installed in the bin directory within the workflow bundle as this.getWorkflowBaseDir()+”/bin/foo.pl”. Similarly, from inside the workflow INI config file you can refer to a data file as my_file=${workflow_bundle_dir}/data/data_file.txt.

The second directory is the current working directory. Every time a workflow is launched, a temporary working directory is created for just that particular run of the workflow. A shared filesystem (NFS, gluster, etc) is required to ensure each job in a workflow is able to access this shared workflow working location regardless of what cluster node is selected to run a particular job. Before a job in a workflow executes the current working directory is set so workflow authors can assume their individual jobs are already in the correct location.

First Steps

Please launch your local VM in VirtualBox or cloud AMI on Amazon now. For the local VM, login as user seqware, password seqware at this time. Click on the “SeqWare Directory” link on the desktop which will open a terminal to the location where we installed the SeqWare tools.

Alternatively, on the Amazon AMI follow the directions to log in here. Make sure that you launch our VM with the “cc1.4xlarge” instance type. Also, please wait roughly 10 minutes for our startup scripts to run and fully setup your instance.

Once logging into the remote instance you need to “switch user” to seqware, e.g.:

$ sudo su - seqware

In some instances using AWS, it may be necessasry to run sudo umount /dev/xvdc as ubuntu user before switching to seqware user.

Both the VirtualBox VM and Amazon AMI include a start page that links to key information for the VM such as the URLs for the installed Portal, Web Service, key file locations, etc. On the VirtualBox VM, just click the “Start Here” link on the desktop. For the Amazon instance use the instance name provided by the AWS console. For example, it will look similar to:

http://ec2-54-224-22-195.compute-1.amazonaws.com

You fill in your instance DNS name from the Amazon console in place of ec2-54-224-22-195.compute-1.amazonaws.com above. Make sure you check your security group settings to ensure port 80 (and the other ports referenced in the landing page) are open.

Overview of Workflow Development Using the VM

You should be in the /home/seqware/ directory now. Notice there are two important directories: provisioned-bundles (SW_BUNDLE_DIR in the config) which contains unzipped workflow bundles and released-bundles (SW_BUNDLE_REPO_DIR in the config) which contains zip versions of the workflows that you create when you package up these bundles and install them.

There are two ways to create a new workflow, the first is simply to copy an existing workflow bundle from provisioned-bundles, rename it, and modify the workflow to meet your needs. The second is using Maven Archetypes, a template system which generates a workflow skeleton making it fast and easy to get started with workflow development. In this tutorial we will use the Maven Archetypes system here since it is fast and easy. We will also test the new bundle and, once you finish, we will package it up, ready for handoff to an admin that will install it in the SeqWare system so users can run it.

Common Steps

Most workflow developers follow a similar series of steps when developing new workflows. Generally one goes through the following process:

  • Plan your workflow
    Most developers are bioinformaticists and will spend some time exploring the tools and algorithms they want to use in this workflow and decide what problems their workflow is trying to solve.
  • Find your tools and sample data
    Tools are collected, synthetic or real test datasets are prepared, prototyping is done
  • Make and test the workflow
    The developer writes and tests the workflow both with bundled test data and real data locally and on a cluster or cloud resource
  • Packaging and handoff
    The developer zips up the finished workflow and hands off to an admin that installs the workflow so users can use it

Creating a New Workflow

The SeqWare workflow archetype allows workflow developers to quickly create new workflows. The Maven archetypes generate skeletons that contain a simple example program that can be modified to create a new workflow. The archetypes also take parameters prior to skeleton generation that allow the name of the workflow to be specified and all configuration files to adjusted with respect to these parameters.

The code generated by the archetype contains all the necessary files to generate a complete workflow bundle when the mvn install command is issued. This command will combine the workflow definition, the config file, the Java workflow file, and other external dependencies pulled in by Maven to create a complete workflow bundle. This makes the code generated by the archetype ideal to place under version control. As maintenance changes are made to the Java file or any other aspect of the workflow, these files can be updated and a new workflow reflecting these changes can be generated by re-issuing the mvn install command.

The first step to get started is to generate your workflow skeleton using Maven archetypes. You will want to do this in a directory without pom.xml files (i.e. outside of the SeqWare development directories). Here we are working in the workflow-dev directory:

$ cd ~/workflow-dev
$ mvn archetype:generate
...
829: local -> com.github.seqware:seqware-archetype-decider (SeqWare Java Decider archetype)
830: local -> com.github.seqware:seqware-archetype-java-workflow (SeqWare Java workflow archetype)
831: local -> com.github.seqware:seqware-archetype-module (SeqWare module archetype)
Choose a number or apply filter (format: [groupId:]artifactId, case sensitive contains): 294: 

Note:You can inform Maven about the archetypes by downloading the latest archetype-catalog.xml file and placing it in ~/.m2/.

The numbers used to identify the archetypes will vary depending on what you have installed, so you will need to scan through the list to find the SeqWare archetype you are looking for, in this case “SeqWare Java workflow archetype”. You can also enter “seqware” as a filter in order to narrow down the possibilities. Following the prompts, use “com.github.seqware” as the package, “MyHelloWorld” as the artifactId and workflow-name, and 1.0 for the version.

Alternately, the archetype can be generated without any interaction:

$ mvn archetype:generate \
  -DinteractiveMode=false \
  -DarchetypeCatalog=local \
  -DarchetypeGroupId=com.github.seqware \
  -DarchetypeArtifactId=seqware-archetype-java-workflow \
  -DgroupId=com.github.seqware \
  -DartifactId=workflow-MyHelloWorld \
  -Dworkflow-name=MyHelloWorld \
  -Dversion=1.0

A Tour of Workflow Bundle Components

In this section we will examine the internals of the Workflow Bundle that was just generated. The first thing you should do is take a look at the workflow manifest showing which workflows are present in this bundle (a single Workflow Bundle can contain many workflows).

$ cd workflow-MyHelloWorld
$ mvn install
...
$ seqware bundle list --dir target/Workflow*

List Workflows:

 Workflow:
  Name: MyHelloWorld
  Version: 1.0
  Description: Add a description of the workflow here.
  Workflow Class: ${workflow_bundle_dir}/Workflow_Bundle_MyHelloWorld/1.0/classes/com/github/seqware/MyHelloWorldWorkflow.java
  Config Path: ${workflow_bundle_dir}/Workflow_Bundle_MyHelloWorld/1.0/config/MyHelloWorldWorkflow.ini
  Requirements Compute: single Memory: 20M Network: local

This shows the one workflow in the generated workflow bundle.

Directory Organization

The directory structure created by the maven archetype includes a pom.xml file which is our Maven build file, a src directory which contains the Java workflow, and a workflow directory that contains any bundled data, the basic workflow config file which includes all the parameters this workflow accepts, the metadata.xml which defines the workflows available in this bundle, and any scripts, binaries, or libraries your workflow needs (in bin and lib respectively).

When you issue the mvn install command the target directory is created which contains the compiled workflow along with the various necessary files all correctly assembled in the proper directory structure. You can run the workflow in test mode or package up the workflow as a zip file for exchange with others. Both topics are covered later in this tutorial.

The Maven archetype workflows are quite nice, too, since it is easy to check in everything but the target directory into source control like git or subversion. This makes it a lot easier to share the development of workflows between developers.

|-- pom.xml
|-- src
|   `-- main
|       `-- java
|           `-- com
|               `-- github
|                   `-- seqware
|                       `-- MyHelloWorldWorkflow.java
|-- target
|   `-- Workflow_Bundle_MyHelloWorld_1.0_SeqWare_1.1.0
|       `-- Workflow_Bundle_MyHelloWorld
|           `-- 1.0
|               |-- bin
|               |-- classes
|               |   `-- com
|               |       `-- github
|               |           `-- seqware
|               |               `-- MyHelloWorldWorkflow.class
|               |-- config
|               |   `-- MyHelloWorldWorkflow.ini
|               |-- data
|               |   `-- input.txt
|               |-- lib
|               |   `-- seqware-distribution-1.1.0-full.jar
|               `-- metadata.xml
|-- workflow
|   |-- config
|   |   `-- MyHelloWorldWorkflow.ini
|   |-- data
|   |   `-- input.txt
|   |-- lib
|   |-- metadata.xml
|   `-- workflows
`-- workflow.properties

Here are some additional details about these files:

  • pom.xml
    A maven project file. Edit this to change the version of the workflow and to add or modify workflow dependencies such as program, modules and data.
  • workflow
    This directory contains the workflow skeleton. Look in here to modify the workflow configuration and data files.
  • src
    This directory contains the Java client. Look in here to modify the .java files (Java workflow). The examples of Java workflows can be found here.
  • workflow.properties
    You can edit the description and workflow names in this file.

Workflow Manifest

The workflow manifest (metadata.xml) includes the workflow name, version, description, test command, and enough information so that the SeqWare tools can test, execute, and install the workflow. Here is an example from the MyHelloWorld workflow:

<bundle version="1.0">
  <workflow name="MyHelloWorld" version="1.0" seqware_version="1.1.0"
  basedir="${workflow_bundle_dir}/Workflow_Bundle_MyHelloWorld/1.0">
    <description>Add a description of the workflow here.</description>
    <workflow_class path="${workflow_bundle_dir}/Workflow_Bundle_MyHelloWorld/1.0/classes/com/github/seqware/MyHelloWorldWorkflow.java"/>
    <config path="${workflow_bundle_dir}/Workflow_Bundle_MyHelloWorld/1.0/config/MyHelloWorldWorkflow.ini"/>
    <requirements compute="single" memory="20M" network="local"  workflow_engine="Pegasus,Oozie" workflow_type="java"/>
  </workflow>
</bundle>

As mentioned above, you can edit the description and workflow name in the workflow.properties file.

Workflow Java Class

You can see the full Java workflow source code by looking at Workflow Examples or, in this case, just the src/main/java/com/github/seqware/MyHelloWorldWorkflow.java file produced by the Maven Archetype above.

This Java class is pretty simple in its construction. It is used to define input and output files along with the individual steps in the workflow and how they relate to each other. It is used to create a workflow object model which is then handed of to a workflow engine that knows how to turn that into a directed acyclic graph of jobs that can run on a cluster (local VM, an HPC cluster, a cloud-based cluster, etc).

Files

    @Override
    public Map<String, SqwFile> setupFiles() {
      try {
        // register an plaintext input file using the information from the INI
        // provisioning this file to the working directory will be the first step in the workflow
        SqwFile file0 = this.createFile("file_in_0");
        file0.setSourcePath(getProperty("input_file"));
        file0.setType("text/plain");
        file0.setIsInput(true);

      } catch (Exception ex) {
        ex.printStackTrace();
        System.exit(1);
      }
      return this.getFiles();
    }

This method sets up files that are inputs for this workflow. In this example the input data/input.txt comes from the workflow bundle itself. Here we pull in the information from the INI file using getProperty.

Output files are defined differently.

    @Override
    public void buildWorkflow() {
	...

	Job copyJob2 = this.getWorkflow().createBashJob("bash_cp");
	...
        copyJob2.addFile(createOutputFile("dir1/output", "txt/plain", manualOutput));
    }

    private SqwFile createOutputFile(String workingPath, String metatype, boolean manualOutput) {
    // register an output file
        SqwFile file1 = new SqwFile();
        file1.setSourcePath(workingPath);
        file1.setType(metatype);
        file1.setIsOutput(true);
        file1.setForceCopy(true);

        // if manual_output is set in the ini then use it to set the destination of this file
        if (manualOutput) {
            file1.setOutputPath(this.getMetadata_output_file_prefix() + getMetadata_output_dir() + "/" + workingPath);
        } else {
            file1.setOutputPath(this.getMetadata_output_file_prefix() + getMetadata_output_dir() + "/"
                + this.getName() + "_" + this.getVersion() + "/" + this.getRandom() + "/" + workingPath);
        }
        return file1;
    }

The ultimate location of the output file is determined by two parameters passed into the WorkflowLauncher which actually runs the workflow: –metadata-output-file-prefix (our output_prefix in the ini file) and –metadata-output-dir (or output_dir in the ini file). Alternatively, you can actually override the output location for a file as is the case with the above “manual_output”. When this parameter is available in the ini file the automatic location of the output file (“output_prefix”+/+”output_dir”+/+”output”) is overridden to ensure the final output path is unique.

The job that produces the output file is linked to that file by using job.addFile(SqwFile).

Directories

    @Override
    public void setupDirectory() {
	// creates a dir1 directory in the current working directory where the workflow runs
        this.addDirectory("dir1");
    }

This method sets up directories in the working directory that the workflow run in. In this case the workflow creates a directory called “dir1”.

Workflow Steps

    @Override
    public void buildWorkflow() {

        // a simple bash job to call mkdir
	// note that this job uses the system's mkdir (which depends on the system being *nix)
	// this also translates into a 3000 h_vmem limit when using sge 
        Job mkdirJob = this.getWorkflow().createBashJob("bash_mkdir").setMaxMemory("3000");
        mkdirJob.getCommand().addArgument("mkdir test1");      
       
	String inputFilePath = this.getFiles().get("file_in_0").getProvisionedPath();
	 
        // a simple bash job to cat a file into a test file
	// the file is not saved to the metadata database
        Job copyJob1 = this.getWorkflow().createBashJob("bash_cp").setMaxMemory("3000");
        copyJob1.setCommand(catPath + " " + inputFilePath + "> test1/test.out");
        copyJob1.addParent(mkdirJob);
	// this will annotate the processing event associated with the cat of the file above
        copyJob1.getAnnotations().put("command.annotation.key.1", "command.annotation.value.1");
        copyJob1.getAnnotations().put("command.annotation.key.2", "command.annotation.value.2");
        
        // a simple bash job to echo to an output file and concat an input file
	// the file IS saved to the metadata database
        Job copyJob2 = this.getWorkflow().createBashJob("bash_cp").setMaxMemory("3000");
	copyJob2.getCommand().addArgument(echoPath).addArgument(greeting).addArgument(" > ").addArgument("dir1/output");
	copyJob2.getCommand().addArgument(";");
	copyJob2.getCommand().addArgument(catPath + " " +inputFilePath+ " >> dir1/output");
        copyJob2.addParent(mkdirJob);
	SqwFile outputFile = createOutputFile("dir1/output", "txt/plain", manualOutput);
        // this will annotate the processing event associated with copying your output file to its final location
        outputFile.getAnnotations().put("provision.file.annotation.key.1", "provision.annotation.value.1");
        copyJob2.addFile(outputFile);

    }

In this buildWorkflow() method three jobs are created. You can see that the createBashJob can be used to run any arbitrary command. In the future we will add more job types (such as Map/Reduce for the Oozie engine). Each child job is linked to its parent using the addParent method. This information is enough to correctly schedule these jobs and run them in the correct order locally on the VM, on an HPC cluster, or on the cloud. The more detailed Pipeline documentation will cover optional useful job methods including examples of how to control memory requirements for particular jobs.

Tip: It can be confusing at first but there are two directories to think about when working with workflows. The first is the ${workflow_bundle_dir} which is the location where the workflow bundle has been unzipped. This variable can be used both in the Java object (via the getWorkflowBaseDir() method) and the various config and metadata files (via ${workflow_bundle_dir}). You use this to access the location of data and other file types that you have included in the workflow bundle. The second directory is the current working directory that your workflow steps will be executed in. This is a directory created at runtime by the underlying workflow engine and is shared for all steps in your workflow. You can use this as your temporary directory to process intermediate files.

Configuration File

Each workflow uses a simple INI file to record which variables it accepts, their types, and default values. For example:

# key=input_file:type=file:display=F:file_meta_type=text/plain
input_file=${workflow_bundle_dir}/Workflow_Bundle_MyHelloWorld/${workflow-version}/data/input.txt
# key=greeting:type=text:display=T:display_name=Greeting
greeting=Testing

cat=${workflow_bundle_dir}/Workflow_Bundle_MyHelloWorld/${workflow-version}/bin/gnu-coreutils-5.67/cat
echo=${workflow_bundle_dir}/Workflow_Bundle_MyHelloWorld/${workflow-version}/bin/gnu-coreutils-5.67/echo

# the output directory is a convention used in many workflows to specify a relative output path
output_dir=seqware-results
# the output_prefix is a convention used to specify the root of the absolute output path or an S3 bucket name 
# you should pick a path that is available on all cluster nodes and can be written by your user
output_prefix=./
# manual output determines whether or not SeqWare should enforce the uniqueness of the final directory or not. 
# If false, SeqWare places files in a directory specified by output_prefix/output_dir/workflowname_version/RANDOM/<files>
# where RANDOM is an integer. If true, SeqWare places the files at output_prefix/output_dir and may overwrite existing
# files
manual_output=false

# Optional: This controls the default number of lines of stdout and stderr that jobs in a workflow will report as metadata
# Otherwise, the default in GenericCommandRunner will be used (currently, 10)
seqware-output-lines-number=20

You access these variables in the Java workflow using the getProperty() method. When installing the workflow the ini file is parsed and extra metadata about each parameter is examined. This gives the system information about the type of the variable (integer, string, etc) and any default values.

The ini file(s) follow the general pattern of:

# comment/specification
key=value

To achieve this overloaded role for ini files you need to include hints to ensure the BundleManager that installs workflow bundles has enough information. Here is what the annotation syntax looks like:

# key=<name>:type=[integer|float|text|pulldown|file]:display=[T|F][:display_name=<name_to_display>][:file_meta_type=<mime_meta_type>][:pulldown_items=<key1>|<value1>;<key2>|<value2>]
key=default_value

The file_meta_type is only used for type=file.

The pulldown type means that the pulldown_items should be defined as well. This looks like:

pulldown_items=<key1>|<value1>;<key2>|<value2>

The default value for this will refer to either value1 or value2 above. If you fail to include a metadata line for a particular key/value then it is assumed to be:

key=<name>:type=text:display=F

This is convenient since many of the values in an INI file should not be displayed to the end user.

Required INI Entries

There are (currently) two required entries that all workflows should define in their ini files. These are related to output file provisioning. In your workflow, if you produce output files and use the file provisioning mechanism built into workflows these two entries are used to construct the output location for the output file.

  • output_dir
  • output_prefix

For example, if you have a SqwFile object can call file.setIsOutput(true); the workflow engine constructs an output path for this file using the following:

<output_prefix>/<output_dir>/<file_name>

You can use s3://bucketname/ or a local path as the prefix.

Note: While the above entries are required, it is STRONGLY suggested that workflow developers no longer rely on them to decide the output path of a provisioned file. Instead we recommend explicitly providing in the ini file whatever paths you may require, possibly using the variables described below, and then assigning that path to the output file via SqwFile.setOutputPath(String path).

Reserved INI Entries

There are a number of entries that are used by SeqWare and should be avoided in your own workflows. They are:

  • parent_accessions
  • parent-accessions
  • parent_accession
  • workflow-run-accession
  • workflow_run_accession
  • metadata
  • workflow_bundle_dir

INI Variables

The ini files support variables, in the format $(variable-name}, that will be replaced when the workflow run is launched. The variable name can refer to another entry in the ini file, or can refer to the following SeqWare generated values:

  • sqw.bundle-dir: the path to the directory of this workflow’s bundle. Support for the legacy version of this variable, workflow_bundle_dir, may be removed in a future version.
  • sqw.date: the current date in ISO 8601 format, e.g., 2013-10-31. Support for the legacy version of this variable, date, may be removed in a future version.
  • sqw.datetime: the current datetime in ISO 8601 format, e.g., 2013-10-31T16:45:30.
  • sqw.random: a randomly generated integer from 0 to 2147483647. Support for the legacy version of this variable, random, may be removed in a future version.
  • sqw.timestamp: the current number of milliseconds since January 1, 1970.
  • sqw.uuid: a randomly generated universally unique identifier.
  • sqw.bundle-seqware-version: the version of seqware that this workflow was built with. You should not have to use this often, but it may be useful if you want to trigger different behaviour based on a version of seqware.

Each instance of the above sqw.* variables in an ini file will be replaced with a separately resolved value, e.g., multiple instances of ${sqw.uuid} will each resolve to different values. If you desire to reuse the same generated value, do somthing akin to the following:

dirname=output
filename=${sqw.random}
text_file=${dirname}/${filename}.txt
json_file=${dirname}/${filename}.json

Thus if filename resolved to a value of 12345, then text_file will have a value of output/12345.txt and json_file will have a value of output/12345.json.

Modifying the Workflow

At this point, one would normally want to edit the workflow by modifying the MyHelloWorldWorkflow.java file as is appropriate for the workflow. In the example below I just added an extra job that does a simple shell operation (dateJob). I also moved the addFile method appropriately since dateJob is now the final job that manipulates the dir1/output file.

    @Override
    public void buildWorkflow() {

        // a simple bash job to call mkdir
        // note that this job uses the system's mkdir (which depends on the system being *nix)
        Job mkdirJob = this.getWorkflow().createBashJob("bash_mkdir").setMaxMemory("3000");
        mkdirJob.getCommand().addArgument("mkdir test1");

        String inputFilePath = this.getFiles().get("file_in_0").getProvisionedPath();

        // a simple bash job to cat a file into a test file
        // the file is not saved to the metadata database
        Job copyJob1 = this.getWorkflow().createBashJob("bash_cp").setMaxMemory("3000");
        copyJob1.setCommand(catPath + " " + inputFilePath + "> test1/test.out");
        copyJob1.addParent(mkdirJob);

        // a simple bash job to echo to an output file and concat an input file
        // the file IS saved to the metadata database
        Job copyJob2 = this.getWorkflow().createBashJob("bash_cp").setMaxMemory("3000");
        copyJob2.getCommand().addArgument(echoPath).addArgument(greeting).addArgument(" > ").addArgument("dir1/output");
        copyJob2.getCommand().addArgument(";");
        copyJob2.getCommand().addArgument(catPath + " " +inputFilePath+ " >> dir1/output");
        copyJob2.addParent(mkdirJob);
        SqwFile outputFile = createOutputFile("dir1/output", "txt/plain", manualOutput);
        // this will annotate the processing event associated with copying your output file to its final location
        outputFile.getAnnotations().put("provision.file.annotation.key.1", "provision.annotation.value.1");


	Job dateJob = this.getWorkflow().createBashJob("date").setMaxMemory("3000");
	dateJob.setCommand("date >> dir1/output");
	dateJob.addParent(copyJob2);
        dateJob.addFile(outputFile);
    }

Building the Workflow

If you made changes to the workflow files now would be a good time to to use “mvn clean install” to refresh the workflow bundle in the target directory. For example:

    $ cd ~/workflow-dev/workflow-MyHelloWorld/
    $ mvn clean install
    [INFO] Scanning for projects...
    ...
    [INFO] ------------------------------------------------------------------------
    [INFO] BUILD SUCCESS
    [INFO] ------------------------------------------------------------------------
    [INFO] Total time: 40.100s
    [INFO] Finished at: Thu Aug 15 19:59:22 UTC 2013
    [INFO] Final Memory: 28M/67M
    [INFO] ------------------------------------------------------------------------

Testing the Workflow

During the launch process, a number of files are generated into the generated-scripts directory inside the run’s working directory. For each job in a workflow, a <job-name>.sh script contains the content of the associated BashJob defined by the developer.

When using the Oozie-SGE engine, some additional files are included: * <job-name>-runner.sh: The script invoked by SGE, which will either perform a file provisioning (copying files into/out of the working directory), or invoke <job-name>.sh * <job-name>-qsub.opts: The options to be provided to the qsub command, e.g. setting max memory.

Prior to testing your bundle, it will be worthwhile to ensure that the files generated are what you expect. You can accomplish this with the dry-run command:

$ seqware bundle dry-run --dir target/Workflow_Bundle_*
Performing dry-run of workflow 'MyHelloWorld' version '1.0'
Using working directory: /usr/tmp/seqware-oozie/oozie-3d971491-ca43-48fb-a5d8-a73e18e7db44
Files copied to hdfs://master:8020/user/seqware/seqware_workflow/oozie-3d971491-ca43-48fb-a5d8-a73e18e7db44

$ ls /usr/tmp/seqware-oozie/oozie-3d971491-ca43-48fb-a5d8-a73e18e7db44
bash_cp_4.sh  bash_cp_5.sh  bash_mkdir_3.sh  start_0.sh

In the above, /usr/tmp/seqware-oozie is the configured location of OOZIE_WORK_DIR. Each of the scripts represents a bash job specified by the developer, with the exception of start_0.sh which creates the directories specified in the workflow’s setupDirectory() method.

At this point, the individual scripts can be executed to ensure they do what you expect.

The next step after authoring your workflows in the Java workflow language, and verifying the generated scripts, is to run them:

$ seqware bundle launch --dir target/Workflow_Bundle_*
Performing launch of workflow 'MyHelloWorld' version '1.0'
Using working directory: /usr/tmp/seqware-oozie/oozie-eccfb3b6-cda5-46c3-89ce-7839d4210531
Files copied to hdfs://master:8020/user/seqware/seqware_workflow/oozie-eccfb3b6-cda5-46c3-89ce-7839d4210531
Submitted Oozie job: 0000001-130930203123321-oozie-oozi-W

Polling workflow run status every 10 seconds.
Terminating this program will NOT affect the running workflow.

Workflow job running ...
Application Path   : hdfs://master:8020/user/seqware/seqware_workflow/oozie-eccfb3b6-cda5-46c3-89ce-7839d4210531
Application Name   : MyHelloWorld
Application Status : RUNNING
Application Actions:
   Name: :start: Type: :START: Status: OK
   Name: start_0 Type: java Status: RUNNING
...

The above will bypass the whole workflow scheduling and asynchronous launching process that you saw in the User Tutorial. What you lose is the metadata tracking functionality. The command runs the workflow which produces file outputs but that is all, no record of the run will be recorded in the MetaDB.

Running with the Oozie-SGE Workflow Engine

By default workflows are executed on the Oozie workflow engine, with each step treated as a MapReduce job. In the future, using this workflow engine will allow for mixed workflows that include traditional scripts along with steps using MapReduce, Pig, Hive, and other Hadoop-associated technologies.

There are a few caveats for the Oozie workflow engine in SeqWare. For example, to run the workflow above you will need to do the following:

  • Ensure your .seqware/settings file includes the correct parameters. If you are using our VM this will be true.
  • Jobs are run by the ‘mapred’ user not the seqware user (this is not the case with oozie-sge, mentioned below). So when you author and run workflows make sure the output destination can be written to by mapred. In the future we will eliminate this constraint.
  • Workflows include bash jobs but in the future we will add other Hadoop-specific types (e.g. MapReduce). For now these are not implemented.
  • This engine will only work on the 1.0.0 release of SeqWare or newer. The 0.13.6.x and earlier releases will only work with the Pegasus workflow engine.

We also provide an alternate “engine” that will allow jobs to be managed and scheduled by Oozie but run on a traditional Sun Grid Engine (SGE) cluster. For workflows that execute on SGE, either change the SeqWare settings file to specify SW_DEFAULT_ENGINE=oozie-sge, or add --engine oozie-sge to the command line. Unlike the oozie engine, the oozie-sge engine will execute jobs as the submitting user.

In this following example the same workflow as above is executed with the oozie-sge engine:

$ cd /home/seqware/workflow-dev/workflow-MyHelloWorld
$ seqware bundle launch --dir target/Workflow_Bundle_* --engine oozie-sge

This will cause the workflow to run and not exit until it finishes. You can also monitor the workflow using the Hue web application installed at http://hostname:11000/oozie/. For our VMs the username and password are “seqware”. This is a great way to monitor and debug workflows, you can very easily get to the logs for each step, for example.

Packaging the Workflow into a Workflow Bundle

Assuming the workflow above worked fine the next step is to package it.

$ mkdir ~/packaged-bundles
$ seqware bundle package --dir target/Workflow_Bundle_* --to ~/packaged-bundles/
Validating Bundle structure
Packaging Bundle
Bundle has been packaged to /home/seqware/packaged-bundles

What happens here is the Workflow_Bundle_MyHelloWorld_1.0_SeqWare_1.1.0 directory is zip’d up to your output directory (~/packaged-bundles) and that can be provided to an admin for installation.

Next Steps

The next step is the Admin Tutorial which will show you how to install the workflow created above so other users can call it.