SeqWare › Java Workflows

Overview

Tip: The Java workflow language is our recommended workflow language for new development.

This document really focuses on the format of the Java workflow language. For more information about the entire workflow bundle please see the Developer Tutorial. You should read this guide before this page.

Limitations

The Java workflows work with both the Pegasus and Oozie Workflow Engines. That being said, if you use MapReduce or other Hadoop-specific job types in your Java workflows they will not function in the Pegasus Workflow Engine (since Hadoop is only present for the Oozie Workflow Engine).

Creating a Java Workflow Bundle

In the Developer Tutorial you saw how to create a MyHelloWorld Java workflow using archetype.

The first step to get started is to generate your workflow skeleton using Maven archetypes. You will want to do this in a directory without pom.xml files (i.e. outside of the SeqWare development directories). Here we are working in the workflow-dev directory:

$ cd ~/workflow-dev
$ mvn archetype:generate
...
829: local -> com.github.seqware:seqware-archetype-decider (SeqWare Java Decider archetype)
830: local -> com.github.seqware:seqware-archetype-java-workflow (SeqWare Java workflow archetype)
831: local -> com.github.seqware:seqware-archetype-module (SeqWare module archetype)
Choose a number or apply filter (format: [groupId:]artifactId, case sensitive contains): 294:

Note:You can inform Maven about the archetypes by downloading the latest archetype-catalog.xml file and placing it in ~/.m2/.

The numbers used to identify the archetypes will vary depending on what you have installed, so you will need to scan through the list to find the SeqWare archetype you are looking for, in this case “SeqWare Java workflow archetype”. You can also enter “seqware” as a filter in order to narrow down the possibilities. Following the prompts, use “com.github.seqware” as the package, “MyHelloWorld” as the artifactId and workflow-name, and 1.0 for the version.

Alternately, the archetype can be generated without any interaction:

$ mvn archetype:generate \
  -DinteractiveMode=false \
  -DarchetypeCatalog=local \
  -DarchetypeGroupId=com.github.seqware \
  -DarchetypeArtifactId=seqware-archetype-java-workflow \
  -DgroupId=com.github.seqware \
  -DartifactId=workflow-MyHelloWorld \
  -Dworkflow-name=MyHelloWorld \
  -Dversion=1.0

Alternatively, enter the workflow name and version you want to use. When complete, you can cd into the new workflow directory and use mvn install to build the workflow. This copies files to the correct location and pulls in needed dependencies.

    $ cd ~/workflow-dev/workflow-MyHelloWorld/
    $ mvn clean install
    [INFO] Scanning for projects...
    ...
    [INFO] ------------------------------------------------------------------------
    [INFO] BUILD SUCCESS
    [INFO] ------------------------------------------------------------------------
    [INFO] Total time: 40.100s
    [INFO] Finished at: Thu Aug 15 19:59:22 UTC 2013
    [INFO] Final Memory: 28M/67M
    [INFO] ------------------------------------------------------------------------

You will now have a workflow directory called target/Workflow_Bundle_<WorkflowName> which contains your assembled workflow.

A Tour of the Java Workflow Syntax

You can see the full Java workflow source code by looking at Workflow Examples or, in this case, just the src/main/java/com/github/seqware/MyHelloWorldWorkflow.java file produced by the Maven Archetype above.

This Java class is pretty simple in its construction. It is used to define input and output files along with the individual steps in the workflow and how they relate to each other. It is used to create a workflow object model which is then handed of to a workflow engine that knows how to turn that into a directed acyclic graph of jobs that can run on a cluster (local VM, an HPC cluster, a cloud-based cluster, etc).

The full contents of the MyHelloWorldWorkflow.java are included below. We will describe each section in more detail next:

package com.github.seqware;

import java.util.Map;
import java.util.logging.Level;
import java.util.logging.Logger;
import net.sourceforge.seqware.pipeline.workflowV2.AbstractWorkflowDataModel;
import net.sourceforge.seqware.pipeline.workflowV2.model.Job;
import net.sourceforge.seqware.pipeline.workflowV2.model.SqwFile;
/**
 * For more information on developing workflows, see the documentation at
 * SeqWare Java Workflows.
 * 
 * Quick reference for the order of methods called:
 * 1. setupDirectory
 * 2. setupFiles
 * 3. setupWorkflow
 * 4. setupEnvironment
 * 5. buildWorkflow
 * 
 * See the SeqWare API for 
 * AbstractWorkflowDataModel 
 * for more information.
 */
public class MyHelloWorldWorkflow extends AbstractWorkflowDataModel {

    private boolean manualOutput=false;
    private String catPath, echoPath;
    private String greeting ="";

    private void init() {
        try {
            //optional properties
            if (hasPropertyAndNotNull("manual_output")) {
                manualOutput = Boolean.valueOf(getProperty("manual_output"));
            }
            if (hasPropertyAndNotNull("greeting")) {
                greeting = getProperty("greeting");
            }
            //these two properties are essential to the workflow. If they are null or do not 
            //exist in the INI, the workflow should exit.
            catPath = getProperty("cat");
            echoPath = getProperty("echo");
        } catch (Exception e) {
            e.printStackTrace();
            System.exit(1);
        }
    }

    @Override
    public void setupDirectory() {
        //since setupDirectory is the first method run, we use it to initialize variables too.
        init();
        // creates a dir1 directory in the current working directory where the workflow runs
        this.addDirectory("dir1");
    }
    @Override
    public Map<String, SqwFile> setupFiles() {
      try {
        // register an plaintext input file using the information from the INI
        // provisioning this file to the working directory will be the first step in the workflow
        SqwFile file0 = this.createFile("file_in_0");
        file0.setSourcePath(getProperty("input_file"));
        file0.setType("text/plain");
        file0.setIsInput(true);

      } catch (Exception ex) {
        ex.printStackTrace();
        System.exit(1);
      }
      return this.getFiles();
    }


    @Override
    public void buildWorkflow() {

        // a simple bash job to call mkdir
        // note that this job uses the system's mkdir (which depends on the system being *nix)
        Job mkdirJob = this.getWorkflow().createBashJob("bash_mkdir");
        mkdirJob.getCommand().addArgument("mkdir test1");

        String inputFilePath = this.getFiles().get("file_in_0").getProvisionedPath();

        // a simple bash job to cat a file into a test file
        // the file is not saved to the metadata database
        Job copyJob1 = this.getWorkflow().createBashJob("bash_cp");
        copyJob1.setCommand(catPath + " " + inputFilePath + "> test1");
        copyJob1.addParent(mkdirJob);

        // a simple bash job to echo to an output file and concat an input file
        // the file IS saved to the metadata database
        Job copyJob2 = this.getWorkflow().createBashJob("bash_cp");
        copyJob2.getCommand().addArgument(echoPath).addArgument(greeting).addArgument(" > ").addArgument("dir1/output");
        copyJob2.getCommand().addArgument(";");
        copyJob2.getCommand().addArgument(catPath + " " +inputFilePath+ " >> dir1/output");
        copyJob2.addParent(mkdirJob);
        copyJob2.addFile(createOutputFile("dir1/output", "txt/plain", manualOutput));

    }

    private SqwFile createOutputFile(String workingPath, String metatype, boolean manualOutput) {
    // register an output file
        SqwFile file1 = new SqwFile();
        file1.setSourcePath(workingPath);
        file1.setType(metatype);
        file1.setIsOutput(true);
        file1.setForceCopy(true);

        // if manual_output is set in the ini then use it to set the destination of this file
        if (manualOutput) {
            file1.setOutputPath(this.getMetadata_output_file_prefix() + getMetadata_output_dir() + "/" + workingPath);
        } else {
            file1.setOutputPath(this.getMetadata_output_file_prefix() + getMetadata_output_dir() + "/"
                + this.getName() + "_" + this.getVersion() + "/" + this.getRandom() + "/" + workingPath);
        }
        return file1;
    }

}

Variables

Variables are simply defined as any other object variables would be in Java. To access variables from the workflow’s ini file simply use the getProperty(“key”) method.

// showing how to access the workflow bundle install location
file0.setSourcePath(this.getWorkflowBaseDir()+"/data/input.txt");

// showing how to access a property from the INI file
file0.setSourcePath(getProperty("input_file"));

// showing how to access variables out of the ini file
this.getProperty("output_file_1");

Files & Directories

Files that are inputs or outputs from workflows need to be copied in or out respectively. Under the hood, this uses the ProvisionFiles module that knows how to move around local files, files over HTTP, or remote files on Amazon’s S3. The Java syntax simplifies the declaration of these input and output files for workflows by providing the method below. Keep in mind, you can also just transfer input or output files by using a standard job that calls the necessary command line tool and bypass this built in system. But the setupFiles() method will likely work for most purposes and is the easiest way to register workflow files.

    @Override
    public Map<String, SqwFile> setupFiles() {
      try {
        // register an plaintext input file using the information from the INI
        // provisioning this file to the working directory will be the first step in the workflow
        SqwFile file0 = this.createFile("file_in_0");
        file0.setSourcePath(getProperty("input_file"));
        file0.setType("text/plain");
        file0.setIsInput(true);

      } catch (Exception ex) {
        ex.printStackTrace();
        System.exit(1);
      }
      return this.getFiles();
    }

Output files are linked to the jobs that produce those files with addFile(SqwFile). A convenience method is provided in the archetype for creating output files. Note that these files are not available from this.getFiles().get(String).

    @Override
    public void buildWorkflow() {
	...

	Job copyJob2 = this.getWorkflow().createBashJob("bash_cp");
	...
        copyJob2.addFile(createOutputFile("dir1/output", "txt/plain", manualOutput));
    }

    private SqwFile createOutputFile(String workingPath, String metatype, boolean manualOutput) {
    // register an output file
        SqwFile file1 = new SqwFile();
        file1.setSourcePath(workingPath);
        file1.setType(metatype);
        file1.setIsOutput(true);
        file1.setForceCopy(true);

        // if manual_output is set in the ini then use it to set the destination of this file
        if (manualOutput) {
            file1.setOutputPath(this.getMetadata_output_file_prefix() + getMetadata_output_dir() + "/" + workingPath);
        } else {
            file1.setOutputPath(this.getMetadata_output_file_prefix() + getMetadata_output_dir() + "/"
                + this.getName() + "_" + this.getVersion() + "/" + this.getRandom() + "/" + workingPath);
        }
        return file1;
    }

You can also specify directories to be created in the working directory of your workflow.

    @Override
    public void setupDirectory() {
	// creates a dir1 directory in the current working directory where the workflow runs
        this.addDirectory("dir1");
    }

Jobs & Dependencies

The jobs need to have distinct IDs and you can generate these using a for loop in FTL if need be. You can put any command in the argument section but mostly this is used to call GenericCommandRunner which runs the command provided in a Bash shell.

    @Override
    public void buildWorkflow() {

        // a simple bash job to call mkdir
	// note that this job uses the system's mkdir (which depends on the system being *nix)
	// this also translates into a 3000 h_vmem limit when using sge 
        Job mkdirJob = this.getWorkflow().createBashJob("bash_mkdir").setMaxMemory("3000");
        mkdirJob.getCommand().addArgument("mkdir test1");      
       
	String inputFilePath = this.getFiles().get("file_in_0").getProvisionedPath();
	 
        // a simple bash job to cat a file into a test file
	// the file is not saved to the metadata database
        Job copyJob1 = this.getWorkflow().createBashJob("bash_cp").setMaxMemory("3000");
        copyJob1.setCommand(catPath + " " + inputFilePath + "> test1/test.out");
        copyJob1.addParent(mkdirJob);
	// this will annotate the processing event associated with the cat of the file above
        copyJob1.getAnnotations().put("command.annotation.key.1", "command.annotation.value.1");
        copyJob1.getAnnotations().put("command.annotation.key.2", "command.annotation.value.2");
        
        // a simple bash job to echo to an output file and concat an input file
	// the file IS saved to the metadata database
        Job copyJob2 = this.getWorkflow().createBashJob("bash_cp").setMaxMemory("3000");
	copyJob2.getCommand().addArgument(echoPath).addArgument(greeting).addArgument(" > ").addArgument("dir1/output");
	copyJob2.getCommand().addArgument(";");
	copyJob2.getCommand().addArgument(catPath + " " +inputFilePath+ " >> dir1/output");
        copyJob2.addParent(mkdirJob);
	SqwFile outputFile = createOutputFile("dir1/output", "txt/plain", manualOutput);
        // this will annotate the processing event associated with copying your output file to its final location
        outputFile.getAnnotations().put("provision.file.annotation.key.1", "provision.annotation.value.1");
        copyJob2.addFile(outputFile);

    }

Currently only the job supported is using the createBashJob() method. In the future we will provide and expanded list of convenience job types for example MapReduce, Pig, Java jar, etc.

The dependencies section links together all the individual jobs in the correct order so they can be executed successfully. Parent/child relationships are used to specify job pre-requisites.

TODO: discuss the JobTypes, namely Bash

Symbolic Links to Local Dependencies

This type of dependency is not generally recommended. In most cases you will want to check a dependency into a maven repository or add it into the binary or data directories of a workflow.

However, in the following cases, you will need to rely on symbolic links:

The dependency is too large to go into a maven repository
The dependency is proprietary or cannot be redistributed

By convention, the symlinks go into a directory in the root of the workflow called links. They should link to a directory, not to a single file (for the purposes of copying the dependencies to the final bundle. Maven doesn’t accept single files for copying upon install. The SeqWare archeype for Java workflow symlinks the entire ‘links’ folder to the ‘data’ directory in the final workflow bundle.

First, create the link:

[seqware@seqwarevm workflow-MyHelloWorld]$ cd links
[seqware@seqwarevm links]$ rm -Rf *
[seqware@seqwarevm links]$ ln -s ../workflow/data/
[seqware@seqwarevm links]$ ls -alhtr
total 8.0K
drwxrwxr-x 5 seqware seqware 4.0K May 31 17:39 ..
lrwxrwxrwx 1 seqware seqware   17 May 31 17:41 data -> ../workflow/data/
drwxrwxr-x 2 seqware seqware 4.0K May 31 17:41 .

Second, check your bundle’s pom.xml to verify that the maven-junction-plugin is present. This plugin will create a link when compiling your bundle:

<build>
   ...
    <plugins>
   ...
        <plugin>
            <groupId>com.pyx4j</groupId>
            <artifactId>maven-junction-plugin</artifactId>
            <executions>
                    <execution>
                    <phase>package</phase>
                    <goals>
                            <goal>link</goal>
                    </goals>
            </execution>
            <execution>
                    <id>unlink</id>
                    <phase>clean</phase>
                    <goals>
                            <goal>unlink</goal>
                    </goals>
          </execution>
        </executions>
        <configuration>
          <links>
           <link>
              <dst>${project.build.directory}/Workflow_Bundle_${workflow-name}_${project.version}_SeqWare_${seqware-version}/Workflow_Bundle_${workflow-directory-name}/${project.version}/data/data</dst>
              <src>${basedir}/links/data</src>
            </link>
          </links>
        </configuration>
     </plugin>
    </plugins>
</build>
</project>

Your bundles will now contain a symbolic link to your dependency after “mvn clean install” in the data directory and this will only be included in the bundle when the bundle is packaged (and/or installed). Note that the destination directory cannot exist before the junction occurs or it will fail.

Running the Workflow

You can run the Workflow using the test process shown in the Developer Tutorial. For example:

java -jar ~/seqware-full.jar -p net.sourceforge.seqware.pipeline.plugins.BundleManager -- -b `pwd` -t --workflow simple-legacy-ftl-workflow --version 1.0
Running Plugin: net.sourceforge.seqware.pipeline.plugins.BundleManager
Setting Up Plugin: net.sourceforge.seqware.pipeline.plugins.BundleManager@e80d1ff
Testing Bundle
  Running Test Command:
java -jar /home/seqware/Temp/simple-legacy-ftl-workflow/target/Workflow_Bundle_simple-legacy-ftl-workflow_1.0-SNAPSHOT_SeqWare_0.13.6.x/Workflow_Bundle_simple-legacy-ftl-workflow/1.0-SNAPSHOT/lib/seqware-distribution-0.13.6.x-full.jar --plugin net.sourceforge.seqware.pipeline.plugins.WorkflowLauncher --provisioned-bundle-dir /home/seqware/Temp/simple-legacy-ftl-workflow/target/Workflow_Bundle_simple-legacy-ftl-workflow_1.0-SNAPSHOT_SeqWare_0.13.6.x --workflow simple-legacy-ftl-workflow --version 1.0 --ini-files /home/seqware/Temp/simple-legacy-ftl-workflow/target/Workflow_Bundle_simple-lSHOT_SeqWare_0.13.6.x/Workflow_Bundle_simple-legacy-ftl-workflow/1.0-SNAPSHOT/config/workflow.ini
MONITORING PEGASUS STATUS:
RUNNING: step 1 of 5 (20%)
RUNNING: step 2 of 5 (40%)
...

For More Information

See the Developer Tutorial document for more information. For Java API documentation consult the SeqWare Javadocs.