1. Introduction to SeqWare
  2. Installation
  3. Getting Started
  4. SeqWare Pipeline
    1. Overview
    2. Why Wrappers
    3. Writing a New Wrapper
    4. Compiling
    5. Standalone Testing
    6. Testing with Metadata Writeback
    7. Module Development: Overview Level
    8. Standards Documentation
    9. Running Modules Standalone on the Command Line
  5. SeqWare MetaDB
  6. SeqWare Portal
  7. SeqWare Web Service
  8. SeqWare Query Engine
  9. Glossary
  10. Frequently Asked Questions
  11. APIs
  12. Source Code
  13. Plugins
  14. Modules
  15. Advanced Topics

Writing Modules

Overview

Tip: Modules are really not needed by workflow developers since most will simply want to run a Bash command as a step rather than writing a custom Java program as a Module for a given step. These directions are intended primarily for core SeqWare developers that wish to extend core SeqWare.

Modules are small Java objects that wrap one or more tools, typically command line programs. The modules are then parameterized and called from the XML workflow documents that get fed into Pegasus for running on a cluster.This approach is used rather than calling command line utilities directly in the workflow XML documents because this gives us a place to add code to test and document the metadata generated by each step. Modules can also be called directly from the command line for testing purposes. There is an example of this below.

The key for designing modules is keep them simple, self-contained, and reusable. For example, it would be an error to:

  • wrap dozens of system calls in one module
  • hard-code paths to any executables e.g. /usr/bin/bfast, instead pass these as arguments
  • code modules for one particular task, in one particular workflow, instead you should focus on well defined steps that can be used over-and-over again by many different workflows

Good module design attempts to do the following:

  • module just one particular reusable task, you should just call one or a few tools in each module
  • all options, inputs, outputs, executable programs should be passed as arguments
  • be reusable, new workflows should be able to use any module, leave out logic specific to one workflow
  • use standard input and output files, modules can create many temp files but converge on standards for input/output
Modules can now be developed simply and quickly by using the [[Creating_Workflow_Bundles_and_Modules_Using_Maven_Archetypes#SeqWare_Module_Archetype Maven module archetype.]]

Why Wrappers

You develop a module for a given tool where a tool is generally command line-based. The wrapper handles testing, collecting metadata, checking parameters, checking output, and running the command line tool. The wrapper is called via an XML workflow document described on another page.

We are using this wrapper because it provides a mechanism to abstract command line tools and it also records metadata that a direct call to a command line tool would not allow.

Writing a New Wrapper

The wrapper objects live in the Java package net.sourceforge.seqware.pipeline.applications. The best way to get started with a new wrapper for a command line you’re interested in is to copy an existing wrapper object and customize it for your purposes. Take a look at the interface to see the minimal methods required for the wrapper: net.sourceforge.seqware.pipeline.wrapper.WrapperInterface. Then look at one of the wrappers to base yours on for example: net.sourceforge.seqware.pipeline.modules.examples.HelloWorld is a good place to start.

The module API includes the following methods:

  • ’’’init()’’’: initialization goes here; db setup, temp file creation, etc
  • ’’’do_verify_parameters()’’’: make sure all needed parameters are passed in
  • ’’’do_verify_input()’’’: ensure needed input files exist, output directories are writable, etc
  • ’’’do_test()’’’: perform any testing here; make sure a DB is only, make sure command line tools function, you can even write a test suite here to run the program you’re wrapping on a known good and then verify the output
  • ’’’do_run()’’’: the core of a module, this is where you actually execute your task
  • ’’’do_verify_output()’’’: check that the expected output has been created
  • ’’’clean_up()’’’: cleanup any temp files and directories

These methods give you well-defined places to implement the running of a given command line utility (or anything else really) in a robust way. In the background the runner, which calls your modules, handles logging information (e.g. success or failure messages) back to the meta database. As a developer, all you have to do is implement these methods however you choose in order to make a new module available.

Compiling

If you use the Maven archetype system for generating Modules then you can simply do the following:

mvn install

Standalone Testing

After you’re written and compiled your new wrapper test it on your cluster submit node or local machine. The key here is that you ensure your wrapper is fully working independently before you try integrating it within a workflow. You should be able to fully test your module on your local or cluster machine.

java -jar seqware.jar --module Module -- [ModuleParameters]

Testing with Metadata Writeback

One of the powerful features of SeqWare Pipeline is the ability to record metadata as modules are run. This metadata includes not only which modules were run but on what inputs, what versions of modules/tools, and how they relate to each other in a hierarchy of processing events. This allows someone to trace back through a complex graph of analysis operations and piece together the individual steps performed, making re-running analysis much easier in the future. In order to get this to work there needs to be a metadatabase to record this information. If you haven’t yet setup SeqWare MetaDB see [[Setup SeqWare MetaDB]]. Assuming you followed these directions we can pick up where we left off with the HelloWorld example from earlier.

You need to pass in additional parameters to the Runner so it knows how to reach the metadb, if you look at the current options you’ll see the ones related to the metadb:

Next, make sure there is a processing record in the database. If you followed the directions on [[Setup SeqWare MetaDB]] there should be one processing event already in that table:

seqware_meta_db=> select * from processing;
-[ RECORD 1 ]-------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
processing_id       | 1
workflow_run_id     | 
algorithm           | seqware-qe
status              | processed
description         | This is a very small test database which includes SNVs, small indels, coverage, and consequences. It is based on the sample files from the backend seqware-queryengine sample files.
url                 | 
url_label           | 
version             | 0.4.0
parameters          | cache_size=52428800,lock_counts=1000
stdout              | 
stderr              | 
exit_status         | 
process_exit_status | 
task_group          | f
sw_accession        | 1
create_tstmp        | 2009-09-18 12:48:48.336709
update_tstmp        |

So you can see the processing_id is “1” here.

Now fill in the values as appropriate for your metadb setup:

java -jar seqware.jar –parent-accession 1 –metadata-processing-accession-file processing_IDs.txt –module net.sourceforge.seqware.pipeline.modules.examples.HelloWorld – –greeting Hello –repeat 4 –output-file greeting.txt

If everything worked correctly you should now see something very similar to the following:

seqware_meta_db=> select * from processing;
...
-[ RECORD 2 ]-------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
processing_id       | 2
workflow_run_id     | 
algorithm           | hello-world-module
status              | success
description         | This demonstrates how to write a simple module
url                 | 
url_label           | 
version             | 0.7.0
parameters          | 
stdout              | nullOutput: greeting.txt
                    : 
                    : 
stderr              | MetaDB ProcessingID for this run is: 2
                    : net.sourceforge.seqware.pipeline.modules.examples.HelloWorld.init
                    : net.sourceforge.seqware.pipeline.modules.examples.HelloWorld.do_verify_parameters
                    : net.sourceforge.seqware.pipeline.modules.examples.HelloWorld.do_verify_input
                    : net.sourceforge.seqware.pipeline.modules.examples.HelloWorld.do_test
                    : net.sourceforge.seqware.pipeline.modules.examples.HelloWorld.do_run
                    : net.sourceforge.seqware.pipeline.modules.examples.HelloWorld.do_verify_output
                    : net.sourceforge.seqware.pipeline.modules.examples.HelloWorld.clean_up
                    : 
exit_status         | 0
process_exit_status | 0
task_group          | f
sw_accession        | 2
create_tstmp        | 2010-05-10 17:57:28.119139
update_tstmp        | 2010-05-10 17:57:28.3592

In addition you should see “processing_IDs.txt” in the current directory. This contains the processing_id for the new record. So you can use this as an input for the next processing event and, in that way, build up a tree structure of processing events with parent-child relationships. Processing events are connected to each other via the processing_relationship table, for example, here is the record linking the parent and child events in the previous example:

seqware_meta_db=> select * from processing_relationship;
 processing_relationship_id | parent_id | child_id | relationship 
----------------------------+-----------+----------+--------------
                          1 |         1 |        2 | parent-child

As for file outputs, you can also see that the HelloWorld module registers an output file “greetings.txt” in the do_run() method:

seqware_meta_db=> select * from file;
-[ RECORD 1 ]+-------------------------------
file_id      | 1
file_path    | greeting.txt
url          | null
url_label    | null
type         | hello-world-text-output
meta_type    | text/plain
description  | A text output for hello world.
sw_accession | 5

Processing and file records are joined with the processing_files table:

seqware_meta_db=> select * from processing_files;
 processing_files_id | processing_id | file_id 
---------------------+---------------+---------
                   1 |             2 |       1

One potential issue is the relative nature of the file_path. In this example the path is relative to the current working directory but real jobs running in a cluster environment work in a temporary directory. Files are then provisioned back to a final destination (using Pegasus). In this case greeting.txt may end up in /data/experiment12/analysis/hello/greeting.txt. ‘’’For that reason, modules should accept a parameter that dictates the final destination for output files so file_paths entered in the file table are correct. It is not necessarily the job of the module to move the file to that location (most often this needs to be done by Pegasus) but it is the responsibility of the module to collect enough information to make the file_path correct in the database. This can be difficult to validate by the module so users are encouraged to check the file_path (via a cron job or at the end of a workflow) to ensure they are correct.’’’

Module Development: Overview Level

Here is a non-programmer’s step-by-step direction for editing, testing, & committing new modules…

  1. Open your module in Eclipse.
  2. Edit your module in Eclipse.
  3. Save your module in Eclipse.
  4. If it is a new module, go to where it is stored (for example, maybe in ~svnroot/seqware-complete/trunk/seqware-pipeline/java/src/net/sourceforge/seqware/pipeline/modules/). Type ‘svn add ’ to add the module. Repeat this step for any other files you want to add, such as a perl script that your module calls. You only need to add each file once – you can edit the file(s) after this but you do not need to add them again.
  5. Go to ~/svnroot/seqware-complete/trunk/seqware-pipeline/java and type ‘ant clean all’. It has compiled successfully if it reports “BUILD SUCCESSFUL”.
  6. Test your module.

Repeat steps 2-6 (although you can skip step 4 on subsequent cycles) until your module works like you want it to.

  1. Now you are ready to check in your code, but it is very important be to absolutely sure that it compiles. If you’re not certain, repeat step 5 to be sure that it reports “BUILD SUCCESSFUL”.
  2. Go to ~/svnroot/seqware-complete/trunk/seqware-pipeline. Type ‘svn ci -m “text here”’. (Obviously, the ‘text here’ should be replaced by a brief description of changes or updates you are making).
  3. Details of available module help are documented on SourceForge wiki [[SeqWare modules help overview]]

Standards Documentation

A help page illustrates the parameters for each module to run [[SeqWare modules help overview]]

The [[Module Conventions]] page lists all the meta types for the file outputs from modules. It’s important that SeqWare is consistent and specific with the use of these metatypes since these will eventually be the mechanism by which we specify the file input format supported by a module or workflow.

Running Modules Standalone on the Command Line

This is a new feature as of version 0.10.0 of SeqWare Pipeline. In version 0.9.0 we introduced the concept of Plugins for SeqWare Pipeline that are basically a simple way to write tools that can all be packed up in the SeqWare Pipeline jar. This is how we’re implementing core features of SeqWare Pipeline going forward, many of the useful Perl scripts we currently use will be deprecated in favor of porting functionality to Plugins. They differ from Modules in that they have no concept of state, e.g. they don’t write back to the MetaDB. This is great since it provides a convenient way to have all of our tools accessible via a single command line call, for example:

java -jar dist/seqware-pipeline-0.10.0.jar –list

This shows all the Plugins currently in the classpath.

For more information on Plugins see the proposal at [Unified SeqWare Pipeline Command Proposal].

In 0.10.0 we’re exposing Modules via this Plugin system. This gives you a single command line interface to access all the Modules using the ModuleRunner Plugin. This will develop over time but currently a user would do the following:

java -jar dist/seqware-pipeline-0.10.0.jar -p net.sourceforge.seqware.pipeline.plugins.ModuleRunner

This lists all the Modules in the classpath.

java -jar dist/seqware-pipeline-0.10.0.jar -p net.sourceforge.seqware.pipeline.plugins.ModuleRunner – –module net.sourceforge.seqware.pipeline.modules.utilities.ListFiles –no-metadata

This selects the ListFiles module for use and prints its help message since there are no arguments.

java -jar dist/seqware-pipeline-0.10.0.jar -p net.sourceforge.seqware.pipeline.plugins.ModuleRunner – –module net.sourceforge.seqware.pipeline.modules.utilities.ListFiles –no-metadata – –s3-url s3://

This actually triggers the module with an appropriate parameter.

Notice the “–” used here to delimit 1) the PluginRunner parameters (“-p net.sourceforge.seqware.pipeline.plugins.ModuleRunner”), the parameters for the ModuleRunner (“–module net.sourceforge.seqware.pipeline.modules.utilities.ListFiles –no-metadata”) and finally the parameters to the actual Module (“–s3-url s3://”).