Tip: Modules are really not needed by workflow developers since most will simply want to run a Bash command as a step rather than writing a custom Java program as a Module for a given step. These directions are intended primarily for core SeqWare developers that wish to extend core SeqWare.
Modules are small Java objects that wrap one or more tools, typically command line programs. The modules are then parameterized and called from the XML workflow documents that get fed into Pegasus for running on a cluster.This approach is used rather than calling command line utilities directly in the workflow XML documents because this gives us a place to add code to test and document the metadata generated by each step. Modules can also be called directly from the command line for testing purposes. There is an example of this below.
The key for designing modules is keep them simple, self-contained, and reusable. For example, it would be an error to:
Good module design attempts to do the following:
Modules can now be developed simply and quickly by using the [[Creating_Workflow_Bundles_and_Modules_Using_Maven_Archetypes#SeqWare_Module_Archetype | Maven module archetype.]] |
You develop a module for a given tool where a tool is generally command line-based. The wrapper handles testing, collecting metadata, checking parameters, checking output, and running the command line tool. The wrapper is called via an XML workflow document described on another page.
We are using this wrapper because it provides a mechanism to abstract command line tools and it also records metadata that a direct call to a command line tool would not allow.
The wrapper objects live in the Java package net.sourceforge.seqware.pipeline.applications. The best way to get started with a new wrapper for a command line you’re interested in is to copy an existing wrapper object and customize it for your purposes. Take a look at the interface to see the minimal methods required for the wrapper: net.sourceforge.seqware.pipeline.wrapper.WrapperInterface. Then look at one of the wrappers to base yours on for example: net.sourceforge.seqware.pipeline.modules.examples.HelloWorld is a good place to start.
The module API includes the following methods:
These methods give you well-defined places to implement the running of a given command line utility (or anything else really) in a robust way. In the background the runner, which calls your modules, handles logging information (e.g. success or failure messages) back to the meta database. As a developer, all you have to do is implement these methods however you choose in order to make a new module available.
If you use the Maven archetype system for generating Modules then you can simply do the following:
mvn install
After you’re written and compiled your new wrapper test it on your cluster submit node or local machine. The key here is that you ensure your wrapper is fully working independently before you try integrating it within a workflow. You should be able to fully test your module on your local or cluster machine.
java -jar seqware.jar --module Module -- [ModuleParameters]
One of the powerful features of SeqWare Pipeline is the ability to record metadata as modules are run. This metadata includes not only which modules were run but on what inputs, what versions of modules/tools, and how they relate to each other in a hierarchy of processing events. This allows someone to trace back through a complex graph of analysis operations and piece together the individual steps performed, making re-running analysis much easier in the future. In order to get this to work there needs to be a metadatabase to record this information. If you haven’t yet setup SeqWare MetaDB see [[Setup SeqWare MetaDB]]. Assuming you followed these directions we can pick up where we left off with the HelloWorld example from earlier.
You need to pass in additional parameters to the Runner so it knows how to reach the metadb, if you look at the current options you’ll see the ones related to the metadb:
Next, make sure there is a processing record in the database. If you followed the directions on [[Setup SeqWare MetaDB]] there should be one processing event already in that table:
seqware_meta_db=> select * from processing; -[ RECORD 1 ]-------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- processing_id | 1 workflow_run_id | algorithm | seqware-qe status | processed description | This is a very small test database which includes SNVs, small indels, coverage, and consequences. It is based on the sample files from the backend seqware-queryengine sample files. url | url_label | version | 0.4.0 parameters | cache_size=52428800,lock_counts=1000 stdout | stderr | exit_status | process_exit_status | task_group | f sw_accession | 1 create_tstmp | 2009-09-18 12:48:48.336709 update_tstmp |
So you can see the processing_id is “1” here.
Now fill in the values as appropriate for your metadb setup:
java -jar seqware.jar –parent-accession 1 –metadata-processing-accession-file processing_IDs.txt –module net.sourceforge.seqware.pipeline.modules.examples.HelloWorld – –greeting Hello –repeat 4 –output-file greeting.txt
If everything worked correctly you should now see something very similar to the following:
seqware_meta_db=> select * from processing; ... -[ RECORD 2 ]-------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- processing_id | 2 workflow_run_id | algorithm | hello-world-module status | success description | This demonstrates how to write a simple module url | url_label | version | 0.7.0 parameters | stdout | nullOutput: greeting.txt : : stderr | MetaDB ProcessingID for this run is: 2 : net.sourceforge.seqware.pipeline.modules.examples.HelloWorld.init : net.sourceforge.seqware.pipeline.modules.examples.HelloWorld.do_verify_parameters : net.sourceforge.seqware.pipeline.modules.examples.HelloWorld.do_verify_input : net.sourceforge.seqware.pipeline.modules.examples.HelloWorld.do_test : net.sourceforge.seqware.pipeline.modules.examples.HelloWorld.do_run : net.sourceforge.seqware.pipeline.modules.examples.HelloWorld.do_verify_output : net.sourceforge.seqware.pipeline.modules.examples.HelloWorld.clean_up : exit_status | 0 process_exit_status | 0 task_group | f sw_accession | 2 create_tstmp | 2010-05-10 17:57:28.119139 update_tstmp | 2010-05-10 17:57:28.3592
In addition you should see “processing_IDs.txt” in the current directory. This contains the processing_id for the new record. So you can use this as an input for the next processing event and, in that way, build up a tree structure of processing events with parent-child relationships. Processing events are connected to each other via the processing_relationship table, for example, here is the record linking the parent and child events in the previous example:
seqware_meta_db=> select * from processing_relationship; processing_relationship_id | parent_id | child_id | relationship ----------------------------+-----------+----------+-------------- 1 | 1 | 2 | parent-child
As for file outputs, you can also see that the HelloWorld module registers an output file “greetings.txt” in the do_run() method:
seqware_meta_db=> select * from file; -[ RECORD 1 ]+------------------------------- file_id | 1 file_path | greeting.txt url | null url_label | null type | hello-world-text-output meta_type | text/plain description | A text output for hello world. sw_accession | 5
Processing and file records are joined with the processing_files table:
seqware_meta_db=> select * from processing_files; processing_files_id | processing_id | file_id ---------------------+---------------+--------- 1 | 2 | 1
One potential issue is the relative nature of the file_path. In this example the path is relative to the current working directory but real jobs running in a cluster environment work in a temporary directory. Files are then provisioned back to a final destination (using Pegasus). In this case greeting.txt may end up in /data/experiment12/analysis/hello/greeting.txt. ‘’’For that reason, modules should accept a parameter that dictates the final destination for output files so file_paths entered in the file table are correct. It is not necessarily the job of the module to move the file to that location (most often this needs to be done by Pegasus) but it is the responsibility of the module to collect enough information to make the file_path correct in the database. This can be difficult to validate by the module so users are encouraged to check the file_path (via a cron job or at the end of a workflow) to ensure they are correct.’’’
Here is a non-programmer’s step-by-step direction for editing, testing, & committing new modules…
Repeat steps 2-6 (although you can skip step 4 on subsequent cycles) until your module works like you want it to.
A help page illustrates the parameters for each module to run [[SeqWare modules help overview]]
The [[Module Conventions]] page lists all the meta types for the file outputs from modules. It’s important that SeqWare is consistent and specific with the use of these metatypes since these will eventually be the mechanism by which we specify the file input format supported by a module or workflow.
This is a new feature as of version 0.10.0 of SeqWare Pipeline. In version 0.9.0 we introduced the concept of Plugins for SeqWare Pipeline that are basically a simple way to write tools that can all be packed up in the SeqWare Pipeline jar. This is how we’re implementing core features of SeqWare Pipeline going forward, many of the useful Perl scripts we currently use will be deprecated in favor of porting functionality to Plugins. They differ from Modules in that they have no concept of state, e.g. they don’t write back to the MetaDB. This is great since it provides a convenient way to have all of our tools accessible via a single command line call, for example:
java -jar dist/seqware-pipeline-0.10.0.jar –list
This shows all the Plugins currently in the classpath.
For more information on Plugins see the proposal at [Unified SeqWare Pipeline Command Proposal].
In 0.10.0 we’re exposing Modules via this Plugin system. This gives you a single command line interface to access all the Modules using the ModuleRunner Plugin. This will develop over time but currently a user would do the following:
java -jar dist/seqware-pipeline-0.10.0.jar -p net.sourceforge.seqware.pipeline.plugins.ModuleRunner
This lists all the Modules in the classpath.
java -jar dist/seqware-pipeline-0.10.0.jar -p net.sourceforge.seqware.pipeline.plugins.ModuleRunner – –module net.sourceforge.seqware.pipeline.modules.utilities.ListFiles –no-metadata
This selects the ListFiles module for use and prints its help message since there are no arguments.
java -jar dist/seqware-pipeline-0.10.0.jar -p net.sourceforge.seqware.pipeline.plugins.ModuleRunner – –module net.sourceforge.seqware.pipeline.modules.utilities.ListFiles –no-metadata – –s3-url s3://
This actually triggers the module with an appropriate parameter.
Notice the “–” used here to delimit 1) the PluginRunner parameters (“-p net.sourceforge.seqware.pipeline.plugins.ModuleRunner”), the parameters for the ModuleRunner (“–module net.sourceforge.seqware.pipeline.modules.utilities.ListFiles –no-metadata”) and finally the parameters to the actual Module (“–s3-url s3://