1. Introduction to SeqWare
  2. Installation
  3. Getting Started
  4. SeqWare Pipeline
    1. Overview
    2. Oozie Workflow Engine Configuration
  5. SeqWare MetaDB
  6. SeqWare Portal
  7. SeqWare Web Service
  8. SeqWare Query Engine
  9. Glossary
  10. Frequently Asked Questions
  11. APIs
  12. Source Code
  13. Plugins
  14. Modules
  15. Advanced Topics

SeqWare Settings

Overview

The SeqWare jar file uses a simple configuration file that has been setup for you already on the VM. By default the location is ~/.seqware/settings. You can control this location using an environment variable:

SEQWARE_SETTINGS=~/.seqware/settings

This file contains the web address of the RESTful web service, your username and password, and you Amazon public and private keys that will allow you to push and pull data files to and from the cloud, etc. Here is the example settings file from the VM, this will be ready to work on the VM but keep in mind, this is where you would change settings if you, for example, setup the Web Service and MetaDB on another server or you launched a VM on the cloud and wanted to use the local VM command line jar to control the remote server. Another common thing you may want to do is use the ProvisionFiles module (described later) to push and pull data into/out of the cloud. This is the file where you would supply your access and secret keys that you got when signing up for Amazon (keep those safe!). For this tutorial the config file available on the VM should be ready to go, you will not need to modify it.

Note that the sections for the Oozie Workflow Engine, General Hadoop, Query Engine, and Amazon Cloud Settings are all optional, so they do not need to be filled in for every deployment of SeqWare, just those using these tools. Also note that the settings file needs to have read and write permissions for only the owner for security reasons. Our tools will abort and refuse to run if this is not set.

The format for the settings file is based on Java properties files.

# SEQWARE PIPELINE SETTINGS

# The settings in this file are tagged by when they are used.
# COMMON: Used by all components
# INSTALL: Used when installing a workflow bundle
# SCHEDULE: Used when a user wants to schedule a workflow run
# LAUNCH: Used when a workflow run is to be launched (or dry-run)
# DELETION: Used for the admin web service supporting deletion
#
# Remote users need COMMON and SCHEDULE.
# Workflow developers need COMMON and LAUNCH for testing.
# Administrators need COMMON, DELETION, and INSTALL.
# Cronjobs/daemon processes will need COMMON and LAUNCH.

# Keys that are required for a typical Oozie-sge installation with metadata via web service are marked as required.

# Note that this document was auto-generated using the UserSettingsPlugin


# COMMON
# Common Seqware settings

# required: SeqWare MetaDB communication method, can be 'database' or 'webservice' or 'inmemory' or 'none'
SW_METADATA_METHOD=webservice
# optional: Amazon cloud settings. Only used if reading and writing to S3 buckets.
AWS_ACCESS_KEY=FILLMEIN
# optional: Amazon cloud settings. Only used if reading and writing to S3 buckets.
AWS_SECRET_KEY=FILLMEIN

# COMMON_WS
# Seqware webservice settings. Only used if SW_METADATA_METHOD=webservice

# required: Specify the URL for the seqware-webservice
SW_REST_URL=http://localhost:8080/SeqWareWebService
# required: Specify the username for the seqware-webservice
SW_REST_USER=admin
# required: Specify the password for the seqware-webservice
SW_REST_PASS=admin@admin.com

# COMMON_DB
# Seqware database settings. Only used if SW_METADATA_METHOD=database and by the database check utility

# optional: JDBC user for the seqware metadb
SW_DB_USER=seqware
# optional: JDBC password for the seqware metadb
SW_DB_PASS=seqware
# optional: Host for the metadb
SW_DB_SERVER=localhost
# optional: database name
SW_DB=seqware_meta_db

# SCHEDULE_LAUNCH
# Settings used by scheduling and launching bundles

# required: the default engine to use if otherwise unspecified (one of: oozie, oozie-sge, whitestar, whitestar-parallel, whitestar-sge)
SW_DEFAULT_WORKFLOW_ENGINE=oozie-sge

# INSTALL_LAUNCH
# Settings used by both installing and launching bundles

# required: The directory containing bundle directories (into which bundle archives are unzipped)
SW_BUNDLE_DIR=/home/seqware/SeqWare/provisioned-bundles

# INSTALL
# Settings used to configure the installation of workflow bundles

# required: The directory containing bundle archives (into which a bundle archive is first copied during install)
SW_BUNDLE_REPO_DIR=/home/seqware/SeqWare/released-bundles
# optional: Default is to use compression, this can be set to OFF to disable compression
BUNDLE_COMPRESSION=ON

# LAUNCH
# Oozie engine settings. Only used for both 'oozie' and 'oozie-sge' engines.

# required: URL for the Oozie webservice
OOZIE_URL=http://localhost:11000/oozie
# required: HDFS directory for storing workflow xml
OOZIE_APP_ROOT=seqware_workflow
# required: Hadoop job tracker, used to schedule jobs for oozie-hadoop engine
OOZIE_JOBTRACKER=localhost:8021
# required: Hadoop name node, possibly redundant (should be refactored)
OOZIE_NAMENODE=hdfs://localhost:8020
# required: Hadoop queue onto which to schedule jobs
OOZIE_QUEUENAME=default
# required: Working directory where your workflow steps execute and where we store generated scripts and logs
OOZIE_WORK_DIR=/usr/tmp/seqware-oozie
# optional: Number of times that Oozie and Whitestar will retry user steps in workflows
OOZIE_RETRY_MAX=5
# optional: Minutes to wait before retry for user steps in workflows
OOZIE_RETRY_INTERVAL=5
# optional: Above this threshold, provision file events on the same job/workflow will be batched together
OOZIE_BATCH_THRESHOLD=10
# optional: Number of provision file events that should be batched together
OOZIE_BATCH_SIZE=100

# WHITESTAR
# WhiteStar engine settings. Only used for the 'whitestar' series of engines.

# optional: Restrict the number of parallel jobs invoked in WhiteStar to this amount of memory
WHITESTAR_MEMORY_LIMIT=2147483647

# LAUNCH
# Oozie engine settings. Only used for both 'oozie' and 'oozie-sge' engines.

# required: HDFS implementation class
FS.HDFS.IMPL=org.apache.hadoop.hdfs.DistributedFileSystem
# optional: Only used for 'oozie-sge' engine. Format of qsub flag for specifying number of threads. If present, ${threads} will be replaced with the job-specific value.
OOZIE_SGE_THREADS_PARAM_FORMAT=-pe serial ${threads}
# required: Format of qsub flag for specifying the max memory. If present, ${maxMemory} will be replaced with the job-specific value.
OOZIE_SGE_MAX_MEMORY_PARAM_FORMAT=-l h_vmem=${maxMemory}M

# ADMIN
# Settings used for administrators

# optional: In atypical environments, the default h_vmem constraint for SGE is too stringent. Override them using this (units in megabytes)
SW_CONTROL_NODE_MEMORY=3000
# optional: Location of the admin web service, currently used for deletion
SW_ADMIN_REST_URL=http://localhost:38080/seqware-admin-webservice
# optional: Used to override the JUnique lock used to ensure that utilities don't run concurrently
SW_LOCK_ID=seqware
# optional: Legacy key used to encrypt provisioned files
SW_ENCRYPT_KEY=seqware
# optional: Legacy key used to decrypt provisioned files
SW_DECRYPT_KEY=seqware

# LAUNCH
# Oozie engine settings. Only used for both 'oozie' and 'oozie-sge' engines.

# optional: Used to determine whether provisioned (out) files should be run through MD5 before and after provisioning
SW_PROVISION_FILES_MD5=true

# TESTING
# Used for regression testing

# optional: Used to designate a database for integration tests
BASIC_TEST_DB_HOST=localhost
# optional: Used to designate a database name for integration tests
BASIC_TEST_DB_NAME=seqware_meta_db
# optional: Used to designate a database username for integration tests
BASIC_TEST_DB_USER=seqware
# optional: Used to designate a database password for integration tests
BASIC_TEST_DB_PASSWORD=seqware
# optional: Used to designate a database for extended integration tests
EXTENDED_TEST_DB_HOST=localhost
# optional: Used to designate a database name for extended integration tests
EXTENDED_TEST_DB_NAME=seqware_meta_db
# optional: Used to designate a database username for extended integration tests
EXTENDED_TEST_DB_USER=seqware
# optional: Used to designate a database password for extended integration tests
EXTENDED_TEST_DB_PASSWORD=seqware

Oozie Workflow Engine Configuration

In addition to the the user’s ~/.seqware/settings file the only other configuration is that required for automatic retry. Like the Pegasus workflow engine, it is possible to control the number of attempts that should be made before a job is considered failed in a workflow.

Edit the Oozie site XML and add and/or add to the error codes that are listed.

    <property>
        <name>oozie.service.LiteWorkflowStoreService.user.retry.error.code.ext</name>
        <value>SGE137</value>
    </property>
    <property>
        <name>oozie.service.LiteWorkflowStoreService.user.retry.max</name>
        <value>30</value>
    </property>

After restarting Oozie, Oozie will use the listed error codes in combination with the OOZIE_RETRY_MAX parameter to determine how many times steps will be retried in case of a specific error. For example, in the above jobs that return with an SGE error code of SGE137 will automatically be retried 30 or OOZIE_RETRY_MAX times, whatever is higher. The actual error codes will likely be dependent on your site.

For versions of the oozie-sge plugin from 1.0.3 onwards, two kinds of error codes are possible. Error codes of the form SGE[0-9]+ refer to the exit status of the actual Bash scripts that form steps in your workflows. Error codes of the form SGEF[0-9]+ refer to the failure code of the SGE infrastructure itself.

For example, the following output from “qacct -j” refers to a workflow step which failed with an error code of 1 (which would correspond to SGE1 for the Oozie XML parameter above).

$ qacct -j 3702
==============================================================
qname        main.q              
hostname     master           
group        seqware               
owner        seqware               
project      NONE                
department   defaultdepartment   
jobname      annotate_5          
jobnumber    3702                
taskid       undefined
account      sge                 
priority     0                   
qsub_time    Fri Aug 29 16:40:08 2014
start_time   Fri Aug 29 16:40:20 2014
end_time     Fri Aug 29 16:40:21 2014
granted_pe   NONE                
slots        1                   
failed       0    
exit_status  1                   
ru_wallclock 1            
ru_utime     1.468        
ru_stime     0.072        
ru_maxrss    112212              
ru_ixrss     0                   
ru_ismrss    0                   
ru_idrss     0                   
ru_isrss     0                   
ru_minflt    42375               
ru_majflt    0                   
ru_nswap     0                   
ru_inblock   0                   
ru_oublock   168                 
ru_msgsnd    0                   
ru_msgrcv    0                   
ru_nsignals  0                   
ru_nvcsw     726                 
ru_nivcsw    269                 
cpu          1.540        
mem          0.306             
io           0.006             
iow          0.000             
maxvmem      557.734M
arid         undefined

The following output from “qacct -j” refers to a workflow step where the actual qsub failed since a logging directory was unavailable (leading to a Eqw state). This would correspond to an Oozie error code of SGEF26.

$ qacct -j 3801
==============================================================
qname        main.q              
hostname     master           
group        seqware               
owner        seqware               
project      NONE                
department   defaultdepartment   
jobname      start_0             
jobnumber    3801                
taskid       undefined
account      sge                 
priority     0                   
qsub_time    Fri Sep 12 15:03:02 2014
start_time   -/-
end_time     -/-
granted_pe   NONE                
slots        1                   
failed       26  : opening input/output file
exit_status  0                   
ru_wallclock 0            
ru_utime     0.000        
ru_stime     0.000        
ru_maxrss    0                   
ru_ixrss     0                   
ru_ismrss    0                   
ru_idrss     0                   
ru_isrss     0                   
ru_minflt    0                   
ru_majflt    0                   
ru_nswap     0                   
ru_inblock   0                   
ru_oublock   0                   
ru_msgsnd    0                   
ru_msgrcv    0                   
ru_nsignals  0                   
ru_nvcsw     0                   
ru_nivcsw    0                   
cpu          0.000        
mem          0.000             
io           0.000             
iow          0.000             
maxvmem      0.000
arid         undefined