This document describes the process of connecting a VM to an actual SGE cluster for use with the Pegasus Workflow Engine or Hadoop cluster for use with the Oozie Workflow Engine. Since the Pegasus engine is more throughly tested and used for the longest time we will focus more on that one.
The reason you might like to do this process is setting up SeqWare from scratch (see Installing from Scratch is time consuming and difficult. So we maintain VMs (both cloud VMs for Amazon and a local VM using VirtualBox) to quickly get people started with SeqWare. When it comes time to do “real work” with SeqWare you need a cluster. Rather than installing SeqWare from scratch you can simply connect the VM to an actual cluster. At OICR, for example, we followed this process for installing SeqWare at the institute (we use the Pegasus Workflow Engine):
These directions cover connecting the Oozie Workflow Engine to an actual Hadoop cluster. This engine is an alternative to the SGE engine above.
We use the Cloudera packages on the SeqWare VM to install and configure the Hadoop system. Please see the excellent documentation on Cloudera’s Website that will walk you through the process of building a Hadoop cluster. You will want to match the Cloudera version on the SeqWare VM with that of your Hadoop cluster and you may want to turn off the namenode, datanode, tasktracker, jobtracker, etc on the SeqWare VM since these functions will use your real Hadoop cluster. Essentially you will just use the SeqWare VM as the Oozie host (so you will want to leave that installed on the VM). See the Cloudera documentation for information on configuring Oozie on the VM to talk to your Hadoop cluster.
From the SeqWare perspective you will need to tell SeqWare which HDFS/MapReduce cluster to talk to, see the Oozie and Hadoop sections of the SeqWare Configuration Guide.
This is really up to your local sysadmin. You will need to use a common version of SGE between the SeqWare VM and your real cluster. Typically this is a common NFS mount that includes the SGE software, config files, and logs. Consult the GridEngine wiki for more information about obtaining and configuring SGE.
Finally you can submit and run a workflow just as you normally do following the User Tutorial. You should see jobs running on the cluster rather than locally using a tool like qstat.