When I started working at the Hutch with Erick Matsen, I was introduced to a tool the team used for orchestrating analyses: SCons, a Python based build tool. Like make, SCons was initially developed for compiling software. However, it's also quite useful for data science and computational biology.
Why you want this
Data science and compbio projects invariably require pushing data through multiple processing and computational steps. While for simple projects one can do this manually, executing one command at a time becomes untenable as things get more complicated:
- Reproducibility becomes tricky without a record of how you performed the computations, what parameters you used, etc.
- Running analyses for multiple parameter settings means entering all the commands multiple times, which is error prone and not DRY.
- The process requires your continued attention throughout the computation; You can't just hit go and walk away.
- If you update one of the intermediate results, it can be difficult to keep track of what needs to be rerun downstream and to make sure things are run the same way.
Shell scripts are a simple solution to some of these problems. They are ubiquitous, familiar to many, and easy to construct given a series of shell commands. However, proper build tools have some advantages:
- Dependency tracking: if you change an intermediate result, only downstream results are rebuilt.
- Parallel execution of independent computations.
- [EDIT: added 2015-05-13] In some cases prevent you from clobbering existing data.
These advantages are exactly why build tools were developed: to save time. This is especially crucial in computational biology, where jobs are often long running (sometimes taking days or weeks), and things are frequently built and tweaked iteratively.
SCons can be installed with
pip install --egg scons if you're using pip. If you're on Ubuntu, you can also
apt-get install scons. Homebrew may be an option for OSX users as well.
Taking an example from computational biology, let's say you have a set of HIV genetic sequences from several patients and want to determine the most likely evolutionary relationship between the viruses in these patients. A simple analysis might consist of:
- aligning the sequences
- building a phylogenetic tree to determine likely ancestry of the sequences
- printing the tree, with tips colored by patient
Each of these steps can be carried out by a different command line program (in this case
fasttree and a custom script
color_tree.py, respectively). Supposing we have sequence and patient data in
metadata.csv files respectively, let's see how we'd put everything together with SCons.
SCons is designed to read
SConstruct files which specify the flow of the computation. In this case, our
SConstruct file would look like this:
# SCons actually does this import for you, but I like the added clarity. from SCons.Script import Command # Assign our input filenames to some variables for convenience sequences = 'input/sequences.fasta' metadata = 'input/metadata.csv' # Our first step is to create the alignment file. align = Command('output/alignment.fasta', # Path of the target (output) file sequences, # Source (input) 'muscle -in $SOURCE -out $TARGET') # Action executed; Note the use of # $SOURCE and $TARGET # We can now specify the `align` object as a source for other targets tree = Command('output/tree.tre', align, 'fasttree -nt $SOURCE > $TARGET') # Now let's build our final target, a colored tree colored_tree = Command('output/colored_tree.svg', [tree, metadata], './bin/colored_tree.py -I -c patient $SOURCES $TARGET')
The only thing in all of this that isn't vanilla python is the
Command function. This function takes 3 arguments:
- target filename(s)
- source files, either filename strings, or the results of other
- and an action which can either be a function or a command line string
Calling this function registers the given task and returns an object representing the file(s) built by that task. Based on the succession of tasks registered, SCons evaluates a dependency graph. As it executes the tasks in the graph, it stores a fingerprint for each file so that it knows what needs to be rebuilt on subsequent runs if something changes. Additionally, changing the action for a given target will trigger a rebuild of that targets.
To run the
scons and you should see a number of files created in a new
output directory. Try opening the
output/colored_tree.svg and enjoy the fruits of your labor.
OK; That's pretty cool, but what about something more complicated?
If this was where SCons stopped, you might be wondering whether it's worth all the trouble. So let's take a look at a more intricate example.
Suppose we want to run the above analysis, but compare the trees obtained from three different alignment methods? All we'd have to do is add a little for loop that:
- performs roughly the same analysis for each of the different methods
- gathers the final results in a collection
- runs a final
Commandon said collection to produce the singular final output
To start, let's add
from os import path to the beginning of our
SConstruct file. Then, modify everything after the input files as follows
# Create a dictionary we can use to get the correct action string for each alignment method action_strings = dict(muscle='muscle -in $SOURCE -out $TARGET', mafft='mafft $SOURCE > $TARGET', clustal='clustalw -INFILE=$SOURCE -OUTFILE=$TARGET -OUTPUT=FASTA -TYPE=DNA') # Next we have to initialize the collection of colored trees we're going to join together colored_trees =  # Now we branch on the various alignment methods for program in ['muscle', 'mafft', 'clustal']: outdir = path.join('output', rogram) # use a method-specific outdir align_action = action_strings[program] # get correct action string # The rest of the for loop is almost the same as in the last example... align = Command(path.join(outdir, 'alignment.fasta'), sequences, align_action) tree = Command(path.join(outdir, 'tree.tre'), align, 'fasttree -nt $SOURCE > $TARGET') colored_tree = Command(path.join(outdir, 'colored_tree.svg'), [tree, metadata], './bin/colored_tree.py -I -c patient $SOURCES $TARGET') colored_trees.append(colored_tree) # add the colored tree to our collection # Final output: join together all the trees into a single SVG combined_trees = Command('output/combined_trees.svg', colored_trees, 'svg_stack.py $SOURCES > $TARGET')
Run, and voilà! SCons builds trees for each of the alignment methods, and joins the resulting tree figures into a single figure.
Now just for fun,
rm -r output and rerun with
scons -n 3 and observe the parallel awesomeness :-)
It's worth highlighting that all we needed to solve this second problem was a bit of extra python logic. This is in contrast with
make, which requires increasingly convoluted syntax as the logic becomes more complex. This feature was a major motivation behind the development of SCons: programming logic is the more scalable solution to complex build logic.
There are a number of other tools that take this approach as well, which I hope to take a look at in later posts. However, without getting into too many details, one of the things that stands out about SCons for this use case is that
Command returns an object representing the file(s)/results produced by that step of the computation. Being able to pass this around in your programs becomes really valuable when doing data analysis.
I hope this tutorial has given you a general sense of the capabilities of SCons, and convinced you it's a useful tool for orchestrating data analysis pipelines. I'll soon be following this post up with more on how to maximize usage of SCons. In the mean time, here are a few resources in case you're keen on digging in now:
- Running in parallel over a Slurm cluster: bioscons
- More declarative running of nested parameter/data sets: Nestly
PS: Thanks to Erick Matsen for his helpful feedback on this post.