Some tips on writing software for research

Posted on Mi 17 Juli 2019 in misc

These are my notes for a presentation in our group at Saarland University. The presentation was mainly about software written as part of experiments in NLP, but most of the tips do not focus on NLP but rather on writing code for reproducible experiments that involve processing data sets. This kind of software is often only used by a small group of people (down to groups of one). I neither claim that this is the ultimate guide nor that I actually follow all my own advice (but I try!).

Why you should care

Our success is not measured by how nice our software looks, the important aspects are scientific insights and publications. So why should you spend additional time on your code, which is just a means to an end? There are several good reasons:

Others might want to build upon your results. This often means running your software. If that is not easy, they might not do it. I once spent a week trying to get a previously published system to run and failed. That wasted my time and reduced the impact of that publication as it is impossible to extend and cite that research.

You might want or need to extend your previous research. Maybe you did some experiments at the start of your PhD but need to change some parameters to make it compatible with other experiments. Do not assume that future-you will still know all intricacies you currently know!

Good software increases your confidence in results. You will publish a paper with your name on it. The more hacky the process of obtained results, the less sure you can be that the results are actually correct. Manual steps are a source of error, minimize them!

Less work in the long run. Made an error somewhere? Just start over! If your experiments run with minimal manual work, restart is easy and CPU time is cheaper than your time. Once you run your experiment pipeline several times, additional work into automation pays off.

You have an ethical obligation to document your experiments. A paper is usually not enough to understand all details! The code used to run the experiments can be seen as the definition of your experiments. Automation is documentation!

Good software enables teamwork. Automated experiments mean that all your collaborators will also be able to do the experiments. You don't want to be the one person knowing how to run the experiments (you will have to do all the work) and don't want to be the one not knowing it (unable to do the work). Worst case: You need knowledge from several people to run the experiment, noone can reproduce the results on their own.

The motto of this blog post
Do it twice or more
Write a script that works for you
Enjoy the summer

Some basics

Hopefully, I have convinced you by now. So let's start with some basics. No need to be perfect, small steps help.

Keep source data, in progress data, and results separate. Having different directories enables you to just rm -rf in_progress results and start afresh. If you save intermediate results for long-running computations: make sure you can always distinguish between finished and in-progress results. E.g., you might have a power outage and otherwise not know which computations are complete and which have to be started again.

Add a shebang to your scripts and make them executable. Bash: #! /bin/bash, Python 3: #! /usr/bin/env python3. This shows everyone which files are intended to be run and which are not. Also, the shebang makes sure that the right interpreter is used. You do not want a bash script run by /bin/sh and get weird results.

Add a comment header with author, license, documentation. Others know who to contact, know how they are allowed to use and distribute your code and what it does. An example from my code:

#!/usr/bin/env python3
# author: Arne Köhn <arne@chark.eu>
# License: Apache 2.0

# reads an swc file given from the command line and prints the tokens
# one on each line, and with empty lines for sentence boundaries.

This gives a potential user at least a first idea of what to expect. Add a README to your project to give a high-level overview.

Use logging frameworks. Do not simply print to stdout or stderr, use a logging framework to distinguish between debug and serious log messages. Do not debug with print statements, use a logger for that! That way, you can simply keep the logging code but disable the specific sub-logger. Java: use log4j2, Python: logging.log. Log to meaningfully named log files (e.g. Y-M-D-experimentname.log) so you can find the logs if you find problems with an experiment later on.

Version control

Use a version control system! Commit early, commit often. Git commits: first line 50 chars, second line empty, detailed log message afterwards. Example:

Add script that handles installer building

That script will automatically fetch the correct versions of files
from the build.gradle file and run install4j if it is in the path.

Also don't use master-SNAMPSHOT but a fixed version of each
dependency.  Otherwise we end up with non-reproducible installer
builds.

This way, tools will be able to display your commit message correctly.

Rewrite history. Committing often means you might want to commit incomplete code or you find errors later on and fix those. Merge those commits into one and reword the commit history to make it more understandable by others. This also lets you commit with nonsense commit messages locally without embarrassment. Simply perform a git rebase -i to merge the commits later on. Note: don't rewrite history you already pushed and other might rely on.

Use rebase when pulling.git pull --rebase will move your local commits on top of the ones already in the upstream repository. A sequential history without merge commits is much easier to read.

Tag important commits.git tag -a [tagname] creates a new tag to find this specific revision. Ran experiments with a specific version? git tag -a acl2019 and people (including you!) will be able to find it.

Use GitHub issues and wiki. You can subscribe to a repository and will get mails for all issues, essentially using it a a mailing list with automatic archival. The wiki can be cloned with git and used as centralized offline documentation. Also works with other hosting systems such as gitlab.

Java

Use a build system. Eclipse projects do not count! Gradle is easy to understand and use. It might get more complicated if your project gets complicated, but it will definitely save you time.

A gradle file for a standard project looks like this, not much work for the time you save:

Plugins {id 'pl.allegro.tech.build.axion-release' version '1.10.1' }
repositories { maven {url 'https://jitpack.io'} }
dependencies { api 'com.github.coli-saar:basics:2909d4b20d9a9cb47ef'
}

group = 'com.github.coli-saar'
version = scmVersion.version
description = 'alto'

The plugin even takes the correct version number from your git tag when creating artifacts.

Make libraries available as a maven dependency. This is trivial with jitpack: It will package libraries for you to use on demand from git repositories. This is already shown in the example above: once jitpack is added as a repository, you can add dependencies from any git repository by simply depending on com.github.USER:REPO:VERSION with VERSION being a git revision or a git tag. jitpack will take care of the rest.

With the plugin described above, a release can be done by simply tagging a version. Only with this ease of use will you actually create releases more than once in a blue moon. In old projects, it sometimes took me half a day to do a proper release because I couldn’t remember all the steps involved. If it is hard to do a release, it will just be dropped because it is not a priority for your research.

Python

Don't write top-level code. Use functions to subdivide your code, run your main function using this at the end:

if __name__ == "__main__":
    main()

This way, you can load your code into an interactive python shell (e.g. using ipython) and play with it. It also becomes possible to load the code as a library into another program.

Parse your arguments with argparse. This makes your program more robust and provides automatic documentation for people wanting to run your code. Alternative: docopt

Declare the packages your program requires. write them in requirements.txt so people can run pip3 install -r requirements.txt to obtain all libraries.

Use proper indentation and comments. Indent with 2 or 4 space, NO TABS! (This is not just my opinion, it’s in the official guidelines) Use docstrings for comments:

def frobnicate:
    """Computes the inner square sum of an infinite derived integral."""
    return 0

Your co-authors and all tools will thank you.

Instruct others how to use your libraries. Similar to jitpack for Java, you can use git URLS with pip: pip3 install git+https://example.com/usr/foo.git@TAG Write this in your README so others can use your code (and use it to include foreign code).

bash

Perform proper error checking. exit on error and force all variables to be initialized. Add these two lines to the top of your bash scripts:

# exit on error
set -e
# exit on use of an uninitialized variable
set -u

Otherwise, the script will continue even when a command fails and interpret uninitialized variables as simply being empty. Not good if you have a typo in a variable name!

Avoid absolute directories. Want to run a program from your bash script? Either change the working directory to the directory your bash script is in or construct the path to the other program by hand:

# obtain the directory the bash script is stored in
DIR=$(cd $(dirname $0); pwd)

# Now either run the command by constructing the path:
$DIR/frobnicate.py
# or change the directory if that is more convenient:
cd $DIR
./frobnicate.py

Create output directories if they don't exist. Running mkdir foo will fail if foo already exists, creating the directories by hand is burdensome. Use mkdir -p foo to create the directory, it will not fail if foo exists. Additional bonus: You can create several directories at once: mkdir -p foo/bar/baz will create all nested directories at once.

Use [[ instead of [ for if statements.[[ is an internal bash mechanism, [ is an external program, which can behave weird: if [ $foo == "" ] can never become true because an empty variable $foo would lead to a syntax error(!) – simply use [[.

data formats

Use a common data format. This saves you from writing your own parser for your data format. CSV is supported everywhere. XML can be read and generated by every programming language. json is well defined. “Excel” is not a good data format.

Keep data as rich as (reasonably) possible. If you throw away information you don’t need at the moment, you (or someone else) might regret that because this additional information would have enabled interesting research. For example, the Spoken Wikipedia Corpora contain information about sub-word alignments, normalization, section structure (headings, nesting) etc. None of the research based on the SWC need all of it and it was a lot of work for us, but it enabled research we did not think of¹ when creating the resource. What does this have to do with data formats? Your data format has to handle all your annotations. Plain-text with timestamps just was not powerful enough for what we wanted to annotate, even though it is standard for speech recognition, for example.

XML is actually pretty good. Young people these days² like to dislike XML due to its verbosity. Yes, it is verbose. But it has redeeming qualities:

the format is properly defined and stalbe (you might run into surprises with YAML)
it is explicitly designed for hierarchical data and text annotaion
XML can be easily validated and the validating grammar is documentation. See the RelaxNG compact tutorial for an introduction
there are libraries for every language in the world
XPath is the bee’s knees to query data stored in XML.
see my previous blog post about why we use XML for the Spoken Wikipedia Corpora for more details.

Continuous integration

Continuous integration automatically builds and tests your software whenever you push new commits or create a pull request. It makes sure that the software actually still works in a clean environment and not just your laptop. Integration with GitHub or Gitlab is pretty easy, I will show an example for GitHub with TravisCI, the de-facto default for CI with GitHub.

All you have to do is to add a .travis.yml file into the root of your project which contains the necessary information and enable CI by logging into travis using your GitHub credentials and selecting the repositories you want to enable CI for.

The configuration is quite simple: If you use a standard build system (see above), you only need to state the language. For example, language: java is enough configuration to get going with a gradle project.

However, you can do more: Travis can create releases based on tags for you:

language: java
deploy:
  provider: releases
  api_key:
    Secure: [omitted]
  skip_cleanup: true
  file_glob: true
  file: build/libs/alto-*-all.jar
  on:
    tags: true

will create a new release for each tag (last lines) by building the project as always, performing no cleanup (important!) and then pushing a specific jar into a new GitHub release (provider: releases means GitHub release). We build alto that way and all its releases are automatically created.

The only slightly non-trivial part is the api key. This is created by the travis command line tool.

Automate experiment evaluation

Automate table creation. Use python, sed, awk, bash (or whatever else) to automatically gather your results. Every manual step is a possibility for copy-paste errors.

Make your output compatible to a plotting toolkit. Same as above: If your output can be read by gnuplot (or matplotlib or ggplot), you can automate your pipeline. Export to png and tikz so you can have a look at your data (png) but also have high quality vector graphics in your paper (tikz). tikz is also great because it enables manual post processing if you e.g. want to do something not obtainable through your plotting toolkit such as forcing numeral font for the y-axis.

When you are done

Clean up your pipeline. Often, an experiments grows and changes over time. Once it is done, make it automatically download needed data, automate leftover manual steps. Remember: automation is documentation – people can read the code to see what you did.

Make your data and code available. Remember: Bad code is better than no code – do not wait because you want to clean up your code some time in the future.

That’s it

Thanks for reading! I hope some points were of interest to you; if there is something not described clearly enough (or plain wrong) or I missed something important: Do let me know! I will update this post if new points come up.

Updates

2019-07-18: based on my best critic Christine.

Unfortunately, also an example of papers not citing the resources they use ↩
We have students who were born after Without me was released! ↩