How to make Research Software into Software as a Service: part 1

Experience running projects as a Research Software Engineer leads me to believe that there are a lot of potential software projects in academia and beyond that take this general form: Researchers are developing some software that does some analysis. This analysis is thought to be valuable to other groups of people (e.g. other researchers, private sector customers). However, these other groups don’t have the expertise or a sufficiently powerful computer to run the software, or to manage the output data. The apparent solution is to make the software available via a web interface – we would say turn it into Software as a Service (SaaS).

We’re all familiar with SaaS from a user perspective - it’s how we consume most of our software. While we used to buy office applications on CDs we now use cloud-based alternatives. Facebook is SaaS. YouTube is SaaS. But how can we go from research software to Software as a Service (SaaS)? And how do we decide if it’s even a good idea in a particular case? To some extent, the first question answers the second. I will describe what is needed to turn research into SaaS, then you can decide if it’s worth it.

In this first of a series of two posts, I’m going to talk about the software and infrastructure elements of SaaS, in the second I will move on to talk about regulations and management consideration before concluding.

Research Code

It is fairly easy to package code within a Docker container, without thinking too much about the quality of the software within or the language it is written in. This container becomes a parcel of software that can be run on lots of different computers with very little setup. One can then test that the code works with a limited range of types of inputs and produces the expected outputs in a few cases. Better still would be to ensure the code within the container conforms to some good software engineering practises – use of version control, has automated testing, documentation and static analysis, has high quality dependencies, etc. With these, the code will be possible to maintain and it should be possible to fix bugs without going back to the original code author. Without them, engineers will need to treat the research code as a black box. This becomes especially important if the research code needs to be updated over time to meet the needs of service users. Perhaps the wise thing to do is to try and be objective about the quality of the research code (not how smart the researcher who made it is, or the quality of the research itself!). Once you’ve done that you can better judge what the risks are of using the code in a SaaS context and can educate and inform the stakeholders accordingly.

Pipelines

It is possible to run code in one container, pass the results to another and so on. This creates a pipeline, the outputs of which would be valuable to the user. Such a pipeline might fork, or stop early for certain inputs. NextFlow is an example of a tool that can coordinate this activity. At this point, we should consider how many CPUs, and how much memory and hard disc space (and perhaps GPUs) each container needs. This essentially means “How powerful a computer is needed to run each container?” and by extension “How powerful a computer is needed to run the full pipeline?”. At the end of the day, this code is going to run on computers - even if the computers are configured by such jargony technologies as “Terraform”, “Cloud”, “Helm” and “Kubernetes”. There needs to be enough resources available to run the pipeline. What the listed technologies allow for is rather than investing in one “big” computer which might not be in use all the time, being able to rent more power when it is needed and to be able to allow the SaaS to ask for extra power when lots of people are using it.

Infrastructure

The use of Cloud in academic research justifies another article. Briefly, moving from using a “big” physical computer to using The Cloud confers some advantages – there is no longer a need to buy and maintain physical computers and network hardware or employ people with that expertise. However, the Cloud versions of these need to be specified using code, most likely Terraform, and expertise in this becomes required. Terraform means that the “hardware” the SaaS is running on matches the Terraform code in your repository, and updating or upgrading the hardware is a change in the code, not a visit to PC World. But it also means that a typo (e.g. 160 GB of RAM, rather than 16GB) can result in a request for a much more expensive Cloud resource and a shock when the bill arrives at the end of the month.

Job Scheduling

We now know how many pipelines we can run on our “big” computer at once, or we might have a setup where we rent more computing power depending on how many pipeline runs (jobs) users are asking for. If our computer can run 5 jobs and we have 6 users who have all submitted 3 jobs to our SaaS, it will fail if it is given all that work to do at once, much like what happens when running lots of software or browser tabs on a desktop or laptop computer. We need a system that knows how many jobs are in progress at that moment, and which jobs we still have to do. When one job finishes, another should be taken from the to do list (queue) and started up. Keda is an example of a technology that will keep track of how many jobs are going on at once.

Alongside these research software jobs, we also need to keep some computer resources to one side for other components of the SaaS such as a database, user interface and networking components.

Data Storage

We need to be able to store data uploaded by users, output data from the jobs, reference data that stays the same for each job, operational data (such as where a job is in the queue) and “administrative” data such as user accounts and permissions.

Cloud storage buckets offer a pay as you go solution for storing in a resilient way very large amounts of data. Otherwise, physical disks with a conventional file system are required and space must be managed such that they don't fill up unexpectedly, and are backed up.

Buckets or other file storage systems are likely not to be appropriate for smaller pieces of information that need to be accessed quickly and combined together to generate information in the user interface. A database (or similar) is needed here. Multiple databases might be used, for example, to separate administrative and scientific data.

User Interface

SaaS users expect a graphical user interface; and a normal website experience. Research software might have a command line interface where people have to type commands to control the software, or the code itself may need to be edited.

Tools like Shiny (or Dash) make setting up a web user interface fairly straightforward. However, adding user logins, data security and scalable computer resources for running research code that needs a lot of CPUs and memory with this kind of user interface can be harder than a more general purpose web user interface. shinyapps.io, for example, will enable someone to get a basic user interface set up and on the web for free, but will charge for password authentication and additional computing resources.

For some applications Shiny or similar will be an excellent choice, either for prototyping or for fully usable SaaS. When using these tools it’s important to think about what your product and users’ needs are and make sure these are served by the tool, and to consider what might happen if those requirements change beyond the limits of the tool. Needing an extra 8GB of RAM per process, or to host user data in a different geographical location, might mean your product has to be completely rewritten if the tool and its services don’t support this.

Some partial conclusions

Before going onto part 2 of this series, there are some initial messages that I think are useful:

If you intend to get other people (e.g. research software engineers) to help turn the code into SaaS, make sure your research code can be worked on by other people by adopting good software engineering practices throughout the research phase. The only way around this is to have code that can be completely sandboxed in a container, and this will be, by its very nature, hard to maintain.
Consider making a prototype SaaS with something like Shiny so other people can quickly see the value in your work. Do this early in the project. But be prepared to have to change the technology you’re using if you hit limitations. And expect to have to pay for services like authentication.