Research Software Engineers and Data Scientists: More in Common

Posted by s.aragon on 5 April 2018 - 1:25pm

8236647979_efbfd1d409_z.jpgBy Matthew Archer, Stephen Dowsland, Rosa Filgueira, R. Stuart Geiger, Alejandra Gonzalez-Beltran, Robert Haines, James Hetherington, Christopher Holdgraf, Sanaz Jabbari Bayandor, David Mawdsley, Heiko Mueller, Tom Redfern, Martin O'Reilly, Valentina Staneva, Mark Turner, Jake VanderPlas, Kirstie Whitaker (authors in alphabetical order)

In our institutions, we employ multidisciplinary research staff who work with colleagues across many research fields to use and create software to understand and exploit research data. These researchers collaborate with others across the academy to create software and models to understand, predict and classify data not just as a service to advance the research of others, but also as scholars with opinions about computational research as a field, making supportive interventions to advance the practice of science.

Some of us use the term "data scientist" to refer to our team members, in others we use "research software engineer" (RSE), and in some both. Where both terms are used, the difference seems to be that data scientists in an academic context focus more on using software to understand data, while research software engineers more often make software libraries for others to use. However, in some places, one or other term is used to cover both, according to local tradition.

What we have in common

Regardless of job title, we hold in common many of the skills involved and the goal of driving the use of open and reproducible research practices.

Shared skill focuses include:

  • Literate programming: writing code to be read by humans.
  • Performant programming: the time or memory used by the code really matters
  • Algorithmic understanding: you need to know what the maths of the code you're working with actually does.
  • Coding for a product: software and scripts need to live beyond the author, being used by others.
  • Verification and testing: it's important that the script does what you think it does.
  • Scaling beyond the laptop: because performance matters, cloud and HPC skills are important.
  • Data wrangling: parsing, managing, linking and cleaning research data in an arcane variety of file formats.
  • Interactivity: the visual display of quantitative information.

Shared attitudes and approaches to work are also important commonalities:

  • Multidisciplinary agility: the ability to learn what you need from a new research domain as you begin a collaboration.
  • Navigating the research landscape: learning the techniques, languages, libraries and algorithms you need as you need them.
  • Managing impostor syndrome: as generalists, we know we don't know the detail of our methods quite as well as the focused specialists, and we know how to work with experts when we need to.

Our differences emerge from historical context

The very close relationship thus seen between the two professional titles is not an accident. In different places, different tactics have been tried to resolve a common set of frustrations seen as scholars struggle to make effective use of information technology.

In the UK, the RSE Groups have tried to move computational research forward by embracing a service culture while retaining participation in the academic community, sometimes described as being both a "craftsperson and a scholar", or science-as-a-service. We believe we make a real difference to computational research as a discipline by helping individual research groups use and create software more effectively for research, and that this helps us to create genuine value for researchers rather than to build and publish tools that are not used by researchers to do research.

The Moore-Sloan Data Science Environments (MSDSE) in the US are working to establish Data Science as a new academic interdisciplinary field, bringing together researchers from domain and methodology fields to collectively develop best practices and software for academic research. While these institutes also facilitate collaboration across academia, their funding models are less based on a service model than in UKRSE groups and more based on bringing together graduate students, postdocs, research staff, and faculty across academia together in a shared environment.

Although these approaches differ strongly, we nevertheless see that the skills, behaviours and attitudes used by the people struggling to make this work are very similar. Both movements are tackling similar issues, but in different institutional contexts. We took diverging paths from a common starting point, but now find ourselves envisaging a shared future.

The Alan Turing Institute in the UK straddles the two models, with both a Research Engineering Group following a science-as-a-service model and comprising both Data Scientists and RSEs, and a wider collaborative academic data science engagement across eleven partner universities.


Observing this convergence, we recommend:

  • Create adverts and job descriptions that are welcoming to people who identify as one or the other title: the important thing is to attract and retain the right people.
  • Standardised nomenclature is important, but over-specification is harmful. Don't try too hard to delineate the exact differences in the responsibilities of the two roles: people can and will move between projects and focuses, and this is a good thing.
  • These roles, titles, groups, and fields are emerging and defined differently across institutions. It is important to have clear messaging to various stakeholders about the responsibilities and expectations of people in these roles.
  • Be open to evolving roles for team members, and ensure that stable, long-term career paths exist to support those who have taken the risk to work in emerging roles.
  • Don't restrict your recruitment drive to people who have worked with one or other of these titles: the skills you need could be found in someone whose earlier roles used the other term.
  • Don't be afraid to embrace service models to allow financial and institutional sustainability, but always maintain the genuine academic collaboration needed for research to flourish.