PaperStream: a software that collects data from multiple-answer questions documents

Posted by s.aragon on 2 August 2018 - 9:57am

image1_0.pngBy Julio Vega, University of Manchester.

As part of my PhD where we are researching if we can use smartphone data to monitor the progression of Parkinson’s Disease, we found out that we had to go “Back to Analogue” as a paper diary was the best tool for patients to self-report their symptoms. This was excellent for the study, but it gave us another thing to worry about, I would have to manually transcribe participants’ answers from paper into electronic files. We were aiming for ten participants that needed to complete a diary with 365 pages over a year; if it had taken 45 seconds (being very optimistic) to transcribe each page, encoding all ten diaries would have taken ~114 hours, or ~19 days of work!

Being a computer scientist and wanting to save a good 114 hours for when I have to write my dissertation, I searched the Internet for a tool that would have allowed me to create and encode paper diaries automatically (maybe some sort of software to mark multiple-answer exams?). To my dismay, the few options available were software projects no longer supported, not well documented, not free or open source, and not very user friendly. I decided then to take this as an opportunity to contribute back to the community because many of the tools we use for data analysis in our lab are freely available thanks to the work of others. This is how PaperStream was born.

PaperStream is a software that researchers and academics can use to create paper diaries, surveys, quizzes or any other document with multiple-answer questions that people can respond using pen and paper and that can then be encoded automatically into a CSV file. PaperStream is free, open source, available for Windows, Mac OS, and Linux, and fully documented as it was designed to be used by anyone without a technical background.

For the sake of this post, I will write about PaperStream using diaries and surveys as two examples to showcase its features. If you are working with quizzes or other types of documents, what I describe here shouldn't be too different from what you'd need to do. So, in our case, a diary is a questionnaire that one or more people must answer every day for several days, weeks, or months. Similarly, a survey is a questionnaire that one or more people have to answer only once.

For both a diary and a survey, you only need a one-page PDF document that will work as a template for every page. For diaries, PaperStream will label each page with a unique date, like the next figure, and for surveys, PaperStream will enumerate them with a unique ID. Once PaperStream has processed your templates, you will get a zip file with your diaries or surveys in both A4 and A5 size ready to print and bind.


After your participants have answered these printouts using a pen, you need to scan them as a multi-page TIF image, or as single PNG images compressed on a ZIP file. Once this is ready, you need to tell PaperStream where and what answers to look for through a marking rubric. A marking rubric is nothing more than a group of circles that indicate what areas of a page participants can mark with a pen and what those mark/answers mean, for example, the hour of the day or a point in a likert scale. Since you used a single template to create your diaries or surveys, you only need to design a marking rubric once, and that’s it! When the rubric is ready, PaperStream will give you a zip file with a CSV file containing all the answers of each diary or survey that you wanted to process. What is also useful, is that PaperStream can detect duplicates, missing data, and is very forgiving, as it will detect an answer when at least 15% of the answer area is filled in and has no problems when the pen goes outside it. This means that your participants don’t have to worry about how to respond the questions, it is as easy as using pen and paper.

From a technical perspective, Python was the best language to develop PaperStream on. It has multiplatform support, it can process PDF files thanks to the pypdf2 library and images via OpenCV. The first prototype of PaperStream was a script that converted a PDF template into a PDF booklet based in the booklet-maker project of Luke Plant. The second prototype, which needed to encode the answers from paper to an electronic file, was slightly more complicated. I wanted to maintain Python’s multi-platform capabilities while at the same time giving users a graphical interface that would not take too much time to develop.  There are many options to create a GUI in Python. First, I tried Tkinter but canvas support and geometric shapes manipulation (like drag and drop) was not straight forward. For this reason, I decided to think of PaperStream as a desktop web app, meaning that the GUI would be HTML/CSS/Javascript based, taking advantage of HTML, SVG and the rich Javascript ecosystem, while relying on a local web server to route all calls from the web GUI to the processing scripts. Falcon was my choice for the web server due to its light weight, extensive documentation, and simple implementation, Fabric.js for the geometric manipulation, plus Async.JS for asynchronous calls and Noty.js for notifications. Then, for the actual encoding logic, I adapted the work done by Raphael Baron that used OpenCV to extract parts of a page framed by markers, complementing it with the answer extraction functionality that works by comparing black pixels between two paper sheets. All this open source software made PaperStream development faster and easier.

Finally, the last sprint for the first version of PaperStream was its testing, distribution and documentation. I implemented a few unit tests using Python’s unit test library for the core functionality of the scripts that create and encode documents. Then, I considered Docker and other similar options to make PaperStream available but in the end I went with PyInstaller which allows developers to distribute a Python project (including firing up a Falcon server) as a single executable or as a single zip file that works on all major OS. I also deployed PaperStream in a pip repository, so it could be installed with a single line by developers and other technical users. Finally, for the documentation I decided to give Hugo a try for the first time; writing it in Markdown was simple, and automatically publishing the static website to Netlify with every GitHub commit was super convenient.

I learnt a lot during the development of this project and I’d love if the community finds PaperStream useful and takes its development forward. Future cool functionality could include detecting different pen colours, shape marks, and even handwritten text. In the meantime, you can get PaperStream and its source code for free in GitHub and how-to guides to create and encode documents over Netlify. Oh, and in case you were wondering, with PaperStream I encoded all ~3650 pages in about 5 minutes; a whopping 1300% faster.