Crying out loud for better research artifacts
In my first semester during my Computer Science Master’s program at TU Dortmund University1, I participated in a seminar titled “Reproducibility of
Research Artifacts”. The gist of it: Our professor studies the use of research artifacts in the Computer Science community. To gather data on the usefulness of research artifacts, he devised this seminar, where a bunch of undergraduates each read through six old conference papers and try to use the accompanying research artifacts to (among other checkboxes) roughly verify the paper’s claim. This seminar is offered multiple times by multiple professors around the world, slowly building up a pile of new metadata to be analyzed for some new paper about reproducibility.
I am now done with reading the introductory papers and checking my six research artifacts, leaving me only with the dreaded task of contributing a few pages of formal research talk. Paragraphs be graded, stapled to the other participant’s texts and archived somewhere, ever only read by two persons. Rephrasing what all the other students contributed already, be it this year or the years before and after. The perfect use case for “Hey ChatGPT, please rephrase this text from before but replace their six papers with mine”.
At times, my brain sucks. Especially when it is time to write in academic lingo. This post is my attempt to extract the gist of my »findings« from its mental claws.
Badges of honor
In the old times of 2010, the term »replication crisis« popped up in academia. Psychology people stumbled upon the fact, that they could not reliably reproduce original findings from older papers, effectively downgrading them from scientific research to hot garbage. Innocent until proven guilty and scientifically worthless until correctly reproducible by a later study.
Watching from the side line of the spectacle, the computer science community took notes and decided “Yeah, maybe we should include the source code of our experiments with our papers instead of simply claiming things”. And because nerds do love to make up new standards, the ACM badges were born. I present to you the pinnacle of serious science:
- Artifacts Available v1.1
- Artifacts Evaluated – Functional v1.1
- Artifacts Evaluated – Reusable v1.1
- Results Reproduced v1.1
- Results Replicated v1.1
Hand in your paper to a conference, ask for your artifacts to be reviewed as well and maybe be allowed to stick a few of these precious PNGs to your first page:
Yeah, go bold colors!
The first badge is more or less free: ZIP your source code and let the CERN store it
forever for at least 10 years. Great, now your artifact is »available«.
Next up: »Functional«: Works-On-My-Machine is no longer enough, now other people need to be able to run your hacky data processing pipeline. Ugh. But okay, you can create a virtual machine image with all your code and dependencies, create a bash script and write a short README.pdf describing how to log in, open a terminal and execute said bash script. Hey CERN, can you please also store this 5 GB Ubuntu 8 VM image for us? Now some poor soul in rural Germany can download that image for a few days to see that your bash script executes without errors, printing out some ASCII symbols along the way, hopefully none of them in red. Hey, it works.
But functional is easy, how about »Reusable«. No problem at all: Just make sure that every possible developer on the planet can use your research-quality code for their next project without friction. Take a deep breath and spend a night or two documenting your code, removing unused functions and assumptions about your specific VM and hope for the best. Depending on the mood of the reviewer, your documentation is enough. Gladly they also use the MacOS version you used, so they can tweak a few lines and the graphs are now pink instead of blue. Take that third, red badge and be proud of yourself. Good boy!
Now there are also badges relevant to the replication crisis: »Results Reproduced« and »Results Replicated«. Reproducing and replicating can mean either “other people re-did your research with your tools” or “other people re-did your research on their own, without your tools”. To avoid any clarity, the two terms can mean either one of these practices, and it depends on the current conference committee to decide how they want to define that. At least the ACM badges now settled on reproducing = same result, original tools
and replicating = same result, different tools
in v1.1 of their badges.
These two blue badges are rarely found on papers. Who waits for an entire replication study, before first publishing a paper?
Anyways, here is my professor explaining research about artifact evaluation and the expectations one might have for it in video form:
Investigation procedure
To achieve some degree of comparability, students investigating the reproducibility of research artifacts were given a lab report template, written in Markdown to log describe efforts and note observations. It includes metadata (DOI, time it took to check everything, badges the paper previously got) and a section each for »Relatedness«, »Functionality« and »Reusability«. Note the absence of a »Reproducibility« section. For those three properties, we were asked to make a boolean decision: Does the paper’s artifact earn the right to hold that title? How was that checked? Which interesting findings were observed? The whole template is rather free-form, leaving it as an exercise to our professor to normalize all lab reports at a later stage.
Click to view the full labnotes template
# Lab Notes on Research Artifact
* Date of examination:
* Start time:
* End time:
## Metadata
* Paper DOI (necessary):
* Artifact DOI (optional):
* Badges achieved in regular artifact evaluation:
* Time (in minutes) needed to read the paper:
## Availability
* [ ] Artifact is archived in a public archive with a long-term retention policy
* [ ] Artifact is available on a different website
* Time (in minutes) needed for the check:
### Checks performed
<!-- Note all checks you performed here to answer the above questions -->
1. Artifact is in a well-known archive (i.e. FigShare, Zenodo, or osf.io)
2. ...
### Observations
<!-- Note all observations (negative, neutral, or positive) made -->
## Relatedness
* [ ] Artifact is related to the paper
* [ ] Artifact is NOT related to the paper
* Time (in minutes) needed for the check:
### Checks performed
<!-- Note all checks you performed here to answer the above questions -->
### Observations
<!-- Note all observations (negative, neutral, or positive) made -->
## Functional
* [ ] Artifact is considered functional
* [ ] Artifact is NOT considered functional
* Time (in minutes) needed for the check:
### Checks performed
<!-- Note all checks you performed here to answer the above questions -->
### Observations
<!-- Note all observations (negative, neutral, or positive) made -->
## Reusable
* [ ] Artifact is considered reusable
* [ ] Artifact is NOT considered reusable
* Time (in minutes) needed for the check:
### Checks performed
<!-- Note all checks you performed here to answer the above questions -->
### Observations
<!-- Note all observations (negative, neutral, or positive) made -->
Each artifact investigation was supposed to last six hours at max. If we could not get the artifact to function in that time, we could label it as “not functional”. Inaccessible artifacts (either due to unavailability or due to impossible hardware requirements) were swapped for new artifacts.
The suspects
As I’ve hinted on earlier, I had to review six papers and their accompanying artifacts. So let me quickly present the suspects, assigned to me at random:
A) Horn-ICE learning for synthesizing invariants and contracts
The authors built a new verifier for programs, to crush other state-of-the-art verifiers at some program verifying competition for logic nerds. Its main contributions are a dual-role system: A learner component proposes hypotheses, for which the teacher component then tries to generate negative inputs in the form of a Horn clause. Apparently it is fast and they are righteously dunking on other researchers’ verifiers:
B) Codebase-adaptive detection of security-relevant methods or »SWAN«
You have heard about taint analysis, haven’t you? No? Okay: We look at paths in code (in this paper’s case in Java for Android apps) which take untrusted input and color (»taint«) the incoming data. The data keeps its taint while traveling through the code base, only to be removed when passing through sanitizer functions. If tainted data reaches an output function (where malicious inputs could actually attack the system or user), we found a potential security vulnerability.
So these guys built SWAN and SWAN_Assist, an accompanying InteliJ plugin. It does some machine learning magic on Android Java code to detect and annotate input, sanitizer and output functions. Maybe it can help Android folks write better stuff. I have some serious respect for actually building something usable (an InteliJ plugin) from their research, instead of just a bare-bones proof of concept:
C) Prediction of atomic web services reliability based on k-means clustering
This one does not count, as the artifact could not be retrieved. It was replaced with a different paper
D) Verifying concurrent search structure templates
Computer Scientists like to make a great deal out of formally verifying that their algorithms and data structures really do what they are supposed to do. While this is a great way to scare first-semester students, doing so for threaded, concurrent data structures makes a great way to keep PhD students busy. To my demise, these authors from New York used the proof assistant coq
, which I don’t really understand.
But: They claim to have managed to construct verified implementations of B-trees, hash tables, and linked lists, with which you could built safer-than-average file systems and databases. Great!
E) Fuzzi: a three-level logic for differential privacy
A catchy name, ain’t it? The authors chip away at »differential privacy«. DP says: We don’t want your personal data, we want statistics over all our customers. So let’s add some noise (speak: random changes) to your data before sending it to our analytics server. They applied that to machine learning algorithms and used that as an excuse to write Fuzzi, a checker to calculate sensitivity bounds of the privacy attributes for the generated models. Somehow this is better than previous methods and simpler.
F) Developer Reading Behavior While Summarizing Java Methods: Size and Context Matters
Hey ethics board, can we strap some poor CS undergrads to an iris-scanner for science, please?
The most straight-forward paper title: A study in which 18 participants of different skill levels were asked to summarize Java methods. An eye-tracker took note at which classes of code elements they looked at the most. They come to the conclusion that (contrary to an older study) developers tend to focus on the control flow terms in the body and less on function signatures. The bigger the code, the more they jump back and forth. Maybe these findings are useful for something practical down the line.
G) Lazy Product Discovery in Huge Configuration Spaces
The authors of this particular paper took on the challenge of speeding up Gentoo package installations. More scientifically, they used it as a case study for product configuration. While installing a set of packages, the package manager (in Gentoo’s case emerge
) must determine a valid set of dependencies to actually install on the system. As always, this problem can be reduced to SAT, and the insanely huge number of possible combinations (the huge configuration space) makes any NP implementation quickly blow up in time.
So this paper contributes a lazy-loading method which only looks at smaller subsets of configurations at a time to bring this into the realm of the possible.
Maybe unintuitive, the emerge
package manager is way faster than this method, but it does cheat a bit and can’t find a solution (= No installation possible) for some configurations, for which this paper’s method was able to find solutions:
The problems
Now that we have shown, that we have actually read the paper at some time, let’s dunk on them. Have they done their homework and crafted useful, eye-pleasing software artifacts? As always: it depends.
- Studies are hard to evaluate. The eye-tracking paper for example provided a quite extensive collection of emails, forms and question lists, but these can’t be easily checked by compiling and running unit tests. For an artifact review, this sets them apart from papers with pure software artifacts. But if you ever want to repeat the study with different participants, you can copy most of the paperwork.
- While we are at this paper: Proprietary hardware and software is always tricky. I can’t easily source an eye-tracker and quickly redo the experiment.
- Docker images as tar files are a great, lightweight alternative to VM images, which often ship an entire desktop environment with them. VM images are the safest and often mandatory route to artifact submission. But boy, oh boy, are those images big. I started the artifact evaluation on a slow internet connection, and those 4 Gigabytes won’t download fast in today’s Germany. And then you start them, having to interact with an old Ubuntu 320x280 desktop, using a different keyboard layout. Sure, you can set-up an ssh server in that VM and connect to it from your own computer. But containers are a bit more terminal-accessible from the beginning and less resource-hungry in my experience.
- On the other hand, solely providing
Dockerfiles
, the recipe to create a container image from scratch, is more error-prone, as dependencies might have gone missing by now. Still, they provide a great starting point for re-use of the software, as they provide a step-by-step dependency and compilation set-up guide. - Code older than a few months on GitHub is rarely updated and often does not build anymore on a modern system with up-to-date dependencies. For one paper, I had to dig a bit to find the historic dependencies they used to get it to run at all.
- I’m not an expert, like the artifact reviewers hopefully were. I can’t say much about formal concurrent logic proofs, taint analysis and machine learning models. This is not a problem in itself, but reduced my ability to fairly grade an artifact’s reusability.
These scandalous conclusions will shock you!
Other miscellaneous findings:
- Academics are humans as well. They use stackoverflow.com like the rest of us (https://ruhr.social/@jzohren/109773605284308171):
- Artifact badge reviews often felt like they are mostly based upon vibes. Some aspects are mostly standardized, e.g. requiring Virtual machine or docker container images for software artifacts. But the question about “reusability” heavily depends on the reuser’s implicit knowledge of the research field at hand. Even the correctness of submitted code requires a deep understanding of the problem domain. How do I know that a specific logical transformation is correct and that the authors did not hide a cheat somewhere in there?
- While the respective conference’s artifact evaluation committee often is allowed to contact the authors for assistance to get software up and running, a few years later the world has moved on: Download links for files missing from the VM images have gone 404, authors changed institutes or simply died. Software rots and artifact maintenance is seldom. Who pays for that anyway? Universities want more papers published!
Outlook
After a few more rounds of this seminar, all the lab reports will be analyzed. Each artifact is supposed to be reviewed independently by at least two students of different universities. Some form of new paper will be written about those results. At least one conference will be attended. Science done. Move on. Cite other papers. Be cited by friends and colleagues. Use your prowess to beg research associations for external funding.
As for us students: We will be cited somewhere in the fine print of some paper. Leaving a small, citable footprint for the 3% who are interested in an academic career.
-
Useless Trivia: The very-official title of “TU Dortmund University” is an abbreviation for “Technical University Dortmund University”. ↩︎