In the Field

On the failure of sports framing in research.

Jun 13, 2021

It’s only fair to start this essay by confessing my anti-sports bias. I was a klutzy, book-oriented child who grew up in a boring place where team sports were a default for most families with children. But probably due to my lack of interest, my parents never signed me up for team sports, and I barely understood what was going on during the obligatory physical education classes where these games were played. To this day, I despise playing and watching games that involve balls. I have too many memories of larger, stronger kids throwing balls (football, basketball, lacrosse, table tennis, any kind) at me to watch my spastic reactions.

But this essay is not, thankfully, about sports. It’s about scientific research, particularly the field I work in, which has become uncomfortably suffused with a sports frame.

Some Background on “AI”

The term AI, for artificial intelligence, has become a commonplace term even in non-technical conversation. What exactly it refers to is a matter of some debate; I’ll start off with a bit of (biased, as always) historical context. In the 1950s, in the afterglow of the Second World War, researchers in certain branches of mathematics relevant to the emerging technology we know as “computers” started questioning just how much these machines could do. Alan Turing wrote a now-famous philosophical paper asking whether machines could “think” and, if so, how we would be able to tell. Researchers coined terms like artificial intelligence and computational linguistics to refer to emerging fields that blended the mathematics of computation (now known as computer science) with the sciences of the mind and its capabilities (psychology, linguistics, and so on). Research communities formed that simultaneously explored scientific/mathematical questions and practical uses to which these ideas might be applied. When I started publishing in the early 2000s, these included communities devoted to natural language processing (NLP),1 where I attached, computer vision, machine learning, robotics, and a “general” AI community. In my circles, the scientific questions had to do with how humans use and acquire language, and the practical ones were around building things like automatic translators and search engines that understood our questions and found answers.

In the post-WWII United States, the pace of progress in a scientific or engineering field depends above all on the extent to which the federal government allocates funding for it.2 The history of government investment in these new fields is fraught. As different theories competed, and their proponents rose to positions of influence in the government (especially the US Defense Department), funding moved from one thing to another. Yesterday’s hot topic quickly became today’s shibboleth. When I started out, for example, neural networks, a family of techniques that had been popularized in various earlier phases, were considered a joke. Looking back, it seems that a large part of any idea’s success was its branding; researchers would give clever names to their ideas that set them apart from what came before and, more importantly, told a story about “intelligence” that flattered the decisionmakers and their perspectives.

To a first approximation, to the extent that we can find a meaningful philosophical dimension to all of this, it is probably the rationalist-empiricist distinction, which in the AI fields boils down to:

Rationalists emphasize the importance of (fore-)knowledge in an AI system. They have often sought to define and then go about constructing (often manually) computational objects that store knowledge. The hard work of rationalist AI is to turn the messy facts of the world into something formal enough to represent in a computer. In its most extreme instantiations, rational AI is built on logic alone.
Empiricists emphasize the importance of acquisition of such knowledge, the stored form of which is not necessarily important. They have often translated this into a problem of statistical generalization. The hard work of empiricist AI is to create systems that can rapidly consume large amounts of data and extract something useful from it. In its most extreme instantiations, empiricist AI looks (to me) a lot like behaviorist theories in psychology.

In my experience, very few AI researchers are 100% rationalist or 100% empiricist for very long. In the late 1990s, when I was getting started, NLP was turning increasingly toward empirical methods. My mentors had been trained in both perspectives, and anticipated a future that creatively blended them, an attitude that has served me well in my own career. I’ve often characterized the rationalist-empiricst dichotomy as a “tired old debate,” and I think there are others that lead more fruitfully to creative progress (e.g., contrasting rational theories used to design an empirical system, or different mathematical formulations of what it means to learn that have a big effect on the efficiency of learning).

Since that time, the landscape changed. Neural networks — which are an empirical strategy for “learning” from data — re-emerged in the 2010s as a powerful way to tackle problems across AI, including in NLP. The difference from their previous incarnations? There was a lot more data by then, and more powerful computers. Neural networks’ new branding as “deep learning” has been touted widely as the best bet for “solving” AI, whatever that means.3 Because they’ve become pervasive across the various research communities, it’s become a lot easier to do work that brings ideas together, for example, models of language and vision.

Because these methods accelerated progress on some well-known problems (most notably in computer vision and speech recognition), and because there’s something a certain kind of (typically rich and not well-read) person finds compelling about “artificial brains,” they’ve led to explosive growth in all of the AI research communities. Huge numbers of students are signing up to study and do research in AI. The pace of publication has dramatically, increased, the demand for AI experts in industry and academia is unprecedented, and some researchers have been launched to celebrity status.

I cannot emphasize enough how different it is to be an NLP researcher in 2021 compared to 2006, when I finished my PhD and became a professor. When people asked me what I did back then — even other computer science professors — the reaction was either “that’s crazy and won’t work” (from people with some technical expertise) or “scary” (from people who’d read too much science fiction). Today, everyone has already heard of NLP, and I regularly see familiar projects and faces being discussed in the news. AI, whatever it is, is intrinsically linked with people’s thinking about the tech industry, the future of business and work, our deepest anxieties about what it means to be human, and international affairs. I really just studied this topic out of a fascination with language and computation.

Talking about Research

Finding one’s way in all of this can be hard intellectual work. Since the empirical paradigm is on the rise, its crisp answer is the one many turn to to make sense of progress: “what works best?” Our guide to finding a path through the research field, then, needs to be some objective means of comparing all the ideas in the discourse.

Doubtless you’ve heard some instances of the kind of frame I am going to criticize in this essay, perhaps the notion that there’s a “race” between economic powers (US and China) or tech companies (Google and Facebook) to achieve some AI goal. You’ve likely heard about some of the big successes of AI, which often focus on games like Chess or Go, or even Jeopardy, with a major press event following a computer’s win over human champions. Indeed, there is a long history of framing AI problems through the lens of games.

While I have no objection to researchers trying to automate game-playing (it doesn’t interest me, but I’m a fan of diversity in problems and solutions and can respect an elegant mathematical abstraction when I see one), I think it’s unwise to use this frame when we talk about scientific work, especially in training new researchers and presenting our work to the public.

Leaderboards and Rankings

A standard unit of research progress in our world occurs when someone builds a new system to solve a particular task (say, translation from German to English). The conventional narrative goes like this: based on past efforts to build German-English translators, I constrain the resources that my system has access to, so that the comparison to those earlier attempts will be “fair.” Then I build my system using some new elements, and use an established measure of accuracy to see whether it performs better than the best-published system in the literature on a standard testing dataset. In the 1990s, for a range of tasks there were series of papers, arriving one or a few per year, from different research groups, each improving over the accuracy of the last. These papers would include a table that listed the past systems and the scores they achieved.

Today, these tables have become dynamic web content where one can watch in near-real time as the numbers go up and up as teams from around the world contribute new systems’ results. We call these “leaderboards” and the notion of attaining the top spot, the “state of the art” designation, has become so important that sota is an earnestly used verb.

For a die-hard empiricist, this all sounds great: we’re all moving in the same direction (up!) and comparing new ideas in a fair manner. Progress is inevitable, and nobody’s going to waste time publishing research papers that don’t “sota.” By extension, it’s easy to know who the “best” researchers are: the ones who attain the state of the art and stay at the top of the leaderboard for the longest.

For a scholar, this path to truth has a lot of problems:

It assumes that the definition of the task — the inputs and outputs, the resources one is allowed to use to build the system — is relevant to a scientific question or a real-world use-case (perhaps an imagined one in the distant future). Any proposed improvement or discussion of problems with a task is met with resistance, because sota winners have a stake in the task as it stands.
It assumes the dataset is representative of the data that might be processed in those real-world use cases. (Over time, the more times we compare on the same test dataset the less plausible this assumption becomes, because the community is subtly adapting to the idiosyncrasies of that particular dataset.)
It assumes that we have a valid measure of the accuracy of a system.4
The end always justifies the means. Inscrutable solutions, or computationally expensive ones, are not penalized; if they win the sota position, their publication is justified. There is no direct value placed on advancing our understanding of the problem or the solutions.

In my view, a “sota” result and a leaderboard comparison have a legitimate role to play in a scientific argument. But they are neither necessary nor sufficient for publication, and leaderboards used by researchers should not be mistaken as benchmarks relevant to decisions about adopting a technique in a practical setting.

The Rules

I think the appeal of leaderboards is, for many, the same as the appeal of sports. In a sport or game, the rules are entirely constructed. They seem to emerge through history, legal deliberation, and a general goal of “keeping things interesting.” During play, they are rigid and non-negotiable. The same kinds of processes have been at play in the world of AI research. Now, research is very challenging, and in order to learn new things, one must almost always be narrowing the question to what is answerable. This is one of the hardest things for student-scholars to learn: what constitutes a good question that we have hope of answering?

In a healthy research community, research teams pursue questions that are sufficiently related that one lab’s new findings will be useful and informative in another lab. But when we frame research as a sport, we eliminate all talk of questions. (Imagine fans at a football game discussing the relative merits of soccer.) Instead, we have the leaderboard, and we arrive at a place where everyone working on the leaderboard’s task is asking the same question. While comparability of results is important, a race where every group is trying to get to the same result before the others (because the bar is raised every time someone achieves a new sota) is a really inefficient system.

In earlier times, the leaderboards were normally set up for short-term exercises (sometimes called “bake-offs”) where the period of competition was fixed, the scores published once, and — most importantly — participants would meet in person to share lessons learned and talk about next steps. This strikes me as a more sensible approach, because a big part of the conversation at those meetings was always about the rules themselves. A new event would inevitably change the rules from past events, to make the problem both more challenging and more relevant to scientific or engineering questions that remained open. The iterative process of redefining the task (the “rules”) was valued research and more tightly integrated with the construction of systems. Ideally, anyone who was building a system to compete in the bake-off was intimately familiar not just with the rules but with the assumptions and goals that motivated them. I remember graduate students of my era holding very strong opinions about the design of these system evaluations — stronger than those they held about what kinds of solutions would work the best.

I fear that today, many people are working to top the leaderboard but have little understanding of the rules that define the task and its evaluation, little appreciation for the shortcomings of those rules, and no opinion about how we might do better at defining leaderboards in the future. And the sports frame enables this; a star basketball player doesn’t need to have any critical thoughts about the game of basketball. They just need to win it.

Winners and Losers

The end-state of a game is that, usually, someone wins, and someone loses. This idea doesn’t transfer usefully to research. Of course it’s the case that, in the world of research, success leads a researcher or a group to increased resources (including things like research funding, job offers, ease of recruiting colleagues to one’s group, and various markers of prestige). Measuring a person’s or group’s research output’s value is a challenging task that requires extensive judgments of value. The sports frame eliminates all that complexity: the best researcher is the one who ekes out a sota result, or who has raised the most money, or whose papers are cited the most often.

Of course, choosing the thing that’s easy to measure is a value judgment. Too often, I’ve seen researchers take this shortcut.5

Tribalism

I’ve never fully understood the emotions of sports fans. (Suggested musical interlude.)

That kind of emotion should be considered hazardous in research. Of course one is invested in one’s own work and one’s ideas and wants them to succeed, but part of our mission is to put the truth first. Mature researchers will have had countless disappointments where they ended a line of pursuit, or stepped back to watch someone else’s idea win out over their own. Adding to this a preference to see one’s team win is creating new ways for our critical thinking to fail.

Who’s in the Game?

In sports, the sides are clearly defined; a player does not switch teams. There’s also a clear delineation between players and spectators. It’s understood to be extremely difficult to play sports at the level where lots of people want to watch (e.g., professionally). Most young enthusiasts have no hope of making it to that level, even the very good ones.

I think this is the most dangerous failure of the sports frame in research. In a leaderboard competition, one does one’s talking on the field, that is to say, by submitting a system into the competition. A nonparticipant’s commentary will be dismissed (“if you think you have a better idea about how to solve this problem, go ahead and build it and enter it into the competition!”). While I agree that arguments need to be backed up with evidence (a sota result being one kind of evidence) and logic, I object to the idea that only those who build competitive systems deserve a voice.

Science works best when many different perspectives are brought together. The skill sets and resources needed to sota are undoubtedly correlated with specific kinds of perspectives: those that implement, and those that have the resources to carry out extensive experimentation before submitting an entry to the leaderboard. The implementation perspective is a valuable one, to be sure,6 but it is not the only one. The best research teams, in my experience, include implementation and experimentation skills as well as analytical skills and many others. There’s likely much more to be said about whether, and if so how and why, the emphasis on implementation correlates with identity attributes (gender, race, and so on), but I think there’s a widely held intuition that it does, which would imply that leaderboard culture contributes to our field’s inclusion challenges.

The resource issue worries me the most of all. I understand that in some sports leagues, differences in team “wealth” affect performance and also attitudes of fans. The worry in research is a little different. To the extent that we define progress around leaderboards, the sota aspiration shapes the minds of trainees and affects who chooses to pursue research in our field in the first place. I’ll try to illustrate this point by returning to my childhood PE class.

One day in middle school I somehow ended up playing basketball one-on-one with a friend who was also not athletic. Because the expectations were so low and the stakes were so low (my friend’s teasing came from a different place from the jeers of the other kids, and he didn’t throw the ball at me to make me flinch), I was able to think about the task at hand (e.g., getting the ball into the basket). I wasn’t good at it, but I could start to think about how I might improve with practice. “You’re stupid,” he told me, “about the physics of the ball.” That diagnosis was wonderful, because it reframed the problem from one of my inherent value (e.g., to a team) to a fixable problem. The solution to being stupid about something was always easy for me: I just needed to learn.

That day was memorable because it was so exceptional. Every other day, I was an object of derision. The main thing I learned was to avoid PE class by strategically scheduling volunteer time in the library or extra practice in the band room. I checked out of sports, and while I’ve never felt like I’m missing out, the sports frame in research always makes me think of all the people who are outside the frame, off the court, cast forever as mere spectators or, worse, unfortunates who can’t afford a ticket to the game.

Computational linguistics is the term used by our main research society, and hints more at the scientific questions, while natural language processing suggests applications. For our purposes, there’s a single community.

Yes, research happens in industry, too, but industry tends to focus heavily on the near term (a year or two). Long-term bets are made by federal investment in academic research.

As annoying as the hype around deep learning has been, one has to respect the genius of the branding. An old objection to empirical methods was that they were “shallow,” i.e., they didn’t grapple with the full complexity of the phenomena they were used to model and make predictions about. While that’s still true, the use of the term deep seems to prime us away from that objection. Never mind that the term deep refers to the many repeated rounds of calculation required to use the neural network. Picture a Microsoft Excel spreadsheet with inputs at the left, outputs at the right, and a huge number of columns in the middle that carry out intermediate calculations. As anyone who’s used spreadsheets can tell you, a lot of cells between the beginning and the end of the calculation does not imply that the spreadsheet is “intelligent.”

For the translation example I mentioned earlier, this is not the case. We have tools that measure how “close” an automatic translation is to a given human translation (or a set of them), but these are far from perfect.

Not just in research. Scientists and engineers love to believe that they are using “objective” tools to make decisions. What can be more objective than measurements? Just rank job applicants by their scores and go for the best ones! Now, what measurements do we have that can go into the score?

Whenever I talk to other critics of leaderboard culture, I feel the need to bring up the state of the field before we had any objective, comparative evaluations at all. The papers from that era read like philosophy; thought experiments and anecdotes abound. Often one can’t quite tell whether the AI system being described was fully implemented or vaporware. If our research communities today, at their current scale, operated like that, it would be chaos.

Halfie Haftoyreh

Discussion about this post