Evaluation is the worst form of HCI research except all those other forms that have been tried

Shumin Zhai

IBM Almaden Research Center

 

Abstract: Yes, all evaluation methods have limitations and deficiencies, just as democracy as a form of government is full of deficiencies. The fact is that we do not have a better alternative. We need better and deeper, not fewer and shallower evaluations. HCI cannot be a faith based enterprise.

 

Democracy is the worst form of government except all those other forms that have been tried. —  Winston Churchill

This essay is inspired — well, provoked — by Henry Lieberman’s critique — well, tirade —  “Rant: The Tyranny of Evaluation”. It is intended in the same spirit as Lieberman’s paper, encouraged by the CHI Fringe format: not so serious (counter) provocations.

I sense a wide spread frustration in the CHI community, not only from Lieberman’s essay but also from many other researchers and graduate students, with the need to do evaluation-based user-interface research and development. Unfortunately, just as democracy as a political system that is full of deficiencies yet still serves as the foundation for modern societies, evaluation is and will be a necessary and basic tool of HCI research for many years to come, if not forever.

Granted, quantitative and controlled experiments or other forms of evaluation, such as field studies, can be burdensome and limited, and can often seem not to be worth the time and effort for the overall understanding of some systems.  “All evaluation methods suck”, as Henry argued. But replacing evaluation with mere judgment, particularly the inventor and designer’s own, sucks even more.

Yes, user interface design is part art, but it is an art with practical consequences like architecture is art with practical consequences. A user interface is not a purely personal artistic expression that the audience can either take or leave. Users of HCI designs have to live with the designs.

I argue that user interface research needs evaluation — and not because the field of HCI has physics-envy. To the contrary, it is the lack of strong theories, models and laws that force us do evaluative experiments that check our intuition and imagination. With well-established physical laws and models, modern engineering practice does not need comparative testing for every design. The confidence comes from calculations based on theory and experience.  But lacking the ability to do calculations of this sort, we must resort to evaluation and testing, if we do not want to turn HCI into a “faith-based” enterprise.

Much of the frustration and complaint about the need for evaluation stems from too little, rather than too much, understanding of evaluation principles and experimental design.  Yes, the following is a very frequent comment in a negative review to a CHI submission “There were only 12 subjects used in the experiment. The results cannot be trusted”—although the submission clearly reported the experimental design and procedure, the degrees of freedom involved, the probability value, the magnitude of difference, and the type of statistical test performed to support the level of confidence.  Reviewers who make this kind of comments obviously have no knowledge of experimental design and statistical analysis. They contribute to the false impression that CHI submissions and reviews are all about evaluation. Reviewers uneducated in proper methods of evaluation are also those who accept papers they do not understand when overwhelmed by numbers and equations. The only way to rectify superficial reviews, unfounded rejections, and unworthy acceptance is to have more educated reviewers who truly understand the strength, the science and the limitations of various evaluation methods[1]. To simply dismiss evaluation does not help.

There are many other misunderstandings and poor practices of experimental study in HCI. Let’s consider a few.

  1. Experimental study is about mindless A-B comparisons.

    Good experimental study is often based on sound and deep analysis, with well founded rationale, motivation, hypotheses, and a quantitative and predictive model. Evaluation should not be narrowly focused on a conclusion (or worse yet, a bet). The value of evaluation quite often comes from the process of conducting the experiment, which often deepens one’s understanding. It is not uncommon for a researcher or designer to realize the limitation of an idea or design in the process of setting up its evaluation. The insight gained in the process of exercising a design idea through a real task is often enough to make one to go back the drawing board to improve the ideas or come up with entirely new ideas.
  2. Evaluation has only one method.

    In fact, the first task of evaluation is to design and choose the appropriate method for the goal of the evaluation. For example, if a perceptual, motor or cognitive advantage is expected from certain design dimension, a laboratory controlled experiment maybe the most suited method. If well-established models exist, a calculation and deduction may suffice. In other cases, a field study, an in-situ observation or a survey may be more appropriate.
  3. Evaluative studies are over generalized.

    Yes, readers often over-generalize the results of studies and sometimes authors consciously or unconsciously mislead readers towards such over generalization. Part of the beauty of well-conducted and well-presented formal evaluation, in comparison to informal subjective judgment, is that the task, the method, the measurement and the procedure of evaluation, are precisely described so the conclusion of the study is not a blank check.
  4. Because user behavior, performance, style, age, experience, gender, task etc vary from person to person (“people aren’t balls”), you can’t control them all and therefore experimentation is not meaningful.

    Of course, there will be variances in human users along many dimensions. The question is how the author’s construct (a new interface, a new idea etc) relates to these dimensions. One needs to randomly or representatively sample along some of these dimensions, and control other dimensions to show a compelling or useful difference (e.g. design A is more beneficial for type X users). A very basic point of experimental design and statistical analysis in the behavioral and cognitive sciences (unlike Newtonian physics) is that meaningful conclusions can be drawn despite the variation. Furthermore, powerful evaluations do not necessarily yield a simple minded “A is better than B on average”, but rather reveal how users’ performance and preference interact and change along relevant dimensions and tasks.

 

Surely, as the number of variables of interest increases, the experiment will be more complex, more difficult, and more expensive to conduct, but that does not mean the researcher or designer’s subjective opinion is a better alternative.

Ideally, the evaluation of new user interfaces and concepts is conducted in a double blind fashion by a third party, rather than by the originators themselves, as in medicine. This isn’t likely to happen in HCI research (but it does happen in some product development). Research is exploration, and the researchers have the responsibility to first convince themselves and then the research community that a desirable path has been found either by plausible reasoning (which is a form of evaluation) or experimental study. Simple claims aren’t enough. In fact, in CHI conferences, novel compelling ideas without evaluation are often favored over evaluation studies of known ideas, particularly when the results are negative or null. As a result, the CHI community beyond the paper committee sometimes only learns about the appealing initial ideas, and not the actual effects of these ideas, either measured by a third party, or the originators later.

Just as no society is one hundred percent democratic, evaluation is by no means the only method of HCI research. Innovation and creativity, which are also required in designing a good evaluation procedure, scenario, and paradigm, are the fundamental driving force of any research field.  CHI in fact accepts many papers that do not have any formal evaluation. Innovations based on sound and well articulated rationale, designs based on well accepted models, principles and theories, new conceptual contributions that are truly compelling, should and have been accepted. When rationale, theory, modeling, and principles are insufficient (which unfortunately is true most of the time), evaluation is what we have to rely on. What I do — and the HCI community should — insist on is that subjective opinion, authority, intimidation, fashion or fad can’t substitute for evaluation. As Henry Lieberman argued, we need judgment in evaluation of user interface, but that should only mean judicious application of evaluative methods for sound conclusions. It by no means should imply we can discard evaluation and simply accept inventors’ and designers’ claims.

 



[1] It is beyond the scope of this essay to discuss the type and quality of accepted and rejected CHI papers. An interesting phenomenon is that different people complain about CHI for opposite reasons.  We have all heard these sorts of objections: “CHI is all about hype and unproven ideas,” “CHI is all about boring evaluations,”. “CHI is all about technology”, “CHI offers no real technology.”  In my view, the real problem is not the type of work that is accepted or rejected, but it is the uneven quality of reviewers who are drawn from a largely open volunteer database. Qualified reviewers must contribute more time in reviewing CHI submissions if we are to raise the level of discourse at CHI.