It upsets me when papers make the effort to do proper usability studies and then misuse the results because they get confused by qualitative and quantitative data.
As nicely explained in my HCI notes, quantitative data has a “structure of integers or real numbers, e.g.
number of errors (discrete) or time to complete task (continuous)”, whilst qualitative data is essentially everything else, e.g. categories of actions, preferences or opinions. The confusion happens when qualitative data is given a rating scale with numbers on it. These numbers are automatically (and wrongly) assumed to be quantitative, and therefore have averages and other mathematical operations applied to them – which make no sense with qualitative data. Everyone rates things differently; someone’s “good” is another’s “indifferent”: things are not compared to a common criteria. It is also very difficult to evaluate the reliability of users opinions and the validity of their ratings. They may be in a good mood and rate things more highly, have empathy for the system designer or have confirmation bias to give them the results they want. Equally, they may have a very high standard to what is considered good or be in a bad mood, giving lower results.
For a good example of a bad usability evaluation, take the paper Computer forensic timeline visualization tool by Jens Olsson and Martin Boldt. This is a very good paper otherwise, but is let down by their usability study. The authors asked their usability participants to rate their tools ease of use against another. The participants had to pick their answers from “very difficult”, “difficult”, “neutral”, “easy” and “very easy”. The authors mistakenly tried to convert this into quantitative data by assigning the values -2 (very difficult) to +2 (very easy) and averaging the participants results. They then focused on the final averaged numbers instead of talking about the results as a whole. Two questions got the same average result (-0.1667), yet the actual results show a quite a different range of results:
|Question||Very difficult (-2)||Difficult (-1)||Neutral (0)||Easy (1)||Very Easy (2)||Average|
In my opinion the replies to question 1 leaned more towards the neutral and easy ratings whilst question 3’s results were more dispersed. It does not help they only had 6 participants – the results are quite inconclusive. Qualitative data is difficult to evaluate, but trying to force it into something it isn’t just makes the results nonsensical and often useless.