Personal explorations of qualitative research in testing

One book I read a while ago from the office’s library is about qualitative research. It’s called ‘Reliability and validity in qualitative research’, by Jerome Kirk and Marc L. Miller.
It sounds fancy and scientific, and it has not been an easy read for me, but I really enjoyed it.
I find it full of great ideas, containing some very consistent examples and discussing ways in which qualitative research can be performed, as well as identifying some really interesting aspects of this approach, with relation to social sciences and anthropology. But I do not intend to review this book.

I’ll try to discuss how I find that some ideas in the book apply to my testing activities. As I read along the pages, I made some thought exercises and tried to identify how my work relates to the ideas presented.

As the title of the book suggests, its framework is given by the reliability problem and the validity problem that arise when performing qualitative research. I instantly related these two problems to testing.

I found that the anthropological research process and testing fit on different levels under the qualitative approach:

The process is similar

“(…)the full qualitative effort depends upon the ordered sequence of invention, discovery, interpretation, and explanation.”(page 60)

When I test a product, I go through a sequence of different activities that focus on different aspects of the testing process.

I think this structure of the qualitative assessment would fit my testing process, as each stage focuses on a specific kind of activity.
‘Invention denotes a phase of preparation, or research design; this phase produces a plan of action.’
‘Discovery denotes a phase of observation and measurement, or data collection; this phase produces information.’
‘Interpretation denotes a phase of evaluation, or analysis; this phase produces understanding.’
‘Explanation denotes a phase of communication, or packaging; this phase produces a message.’

There’s more to say about each of these phases, but I will cover them in a following article.

Importance of objectivity

“The assumptions underlying the search for objectivity are simple. There is a world of empirical reality out there. The way we perceive and understand that world is largely up to us, but the world does not tolerate all understandings of it equally (…)” (page 11)

If I lack objectivity, I may not test for what stakeholders find important.

The kind of objectivity I am thinking about is one composed of multiple subjective views.
It is up to me how I understand my mission and what I decide it’s important to focus on. But the inability to consider others’ view in relation to my information objective might affect the outcome of the product under test.
So I collect information on which the final view of the product might rely, in the context of my information objective. For instance:

- I discuss with the PM to find information from business, and try to understand the expectations for the new build.

- I search for similar products and identify the key features they rely on

- I compare those similar products with the product I am testing and try to identify the problem the product is bound to solve for users

I search and try to understand testimonials/complaints from users of previous versions of the product, in relation to the areas affected by the new build (bug fixes, new features, etc.)

What I consider and select as relevant information in my context is still up to me, but getting a glimpse of the world through the eyes of ‘others’ who matter enriches my understanding up to a certain level. This more objective view is likely to meet more tolerance from the empirical reality.

Understandings of behavior

“To focus on the validity of an observation or an instrument is to care about whether measurements have currency (what do the observations buy?), and about whether phenomena are properly labeled (what are the right names for the variable?).” (page 21)

Identifying a bug implies understanding if the behaviour I am noticing is relevant to anyone who matters and whether I am able to define that behaviour.

The analogy I am making here is with finding a behaviour in the application and labeling it as a bug, or it could be the dual problem of finding a behaviour and labeling it as a feature. I see this as a problem of pattern recognition and definition of that pattern.

For example, let’s say I press on a link in an application and I notice it filters some of the content displayed. Is my observation relevant and could this be a bug?
If for other links in the application the content displayed follows a different rule, it could be relevant. But it also could be a functionality for that particular link.
My observation gains currency if the rule I identified is tied to the behaviour I am noticing when pressing the link. Defining the behaviour helps me establish what could be happening when I click on the link. How is the content filtered? Is it really a filter applied, or what else could be providing the same outcome?

I also found that the methods used in qualitative research are similar to those I use in testing:

Serendipity occurs

“In science, as in life, dramatic new discoveries must almost by definition be accidental (‘serendipitous’). Indeed, they occur only in consequence of some kind of mistakes.” (page 16)

I accept that some of the interesting bugs, I find by accident.

I think the connection between this idea and finding bugs by accident comes from the fact that we cannot follow all variables when performing tests. I’m often focused on certain aspects of an application and am surprised by an unexpected outcome or behaviour in areas I wasn’t paying much attention to.

There are numerous examples for this kind of situations, in the context of complex systems like software. These situations usually involve complex scenarios, but sometimes I’ve experienced serendipity in straightforward tests: from discovering that double-clicking on a field that was not designed to be double-clicked leads to application crash, to finding that an obsolete (not meant to be used) feature would break the mapping of names and ids in a database table and would eventually render an entire desired feature not working.

Variety in techniques is helpful

“The most fertile search for validity comes from a combined series of difference measures, each with its idiosyncratic weaknesses, each pointed to a single hypothesis. When a hypothesis can survive the confrontation of a series of complementary methods of testing, it contains a degree of validity unattainable by one tested within the more constricted framework of a single method.” (page 30, quotation from Webb, E., J., D.T. Campbell, R. D. Schwartz, and L. Sechrest(1966) Unobtrusive Measures. Chicago: Rand McNally)

Using a variety of testing techniques allows me to test more thoroughly.

Each technique I use while testing allows me to understand the product I am testing from a perspective. Collecting results from looking at my product from different angles will give me more confidence in the validity of the information I have on the product. I often use as starting point the general testing techniques in James Bach’s Heuristic Test Strategy Model, and I try to apply what seems appropriate for my context, as well as to particularize the techniques based on the context.

For example, when testing a facebook app, I started with testing based on the claims in the documents I had received that described the application – this would be Requirements Testing.
Then I explored the application to identify and test what it can do – this would map to Function Testing.
Then I thought about different flows and sequences in the application and tested them: the user can navigate to a certain path, change his/her mind and decide to take a different branch in the application’s paths, access some resources from one location, then from another, etc. – Flow Testing.
I was then thinking about how I can overwhelm the application by providing as input a huge amount of data in the text fields – Domain Testing.
And I will continue thinking about different strategies for as long as I will have the luxury of time.
This approach won’t help me validate a hypothesis like ‘this application has no more bugs’, but it can help me in validating a conclusion that stems from weaker premises: there probably were no more bugs that I could find in the given time frame with the techniques I could think of. I can make this assertion in this situation, but not when considering only one technique in my tests.

Oracles and their fallibility

“Reliability depends essentially on explicitly described observational procedures. It is useful to distinguish several kinds of reliability. These are quixotic reliability, diachronic reliability, and synchronic reliability.” (page 41)

When choosing oracles, I am trying to be aware of the way in which they are fallible.

When I test, I rely on a set of oracles that are basically heuristics that help me interpret test results and decide if a test has passed or failed. The reliability of test results via this set of oracles is a problem related to the validity problem. For example, if I choose oracles that provide consistent results, but none is able to provide a valid result, they may lead me to believe a test has passed, when in fact it has failed, or viceversa.
On the other hand, if I choose an oracle that leads me to opposite results alternatively – one time it tells me that a test has passed, and the other that it has failed, I will again not be able to determine the validity of the test results.
So I need to build sets of oracles that would help me determine whether a test result is valid or not, and also provide consistent test results.

Here are 3 types of reliability that may apply to oracles, and possible indications of lack of validity that may be uncovered when using them:

“<Quixotic reliability> refers to the circumstances in which a single method of observation continually yields an unvarying measurement.” (page 41)

In this particular case, I could be using one oracle to base my interpretation of test results on, and its reliability would be given by the fact that results are the same for the test via the oracle, while varying the test variables. The book gives as example a broken thermometer. This type of oracle could be the ‘broken thermometer’ of testing.
For example, for testing if a form data is saved after the user hits ‘Save’, I could use as an oracle the display of the ‘Your form has been saved’ message box.
This message may be displayed each time I made a modification and hit ‘Save’. But it could be that the display of the message box is triggered by the action of clicking ‘Save’, and not when the data is saved.
Using this oracle would provide reliability in my tests – the message would always be displayed when I made modifications in my form and hit ‘Save’, but it might tell me nothing about the data in my form actually being saved.

“<Diachronic reliability> refers to the stability of an observation through time.” (page 42)

The expectation that tests provide the same results in time via an oracle is valid only if the application I am testing and the environment it runs in remain unchanged. This is very unlikely to happen with software over large intervals of time.
Choosing oracles that are fixed in terms of many parameters and do not adapt to the variations introduced in the application may quickly lead to tests deprecation. This can be the case of regression tests.
If I choose my oracle to be a set of existing test ideas, in spite of the reliability they may provide during multiple test sessions, they can become quite irrelevant once new features are included in my application or modifications are made. Repeating the same tests over and over will at some point stop giving me new information.

“<Synchronic reliability> refers to the similarity of observations within the same time period. Unlike quixotic reliability, synchronic reliability rarely involves identical observations, but rather observations that are consistent with respect to the particular features of interest to the observer.” (page 42)

An oracle that could be subject to synchronic reliability is: ‘two similar features/fields could work and be implemented in a similar way’. In the BBST course I encountered this as the ‘consistency within product’ oracle.
As the authors of the book observe, this case of reliability can be most useful when it fails.
Failure of this type of reliability can uncover, for example, in the case of two very similar fields, the lack of certain validations for one field, and thus the possibility of SQL-Injection attacks on only one field and not on the other.

Recording gives insights

“For expert or student, therefore, the whole point of devoting time to recording is not merely to make sure he will have materials down in black and white upon which to base his final report, but also to insure that he has the opportunity, while in the field or fresh from it, to relate insightful experience to theoretical analysis, percepts to concept, back and forth, in a kind of weaving of the fabric of knowledge” (page 59, quotation from Junker, B. H. (1960) Field Work. Chicago: University of Chicago Press)

Recording my testing work helps me understand how to think my strategy and tests.

I use a variety of ways to record my testing, ranging from scribbling test ideas, noting found bugs, questions that rise while I test and issues I am having while testing, to filling in dashboards or creating mindmaps.

I choose the particular way of recording judging on my context. I also combine them sometimes, depending on my needs.
Let’s take as an example an extensive test session. For me, this is a session that comprises several new features and bug fixes for a new build release. I may have about 2 weeks for testing in this scenario.
If I do an extensive test session, I may start with a high-level mindmap, to provide me with a plan of action, and keep notes on particular test ideas and questions in a word document.

After reading this passage in the book, I thought about 3 uses for the recordings I make:

- generating the reports I need for stakeholders:

When reviewing the records created, I often create any necessary reports based on them.

- analyzing the tests I made and improving my thinking:

I can obtain information on how I applied a test strategy, and identify what did not go according to my initial plan. For example, it happened that I looked at the records I made of test scenarios and realized that finding issues in a certain area drove me to generating many test ideas for that area, thus being focused on one certain feature, and coming up with fewer and more ‘shallow’ test ideas for other areas.
The records gathered can also help me identify aspects I did not consider for certain functionalities, by analogy with other functionalities considered, and thus being actual tools that enhance my thinking.

- analyzing my testing process:

I found that analyzing the records on a meta level can actually give me information on how I organize my testing process. For example, if my records contain many questions, maybe I should try to understand why at that particular stage in testing I find myself in the middle of all those questions. Is there a communication gap between me and the rest of the team in relation to recent changes? Am I using an approach that is not very effective, since perhaps it would have been useful to have some of those questions answered before getting to that point in the testing process? Am I missing some core detail?

I find that the records I make often generate questions in my head, which to me means that they trigger my thinking. So they become more than tools useful to communicate results of tests. They are tools I can use for thinking my strategy and my tests.

Conclusion:

I was struck by the similarity of how these ideas apply to anthropology studies vs. testing. It may be that the fact that both activities are investigative initiatives generates this parallelism. Realizing this gave me once again a sense of how broad an area testing is, and how vast the space of valuable stuff I can learn to improve myself in this job. I wanted to capture this realization through these ideas in order to be able to materialize it and keep it in the back of my mind and reach to it as often as possible.