What criteria do you use for that?

For example, the work of the lawyer can be evaluated according to the number of successful cases in court, the work of the doctor - according to the number of saved lives.

But what about testers?
How can you know that this tester is a good one and the other one isn't qualified enough?


Views: 3741

Reply to This

Replies to This Discussion

There's always a difficulty with how tangible any type of measure is, and this can also lead to measurement dysfunction but we've used the following on occassions:

- Quality of bug reports
- Bug success rates (how many resulted in fixes)
- Number of bug reports returned as duplicates, or lacking enough information
Good points, Simon!

I just thought... Can the lacking enough information be the one of the items of bug report quality? And what about duplicated bugs? Maybe it's the indicator of the fact that bugs are not fixed?

Isn't it better to pay attention to the number of rejected bugs?

Good questions!

Please consider that lawyers and doctors can do a stellar job and miss their respective objectives.

It is my opinion that in addition to soft skills, to evaluate the work of test engineers one must look at:
1. How well they meet the job requirements - beyond defects.
2. Barring all delays that originated elsewhere, how timely are deliverables (including defects)?
3. What is the quality of deliverable work?
4. What do they do to improve processes and practices?
... etc...

Bear in mind...
1. that as apps mature, defect-discovery opportunities will likely decrease.
2. "bugs not fixed" may be an indicator of issues having nothing to do with test engineers.
3. "bug quality" is in the eyes of the beholder and therefore subjective. Granted, one can set a standard - but how many do and how robust is the standard?
4. "duplicate bugs" especially on teams of test engineers can be a symptom of lack of coordination and not necessarily an omission on the part of the test engineers. Now if the same engineer submits the same defect report multiple times for the same version of app and in a time-frame where forgetfulness is likely not a factor - AND - the tools facilitate duplicate-detection; then one might end up wearing a face ablaze with wonderment.

Under this link is something to consider when using defects and attributes to evaluate test engineers.
Good point Jake - this article suggests that the very best surgeons may actually appear to be less successful than their peers, if you look only at the numbers - because they take on the riskiest cases, and are likely to have a higher mortality rate. Penalising your best surgeons would not seem to be good for the patients.

I've actually never been measured as a tester with metrics like bug count, rejected bugs, test run rate, or anything of the sort. I feel a real horror at the idea - in a previous life I was a callcentre trainer, so I know all too well how incredibly poor any kind of metrics are at evaluating intelligent individuals who have a real investment in learning how to game the system. I hated having to penalise and sometimes get rid of people who did a better job than their neighbours, but just didn't hit the numbers (sometimes because they wanted to do a better job). I hated having to reward people I knew were just playing the numbers, who couldn't care less about doing a good job and would cut customers off as soon as they knew the manager's eyes were off them.

My test managers have always looked at the items you suggested above to evaluate me: how well do I meet the job requirements, do I get the job done on time (and do I warn them if it's likely to run late so they can plan for that), do I get it done well, what work I do outside of running and writing tests to improve what we do and how we do it -- in short, what do I contribute to the team and to the projects I work on. That's not something you can pull out of Test Director in a nice bar chart. It takes time. It takes effort on their part to seek feedback from the people I work with, to discuss issues with me to find out what happened and why, to see what I was doing when I wasn't passing tests or raising defects but still doing real important work. It means they have to go out on a bit of a limb and exercise their professional judgement. But I value that so much more, and I don't think there's a good shortcut for that work.

Read the whole paper, but pay special attention to the last section.

Any reward system that is based on numbers--especially monodimensional numbers--can and will be gamed.

If you reward your testers based on the number of bugs, you provide an incentive for your testers to report duplicate, spurious, and shallow bugs, and you provide a disincentive for deep and thoughtful testing.

If you punish your testers for reporting duplicate bugs, you provide an incentive for them to search the database to make sure that they aren't reporting duplicates--which takes time away from testing.

If you punish your testers for rejected bugs, you provide a disincentive for them to report observations that might be bugs--at which point representativeness bias kicks in, and big problems that merely look like small problems will go unreported.

Instead of focusing on quantitative measures, focus on qualitative ones. Seek value and information, rather than numbers and data.

---Michael B.
"If you punish your testers for reporting duplicate bugs, you provide an incentive for them to search the database to make sure that they aren't reporting duplicates--which takes time away from testing."

This is an interesting point. I agree that searching for duplicate defects takes up time which could be spent testing, however, someone, somewhere on the project has to feel the pain, and cost, of the activity which hunts for similar bugs and classifies duplicates.

I feel that as a tester you have a duty to ensure you're not raising duplicate defects and increasing the cost to the project which can be caused by many duplicate defects being raised.

We might not measure an objective based purely on numbers, but we might challenge testers to only raise n % of duplicates and then spot the variances in the final numbers (what's the norm, who's above the average, and why?).
I agree that reducing the number of duplicates is a good thing, and personally, I would say that it's my job as a tester to ensure I take a look to see if someone else is already following this one before I invest the time in logging it or investigating it, but I'm not so sure that targeting testers on achieving a specific percentage is a great idea.

How would you define a good %age of duplicates? Does it vary from project to project? What other things might cause variation? How much effort do you want to put into tracking down the why and the what of each and every duplicate? Does the %age of duplicates raised always correspond to something you actually personally care about and can influence, or is it often due to variables beyond your control as a test manager?

The story that jumps to mind as a test is the following:

Let's say you have three testers, A, B, and C.

On your first project, A has a fairly low %age of duplicates, B a fairly high %age of duplicates, and C has zero duplicates.

On your second project, A has a high %age, B a low one, and C has zero again.

On your third project, A has a low %age, B a high one, and C again gets zero duplicates.

What's going on here? What happened? What could have caused that pattern?

Well, for the sake of *my* argument, I'm going to suggest the following story. I'm sure you can come up with many others:

Tester B raises a lot of defect reports, which tend to be somewhat skimpy on the detail. Skimpy enough, in fact, that it's a bit hard to work out what exactly they're about, without spending some time discussing the defect with B. Often the titles and description are a bit misleading. Tester B tends to fire off a defect report within a minute of discovering a defect, without much further investigation.

Tester A raises fewer defect reports. Tester A tends to take a bit longer to raise their reports, as they usually include supporting information, and A likes to try a few more things to ensure they've isolated the underlying cause of the defect. A has a quick look in the queue before raising the defect, but doesn't recognise B's defect as being the same, as it's so vague and ambiguous.

In the first project, Developer D picks up B's defect reports straightaway as they're first in the queue. D starts investigating, though it's a bit painful as it requires a few email exchanges with B to get the information D needs. When A's reports pop up in the queue, D recognises the defect immediately and marks them as duplicates.

In the second project, Developer D has changed tactics. Now D doesn't deal with the defects in the order they were raised in, but sifts through for well described defects first. D also knows A raises good defect reports, and tends to open those first. When D gets to B's defects, those get marked as duplicates.

In the third project, Developer D has moved on. Developer E comes in in the morning and picks the first defect off the queue to start investigating... and just like the first project, B's defects get read first, and A's get marked as duplicates.

Oh yeah - Tester C. Well, Tester C tends to specialise in a particular area of the system under test which nobody else knows much about. Nobody else ever raises defects in that area because only C tests it.

Now, okay - as I said, you could make up umpteen different stories to explain the metrics. But that's the point - you can make up umpteen different stories. And until you go and talk to people, get feedback from Developer D and Developer E and probably quite a few other different people, you just don't know which, if any, of your stories might be true.

But if you're going to all that effort to gather feedback, and in the end the feedback is what has the final say in how you evaluate testers A, B, and C: then why not cut out the metrics step? It's not saving you any time. You could be putting that time into coaching B to raise better defect reports, or maybe thinking about how C is a single point of failure and perhaps you'd better start buddying them up with someone else in the team so that if C is sick, or is hit by a winning lottery ticket on their way to work one morning, the test team will still be able to function without C.
This is an interesting point. I agree that searching for duplicate defects takes up time which could be spent testing, however, someone, somewhere on the project has to feel the pain, and cost, of the activity which hunts for similar bugs and classifies duplicates.

Has to
? What alternatives might exist any time any one says that something has to be this way?

If your metrics are being used as control metrics ("only raise N % of duplicates"), I guarantee that you'll drive distortion.

If your metrics are used as inquiry metrics ("Why are we raising so many duplicates compared to last time?"), you decrease the likelihood of making the numbers your masters. They should be our servants.

---Michael B.
There's nothing we do where metrics are used as control. However, I think you can challenge testers to ensure they're raising as few duplicate defects as is possible and I think you can look at the numbers to help quide your questioning.

Why I said someone has to feel the pain is because if you're testers aren't actively trying to reduce the number of duplicate defects they're raising then the developers are going to spend more time than they might have to reviewing the defects, understanding they're duplicates, and setting the classification as such. Either way, you'll either have testers spending time ensuring they aren't raising duplicates rather than spending time defects, or you'll have developers doing the filtering, and not spending time fixing important defects - unless you have a middle-man, but what project can afford that luxury?

Maybe I'm missing your point but as previously mentioned, we don't use metrics to judge tester performance.
I can't tell if you're missing the point or not, so I'll try again.

The kinds of numbers in software engineering are at best very rough first-order approximations whose sole purpose should be to prompt questions.

Two of the measures you cited near the top of the thread were

- Bug success rates (how many resulted in fixes)
- Number of bug reports returned as duplicates, or lacking enough information

I'd be worried about those being used to evaluate the quality of the tester's work, were the numbers used directly for the evaluation, rather than to trigger questions--and then, only maybe. I'd be much more comfortable with your first point

- Quality of bug reports (of which some dimensions might include high bug success rates or low duplication, maybe)

and to which I would add

- Quality of status reports and other work products
- Value to other members of the testing and development teams
- Application of unique or important skills
- Attitude towards colleagues and towards the work
- Capacity to learn and research quickly
- Capacity to teach effectively
- Capacity to manage time effectively
- Significance of problems found
- Ability to respond to changing priorities
- Capacity to observe unanticipated problems

and plenty of others.

Note that these are things that can be easily assessed, but not so easily measured.

---Michael B.
I agree with what you're saying and as I mentioned before, we use the numbers to prompt questions around performance.

Any confusion around this topic probably comes down to choice of language or a lack of information on my part. As an example, we used the numbers recorded for duplicates raised to enable us to spot higher than normal values and review the common reasons why the number seemed abnormal for a tester.We don't use absolute values to measure tester performance.

I also agree with the comment by Thomas that 360 feedback is an excellent way of understanding more about your testers. 360 feedback driven by the project team can give a good indication of the value they felt the tester added to that project.
How do you want your evaluation system to change the way your testers approach their work?

What things are they currently not doing, that you'd like them to do more? What things do they currently do, that you'd like them to do less of?



© 2017   Created by Rosie Sherry.   Powered by

Badges  |  Report an Issue  |  Terms of Service