Consensus
Consensus works by having more than one annotator annotate the same asset. Once the asset is annotated, a consensus score is calculated to measure the agreement level between the different annotations for a given asset. It is a key lever for controlling production quality.
Set consensus
To activate consensus, go to your project settings, tab "Quality" and tick the box to activate it. You will need then to set the following parameters:
Total Asset Coverage: it is the % of the dataset that will be annotated several times. If the dataset contains 10,000 assets, and the Total Asset Coverage is at 10%, it means that 1,000 assets will be annotated several times. The assets to be labelled multiple times will be chosen randomly.
Minimum Consensus Size: this is the number of annotators that will have to annotate the same asset. It must be set to a number less than or equal to the total number of annotators. On standard projects, it is limited to 10 labelers, for performance reasons. Please contact support@kili-technology.com if you need more.
If you change the consensus in an ongoing project, only the unlabeled assets (with status TODO
) will be distributed to labelers according to the new parameters.
All ONGOING
and LABELED
assets may see their status actualized later in the day. A LABELED
asset can now be ONGOING
in the case where consensus was increased and if the asset was already chosen for consensus. An ONGOING
asset can now be LABELED
in the case where consensus was decreased.
Example of a project with consensus 100% for 2 people
- An asset A is chosen for consensus. A has two labels. So A is
LABELED
. - The project is updated to 100% for 3 people
- A is now
ONGOING
- An asset A is chosen for consensus. A has two labels. So A is
Example of a project with consensus 100% for 3 people
- An asset B is chosen for consensus. B has two labels. So B is
ONGOING
. - The project is updated to 100% for 2 people
- B is now
LABELED
- An asset B is chosen for consensus. B has two labels. So B is
The assets that have been selected by the algorithm to participate in the consensus remain at the top of the stack until they have been annotated by Minimum Consensus Size
annotators, with an ONGOING
status. Once labelled by the desired number of labelers, their status is updated to LABELED
.
Consensus calculation
At the asset level
Depending on your task, the consensus will be computed differently at the asset level :
Single-class classification (For all assets types):
With only one classification possible, we want a metric that can accurately describe the agreement, and decreases when the number of different selected categories increases like 1 / number_of_different_selected_categories
.
For a perfect consensus, the consensus will be evaluated to 1, but in case of disagreement, it is interesting to notice that the lowest possible score is max(1 / number_of_possible_classes, 1 / minimum_consensus_size)
In this example, two different classes have been selected (number_of_diferent_selected_categories = 2
),
so the consensus at the asset level is: consensus = 1 / 2 = 50%
. With only two labelers, this is the lowest score possible.
With a consensus between three labelers, either the consensus will be 33 %
for the case of complete disagreement, 67 %
in the case of partial agreement of two labelers, or 100 %
for perfect agreement.
Multi-class classification (For all assets types):
In multi-class classification, we generalize this pattern by
1 / number_of_different_selected_categories * sum_on_selected_categories( number_of_annotators_that_selected_this_category / min_consensus_size )
In this example, two different classes have been selected (number_of_different_selected_categories = 2
)
with two annotators (min_consensus_size = 2
), so the consensus at the asset level is:
consensus = 1/2 * ( 2/2 + 0/2 + 1/2 ) = 1/2 * 3/2 = 75 %
If the two labelers both select an extra wrong category, say "Credit card renewal" for one, and "Wire transfer" for the other, the score becomes
consensus = 1/3 * ( 2/2 + 1/2 + 1/2 ) = 1/3 * 2 = 67 %
Again, the minimum consensus in this situation is max(1 / number_of_possible_classes, 1 / min_consensus_size)
, when there is a disagreement on every selected label.
Object detection (implemented for bounding-boxes, polygon and semantic) :
Each image is considered as a set of pixels to be classified.
A pixel can be classified into several non-exclusive categories representing different objects. The computation involves evaluating the intersection over the union of all annotations. Hence, two perfectly overlapping bounding-boxes will correspond to a consensus of 100 %
, and the ratio decreases with the common area, until 0 %
for two completely distinct shapes.
One benefit of this method is that it is dependendent on the size of the shape, through the union area denominator, ensuring accuracy for all sizes of shapes.
However, as shown below, consensus quickly decreases with unprecise labelling, allowing you to monitor closely the quality of your data.
Transcription (For all assets types):
Consensus for transcription jobs is made using the Levenshtein distance. It is an approach to quantify the number of changes necessary to change one string into the other, taken one character at a time.
For example, a task to translate the sentence "Bryan is in the kitchen" to French with two different translations "Bryan est dans la cuisine" et "Bryan est dans la salle de bain" has a consensus of 76 %
.
With a completely irrelevant translation, the Levenshtein ratio can still be around 50 %
, so it is advised to monitor closely, especially in the case of short transcriptions.
Named Entity Recognition (NER):
For NER tasks, consensus is computed at the category level :
- for each category, the consensus is again the intersection over the union of all entities by all annotators : its maximum is 1 if the chosen entities are the same, and its minimum is zero if the entities do not overlap.
It is not only dependent on the content, but also the offset : two words are considered to overlap if the beginOffset and endOffset are intertwined. At the asset level, the consensusMark is the average for all categories of entities.
For example, in the example below, if one annotator has labeled the yellow entity and another the two blue entities for the same category, the final consensus will be 33%
.
consensus = number_of_caracters_in_common / number_of caracters_union = 15/45 = 33%
Named Entity Relationships:
For Named Entities Relation, the computation is similar, except that it is made at the category level and also at the relation level : for two correctly classified entities, if one is included in a relation but not the other, it will lead to a null consensus.
Character recognition (OCR):
We interpret the OCR task as the composition of an object detection task (selecting a box) and a text entry task (text contained in a box). Consensus is for now only computed as the result of the task for object detection.
At the annotator level
We take the average of the consensuses for the assets annotated by the annotator and with consensus as a simple and intuitive estimator of the consensus level of an annotator.
At the project level
We take the average of the assets consensuses (for the assets with consensus) as a simple and intuitive estimator of project level of consensus.