Kili Docs

Kili Docs

›Quality management

Introduction to Kili Technology

  • Introduction to Kili Technology
  • Kili Technology allows
  • Compatible browser

Getting Started

  • Getting started with Kili - Classification

Hosting

  • SaaS
  • On-Premise Data
  • On-Premise Entreprise

Concepts

  • Definitions
  • Status Lifecycle
  • Architecture

Users and roles

  • Roles by project
  • Users
  • Users and roles management

Projects

  • Audit labelers
  • Customize interface
  • Dataset
  • New project
  • Project overview
  • Projects
  • Projects list
  • Settings
  • Shortcuts

Image interfaces

  • Bounding Box
  • Classification
  • Point
  • Polygon
  • Polyline
  • Segmentation
  • Simple and intuitive interfaces

Text & PDF interfaces

  • Classification
  • Image transcription / OCR
  • Named entities recognition
  • Relations extraction

Video interfaces

  • Classification
  • Multi-frames classification
  • Multi-frames object detection
  • Transcription

Audio interfaces

  • Voice transcription / Speech to text

Data ingestion

  • Data ingestion made easy
  • Load data from a workstation
  • Load data from a public cloud
  • Data on premise or on private cloud
  • How to generate non-expiring signed URLs on AWS

Quality management

  • Consensus
  • Honeypot or Gold Standard
  • Instructions
  • Quality KPIs
  • Quality management
  • Questions and Issues
  • Review Process
  • Workload distribution

Automation

  • Human in the loop
  • Model based preannotation
  • Online learning
  • Queue prioritisation

Data export

  • Data export
  • Data format
  • Example

Python - GraphQL API

  • GraphQL API
  • Python API

Code snippets

  • Authentication
  • Create a Honeypot
  • Create a user
  • Creating Consensus
  • Delete the data
  • Export data
  • Export labels
  • Import data
  • Import labels
  • Prioritize assets
  • See the Consensus of an annotation
  • See the Honeypot of an annotation
  • Throttling

Recipes

  • AutoML for faster labeling with Kili Technology
  • Create a project
  • Exporting a training set
  • Importing medical data into a frame project
  • Importing assets
  • Import rich-text assets
  • Importing predictions
  • Reading and uploading dicom image data
  • How to query using the API
  • Labelled Image Data & Transfer Learning
  • Webhooks

Change log

  • Change log

Consensus

Consensus works by having more than one annotator annotate the same asset. Once the asset is annotated, a consensus score is calculated to measure the agreement level between the different annotations for a given asset. It is a key lever for controlling production quality.

Set consensus

To activate consensus, go to your project settings, tab "Quality" and tick the box to activate it. You will need then to set the following parameters:

  • Total Asset Coverage: it is the % of the dataset that will be annotated several times. If the dataset contains 10,000 assets, and the Total Asset Coverage is at 10%, it means that 1,000 assets will be annotated several times. The assets to be labelled multiple times will be chosen randomly.

  • Minimum Consensus Size: this is the number of annotators that will have to annotate the same asset. It must be set to a number less than or equal to the total number of annotators. On standard projects, it is limited to 10 labelers, for performance reasons. Please contact support@kili-technology.com if you need more.

If you change the consensus in an ongoing project, only the unlabeled assets (with status TODO) will be distributed to labelers according to the new parameters.

All ONGOING and LABELED assets may see their status actualized later in the day. A LABELED asset can now be ONGOING in the case where consensus was increased and if the asset was already chosen for consensus. An ONGOING asset can now be LABELED in the case where consensus was decreased.

  • Example of a project with consensus 100% for 2 people

    • An asset A is chosen for consensus. A has two labels. So A is LABELED.
    • The project is updated to 100% for 3 people
    • A is now ONGOING
  • Example of a project with consensus 100% for 3 people

    • An asset B is chosen for consensus. B has two labels. So B is ONGOING.
    • The project is updated to 100% for 2 people
    • B is now LABELED

The assets that have been selected by the algorithm to participate in the consensus remain at the top of the stack until they have been annotated by Minimum Consensus Size annotators, with an ONGOING status. Once labelled by the desired number of labelers, their status is updated to LABELED.

Consensus calculation

At the asset level

Depending on your task, the consensus will be computed differently at the asset level :

Single-class classification (For all assets types):

With only one classification possible, we want a metric that can accurately describe the agreement, and decreases when the number of different selected categories increases like 1 / number_of_different_selected_categories. For a perfect consensus, the consensus will be evaluated to 1, but in case of disagreement, it is interesting to notice that the lowest possible score is max(1 / number_of_possible_classes, 1 / minimum_consensus_size)

In this example, two different classes have been selected (number_of_diferent_selected_categories = 2), so the consensus at the asset level is: consensus = 1 / 2 = 50%. With only two labelers, this is the lowest score possible.

With a consensus between three labelers, either the consensus will be 33 % for the case of complete disagreement, 67 % in the case of partial agreement of two labelers, or 100 % for perfect agreement.

Multi-class classification (For all assets types):

In multi-class classification, we generalize this pattern by

1 / number_of_different_selected_categories * sum_on_selected_categories( number_of_annotators_that_selected_this_category / min_consensus_size )

In this example, two different classes have been selected (number_of_different_selected_categories = 2) with two annotators (min_consensus_size = 2), so the consensus at the asset level is:

consensus = 1/2 * ( 2/2 + 0/2 + 1/2 ) = 1/2 * 3/2 = 75 %

If the two labelers both select an extra wrong category, say "Credit card renewal" for one, and "Wire transfer" for the other, the score becomes

consensus = 1/3 * ( 2/2 + 1/2 + 1/2 ) = 1/3 * 2 = 67 %

Again, the minimum consensus in this situation is max(1 / number_of_possible_classes, 1 / min_consensus_size), when there is a disagreement on every selected label.

Object detection (implemented for bounding-boxes, polygon and semantic) :

Each image is considered as a set of pixels to be classified. A pixel can be classified into several non-exclusive categories representing different objects. The computation involves evaluating the intersection over the union of all annotations. Hence, two perfectly overlapping bounding-boxes will correspond to a consensus of 100 %, and the ratio decreases with the common area, until 0 % for two completely distinct shapes. One benefit of this method is that it is dependendent on the size of the shape, through the union area denominator, ensuring accuracy for all sizes of shapes.

However, as shown below, consensus quickly decreases with unprecise labelling, allowing you to monitor closely the quality of your data.

Transcription (For all assets types):

Consensus for transcription jobs is made using the Levenshtein distance. It is an approach to quantify the number of changes necessary to change one string into the other, taken one character at a time.

For example, a task to translate the sentence "Bryan is in the kitchen" to French with two different translations "Bryan est dans la cuisine" et "Bryan est dans la salle de bain" has a consensus of 76 %.

With a completely irrelevant translation, the Levenshtein ratio can still be around 50 %, so it is advised to monitor closely, especially in the case of short transcriptions.

Named Entity Recognition (NER):

For NER tasks, consensus is computed at the category level :

  • for each category, the consensus is again the intersection over the union of all entities by all annotators : its maximum is 1 if the chosen entities are the same, and its minimum is zero if the entities do not overlap.

It is not only dependent on the content, but also the offset : two words are considered to overlap if the beginOffset and endOffset are intertwined. At the asset level, the consensusMark is the average for all categories of entities.

For example, in the example below, if one annotator has labeled the yellow entity and another the two blue entities for the same category, the final consensus will be 33%.

consensus_ner

consensus = number_of_caracters_in_common / number_of caracters_union = 15/45 = 33%

Named Entity Relationships:

For Named Entities Relation, the computation is similar, except that it is made at the category level and also at the relation level : for two correctly classified entities, if one is included in a relation but not the other, it will lead to a null consensus.

Character recognition (OCR):

We interpret the OCR task as the composition of an object detection task (selecting a box) and a text entry task (text contained in a box). Consensus is for now only computed as the result of the task for object detection.

At the annotator level

We take the average of the consensuses for the assets annotated by the annotator and with consensus as a simple and intuitive estimator of the consensus level of an annotator.

At the project level

We take the average of the assets consensuses (for the assets with consensus) as a simple and intuitive estimator of project level of consensus.

← How to generate non-expiring signed URLs on AWSHoneypot or Gold Standard →
  • Set consensus
  • Consensus calculation
    • At the asset level
    • At the annotator level
    • At the project level