Evaluating Model Performance

Understanding AI for maritime data applications

We are continuously improving our methods of evaluating the performance of our AI models. There are many ways to measure how well they are doing, and people can have different preferences on what it means to do “well”. In the case of Skylight delivering to governments and enforcement/compliance users, a core principle is to aim for operations-grade reliability. This means high precision and human-level accuracy.

We also know that there can be a wide gap between models built for research, and models built for “production” like the Skylight product. Models must also be tested online and iteratively improved to be most valuable to users deriving their value in a product.

There are many resources freely available online to learn more information. Here we focus on high-level descriptions of the terminology used to describe model performance in Skylight.

Common Metrics

Most AI models in Skylight are classification models. Their purpose is to predict what class an input belongs in. One example is fishing detection: “Is X track showing fishing or some other behavior?”

These models will have four types of results:

True positive: Fishing track that was correctly classified as Fishing
False positive: Non-fishing track that was incorrectly classified as Fishing
True negative: Non-fishing track that was correctly classified as Non-fishing
False negative: Fishing track that was incorrectly classified as Non-fishing

The most common metrics to evaluate these models are Precision, Recall, and F1-Scores. Continuing with this example:

Precision describes: Of all the tracks classified as Positive for fishing, how many were actually correct?
Recall describes: Of all the actual Positives of fishing tracks in the data, how many did the model correctly find?
F1-score: Combines precision and recall into one metric

Another classification model example used in Skylight is detecting vessels in satellite imagery. This model classifies objects in an image as vessels or non-vessels.

Precision: How many of the objects classified as vessels were actually vessels?
Recall: Of all the actual vessels in the images, how many were found?

Ideally a model has exceptionally high recall and precision, but this is not always possible. In fact, precision and recall are almost always in tension with each other. To improve recall, you often need to cast a wider net, but that risks more false positives, which lowers precision. To improve precision, you tighten your criteria, but that means you might miss true positives, which lowers recall.

Other Metrics

Besides classification models, Skylight also uses regression models. These models predict continuous numbers rather than classes. For example, estimating the length of vessels detected in imagery to a specific digit. There are different metrics for regression models, such as Mean Absolute Error (MAE) and the Coefficient of Determination (R-squared Score).

Since regression metrics can be more difficult to interpret and make operationally useful compared to classification metrics, it is possible to “bucket” regression outputs after-the-fact (i.e. put them into classes). For example, the Skylight vessel length estimation model is regressing the length. However to understand how accurate this model is, we bucketed the lengths into a confusion matrix after the model is run to give an idea of how well it is doing.

How Does Skylight Evaluate its Models?

Skylight evaluates all of its AI models during their development and after they have been pushed to production. There are a few ways to understand how well these models are doing:

Industry-Best: What are the top metrics developed by the rest of the industry, and is the Skylight model doing the same or better than that?
- We do not have this information when we are the first in the industry (e.g. detecting vessels globally in Sentinel-2 imagery)

Customer Feedback: What is the user-reported satisfaction?
- We have a “thumbs up/down” feature in our user interface that lets us capture this information
Online Audit: When the system is actively processing incoming data and generating outputs, how well is it doing?
- We always do this before a new release or a major model update
Offline Audit: With a controlled, labeled dataset, how well did the model do?
- We always do this during model development

Was this article helpful?

Give feedback about this article

Results

We're sorry. We were not able to find a match

Evaluating Model Performance

Common Metrics

Other Metrics

How Does Skylight Evaluate its Models?