/

Evaluating Model Performance

Understanding AI for maritime data applications

We are continuously improving our methods of evaluating the performance of our AI models. There are many ways to measure how well they are doing, and people can have different preferences on what it means to do “well”. In the case of Skylight delivering to governments and enforcement/compliance users, a core principle is to aim for operations-grade reliability. This means high precision and human-level accuracy. 

We also know that there can be a wide gap between models built for research, and models built for “production” like the Skylight product. Models must also be tested online and iteratively improved to be most valuable to users deriving their value in a product. 

There are many resources freely available online to learn more information. Here we focus on high-level descriptions of the terminology used to describe model performance in Skylight. 

Common Metrics

Most AI models in Skylight are classification models. Their purpose is to predict what class an input belongs in. One example is fishing detection: “Is X track showing fishing or some other behavior?”

These models will have four types of results: 

  • True positive: Fishing track that was correctly classified as Fishing
  • False positive: Non-fishing track that was incorrectly classified as Fishing
  • True negative: Non-fishing track that was correctly classified as Non-fishing
  • False negative: Fishing track that was incorrectly classified as Non-fishing

The most common metrics to evaluate these models are Precision, Recall, and F1-Scores. Continuing with this example: 

  • Precision describes: Of all the tracks classified as Positive for fishing, how many were actually correct? 
  • Recall describes: Of all the actual Positives of fishing tracks in the data, how many did the model correctly find? 
  • F1-score describes: How-well balanced are precision and recall?

Another classification model example used in Skylight is detecting vessels in satellite imagery. This model classifies objects in an image as vessels or non-vessels. 

  • Precision: How many of the objects classified as vessels were actually vessels? 
  • Recall: Of all the actual vessels in the images, how many were found?  

Ideally a model has exceptionally high recall and precision, but this is not always possible. In fact, precision and recall are almost always in tension with each other. To improve recall, you often need to cast a wider net, but that risks more false positives, which lowers precision. To improve precision, you tighten your criteria, but that means you might miss true positives, which lowers recall. 

Other Metrics

Besides classification models, Skylight also uses regression models. These models predict continuous numbers rather than classes. For example, estimating the length of vessels detected in imagery to a specific digit. This example can be turned into a classification problem (e.g. predicting vessels as “small”, “medium”, and “large”). 

There are different metrics for regression models, such as Mean Absolute Error (MAE) and the Coefficient of Determination (R-squared Score). 

Since regression metrics can be more difficult to interpret and make operationally useful compared to classification metrics, it is possible to “bucket” regression outputs after-the-fact (i.e. put them into classes). For example, the Skylight vessel length estimation model is regressing the length. However to understand how accurate this model is, we bucketed the lengths into a confusion matrix after the model is run to give an idea of how well it is doing.

How Does Skylight Evaluate its Models? 

Skylight evaluates all of its AI models during their development and after they have been pushed to production. There are a few ways to understand how well these models are doing: 

  • Industry-Best: What are the top metrics developed by the rest of the industry, and is the Skylight model doing the same or better than that?
  • Customer Feedback: What is the user-reported satisfaction? 
    • We have a “thumbs up/down” feature in our user interface that lets us capture this information 
  • Online Audit: When the system is actively processing incoming data and generating outputs, how well is it doing? 
    • We always do this before a new release or a major model update 
  • Offline Audit: With a controlled, labeled dataset, how well did the model do? 
    • We always do this during model development