
Artificial intelligence is transforming how we analyse satellite imagery and geospatial data. From crop classification to disaster monitoring, machine learning models trained on Earth observation datasets are being deployed across an expanding range of applications. But there is a growing problem that the industry has been slow to address: AI models frequently produce unreliable results when applied outside the conditions they were designed for.
The issue is straightforward. A machine learning model trained to classify agricultural crops using Sentinel-2 satellite imagery will perform well over farmland. But apply that same model to an area of open water, and it will generate nonsensical outputs — confidently labelling ocean pixels as wheat fields or vineyards. For an experienced GIS analyst, this kind of error is easy to spot and correct. But in automated workflows where no human is reviewing intermediate results, these failures can propagate silently through entire analysis pipelines.
This is not a theoretical concern. Research has shown that even well-regarded models like BigEarthNet, trained on Sentinel-1 and Sentinel-2 data, can swing from over 85 percent accuracy in optimal conditions to as low as 20 percent in unfavourable scenarios. The gap between best-case and worst-case performance is enormous, and most users have no way of knowing which end of that spectrum they are operating at for any given query.
The documentation problem
Compounding this reliability issue is a documentation gap. Models published on platforms like HuggingFace and Kaggle are often poorly documented — at least not in a machine-readable format that a processing platform could use to automatically validate whether a model is appropriate for a given dataset and region. In practice, this means users need to manually inspect model specifications, preprocess input data with custom Python scripts, and make judgement calls about applicability. That effectively limits the use of these models to specialists with both domain expertise and programming skills.
For geospatial AI to scale beyond expert users, platforms need to handle this validation automatically.
Model fencing as a solution
A research collaboration between Constructor University and rasdaman GmbH in Bremen, Germany, funded by the EU’s EFRE programme, is working on exactly this problem. The project, called FAIRgeo, introduces a concept called “model fencing” — automatically restricting AI model inference to the spatial, temporal, and thematic contexts where reliable results can be expected.
The approach works by enriching model metadata with machine-readable information about where and when a model is valid. When a user submits a query, the platform checks parameters automatically before execution: correct satellite source, correct spectral bands, appropriate patch size, and geographic applicability. If the model is being asked to operate outside its validated comfort zone, the system can flag the issue or prevent execution entirely.
Early results are promising on the usability front as well. What typically requires over a hundred lines of Python code can be reduced to a two-line datacube query, with the platform handling data selection, preparation, and tiling automatically. Performance benchmarks also show the integrated approach running faster than traditional Python implementations in most cases.
Why this matters beyond research
As AI becomes embedded in operational geospatial workflows — from agricultural monitoring to urban planning to climate risk assessment — the consequences of unreliable model outputs grow more serious. Decisions about land use, disaster response, and infrastructure investment increasingly depend on automated analysis of satellite data.
The geospatial industry needs standardised approaches to model validation and applicability metadata. Efforts like FAIRgeo, which is contributing its findings to OGC working groups on data quality and coverage standards, point toward a future where AI on Earth observation data is not just more powerful, but meaningfully safer.
