Satellite ML
CNN-based opium poppy detection on Sentinel-2 imagery — a research POC built at Scanpoint Geomatics.
Scanpoint Geomatics Ltd. (SGL) — ISRO technology partner
Client logo used with permission.
Project Overview
A research proof-of-concept built during my time at Scanpoint Geomatics, an Indian geospatial company and ISRO technology development partner. The goal was to evaluate whether convolutional neural networks could detect opium poppy cultivation in Sentinel-2 satellite imagery — a workflow currently dependent on manual analyst review across very large geographies.
The project used labelled training data from Afghanistan, provided by SGL, and produced a working end-to-end pipeline: from raw Sentinel-2 scenes through to coordinate-aligned shapefile outputs compatible with QGIS and SGL's IGiS platform.
This was R&D, not a deployed system. Productization decisions sat with SGL.
The Problem
Manual analyst review does not scale to the geographies poppy cultivation actually covers. Sentinel-2 provides freely available 10-meter resolution imagery globally with frequent revisit times, but raw imagery is just pixels — turning it into actionable detections requires either an army of trained analysts or a model that can do the first pass automatically.
The question SGL wanted answered was: can a CNN trained on labelled Afghan imagery produce detections accurate enough that a human analyst's role shifts from search to verification?
What I Built
Sentinel-2 scenes parsed with embedded geospatial metadata preserved, so every prediction can be traced back to real-world coordinates.
Large scenes split into overlapping patches with the geographic index preserved for downstream reassembly. Overlap matters because field boundaries near tile edges otherwise get cut in half.
The model was trained on the SGL-provided labelled dataset from Afghanistan. Class imbalance was the dominant training challenge: poppy fields are a tiny fraction of total agricultural area, so naive sampling gives a model that learns to predict "not poppy" almost all the time.
Class imbalance — not model architecture — was the hardest problem. Poppy fields are a small fraction of total agricultural area, so without balanced sampling the model degenerates to predicting the majority class. Solving this first is what made the F1 0.87 result possible.
Tile-level predictions merged, noise artefacts suppressed, and final outputs written as shapefile polygons that load directly into QGIS and SGL's IGiS platform without further processing.
System Architecture
Sentinel-2 scenes loaded with embedded geospatial metadata preserved for coordinate-accurate downstream outputs.
Scenes split into overlapping geographic patches with the index preserved, so tile-level predictions can be reassembled into full-scene polygons.
Model trained on the labelled Afghanistan dataset. Class-balanced sampling addresses the dominant training challenge: positives are a tiny fraction of total agricultural area.
Tile predictions merged, noise suppressed, and final outputs written as shapefile polygons compatible with QGIS and SGL's IGiS platform.
Technologies Used
Result
F1 score of 0.87 on the held-out test set — strong enough to demonstrate the approach works for the use case, while still leaving meaningful room for productization improvements: false-positive characterisation, generalisation across non-Afghan geographies, and integration with revisit-time workflows.
The pipeline ran end-to-end in Docker, was reproducible across runs, and produced GIS outputs that loaded cleanly into IGiS without manual intervention.
What I'd Approach Differently Today
Start with a pre-trained backbone.
In 2024 I built a custom CNN. Today I'd start with a pre-trained segmentation backbone — something like SegFormer or a Sentinel-2-specific foundation model — and fine-tune from there. There has been real progress on geospatial foundation models, and the training-data labelling effort would be the same while the model architecture work would be smaller.
Data quality over architecture.
That is how I think about most ML projects now: time is better spent on data quality and the surrounding pipeline than on architecture, unless the architecture is genuinely the research question.