Using Tessera foundation model embeddings to predict where data-deficient plant species might occur
<aside> 💡
Disclaimer: Claude Code wrote the initial draft of this post using our conversation transcript and git history. I simply post-edited it and made minor adjustments.
</aside>

Before starting this project, I wanted to quantify exactly how biased toward well-studied species GBIF's plant occurrence data . I queried their API and the numbers confirmed it starkly:
This creates a problem for conservation. If you're trying to assess whether a species is endangered, or plan where to survey for new populations, you need to know where it could occur—not just the handful of places someone happened to document it. The African Baobab (Adansonia digitata) has 11,281 records; countless equally important species have fewer than 10.
The goal: Given just a few GPS locations of a plant species, can we predict other locations where it might occur?
This approach was inspired by Gabriel Mahler’s work on brambles using Tessera, a geospatial foundation model. Tessera produces 128-dimensional embedding vectors for every 10m x 10m pixel on Earth, encoding land cover, terrain, climate, and other environmental features the model learned from satellite imagery.
The hypothesis: if we know where a species occurs, we can sample the Tessera embeddings at those locations to learn what habitat "looks like" to the model. Then we can score every other location by how similar its embedding is to the known occurrences.
My first approach was simple: