In this section, we will explore various spatial data formats, operations, and visualization methods. As before, there is large landscape of tools and platforms designed to work for spatial data. We will be focusing on a subset of tools now emerging at the leading edge of cloud-native geospatial computing.
Geospatial computation can be divided into three components:
- Data I/O (input/output)
- Data manipulation / analysis
- Data visualization
Data I/O¶
We divide geospatial data[1] into two broad classes of data:
- raster data
- vector data
Raster data, also called “image” data or gridded data are made up of pixels. Vector data is made out of spatial coordinates expressing shapes as points, lines, or polygons.
import ibis
from ibis import _
con = ibis.duckdb.connect()
Working with Vector data¶
As with tabular data, our first task is to read our data object from a file into python. Spatial vector data can be found in a bewildering array of formats or file types (or “serializations”). Because some common platforms can only understand a few of these formats, it is not uncommon to see spatial data providers offer the same data in multile formats. Fortunately for us can more-or-less all be read in the same way.
Just as pandas
has historically been the go-to for reading in tabular data, geopandas
is the most common mechanism for reading spatial data into python. Built on top of pandas
, it faces the same limitations, especially when it comes to very big geospatial data. Rather than teaching you different tools to work with ‘small’ and ‘big’ data, thanks to recent innovations we will be able to start off learning the faster, more scalable ‘big data’ solutions provided by duckdb
and accessible to us in ibis
.
Let’s begin by looking at the redlining data from the Mapping Inequality Project, https://ibis
[2] can read either one, and can do so both directly from the URL and without downloading the entire file. Use a two-finger click on the download button to copy-paste the download URL into your notebook, like so:
%%time
redlines = con.read_geo("/vsicurl/https://dsl.richmond.edu/panorama/redlining/static/mappinginequality.gpkg")
CPU times: user 542 ms, sys: 110 ms, total: 653 ms
Wall time: 1.67 s
Note how we have added the prefix /vsicurl/
in the URL. This signifies to the underlying software (GDAL) that we wish to use the virtual system interface to access this url, i.e. rather than first copying the file to the local machine, we access just the bytes we need over ‘range requests’. Previously we have seen ibis
do this ‘range request’ trick automatically on URLs for text-based data, and in future this explicit prefix may not be necessary here either.
#%%time
#without_vsi = con.read_geo("https://dsl.richmond.edu/panorama/redlining/static/mappinginequality.gpkg")
We can now use our ibis
functions to explore this data. What columns are in this data?
redlines.columns
['area_id',
'city',
'state',
'city_survey',
'category',
'grade',
'label',
'residential',
'commercial',
'industrial',
'fill',
'geom']
Let’s sample a few city names from the data:
redlines.select(_.city).distinct().head(10).execute()
or select all the data for a specific city:
city = redlines.filter(_.city == "New Haven")
As before, we can use the .execute()
method to materialize our data into memory. Only instead of getting a pandas
object, because it is spatial we get a geopandas
object back.
city_gdf = city.head().execute()
city_gdf
Geometries and WKT¶
At first glance this looks like a familar pandas
table. But the last column, geom
looks a little different, with content like MULTIPOLYGON (((-72.9 41.36885, -72.90325 41.3...
. These text strings follow a specific standardized format for representing shapes called “Well Known Text”. These shapes (points, lines, polygons, as well as disjoint versions such as multipoint, multiline, and multipolygon), are collectively refered to as “features”.
fun fact: if asked for a single feature, geopandas will actually draw it:
city_gdf.geom[3]
We can coerce WKT into a plain string of text, and see that is simply an ordered list of X Y pairs of points like a connect-the-dots puzzle (in this case, in Longitude-Latitude pairs):
str(city_gdf.geom[3])
'MULTIPOLYGON (((-72.90284 41.35795, -72.8999 41.35712, -72.89788 41.36147, -72.89564 41.36069, -72.89401 41.36331, -72.90102 41.36493, -72.90073 41.36396, -72.90078 41.36333, -72.90062 41.36305, -72.90069 41.36266, -72.90132 41.36172, -72.90178 41.36076, -72.90284 41.35795)))'
Spatial operations¶
In addition to the standard operations ibis
can perform on tabular data, the spatial extension provides an extensive collection of operations specific to spatial geometries. The ibis
documentation provides a table of all available methods.[3] For instance, a common task is to compute the area inside each polygon.
city = (redlines
.filter(_.city == "New Haven")
.mutate(area = _.geom.area())
)
The same concepts exist in computer graphics, e.g. png vs svg. Unlike computer graphics formats which typically represent images in a ‘flat’ (Euclidian) geometry, geospatial data must contend with the curvature of the earth using an explicit notion of coordinate reference systems. We will have more to say about this later.
Good abstractions are at the core of good software design.
ibis
isn’t really doing the heavy lifting here, it is merely providing a user-friendly ‘pythonic’ interface to many different “backends” that perform the actual computation. Only one of those,duckdb
, is providing the spatial abilities we need. When it comes to reading different file formats,duckdb
achieves this only by calling out to another piece of software,GDAL
, which has been around a lot longer and can be found under the hood of a great many geospatial data tools, from python packages likegeopandas
to QGIS to Google Earth Engine.some additional methods are possible in the underlying duckdb spatial extension but do not yet have bindings in
ibis
.ibis
provides an extension syntax that allows users to invoke these methods.