ibis Table Joins - Data Science for Global Change Ecology

Learning Goals¶

use join() to combine two tables on a key column

import ibis
from ibis import _
import ibis.selectors as s

con = ibis.duckdb.connect()

Last time we started getting comfortable with lazy evaluation (head() and execute()) in ibis, and began to learn how to select() (subset columns) and filter() (subset rows), as well as looking at distinct values. Today we will continue to draw on these skills as we go deeper into the fisheries data in search of the evidence of the North Atlantic Cod collapse. In the process, we shall pick up some new methods as well.

As before, let’s start with reading in data. Rather than focus on the metrics table, this time we will connect to several tables at the same time. Note how we can reuse the base_url to avoid extra typing, but take care that we reading the right CSV file in each case! As before, we explicitly set the nullstr value as well to ensure missing value codes are correctly interpreted.

base_url = "https://huggingface.co/datasets/cboettig/ram_fisheries/resolve/main/v4.65/"

stock = con.read_csv(base_url + "stock.csv", nullstr="NA")
timeseries = con.read_csv(base_url + "timeseries.csv", nullstr="NA")
assessment = con.read_csv(base_url + "assessment.csv", nullstr="NA")

Fish ‘stocks’¶

Like most real world data science problems, understanding these tables requires both a bit of background in fisheries science and a lot of splunking into the data. For our purposes, one of the key things you should know is that fisheries are divided into “stocks”, which you can think of as a particular species of fish in a particular area of the ocean. Let’s use the stock table to explore this idea a bit more. Let’s begin with a peek at the stock table:

stock

Ah! commonname looks like a good place as any to go looking for Atlantic cod. Of course if we knew (or looked up) the scientific name of the species, that might be even better -- after all, common names are not always as precise. Let’s see what we can find:

(stock
 .filter(_.commonname == "Atlantic cod")
 .select(_.stockid, _.scientificname, _.commonname, 
         _.areaid, _.region, _.primary_country, _.ISO3_code)
 .head()
 .execute()
)

Lots of stocks of Atlantic cod! Each row begins with a unique stockid. A column that uniquely identifies each row in a given table is often referred to as the “primary key” for that table (and is often but not necessarily listed first). The rows that follow give us some sense of what defines a “stock” as a species in an area: we see a few different identifiers for the species: commonname, scientificname. We also see information abot the area the stock occurs in -- such as areaid, region, and primary country. (For display purposes we selected only a subset of columns).
While we have found the Cod, we haven’t yet found any data about the cod catch over time! For that we will need to look in the timeseries data. Let’s see how it is organized:

timeseries.head().execute()

We again have a column called stockid. While we no longer have columns such as commonname or scientificname to tell us what species each row in the timeseries is measuring, we now know that we can look up that information in the stock table using the stockid. Such a column is often called a “foreign key”, because it matches the primary key of a separate table. (it appears the timeseries data has no ‘primary key’ of it’s own -- no column that has a unique value for each row.). Rather than have to switch back and forth between two tables, we can join the two tables on stockid:

(stock
 .filter(_.commonname == "Atlantic cod")
 .join(timeseries, "stockid")
 .head()
 .select(_.stockid, _.scientificname, _.tsid, _.tsyear, 
         _.tsvalue, _.stocklong, _.stocklong_right) # subset of columns to keep display narrow
 .execute()
)

Effectively all this has done is take our timeseries table and for each stockid, add extra columns explaining what the stock table tells us about the stockid - species names, areas, and so on. The join has made our data is much wider than before -- we have all the columns from both tables. (Note that both tables happened to have one column with the same name, stocklong. A truly tidy database would not have done this -- we can easily see that this information belongs in the stock table. Because our database cannot assume these are the same when we join, it has renamed the one on the “right” (from timeseries) as “stocklong_right” to distinguish them). Because each stockid was repeated in the timeseries table, now all this other information is repeated too. This is not as inefficient as it may sound, thanks to internal optimizations in the database.

While it is clear even from this head() preview that we have the columns from both tables, what about the rows? Our stock table was already filtered to a subset of rows containing only Cod stocks. This join (technically called an “inner join”) has kept only those stockids, so we now have timeseries only about Cod! In fact, we could have instead joined the full tables for all stock ids, and then applied the filter for commonname.

Exercise¶

Try further exploring this resulting table using select() and distinct() to get a better sense of what rows are here. You will notice additional “*id” columns, like asssesid or areaid matching other tables in the data. Explore filtering and joinging with these tables as well.

Textbook

ibis Single Table Verbs

Textbook

ibis mutates and aggregates