`ibis` Table Joins¶

Learning Goals¶

use join() to combine two tables on a key column

import ibis
from ibis import _
import ibis.selectors as s

con = ibis.duckdb.connect()

Last time we started getting comfortable with lazy evaluation (head() and execute()) in ibis, and began to learn how to select() (subset columns) and filter() (subset rows), as well as looking at distinct values. Today we will continue to draw on these skills as we go deeper into the fisheries data in search of the evidence of the North Atlantic Cod collapse. In the process, we shall pick up some new methods as well.

As before, let’s start with reading in data. Rather than focus on the metrics table, this time we will connect to several tables at the same time. Note how we can reuse the base_url to avoid extra typing, but take care that we reading the right CSV file in each case! As before, we explicitly set the nullstr value as well to ensure missing value codes are correctly interpreted.

base_url = "https://huggingface.co/datasets/cboettig/ram_fisheries/resolve/main/v4.65/"

stock = con.read_csv(base_url + "stock.csv", nullstr="NA")
timeseries = con.read_csv(base_url + "timeseries.csv", nullstr="NA")
assessment = con.read_csv(base_url + "assessment.csv", nullstr="NA")

Fish ‘stocks’¶

Like most real world data science problems, understanding these tables requires both a bit of background in fisheries science and a lot of splunking into the data. For our purposes, one of the key things you should know is that fisheries are divided into “stocks”, which you can think of as a particular species of fish in a particular area of the ocean. Let’s use the stock table to explore this idea a bit more. Let’s begin with a peek at the stock table:

stock

DatabaseTable: ibis_read_csv_v4fblezl6jahlpkem4wb2wqefq
  stockid         string
  tsn             int64
  scientificname  string
  commonname      string
  areaid          string
  stocklong       string
  region          string
  primary_country string
  primary_FAOarea int64
  ISO3_code       string
  GRSF_uuid       string
  GRSF_areaid     string
  inmyersdb       int64
  myersstockid    string
  state           string

Ah! commonname looks like a good place as any to go looking for Atlantic cod. Of course if we knew (or looked up) the scientific name of the species, that might be even better – after all, common names are not always as precise. Let’s see what we can find:

(stock
 .filter(_.commonname == "Atlantic cod")
 .select(_.stockid, _.scientificname, _.commonname, 
         _.areaid, _.region, _.primary_country, _.ISO3_code)
 .head()
 .execute()
)

	stockid	scientificname	commonname	areaid	region	primary_country	ISO3_code
0	COD1ABCDE	Gadus morhua	Atlantic cod	multinational-ICES-1ABCDE	Canada East Coast	Greenland	GRL
1	COD1F-XIV	Gadus morhua	Atlantic cod	multinational-ICES-1F-XIV	Europe non EU	Greenland	GRL
2	COD1IN	Gadus morhua	Atlantic cod	multinational-ICES-1IN	Canada East Coast	Greenland	GRL
3	COD2J3KL	Gadus morhua	Atlantic cod	Canada-DFO-2J3KL	Canada East Coast	Canada	CAN
4	COD3M	Gadus morhua	Atlantic cod	multinational-NAFO-3M	Canada East Coast	Portugal	PRT

Lots of stocks of Atlantic cod! Each row begins with a unique stockid. A column that uniquely identifies each row in a given table is often referred to as the “primary key” for that table (and is often but not necessarily listed first). The rows that follow give us some sense of what defines a “stock” as a species in an area: we see a few different identifiers for the species: commonname, scientificname. We also see information abot the area the stock occurs in – such as areaid, region, and primary country. (For display purposes we selected only a subset of columns).
While we have found the Cod, we haven’t yet found any data about the cod catch over time! For that we will need to look in the timeseries data. Let’s see how it is organized:

timeseries.head().execute()

	assessid	stockid	stocklong	tsid	tsyear	tsvalue
0	ABARES-BGRDRSE-1960-2011-CHING	BGRDRSE	Blue grenadier Southeast Australia	CdivMEANC-ratio	1960	NaN
1	ABARES-BGRDRSE-1960-2011-CHING	BGRDRSE	Blue grenadier Southeast Australia	CdivMEANC-ratio	1961	NaN
2	ABARES-BGRDRSE-1960-2011-CHING	BGRDRSE	Blue grenadier Southeast Australia	CdivMEANC-ratio	1962	NaN
3	ABARES-BGRDRSE-1960-2011-CHING	BGRDRSE	Blue grenadier Southeast Australia	CdivMEANC-ratio	1963	NaN
4	ABARES-BGRDRSE-1960-2011-CHING	BGRDRSE	Blue grenadier Southeast Australia	CdivMEANC-ratio	1964	NaN

We again have a column called stockid. While we no longer have columns such as commonname or scientificname to tell us what species each row in the timeseries is measuring, we now know that we can look up that information in the stock table using the stockid. Such a column is often called a “foreign key”, because it matches the primary key of a separate table. (it appears the timeseries data has no ‘primary key’ of it’s own – no column that has a unique value for each row.). Rather than have to switch back and forth between two tables, we can join the two tables on stockid:

(stock
 .filter(_.commonname == "Atlantic cod")
 .join(timeseries, "stockid")
 .head()
 .select(_.stockid, _.scientificname, _.tsid, _.tsyear, 
         _.tsvalue, _.stocklong, _.stocklong_right) # subset of columns to keep display narrow
 .execute()
)

	stockid	scientificname	tsid	tsyear	tsvalue	stocklong	stocklong_right
0	COD4VsW	Gadus morhua	CdivMEANC-ratio	1958	0.997804	Atlantic cod Eastern Scotian Shelf	Atlantic cod Eastern Scotian Shelf
1	COD4VsW	Gadus morhua	CdivMEANC-ratio	1959	1.706091	Atlantic cod Eastern Scotian Shelf	Atlantic cod Eastern Scotian Shelf
2	COD4VsW	Gadus morhua	CdivMEANC-ratio	1960	1.308003	Atlantic cod Eastern Scotian Shelf	Atlantic cod Eastern Scotian Shelf
3	COD4VsW	Gadus morhua	CdivMEANC-ratio	1961	1.713846	Atlantic cod Eastern Scotian Shelf	Atlantic cod Eastern Scotian Shelf
4	COD4VsW	Gadus morhua	CdivMEANC-ratio	1962	1.685411	Atlantic cod Eastern Scotian Shelf	Atlantic cod Eastern Scotian Shelf

Effectively all this has done is take our timeseries table and for each stockid, add extra columns explaining what the stock table tells us about the stockid - species names, areas, and so on. The join has made our data is much wider than before – we have all the columns from both tables. (Note that both tables happened to have one column with the same name, stocklong. A truly tidy database would not have done this – we can easily see that this information belongs in the stock table. Because our database cannot assume these are the same when we join, it has renamed the one on the “right” (from timeseries) as “stocklong_right” to distinguish them). Because each stockid was repeated in the timeseries table, now all this other information is repeated too. This is not as inefficient as it may sound, thanks to internal optimizations in the database.

While it is clear even from this head() preview that we have the columns from both tables, what about the rows? Our stock table was already filtered to a subset of rows containing only Cod stocks. This join (technically called an “inner join”) has kept only those stockids, so we now have timeseries only about Cod! In fact, we could have instead joined the full tables for all stock ids, and then applied the filter for commonname.

Exercise¶

Try further exploring this resulting table using select() and distinct() to get a better sense of what rows are here. You will notice additional “*id” columns, like asssesid or areaid matching other tables in the data. Explore filtering and joinging with these tables as well.

assessment

DatabaseTable: ibis_read_csv_52rlwfdjjbf4rffp5d34j4cuxa
  assessid       string
  assessorid     string
  stockid        string
  stocklong      string
  recorder       string
  daterecorded   timestamp(6)
  dateloaded     timestamp(6)
  assessyear     string
  assesssource   string
  contacts       string
  notes          string
  pdffile        string
  assess         int64
  refpoints      int64
  assessmethod   string
  assesscomments string
  xlsfilename    string
  mostrecent     int64

assessment.group_by(_.stockid).agg(n=_.count()).order_by(_.n.desc()).execute()

	stockid	n
0	CODIIIaW-IV-VIId	12
1	POLLIEG	11
2	HAKENRTN	11
3	HERRNIRS	11
4	HERRNS-IIIa-VIId	11
...	...	...
1507	SSALMLWASH	1
1508	STROUT22-32	1
1509	BLRAYVIIe	1
1510	SGRAYVI-VII	1
1511	PLAICIIIa	1

1512 rows × 2 columns

assessment.group_by(_.stockid).order_by(_.assessyear).agg(assessid=_.assessid.last()).filter(_.stockid == "CODIIIaW-IV-VIId").execute()

	stockid	assessid
0	CODIIIaW-IV-VIId	WGNSSK-CODIIIaW-IV-VIId-1963-2021-ICESIMP2021-2

timeseries

DatabaseTable: ibis_read_csv_qoxovyrtjfbifanq72wnhpfbau
  assessid  string
  stockid   string
  stocklong string
  tsid      string
  tsyear    int64
  tsvalue   float64

stock

DatabaseTable: ibis_read_csv_v4fblezl6jahlpkem4wb2wqefq
  stockid         string
  tsn             int64
  scientificname  string
  commonname      string
  areaid          string
  stocklong       string
  region          string
  primary_country string
  primary_FAOarea int64
  ISO3_code       string
  GRSF_uuid       string
  GRSF_areaid     string
  inmyersdb       int64
  myersstockid    string
  state           string

ibis Table Joins¶

Learning Goals¶

Fish ‘stocks’¶

Exercise¶

`ibis` Table Joins¶