`ibis` Single Table Verbs¶

We will be focusing on how to use the ibis package, a successor to the popular pandas package, for manipulating tabular data. We begin by importing the ibis package. (We include two additional imports from the package which are commonly referred to using their short names, the table placeholder _ instead of ibis._, and the selectors methods as s instead of the verbose ibis.selectors. We will see these in action later).

Learning Goals¶

establish a connection with duckdb.connect()
use head() and excute() to preview large data
use select(), distinct(), filter() to explore data.

Getting started¶

To use ibis, we must also select a backend. We will always be using the quite new and very powerful duckdb backend for all of our tasks. We select a backend by creating a “connection”. The details here are not important for us, we can treat this first block as “boilerplate” starting code.

import ibis
from ibis import _
import ibis.selectors as s

con = ibis.duckdb.connect()

We are now ready to read in our data. We will begin by reading the metrics table from the direct access link, as indicated in the URL below. con.read_csv() is quite similar to the pandas.read_csv() we saw in module 1, though the optional arguments get some different names and are not quite as flexibile. One important option for our purposes will be the how to indicate missing values. In the past, we’ve seen negative values like -99 be used to indicate missing values. That convention reflects limitations of early software, which had no natural concept of “missing”. More modern conventions indicating missing values as “NULL” or “NA”. We indicate the data has chosen the latter:

metrics_url = "https://huggingface.co/datasets/cboettig/ram_fisheries/resolve/main/v4.65/tsmetrics.csv"
tsmetrics = con.read_csv(metrics_url, nullstr="NA")

Previewing data: `head()` and `execute()`¶

Let’s take a look at our new table:

tsmetrics

DatabaseTable: ibis_read_csv_ienvx6pgrzgavp6a6pjm3si4mu
  tscategory   string
  tsshort      string
  tslong       string
  tsunitsshort string
  tsunitslong  string
  tsunique     string

This doesn’t look like a pretty pandas table! Where are the values? Actually, as we become more familiar with ibis we learn to appreciate the display choice here. ibis is designed for working with very big data. An important part of this is something called lazy evaluation. Even downloading a very large fle might take a long time, and trying to load a large dataset into python all at once can exceed available RAM and crash the kernel. Instead, ibis merely “peeks” at the data over the remote connection – without even downloading it! It tells us the names of each column and the data type (e.g. string, or numeric, etc) that the read_csv method has ‘guessed’ for the data. As we will see, this is often the most useful information anway.

If we we do want to see a few example rows, we can use the method head() on the table, tsmetrics.head(), to say we want only want to see the top of the data frame. Optionally we can specify how many rows we want to preview, e.g. tsmetrics.head(10) to see 10 (the default is 5). Let’s try it:

tsmetrics.head()

r0 := DatabaseTable: ibis_read_csv_ienvx6pgrzgavp6a6pjm3si4mu
  tscategory   string
  tsshort      string
  tslong       string
  tsunitsshort string
  tsunitslong  string
  tsunique     string

Limit[r0, n=5]

That’s not the top of 5 rows! Once again, ibis is being lazy. We see the same definition of the table as before, only this time it has a name r0, and we see a “plan of execution”, that ibis will return the first 5 rows Limit[r0, 5]. We can force it to execute this plan with execute() :

tsmetrics.head().execute()

	tscategory	tsshort	tslong	tsunitsshort	tsunitslong	tsunique
0	OTHER TIME SERIES DATA	AQ	Aquaculture	MT	metric tons	AQ-MT
1	OTHER TIME SERIES DATA	ASP	Annual surplus production	MT	Metric tons	ASP-MT
2	TOTAL BIOMASS	BdivBmgtpref	General biomass time series preferentially rel...	dimensionless	dimensionless	BdivBmgtpref-dimensionless
3	TOTAL BIOMASS	BdivBmgttouse	General biomass time series relative to manage...	dimensionless	dimensionless	BdivBmgttouse-dimensionless
4	TOTAL BIOMASS	BdivBmsypref	General biomass time series preferentially rel...	dimensionless	dimensionless	BdivBmsypref-dimensionless

At last, we are starting to see what the data really looks like. Data tables can quickly become much to large to explore by simply trying to eyeball every row. For instance, we notice the first column, tscategory, shows a few different possible categories for the various metrics in the database. So, how many distinct categories are there?

`select()` and `distinct()`¶

To answer this, we will introduce a few more methods of data table manipulation. select() selects one or more columns of a given table, while distinct() returns only distinct (unique) rows of the table. Note that both of these methods share a common pattern – they both apply to a table (not some piece of a table, like a row or column or cell), and they both return a new table as well that is some subset of the old table. table in, table out. This design is very intentional – by having methods designed specificially to operate on tables and return tables, we can easily stack or chain these together, (also true of head() and execute()`. So let’s try and see distinct categories:

(tsmetrics
 .select("tscategory")
 .distinct()
 .head(10)
 .execute()
)
 

	tscategory
0	CATCH or LANDINGS
1	FISHING MORTALITY
2	OTHER TIME SERIES DATA
3	SPAWNING STOCK BIOMASS or CPUE
4	PRODUCTION
5	TOTAL BIOMASS
6	RECRUITS (NOTE: RECRUITS ARE OFFSET IN TIME SE...
7	TIME UNITS

Note that we have stacked these methods together with each step on it’s own line by wrapping the whole thing inside () parentheses. This can make a long “chain” of commands easier to read. While we have asked for no more that 10 values, we have gotten back only 8 – so we now know there are only 8 categories.

`filter()`¶

What are the different unit types within, say, the “CATCH or LANDINGS” category? This requires a subset of rows (a filter), rather than a subset of columns (select()):

(tsmetrics
 .filter(_.tscategory == "CATCH or LANDINGS")
 .distinct()
 .head(14)
 .execute()
)

	tscategory	tsshort	tslong	tsunitsshort	tsunitslong	tsunique
0	CATCH or LANDINGS	CdivMSY-conv	Catch divided by MSY calculated from converted...	ratio	ratio	CdivMSY-conv-ratio
1	CATCH or LANDINGS	CdivMSY	Catch divided by MSY	ratio	ratio	CdivMSY-ratio
2	CATCH or LANDINGS	TAC-A2	Total allowable catch In subarea	MT	Metric tons	TAC-A2-MT
3	CATCH or LANDINGS	TC-3	Total catch. Use only when there is more than ...	E03	Thousands	TC-3-E03
4	CATCH or LANDINGS	TL-1	Total landings. Use only when there is more th...	E03MT	Thousands of metric tons	TL-1-E03MT
5	CATCH or LANDINGS	TL-A1	Total landings in subarea	MT	Metric tons	TL-A1-MT
6	CATCH or LANDINGS	TL-A3	Total landings in subarea	MT	Metric tons	TL-A3-MT
7	CATCH or LANDINGS	CdivMSY-dvmb	Catch divided by MSY calculated from reference...	ratio	ratio	CdivMSY-dvmb-ratio
8	CATCH or LANDINGS	Cpair-CMB	Catch or landings that is paired with TAC comb...	MT	Metric tons	Cpair-CMB-MT
9	CATCH or LANDINGS	Cpair	Catch or landings that is paired with TAC	MT	Metric tons	Cpair-MT
10	CATCH or LANDINGS	RecC	Recreational catch	E00	Individuals	RecC-E00
11	CATCH or LANDINGS	TAC-2	Total allowable catch	MT	Metric tons	TAC-2-MT
12	CATCH or LANDINGS	TAC-A3	Total allowable catch In subarea	MT	Metric tons	TAC-A3-MT
13	CATCH or LANDINGS	TAC-CMB	Total allowable catch combined beyond just thi...	MT	Metric tons	TAC-CMB-MT

This syntax to subset rows (filter) is more complicated than columns (select) – to find rows containing “CATCH or LANDINGS” we have to indicate which column to look for.

column selection and `.`¶

For python to know that we are looking for the column called “tscategory”, we use the column selection _.tscategory. This is actually a shorthand for the pattern tsmetrics.tscategory – the _ is a placeholder for “the current table” in our chain. Extracting a single column with . is itself something of a shorthand, it is equivalent to using the selector [, as tsmetrics["category"]. When a column name is also the name of a table method, we may need to fall back on the square bracket convention. So why use a dot at all? In addition to taking two less characters to write, the . method allows “tab completion” of the column name, which helps us avoid typos. Note that our select() method recognizes either syntax, you can do: tsmetrics.select(_.tscategory). This looks slightly more cryptic, but benefits from autocomplete and matches the sytnax of other functions.

`==` not `=`¶

Another common mistake is to use a single = sign rather than == in filter. Recall that = is used in variable assignment, a = 1 sets the value of a as 1. Double-equals is a “boolean operator”, that tests if the statement is True or False:

a = 1
a == 1

True

Other boolean operators include >, >=, != (not equal) and so forth. The important thing is to know that we can do boolean comparisons, this syntax is easy to look up.

Next steps¶

Explore the datasets in this collection using select(), distinct() and filter(). We will return to this list of Catch units after we become more aquainted with the remaining tables.

This syntax is harder than select() – we can’t just filter for “CATCH or LANDINGS” without indicating which column we are looking in. To signal that tscategory is a column name and not just a piece of text, we use the subsetting notation, _.tscategory. This is merely a shorthand for the more verbose:

ibis Single Table Verbs¶

Learning Goals¶

Getting started¶

Previewing data: head() and execute()¶

select() and distinct()¶

filter()¶

column selection and .¶

== not =¶