We will be focusing on how to use the ibis
package, a successor to the popular pandas
package, for manipulating tabular data. We begin by importing the ibis
package. (We include two additional imports from the package which are commonly referred to using their short names, the table placeholder _
instead of ibis._
, and the selectors methods as s
instead of the verbose ibis.selectors
. We will see these in action later).
Learning Goals¶
- establish a connection with
duckdb.connect()
- use
head()
andexcute()
to preview large data - use
select()
,distinct()
,filter()
to explore data.
Getting started¶
To use ibis
, we must also select a backend. We will always be using the quite new and very powerful duckdb
backend for all of our tasks. We select a backend by creating a “connection”. The details here are not important for us, we can treat this first block as “boilerplate” starting code.
import ibis
from ibis import _
import ibis.selectors as s
con = ibis.duckdb.connect()
We are now ready to read in our data. We will begin by reading the metrics table from the direct access link, as indicated in the URL below. con.read_csv()
is quite similar to the pandas.read_csv()
we saw in module 1, though the optional arguments get some different names and are not quite as flexibile. One important option for our purposes will be the how to indicate missing values. In the past, we’ve seen negative values like -99
be used to indicate missing values. That convention reflects limitations of early software, which had no natural concept of “missing”. More modern conventions indicating missing values as “NULL” or “NA”. We indicate the data has chosen the latter:
metrics_url = "https://huggingface.co/datasets/cboettig/ram_fisheries/resolve/main/v4.65/tsmetrics.csv"
tsmetrics = con.read_csv(metrics_url, nullstr="NA")
Previewing data: head()
and execute()
¶
Let’s take a look at our new table:
tsmetrics
This doesn’t look like a pretty pandas table! Where are the values? Actually, as we become more familiar with ibis
we learn to appreciate the display choice here. ibis
is designed for working with very big data. An important part of this is something called lazy evaluation. Even downloading a very large fle might take a long time, and trying to load a large dataset into python all at once can exceed available RAM and crash the kernel. Instead, ibis
merely “peeks” at the data over the remote connection -- without even downloading it! It tells us the names of each column and the data type (e.g. string, or numeric, etc) that the read_csv method has ‘guessed’ for the data. As we will see, this is often the most useful information anway.
If we we do want to see a few example rows, we can use the method head()
on the table, tsmetrics.head()
, to say we want only want to see the top of the data frame. Optionally we can specify how many rows we want to preview, e.g. tsmetrics.head(10)
to see 10 (the default is 5). Let’s try it:
tsmetrics.head()
That’s not the top of 5 rows! Once again, ibis
is being lazy. We see the same definition of the table as before, only this time it has a name r0
, and we see a “plan of execution”, that ibis will return the first 5 rows Limit[r0, 5]
. We can force it to execute this plan with execute()
:
tsmetrics.head().execute()
At last, we are starting to see what the data really looks like. Data tables can quickly become much to large to explore by simply trying to eyeball every row. For instance, we notice the first column, tscategory
, shows a few different possible categories for the various metrics in the database. So, how many distinct categories are there?
select()
and distinct()
¶
To answer this, we will introduce a few more methods of data table manipulation. select()
selects one or more columns of a given table, while distinct()
returns only distinct (unique) rows of the table. Note that both of these methods share a common pattern -- they both apply to a table (not some piece of a table, like a row or column or cell), and they both return a new table as well that is some subset of the old table. table in, table out. This design is very intentional -- by having methods designed specificially to operate on tables and return tables, we can easily stack or chain these together, (also true of head()
and execute()`. So let’s try and see distinct categories:
(tsmetrics
.select("tscategory")
.distinct()
.head(10)
.execute()
)
Note that we have stacked these methods together with each step on it’s own line by wrapping the whole thing inside ()
parentheses. This can make a long “chain” of commands easier to read. While we have asked for no more that 10 values, we have gotten back only 8 -- so we now know there are only 8 categories.
filter()
¶
What are the different unit types within, say, the “CATCH or LANDINGS” category? This requires a subset of rows (a filter), rather than a subset of columns (select()
):
(tsmetrics
.filter(_.tscategory == "CATCH or LANDINGS")
.distinct()
.head(14)
.execute()
)
This syntax to subset rows (filter) is more complicated than columns (select) -- to find rows containing “CATCH or LANDINGS” we have to indicate which column to look for.
column selection and .
¶
For python to know that we are looking for the column called “tscategory”, we use the column selection _.tscategory
. This is actually a shorthand for the pattern tsmetrics.tscategory
-- the _
is a placeholder for “the current table” in our chain. Extracting a single column with .
is itself something of a shorthand, it is equivalent to using the selector [
, as tsmetrics["category"]
. When a column name is also the name of a table method, we may need to fall back on the square bracket convention. So why use a dot at all? In addition to taking two less characters to write, the .
method allows “tab completion” of the column name, which helps us avoid typos. Note that our select()
method recognizes either syntax, you can do: tsmetrics.select(_.tscategory)
. This looks slightly more cryptic, but benefits from autocomplete and matches the sytnax of other functions.
==
not =
¶
Another common mistake is to use a single =
sign rather than ==
in filter. Recall that =
is used in variable assignment, a = 1
sets the value of a
as 1. Double-equals is a “boolean operator”, that tests if the statement is True or False:
a = 1
a == 1
True
Other boolean operators include >
, >=
, !=
(not equal) and so forth. The important thing is to know that we can do boolean comparisons, this syntax is easy to look up.
Next steps¶
Explore the datasets in this collection using select()
, distinct()
and filter()
. We will return to this list of Catch units after we become more aquainted with the remaining tables.
This syntax is harder than select()
-- we can’t just filter for “CATCH or LANDINGS” without indicating which column we are looking in. To signal that tscategory
is a column name and not just a piece of text, we use the subsetting notation, _.tscategory
. This is merely a shorthand for the more verbose: