ibis mutates and aggregates

Learning Goals¶

use group_by() and aggregate() patterns to summarize data
use order_by() to arrange rows by one or more columns.

import ibis
from ibis import _
import ibis.selectors as s

con = ibis.duckdb.connect()

base_url = "https://huggingface.co/datasets/cboettig/ram_fisheries/resolve/main/v4.65/"

stock = con.read_csv(base_url + "stock.csv", nullstr="NA")
timeseries = con.read_csv(base_url + "timeseries.csv", nullstr="NA")
assessment = con.read_csv(base_url + "assessment.csv", nullstr="NA")

Stock “assessments”¶

The next thing we need to know is a stock assessment.

assessment

Are some stockids assessed multiple times? One intuitive idea is to filter() for a single stockid, and see if we get back multiple rows (multiple assessments). Let’s take a look at “COD2J3KL”:

(assessment
 .filter(_.stockid == "COD2J3KL")
 .select(_.assessid, _.assessorid, _.stockid,
         _.daterecorded, _.assessyear) # pick a subset of columns to focus on
 .execute()
)

Indeed, it looks like their are four assessments of this stock, each conducted in different years and spanning different periods in time! filter()ing for each possible stockid would be tedious though. These four assessments that correspond to this stockid

(assessment
 .group_by(_.stockid)
 .agg(n=_.count())
 .execute()
)

`order_by()`¶

Which stockids have the most assessments? We can re-order the rows by different columns using the order_by(). (Changing the row order does not alter any individual row itself -- that would mess up the data. Each row is moved as a unit). By default, order is always increasing, smallest to largest, A to Z. While that might be intuitive for dates or names, if we want to see which stocks have the most assessments, we need n to be in descending order. We indicate this by appending the .desc() method to the column:

(assessment
 .group_by(_.stockid)
 .agg(n=_.count())
 .order_by(_.n.desc())
 .execute()
)

Textbook

ibis Table Joins

Textbook

ibis + seaborn.objects: Data Exploration

Learning Goals¶

Stock “assessments”¶

order_by()¶

`order_by()`¶