Learning Goals¶
- use
group_by()
andaggregate()
patterns to summarize data - use
order_by()
to arrange rows by one or more columns.
import ibis
from ibis import _
import ibis.selectors as s
con = ibis.duckdb.connect()
base_url = "https://huggingface.co/datasets/cboettig/ram_fisheries/resolve/main/v4.65/"
stock = con.read_csv(base_url + "stock.csv", nullstr="NA")
timeseries = con.read_csv(base_url + "timeseries.csv", nullstr="NA")
assessment = con.read_csv(base_url + "assessment.csv", nullstr="NA")
Stock “assessments”¶
The next thing we need to know is a stock assessment.
assessment
Are some stockid
s assessed multiple times? One intuitive idea is to filter()
for a single stockid, and see if we get back multiple rows (multiple assessments). Let’s take a look at “COD2J3KL”:
(assessment
.filter(_.stockid == "COD2J3KL")
.select(_.assessid, _.assessorid, _.stockid,
_.daterecorded, _.assessyear) # pick a subset of columns to focus on
.execute()
)
Indeed, it looks like their are four assessments of this stock, each conducted in different years and spanning different periods in time! filter()
ing for each possible stockid would be tedious though. These four assessments that correspond to this stockid
(assessment
.group_by(_.stockid)
.agg(n=_.count())
.execute()
)
order_by()
¶
Which stockid
s have the most assessments? We can re-order the rows by different columns using the order_by()
. (Changing the row order does not alter any individual row itself -- that would mess up the data. Each row is moved as a unit). By default, order
is always increasing, smallest to largest, A to Z. While that might be intuitive for dates or names, if we want to see which stocks have the most assessments, we need n
to be in descending order. We indicate this by appending the .desc()
method to the column:
(assessment
.group_by(_.stockid)
.agg(n=_.count())
.order_by(_.n.desc())
.execute()
)