8  Bringing It All Together

Important

This book is a first draft, and I am actively collecting feedback to shape the final version. Let me know if you spot typos, errors in the code, or unclear explanations, your input would be greatly appreciated. And your suggestions will help make this book more accurate, readable, and useful for others. You can reach me at:
Email:
LinkedIn: www.linkedin.com/in/jorammutenge
Datasets: Download all datasets

Knowledge is potential power. Knowledge used is power.

Joram Mutenge


In the previous chapters, we explored how powerful and versatile Polars can be for data analysis. In this chapter, we bring those ideas together by applying what you have learned to answer ten questions based on the library_checkouts dataset. Learning a tool has limited value if you never put it into practice. Application is where understanding happens.

I strongly encourage you to attempt each question before reviewing the solutions. If you find yourself stuck, return to earlier chapters to refresh your memory. Revisiting concepts in context will help reinforce what you have learned and deepen your confidence.

8.1 The Dataset

The library_checkouts dataset contains records of items checked out by patrons at public libraries across Brisbane, Australia. Library names and item types are stored as acronyms. To make the results easier to interpret, we will introduce a second dataset, library_acronyms, which maps these acronyms to their full names. Working with full names allows us to more easily recognize libraries and items throughout the analysis.

8.1.1 Reading the Datasets

import polars as pl
import polars.selectors as cs

new_cols = ['Title', 'Author', 'Call_Num', 'ID', 'Item', 'Status',
 'Lang', 'Age', 'Library', 'Date']

libraries = (pl.scan_parquet('data/brisbane_library.parquet')
 .pipe(lambda lf: lf.rename(dict(zip(lf.collect_schema().names(), new_cols))))
 .select(pl.exclude('Status','Call_Num'))
 .filter(pl.col('ID') != "REFRESH")
 .collect()
 )

acronyms = (pl.read_csv('data/library_acronyms.csv',
1             columns=[0,1,4,5],
2             new_columns=['Library','Lib_Full','Item','Item_Full'])
 )

library_acronyms = (acronyms
 .filter(pl.col('Library').is_not_null())
 .select(pl.nth(0,1))
 )

item_acronyms = (acronyms
 .select(pl.nth(2,3))
 )
1
Choose which columns to read by index position.
2
Assign simpler, more descriptive names to the selected columns.

Next, we join the two acronym dataframes, library_acronyms and item_acronyms, with the main dataset. This produces the final dataframe that we will use to answer all ten questions.

library_checkouts = (libraries
 .join(library_acronyms, on='Library', how='left')
 .with_columns(Library=pl.col('Lib_Full').str.strip_chars())
 .join(item_acronyms, on='Item', how='left')
 .with_columns(Item=pl.col('Item_Full').str.strip_chars())
1 .select(~cs.ends_with('Full'))
 )
library_checkouts.head(3)
1
Exclude columns names ending with the text “Full”.
shape: (3, 8)
Title Author ID Item Lang Age Library Date
str str str str str str str datetime[ms, Australia/Brisbane]
"Scooby-Doo, where are you?." null "34000104954037" "Junior Magazines" null "JUVENILE" "Coopers Plains" 2020-02-06 19:59:33 AEST
"Scooby-Doo, where are you?." null "34000107294902" "Junior Magazines" null "JUVENILE" "Coopers Plains" 2020-02-06 19:59:33 AEST
"Scooby-Doo's creepiest capers" null "34000106761927" "DVD" null "JUVENILE" "Toowong" 2020-02-06 20:14:31 AEST

8.2 The Ten Questions

The questions in this section draw on a wide range of concepts covered throughout the book. They are intentionally challenging. Do not be discouraged if you cannot answer all of them on your first attempt. Their purpose is to strengthen your understanding of Polars through practice.

For each question, I have included the expected output to give you a sense of what your solution should produce. Once again, try to answer the questions on your own before jumping directly to the solutions.

  1. What is the title of the item checked out in October 2020 at 23:46:47?

    ┌───────────────┐
    │ Title         │
    │ ---           │
    │ str           │
    ╞═══════════════╡
    │ Downton Abbey │
    └───────────────┘
  2. What proportion of DVD checkouts between Banyo and Hamilton does each library account for?

    ┌──────────┬──────────────┐
    │ Library  ┆ Pct_Checkout │
    │ ---      ┆ ---          │
    │ str      ┆ f64          │
    ╞══════════╪══════════════╡
    │ Banyo    ┆ 55.71        │
    │ Hamilton ┆ 44.29        │
    └──────────┴──────────────┘
  3. Create a bar chart showing the five most popular checked-out items.

  1. Who is the most popular non-fiction adult author in each of the top five libraries by checkouts?

    ┌───────────────┬──────────────────┐
    │ Library       ┆ Author           │
    │ ---           ┆ ---              │
    │ str           ┆ str              │
    ╞═══════════════╪══════════════════╡
    │ Ashgrove      ┆ Rick Steves      │
    │ Chermside     ┆ Patrick. Holford │
    │ Grange        ┆ Jamie Oliver     │
    │ Indooroopilly ┆ Noel Whittaker   │
    │ Wynnum        ┆ Jamie Oliver     │
    └───────────────┴──────────────────┘
  2. What are the five most checked-out adult fiction books in 2025?

    ┌───────────────────────────────────┐
    │ Title                             │
    │ ---                               │
    │ str                               │
    ╞═══════════════════════════════════╡
    │ Traitors gate                     │
    │ Tell me everything : a novel      │
    │ Mr Einstein's secretary : an epic │
    │ Turn a blind eye                  │
    │ Exiles                            │
    └───────────────────────────────────┘
  3. Create a multi line chart showing the 7-day average and median checkouts for Toowong library.

  1. Use your local time zone and show the day and time when one of my favorite books, This Side of Paradise, was checked out.
    Note: The values will reflect your local time zone, but the shape of the dataframe should remain the same.

    ┌───────────────────────┬───────────────────────────────┬───────────┬──────────┐
    │ Title                 ┆ Date                          ┆ Day       ┆ Time     │
    │ ---                   ┆ ---                           ┆ ---       ┆ ---      │
    │ str                   ┆ datetime[ms, America/Chicago] ┆ str       ┆ time     │
    ╞═══════════════════════╪═══════════════════════════════╪═══════════╪══════════╡
    │ This side of paradise ┆ 2022-08-08 07:57:21 CDT       ┆ Monday    ┆ 07:57:21 │
    │ This side of paradise ┆ 2020-01-06 03:50:08 CST       ┆ Monday    ┆ 03:50:08 │
    │ This side of paradise ┆ 2020-03-06 05:20:32 CST       ┆ Friday    ┆ 05:20:32 │
    │ This side of paradise ┆ 2020-06-08 06:30:04 CDT       ┆ Monday    ┆ 06:30:04 │
    │ This side of paradise ┆ 2022-06-07 07:54:41 CDT       ┆ Tuesday   ┆ 07:54:41 │
    │ This side of paradise ┆ 2023-12-08 04:44:55 CST       ┆ Friday    ┆ 04:44:55 │
    │ This side of paradise ┆ 2024-05-08 21:35:33 CDT       ┆ Wednesday ┆ 21:35:33 │
    └───────────────────────┴───────────────────────────────┴───────────┴──────────┘
  2. Create a heatmap showing the frequency of checkouts, with days on the y-axis and months on the x-axis.

  1. Create a single column dataframe that shows a numbered list of Cal Newport books. The first row should be the full author name and book titles should be in the subsequent rows.
    Tip: this involves using intermediate variables.

    ┌───────────────────────────────────────────────────────────────────────────────────┐
    │ Author                                                                            │
    │ ---                                                                               │
    │ str                                                                               │
    ╞═══════════════════════════════════════════════════════════════════════════════════╡
    │ Cal Newport                                                                       │
    │ 1. A world without email : reimagining work in the age of overload                │
    │ 2. Deep work : rules for focused success in a distracted world                    │
    │ 3. Digital minimalism : on living better with less technology                     │
    │ 4. Slow productivity : the lost art of accomplishment without burnout             │
    │ 5. So good they can't ignore you : why skills trump passion in the quest for wor… │
    └───────────────────────────────────────────────────────────────────────────────────┘
  2. Create a mult-year line chart showing the biography checkouts over the years.

8.3 Solutions to the Questions

In this section, I present solution code for the ten questions asked in the previous section. I hope you took some time to work through them on your own. If not, that is perfectly fine as well.

It is important to note that your solution code does not need to match mine exactly. One of the advantages of Polars is that it allows you to express the same logic in multiple ways while producing identical results. As long as your output matches the expected result, your approach is valid. In fact, your solution may feel more intuitive since it reflects your own reasoning. At the very least, reviewing my solutions may offer a different perspective on how each question can be approached.

8.3.1 Question One

  1. What is the title of the item checked out in October 2020 at 23:46:47?

The key idea behind this question is that it requires filtering the dataframe using datetime components. In particular, it involves filtering on specific parts of a timestamp. The most challenging aspect is applying filter correctly when working with time values.

(library_checkouts
 .filter(pl.col('Date').dt.year() == 2020,
         pl.col('Date').dt.month() == 10,
1         pl.col('Date').dt.time() == pl.time(23,46,47))
2 .select(pl.col('Title').str.strip_chars('.'))
 )
1
The colons : in the time value are replaced with commas , when using pl.time.
2
The returned Title value includes a trailing period .. This line removes that character.
shape: (1, 1)
Title
str
"Downton Abbey"

8.3.2 Question Two

  1. What proportion of DVD checkouts between Banyo and Hamilton does each library account for?

The most subtle part of this question is remembering that filtering by equating the Item value to “DVD” does not capture all DVD-related items. When filtering text values, it is generally safer to use contains rather than == or eq. The following examples illustrate why this matters. First, consider the items that are exactly equal to “DVD”.

set(library_checkouts
 .filter(pl.col('Item').eq('DVD'))
 .get_column('Item').to_list()
)
{'DVD'}

Now compare that with the set of items that contain the string “DVD”, which also represent DVD items.

set(library_checkouts
 .filter(pl.col('Item').str.contains('DVD'))
 .get_column('Item').to_list()
)
{'Brisbane Square DVD',
 'DVD',
 'DVD Restricted - 18+',
 'Languages Other than English DVD',
 'Mt Coot-tha Botanical Gardens DVD'}

This comparison shows that filtering with eq misses several items that are still DVDs. With that in mind, the solution code can be written as follows.

(library_checkouts
 .filter(pl.col('Item').str.contains('DVD'),
1         pl.col('Library').is_in(['Banyo','Hamilton']))
 .group_by('Library').len()
 .with_columns(Pct_Checkout=pl.col('len').mul(100) / pl.sum('len'))
 .select('Library', pl.col('Pct_Checkout').round(2))
 )
1
A common mistake is to assume that percentage calculations should be based on the entire dataframe. Here, percentages are calculated only after filtering for the target libraries, Banyo and Hamilton.
shape: (2, 2)
Library Pct_Checkout
str f64
"Banyo" 55.71
"Hamilton" 44.29

8.3.3 Question Three

  1. Create a bar chart showing the five most popular checked-out items.

Some rows in the Item column contain null values. Since null values should not be counted among the most popular items, they must be filtered out before aggregation.

from bokeh.models import NumeralTickFormatter

(library_checkouts
.filter(pl.col('Item').is_not_null())
.group_by('Item').len()
1.top_k(5, by='len')
.hvplot.barh(x='Item', y='len',
            xlabel='', ylabel='')
.opts(title='Top 5 most popular items checked-out',
2    xformatter=NumeralTickFormatter(format='0a'))
)
1
This retrieves the top values directly without requiring an explicit sort of the dataset.
2
This ensures that numeric labels are not displayed in scientific notation, with k representing thousands and m representing millions.

8.3.4 Question Four

  1. Who is the most popular non-fiction adult author in each of the top five libraries by checkouts?

This question requires multiple filters and, most importantly, the removal of null values in the Author column.

(library_checkouts
 .filter(pl.col('Item').eq('Non Fiction'),
         pl.col('Age').eq('ADULT'),
         pl.col('Author').is_not_null())
 .with_columns(pl.col('Author').str.strip_chars(','))
 .group_by('Author','Library').len()
1 .sort('len', descending=True)
 .group_by('Library').first()
 .top_k(5, by='len')
 .select(pl.exclude('len'))
2 .with_columns(pl.col('Author').str.split(', ').list.reverse().list.join(' '))
 )
1
Sorting at this stage is essential to ensure that the most popular authors are selected for each library in the subsequent group_by. Without sorting, there is no guarantee that the top results are returned.
2
Author names are recorded in last-name-first format, such as “Steves, Rick”. This line reverses the order and removes the comma so the names appear in a more natural format.
shape: (5, 2)
Library Author
str str
"Ashgrove" "Rick Steves"
"Chermside" "Patrick. Holford"
"Grange" "Jamie Oliver"
"Indooroopilly" "Noel Whittaker"
"Wynnum" "Jamie Oliver"

8.3.5 Question Five

  1. What are the five most checked-out adult fiction books in 2025?

This question focuses only on the titles of the most checked-out books. The number of checkouts for each title does not need to be displayed.

(library_checkouts
 .filter(pl.col('Date').dt.year() == 2025,
1         pl.col('Item').str.contains('Fiction'),
         pl.col('Item').ne('Non Fiction'),
         pl.col('Age').eq('ADULT'),
         pl.col('Title').is_not_null())
2 .with_columns(pl.col('Title').str.strip_chars(' /'))
 .group_by('Title').len()
 .top_k(5, by='len')
 .select(pl.exclude('len'))
 )
1
The dataset does not include an item labeled simply “Fiction”. Instead, values such as “Adult Fiction” and “Junior Fiction” are used, which makes contains necessary. However, contains also matches “Non Fiction”, so the ne filter removes those rows.
2
This removes the trailing space and forward slash / from the end of the book titles.
shape: (5, 1)
Title
str
"Traitors gate"
"Tell me everything : a novel"
"Mr Einstein's secretary : an e…
"Turn a blind eye"
"Hidden in plain sight"

8.3.6 Question Six

  1. Create a multi line chart showing the 7-day average and median checkouts for Toowong library.

Before computing rolling statistics, it is essential to filter the data for the Toowong library. If this step is skipped, the rolling average and median will be calculated across all libraries, producing incorrect results.

(library_checkouts
 .filter(pl.col('Library') == 'Toowong')
 .sort('Date')
1 .group_by_dynamic('Date', every='1d')
 .agg(Checkouts=pl.count('ID'))
2 .with_columns(Average=pl.col('Checkouts').rolling_mean(window_size=7, min_samples=1),
               Median=pl.col('Checkouts').rolling_median(window_size=7, min_samples=1))
 .hvplot.line(x='Date', y=['Average','Median'],
              legend='top_right', legend_opts={'title':''},
              legend_cols=2, xlabel='', ylabel='Checkouts')
 .opts(title='Toowong 7-day average and median checkouts')
)
1
Since the rolling calculations are based on days, the datetime granularity can be reduced from hours to days.
2
Setting min_samples=1 ensures that the calculation is performed as long as there is at least one non-null value in the window.

8.3.7 Question Seven

  1. Use your local time zone and show the day and time when one of my favorite books, This Side of Paradise, was checked out.

When most people hear the name F. Scott Fitzgerald, they immediately think of The Great Gatsby. However, I would argue that his lesser-known novel, This Side of Paradise, is his best work. If you have not read it yet, give it a try. I will be waiting for your thank you email.

(library_checkouts
1 .filter(pl.col('Title').str.contains(r'(?i)this side of paradise'))
 .with_columns(pl.col('Title').str.strip_chars('/ '))
2 .with_columns(pl.col('Date').dt.convert_time_zone('America/Chicago'))
3 .with_columns(Day=pl.col('Date').dt.strftime('%A'),
               Time=pl.col('Date').dt.time())
 .select('Title','Date','Day','Time')
 )
1
Book titles often appear in datasets with inconsistent capitalization. Unless you know the exact casing, it is best to use a case-insensitive filter.
2
My local time zone is “America/Chicago”. If your time zone differs, your datetime values will also differ, but the structure of the dataframe will remain the same.
3
The format string %A returns the full name of the weekday.
shape: (7, 4)
Title Date Day Time
str datetime[ms, America/Chicago] str time
"This side of paradise" 2022-08-08 07:57:21 CDT "Monday" 07:57:21
"This side of paradise" 2020-01-06 03:50:08 CST "Monday" 03:50:08
"This side of paradise" 2020-03-06 05:20:32 CST "Friday" 05:20:32
"This side of paradise" 2020-06-08 06:30:04 CDT "Monday" 06:30:04
"This side of paradise" 2022-06-07 07:54:41 CDT "Tuesday" 07:54:41
"This side of paradise" 2023-12-08 04:44:55 CST "Friday" 04:44:55
"This side of paradise" 2024-05-08 21:35:33 CDT "Wednesday" 21:35:33

8.3.8 Question Eight

  1. Create a heatmap showing the frequency of checkouts, with days on the y-axis and months on the x-axis.

A heatmap uses color intensity to represent the magnitude of values in a two-dimensional matrix, transforming numerical data into an intuitive visual form. Heatmaps are especially useful for quickly identifying patterns, trends, and areas of high or low activity. In the heatmap below, darker colors indicate a higher number of checkouts for a given day and month combination.

(library_checkouts
 .with_columns(Month=pl.col('Date').dt.strftime('%b'))
1 .with_columns(_Month=pl.col('Date').dt.month())
 .with_columns(Day=pl.col('Date').dt.strftime('%a'))
 .with_columns(_Day=pl.col('Date').dt.weekday())
 .group_by('Day','_Day','Month','_Month').len()
2 .sort(['_Day','_Month'], descending=True)
3 .select(~cs.starts_with('_'))
 .hvplot.heatmap(x='Month', y='Day', C='len',
                 xlabel='', ylabel='')
 .opts(title='Brisbane libraries checkout frequency')
 )
1
Numerical representations of months and days are necessary for correct sorting. Sorting by month names alone would place April before January. The underscore prefix makes these helper columns easy to identify and remove later.
2
Sorting is performed using the numerical day and month values.
3
This removes all columns whose names start with an underscore.

8.3.9 Question Nine

  1. Create a single column dataframe that shows a numbered list of Cal Newport books. The first row should be the full author name and book titles should be in the subsequent rows.

This question relies on intermediate variables. First, we extract all books written by Cal Newport and store them in a variable named cn_books.

cn_books = (library_checkouts
1 .filter(pl.col('Title').is_not_null(),
         pl.col('Author').str.contains('Newport, Cal'))
 .with_columns(pl.col('Author').str.strip_chars(','),
               pl.col('Title').str.strip_chars(' /'))
2 .with_columns(pl.col('Title').str.replace('FASTBACK - ',''),
3               pl.col('Author').str.split(', ').list.reverse().list.join(' '))
 .select('Title','Author')
4 .with_columns(pl.col('Title').str.replace('sm on','sm : on'))
 .unique('Title')
 )
cn_books
1
Omitting this line would introduce a null value into the result set.
2
Some titles appear to be unique only because they include the text “FASTBACK”. Removing this text ensures that identical titles are treated as the same book.
3
This reverses the author name so it appears as first name followed by last name.
4
One title includes a colon before the subtitle while another does not. This standardization ensures both are treated as the same book.
shape: (5, 2)
Title Author
str str
"Deep work : rules for focused … "Cal Newport"
"Slow productivity : the lost a… "Cal Newport"
"A world without email : reimag… "Cal Newport"
"Digital minimalism : on living… "Cal Newport"
"So good they can't ignore you … "Cal Newport"

Now we construct the final output.

(cn_books
 .select('Author')
 .head(1)
 .vstack(cn_books
 .select(pl.col('Title').sort())
1 .rename(dict(Title='Author')))
2 .with_columns(pl.when(pl.int_range(pl.len()) > 0)
               .then(pl.format('{}. {}', pl.int_range(pl.len()), pl.col('Author')))
               .otherwise(pl.col('Author'))
               .alias('Author'))
 )
shape: (6, 1)
Author
str
"Cal Newport"
"1. A world without email : reimagining work in the age of overload"
"2. Deep work : rules for focused success in a distracted world"
"3. Digital minimalism : on living better with less technology"
"4. Slow productivity : the lost art of accomplishment without burnout"
"5. So good they can't ignore you : why skills trump passion in the quest for work you love"
1
The dataframes being stacked must have identical column names, which is why the column is renamed.
2
This ensures that the numbering format is applied only to the book titles and not to the first row containing the author name.

8.3.10 Question Ten

  1. Create a mult-year line chart showing the biography checkouts over the years.

This question also requires numerical month values to ensure that the months are sorted correctly.

(library_checkouts
 .with_columns(Year=pl.col('Date').dt.year(),
               Month=pl.col('Date').dt.strftime('%b'),
               _Month=pl.col('Date').dt.month())
 .filter(pl.col('Item').str.contains('Biography'))
 .group_by('_Month','Month','Year').len()
 .sort('_Month')
 .select(pl.exclude('_Month'))
 .pivot(index='Month', on='Year')
 .hvplot.line(x='Month',
              y=['2020','2023','2024','2024'],
              xlabel='', ylabel='Checkouts',
              legend_opts={'title':''},
              legend='bottom_right', legend_cols=3)
 .opts(title="Biography checkouts trends over the years")
 )

The years 2021 and 2022 contain null values across the entire Item column, not just for biography records. This suggests that an error occurred during data collection for those years.

8.3.11 Bonus Question

Here is a bonus question to further test your Polars mastery. I will show you the output, but I will not provide the solution code. Instead, I am sharing my email address, so you can send me your solution.

I promise to reply with my own solution, especially if it differs from yours. This way, you can compare approaches and learn an alternative way to solve the problem.

  1. Return the book title(s) whose number of digits is exactly half of the maximum digit count found in any title, along with the digit count. Exclude titles that consist only of digits.
    ┌───────────────────────────────────────────────────────────────┬────────┐
    │ Title                                                         ┆ Digits │
    │ ---                                                           ┆ ---    │
    │ str                                                           ┆ u32    │
    ╞═══════════════════════════════════════════════════════════════╪════════╡
    │ Headwind : 14,251 kilometres. 37 days, 20 hours, 45 minutes.… ┆ 14     │
    └───────────────────────────────────────────────────────────────┴────────┘

8.4 Conclusion

The ten questions we explored were intentionally challenging, but they demonstrate that Polars is well suited for performing deep and expressive data analysis. Throughout these examples, we filtered data using datetime components, worked with text inconsistencies, computed rolling statistics, reshaped data for visualization, and combined multiple transformations into clear analytical pipelines.

One of Polars’ greatest strengths is its flexibility. The same problem can often be solved in several valid ways, each producing the same result. As you continue working with Polars, focus on identifying an approach that aligns with your own reasoning and feels natural to you. Avoid feeling overwhelmed by alternative solutions, especially early on.

The most important goal is to write code that produces the correct answer in a clear and reliable way. Once you are confident in the correctness of your results, you can revisit your solution to improve readability, performance, or expressiveness. By building this habit, you will develop both confidence and fluency in Polars, allowing you to tackle increasingly complex analytical questions with ease.