7 Creating Complex Datasets for Analysis

Important

This book is a first draft, and I am actively collecting feedback to shape the final version until March 1, 2026. Let me know if you spot typos, errors in the code, or unclear explanations, your input would be greatly appreciated. And your suggestions will help make this book more accurate, readable, and useful for others. You can reach me at:
Email: contervalconsult@gmail.com
LinkedIn: www.linkedin.com/in/jorammutenge
Datasets: Download all datasets

Generally, aggregations are what your boss wants you to report on. Say you work in a candy store and your boss asks you, “How are we doing today?” Don’t tell him that Sally came in and bought a tootsie roll and a lollipop, and Billy bought two candy bars… Your boss doesn’t care about the details.

— Matt Harrison

In Chapters 3 through 7, we explored a variety of techniques for using Polars to analyze different kinds of datasets. Beyond these targeted examples, there are situations where the primary goal is to construct a dataset that is tailored to a specific purpose while remaining flexible enough to support multiple downstream analyses. The final output might be another CSV file, a text-based export, or a dataset consumed by a business intelligence platform. In some cases, the Polars code required for this work is straightforward, perhaps involving only a small number of filters or summary operations. More often, however, the transformations needed to shape the data become complex. As stakeholders request new fields, metrics, or business logic, the code naturally evolves. For these reasons, the structure, performance, and long-term maintainability of your Polars workflows matter far more here than they do for one-off analytical tasks.

In this chapter, I outline strategies for organizing your code so that it can be shared, modified, and extended more easily. I also examine how to decide whether complex logic should remain in Polars or be migrated into upstream ETL (extract, transform, load) processes. Next, I cover different ways to store intermediate results during computation. Finally, I discuss methods for reducing dataset size and approaches for protecting sensitive information, including the removal of personally identifiable information (PII).

7.1 When to Use Polars for Complex Datasets

Nearly every dataset prepared for deeper analysis incorporates some form of logic. This logic may be as simple as determining how datasets should be joined or which filters to apply, or as involved as performing multi-step calculations that aggregate, classify, parse, or apply window functions across groups of data. When assembling datasets intended for further use, deciding whether to keep this logic in Polars, shift it to an ETL pipeline, or delegate it to another system is rarely straightforward. Practical considerations, such as ease of development, performance requirements, and access to engineering support, play a major role. There is seldom a single correct choice, but your judgment will improve as you gain more experience with Polars.

7.1.1 Advantages of Using Polars

Polars offers a high degree of flexibility. If the earlier chapters were effective, you have already seen how many different data preparation and analytical tasks it can support. This adaptability is one of the strongest reasons to rely on Polars when constructing complex datasets.

When you first begin working with a dataset, experimentation is common. The process usually starts with profiling the data to understand its structure and irregularities. From there, you incrementally build a transformation pipeline, validating each step to ensure the results are sensible. Additional profiling is often required when the data behaves differently than expected. Complex datasets frequently emerge from chaining together many expressions, using joins, unions, or concats to answer specific analytical questions. The ability to run parts of the pipeline quickly and inspect intermediate outputs makes this iterative development process efficient.

Polars also minimizes external dependencies. Transformations run only when needed and do not require coordination with data engineering teams or waiting for scheduled releases. Because Polars expressions integrate cleanly into Python or Rust codebases, analysts and data scientists can modify logic independently. When stakeholders request new metrics, filters, or derived fields, the code can be updated quickly, and Polars’ lazy execution engine recomputes only what is necessary.

For new analyses or those that change frequently, keeping the logic directly in Polars is often the most practical approach. When execution times remain reasonable and stakeholders receive results promptly, there is often little incentive to move that logic elsewhere.

7.1.2 When to Build into ETL Instead

There are situations where shifting logic into an ETL pipeline is more effective than keeping everything inside Polars, especially in environments built around a data warehouse or data lake. The two primary motivations for choosing ETL are improved performance and increased transparency.

The speed of Polars code depends heavily on both data volume and transformation complexity. More advanced analytical workflows often require intricate expressions, large joins, or even Cartesian joins. These operations can push execution times into minutes or longer. While analysts and data scientists may tolerate slower runtimes during development, most data consumers expect near-instant responses similar to those provided by modern web applications. Waiting more than a few seconds is often perceived as unacceptably slow.

ETL processes, by contrast, run in the background on a fixed schedule and write their results into tables. Because these jobs execute out of sight, they can take 30 seconds, several minutes, or even an hour without disrupting users. Although daily schedules are common, many organizations run ETL jobs more frequently. Once the job completes, users can query the prepared table directly, with no joins or heavy logic required, which leads to much faster response times.

A common scenario where ETL is preferable is the creation of daily snapshot tables. Many organizations benefit from storing a daily record of customers, orders, or other key entities to support historical analysis. For example, customer-level metrics such as lifetime order counts, visit totals, or current sales pipeline stage may change over time or accumulate gradually. As discussed in Chapter 3, generating daily sequences, including days when an entity does not appear, is straightforward. However, repeatedly running these queries over long time spans can become slow. In addition, some attributes, such as a customer’s current status, may be overwritten in the source system, making daily snapshots the only reliable way to preserve historical truth. In these cases, building an ETL job and storing the results in a table is often the most sustainable solution.

Visibility is another strong reason to migrate logic into ETL. Polars scripts often live on an individual’s laptop, which makes it difficult for colleagues to find the code, review it, or validate its correctness. Moving the logic into an ETL pipeline and storing it in a shared repository, such as GitHub, allows others to inspect, collaborate on, and improve the transformations. Version control also provides a clear record of changes, making it easy to see when and why specific lines were added or modified.

Despite these advantages, ETL comes with trade-offs. One limitation is that updated results are only available after the scheduled ETL run completes, even if new data has already arrived. A common workaround is to apply Polars transformations directly to the most recent raw data, limiting the analysis to a small time window to keep execution fast. This approach can be combined with a lookup against the ETL-generated table, merging the two result sets when needed. Another drawback is that ETL logic is more difficult to change. Fixes or enhancements typically require involvement from a data engineer, followed by testing, code review, and deployment to production. For this reason, I usually wait until my Polars logic has stabilized, after rapid iteration has slowed and stakeholders have validated the outputs, before migrating it into ETL. Even so, the added rigor of code reviews and controlled releases often results in more reliable and consistent data.

7.1.3 When to Put Logic in Other Tools

Polars transformations and the results they produce in your development environment are often only one part of a broader analysis. Those results may be embedded in reports, visualized in tables and charts, or further manipulated in tools ranging from spreadsheets and BI platforms to environments that support statistical or machine-learning workflows. In addition to deciding when to move logic upstream into ETL, you also need to decide when it makes sense to move logic downstream into other tools. Performance requirements and specific use cases typically drive these choices.

Each tool comes with its own strengths and limitations. Spreadsheets are flexible and familiar, but they struggle with large datasets or complex operations across many rows. For this reason, it is often best to use ETL or Polars to perform heavier transformations, then pass a smaller and cleaner dataset into a spreadsheet.

BI tools vary widely in how they process data. Some cache data locally in optimized formats, which makes certain calculations fast. Others recompute results whenever a column is added to or removed from a report. Some calculations, such as unique counts or medians, require detailed row-level data. If you cannot anticipate all the ways users might explore the data, you may need to provide a more detailed dataset than would otherwise be ideal. Additionally, if the goal is broad exploration and slicing across many dimensions, more detail is usually preferable. Finding the right balance between ETL logic, Polars transformations, and BI-tool computation often requires iteration.

When the goal is statistical or machine-learning work with libraries such as SciPy and scikit-learn, detailed data is usually the better choice. These libraries can handle a wide range of transformations, modeling steps, and text manipulation tasks. It is often useful to push only stable, well-defined transformations into ETL or Polars, while keeping more experimental or iterative work in the analysis environment. Ultimately, the decision about where to perform a given transformation depends on workflow preferences, iteration speed, and the stability of the logic involved.

Tip

Although there are very few hard-and-fast rules about where logic should live, there is one guideline I strongly recommend: eliminate manual steps. Opening a dataset in a spreadsheet or text editor, tweaking a value, saving the file, and moving on may feel harmless. However, as soon as you need to repeat the process, or when new data arrives, it becomes easy to overlook that manual change or apply it inconsistently. In practice, so-called one-off tasks rarely remain one-off. Whenever possible, capture the logic in code instead of relying on ad hoc edits.

Polars is a powerful and flexible tool that operates within a broader analysis workflow, alongside ETL processes and a range of downstream tools. Deciding where each calculation should live often requires experimentation as you iterate on what is practical in Polars, what belongs in ETL, and what is better handled in reporting, BI, or modeling environments. As you become more familiar with the full set of options, it becomes easier to evaluate the trade-offs and steadily improve both the performance and adaptability of your work.

7.2 Code Organization

You may have noticed that this book follows a consistent style when presenting Polars code. Whenever a snippet spans more than a single line, the operations are enclosed in parentheses and written as a chain. This pattern, often called method chaining, helps the code read as a recipe–a clear, ordered set of instructions. Just as importantly, structuring Polars logic this way allows the query engine to optimize the entire chain, which often results in noticeably faster execution.

7.2.1 Commenting

Nearly every programming language provides a way to mark text as a comment so that it is ignored when the code runs. Because this book uses Python, we rely on the two commenting approaches that Python supports. The first uses the hash symbol (#), which turns the remainder of the line into a comment:

# This is a comment

The second approach uses triple quotes to create a multi-line comment block, which is closed by another set of triple quotes:

'''
This is a comment block
with multiple lines
'''

Note

You can also use triple double quotes (""" """) for comment blocks. I tend to prefer single quotes, but the choice is entirely stylistic.

Most editors visually distinguish comments, often by dimming or recoloring them, so they stand out from executable code.

Although commenting is a valuable habit, many people find it difficult to do consistently. Polars code is often written quickly, especially during early exploration or profiling, and we do not always expect those snippets to persist. Too many comments can clutter the code, while too few can make it difficult to understand later. We may also assume we will remember the reasoning behind our own code, but anyone who has returned to a long transformation pipeline after several months, or inherited one from a teammate, knows how challenging it can be to reconstruct the original intent.

To strike a balance between clarity and overhead, I follow a few simple guidelines. First, add a comment whenever a value’s meaning is not immediately obvious. Many systems store coded values as integers, and it is easy to forget what each number represents. A short note can prevent confusion later:

df.filter(pl.col('Status').is_in([1,2])) # 1 = Active, 2 = Pending

Second, document any transformations or calculations that are not self-explanatory, especially those that rely on quirks in the data, such as inconsistent entry patterns or known outliers:

(df
 .with_columns(pl.when(pl.col('Status') == 'Live')
               .then(pl.lit('Active'))
               .otherwise(pl.col('Status'))
               .alias('Status'))
 )

'''Customers were labeled "Live" until 2024,
when the terminology switched to "Active".'''

Finally, when a method chain spans several lines, it is helpful to leave brief notes describing what each major step accomplishes. This makes it easier to revisit the code later and quickly locate the section that needs review or adjustment:

(df
 # count distinct grocery items in list
 .with_columns(pl.col('Groceries').str.split(',').list.n_unique())
 # add fixed fare amounts
 .with_columns(Uber_Fare=pl.lit(25),
               Bus_Fare=pl.lit(3))
 # sample five reproducible rows
 .sample(5, seed=31)
)

Writing good comments takes practice and a bit of discipline, but it pays off for any code that extends beyond a few lines. Comments can also provide useful context, such as the purpose of the pipeline, who wrote it, or when it was created. Thoughtful comments are an act of kindness to your teammates and to your future self.

7.2.2 Storing Code

Once you have invested time in documenting and formatting your code, it makes sense to store it somewhere you can easily locate and reuse it later.

Many data scientists and analysts rely on Jupyter notebooks, either in a browser or within editors such as JupyterLab or VS Code. Notebooks are popular because they support rapid experimentation and provide immediate feedback during data exploration. They are saved with the .ipynb extension, and any editor that supports notebooks can open and modify them. VS Code does not handle notebooks natively, so the Jupyter extension is required. Notebook files can be stored locally or synchronized to cloud services such as GitHub or Google Drive.

Despite their convenience, notebooks are not well suited to traditional version control tools like Git. Because they are stored as JSON and combine code, outputs, and metadata, they often generate noisy diffs and frequent merge conflicts. For this reason, I typically use notebooks for exploratory work and then move stable code into a Python script once it is ready. A Python script is a plain text file with a .py extension.

Python scripts are easy to manage in repositories such as GitHub. A repository also provides a reliable backup and simplifies collaboration. Version control systems maintain a complete history of changes, which is useful when tracing modifications or meeting audit requirements. Their main limitation is that they are not automatically integrated into the analysis workflow. You must remember to commit updates, and it is easy to miss these manual steps.

7.3 Organizing Computations

When building complex datasets, two common challenges emerge: ensuring the logic is correct and achieving acceptable performance. If the logic is flawed, the resulting data is unusable. Performance, by contrast, often has a wide tolerance in analytical work. Code that does not run at all is a serious issue, but the difference between a 30-second runtime and a 60-second runtime is usually insignificant. Polars often provides multiple valid ways to express a query, which gives you flexibility to confirm correctness and improve performance for longer-running operations.

One practical way to structure intermediate steps in Polars is to assign partial results, whether values or dataframes, to variables. You are free to choose any variable names, but descriptive names generally make the workflow easier to read and reason about.

7.3.1 Using Intermediate Variables

Intermediate variables allow you to capture the output of any Polars expression, whether that output is a dataframe, a list, a tuple, or even a dictionary. Once stored, the value can be passed into another Polars expression to produce a final result. For more involved transformations, you may rely on several such variables along the way. When a computation becomes too complex to express as a single chained expression, intermediate variables provide a practical alternative. They can also make long transformation sequences easier to read and reason about.

Although I generally prefer to avoid intermediate variables, they do serve an important role, and in some cases they are unavoidable. One way to think about them is through a simple analogy. Imagine you are sitting in the living room and ask your teenage daughter to bring you a spoon and a fork from the kitchen. If you ask for the spoon first and then send her back for the fork, that mirrors the use of intermediate variables, with each trip representing a separate step. Ideally, you would ask her to bring both items at once. However, when the number of things you need exceeds what she can carry, multiple trips become necessary. In the same way, intermediate variables become necessary when a transformation cannot reasonably be expressed as a single operation.

Below is an example where intermediate variables help us reach the desired output. Using the liquor_sales dataset introduced in Chapter 5, suppose we want to compute aggregated liquor sales by store, by category, and by vendor as independent aggregations, rather than only the combinations that appear in the raw data:

store_df = (liquor_sales
 .group_by('Store')
 .agg(pl.sum('Dollars'))
 .with_columns(Category=pl.lit(None, dtype=pl.String),
               Vendor=pl.lit(None, dtype=pl.String))
 .select('Store', 'Category', 'Vendor', 'Dollars')
)

category_df = (liquor_sales
 .group_by('Category')
 .agg(Dollars=pl.sum('Dollars'))
 .with_columns(Store=pl.lit(None, dtype=pl.String),
               Vendor=pl.lit(None, dtype=pl.String))
 .select('Store', 'Category', 'Vendor', 'Dollars')
)

vendor_df = (liquor_sales
 .group_by('Vendor')
 .agg(Dollars=pl.sum('Dollars'))
 .with_columns(Category=pl.lit(None, dtype=pl.String),
               Store=pl.lit(None, dtype=pl.String))
 .select('Store', 'Category', 'Vendor', 'Dollars')
)

pl.concat([store_df, category_df, vendor_df])

shape: (4_111, 4)

Store	Category	Vendor	Dollars
str	str	str	f16
"WORLD LIQUOR & TOBACCO PLUS VA…	null	null	inf
"CASEY'S GENERAL STORE #3212 / …	null	null	inf
"CARGO EXPRESS"	null	null	22800.0
"CASEY'S / COUNCIL BLUFFS"	null	null	33120.0
"CORK & BOTTLE / IOWA CITY"	null	null	6440.0
…	…	…	…
null	null	"KENTUCKY PEERLESS DISTILLING C…	inf
null	null	"NEW HOLLAND BREWING COMPANY, L…	1080.0
null	null	"WHITE ROCK DISTILLERIES, INC."	3890.0
null	null	"SPIRIT IMPORTS, INC."	887.5
null	null	"WHIDBEY ISLAND DISTILLERIES LL…	1267.0

Here, each group_by result is stored in its own variable and then combined using concat. You could avoid naming these intermediate results by embedding each chain directly inside the concat operation, but doing so makes the code harder to follow:

(pl.concat([
    (liquor_sales
     .group_by('Store')
     .agg(pl.sum('Dollars'))
     .with_columns(Category=pl.lit(None, dtype=pl.String),
                   Vendor=pl.lit(None, dtype=pl.String))
     .select('Store', 'Category', 'Vendor', 'Dollars')
    ),
    (liquor_sales
     .group_by('Category')
     .agg(Dollars=pl.sum('Dollars'))
     .with_columns(Store=pl.lit(None, dtype=pl.String),
                   Vendor=pl.lit(None, dtype=pl.String))
     .select('Store', 'Category', 'Vendor', 'Dollars')
    ),
    (liquor_sales
     .group_by('Vendor')
     .agg(Dollars=pl.sum('Dollars'))
     .with_columns(Category=pl.lit(None, dtype=pl.String),
                   Store=pl.lit(None, dtype=pl.String))
     .select('Store', 'Category', 'Vendor', 'Dollars')
    )])
)

shape: (4_111, 4)

Store	Category	Vendor	Dollars
str	str	str	f16
"CASEY'S GENERAL STORE #2481 / …	null	null	inf
"HY-VEE FOOD STORE (1058) / CEN…	null	null	inf
"CASEY'S GENERAL STORE #3034 / …	null	null	inf
"SUPER SAVER TOBACCO & LIQUOR /…	null	null	inf
"DOWNTOWN GAS AND GROCERY / DUB…	null	null	inf
…	…	…	…
null	null	"MARA IMPORTS"	inf
null	null	"BAD BEAR ENTERPRISES"	52800.0
null	null	"STOLLER WAREHOUSE"	7580.0
null	null	"PATRIARCH DISTILLERS LLC / SOL…	62304.0
null	null	"GARGOYLE ENTERPRISES / LAWS WH…	37568.0

Notice that this version is not truly a single chained expression. Instead, it consists of three separate chains placed inside a concat. Functionally, it behaves the same as the version that uses intermediate variables, and Polars cannot execute the three chains in parallel because they are independent operations.

You could also express this pattern more compactly using a Python dictionary comprehension. However, that approach is still not idiomatic Polars, since the comprehension forces execution to occur in Python rather than in Rust, which is where Polars achieves most of its performance benefits. As a result, it may look cleaner, but it is not fundamentally different from using intermediate variables.

(pl.concat([
    liquor_sales
    .group_by(col)
    .agg(Dollars=pl.sum('Dollars'))
    .with_columns(**{c: pl.lit(None, dtype=pl.String) 
                     for c in ['Store', 'Category', 'Vendor'] if c != col})
    .select('Store', 'Category', 'Vendor', 'Dollars')
    for col in ['Store', 'Category', 'Vendor']
])
)

shape: (4_111, 4)

Store	Category	Vendor	Dollars
str	str	str	f16
"HY-VEE FOOD STORE #3 / SIOUX C…	null	null	inf
"KUM & GO #233 / CLARINDA"	null	null	11232.0
"HY-VEE #4 / DAVENPORT"	null	null	inf
"A&S FOOD AND GAS / DAVENPORT"	null	null	inf
"CASEY'S GENERAL STORE #2787 / …	null	null	inf
…	…	…	…
null	null	"TOTAL BEVERAGE SOLUTIONS"	9032.0
null	null	"HIGH WEST DISTILLERY"	32608.0
null	null	"WORLDWIDE LIBATIONS, LLC"	inf
null	null	"LEVECKE CORP"	25056.0
null	null	"TY KU LLC"	inf

Here is another situation where intermediate variables are essential. In this example, we work with the Brisbane City Council library book checkouts ¹, covering the years 2020 through 2025. For convenience, all yearly files have been merged into a single dataset, and several long column names have been shortened.

Suppose we want to identify which authors were read by the two most active borrowers during 2024. One effective approach is to first determine the ID values of those top two readers and store them in an intermediate variable. That variable can then be used to filter the dataset down to only the relevant checkouts from 2024.

Suppose we want to identify which authors featured in the two libraries with the most checkouts in 2024. One effective approach is to first determine the Library values of those top two libraries and store them in an intermediate variable. That variable can then be used to filter the dataset down to only the relevant libraries from 2024.

We start by identifying the two libraries that recorded the highest total number of checkouts across the dataset.

import polars as pl

new_cols = ['Title', 'Author', 'Call_Num', 'ID', 'Item', 'Status',
 'Lang', 'Age', 'Library', 'Date']

library_checkouts = (pl.scan_parquet('data/brisbane_library.parquet')
 .pipe(lambda lf: lf.rename(dict(zip(lf.collect_schema().names(), new_cols))))
 .collect()
 )

top_2_checkouts_library = (library_checkouts
1 .filter(pl.col('ID') != "REFRESH")
 .group_by('Library').len()
 .sort('len')
 .get_column('Library')
 .tail(2)
 .to_list()
 )
top_2_checkouts_library

1: Exclude this value since it does not represent a valid ID.

['SBK', 'CDE']

With the list of libraries in hand, we can now identify the authors featured in those two libraries during 2024.

(library_checkouts
 .filter(pl.col('Date').dt.year() == 2024)
 .filter(pl.col('Library').is_in(top_2_checkouts_library))
 .select(pl.col('Author').unique())
 )

shape: (17_118, 1)

Author
str
"Bornstein, Robert F.,"
"Twiddy, Robin,"
"Gu, Man,"
"Zoehfeld, Kathleen Weidner,"
"Lee, Miye,"
…
"Winner, Cherie."
"Jiudishui,"
"Henry, Sally."
"Gleeson, Libby,"
"Appignanesi, Richard."

This example highlights the flexibility that intermediate variables provide. They allow you to control the sequence of operations explicitly and to reuse intermediate results where needed. When a single chained expression becomes too long or difficult to reason about, intermediate variables can make the underlying logic clearer and easier to maintain.

7.4 Managing Dataset Size and Privacy Concerns

After refining the logic in our Polars workflows, organizing the code effectively, and optimizing performance, another challenge often comes into focus: the size of the resulting data. Even though storage costs continue to decline, encouraging organizations to retain increasingly large datasets, and computing power steadily improves, practical constraints remain. These limits often appear in downstream tools such as BI platforms or in the bandwidth required to transfer large datasets between systems. In addition, sensitive information must be handled responsibly, making data privacy an essential consideration. In this section, we explore strategies for reducing dataset size and highlight several privacy-related factors to keep in mind.

7.4.1 Sampling with Sample

One of the simplest ways to reduce a dataset is to work with a sample rather than the full set of records. Sampling involves selecting a subset of observations and is most effective when the original dataset is large enough that a smaller portion still reflects the overall population. For example, sampling web traffic data often preserves the majority of insights while significantly reducing volume.

When sampling, there are two key decisions to make. First, determine the appropriate sample size. The goal is to meaningfully reduce the data while retaining enough information to support analysis. Depending on the size of the original dataset, you might keep 10%, 1%, or even 0.1% of the rows. Second, decide what unit should be sampled. If the objective is to analyze user behavior on a website, sampling 1% of visitors is usually more appropriate than sampling 1% of visits, since the former preserves all activity associated with each selected user.

In Polars, sampling is typically performed using the sample method on a dataframe or the sample expression on individual columns. The most important parameters are listed below:

n an integer specifying the exact number of rows to return. Default is None
fraction a float indicating the proportion of rows to include. Default is None
with_replacement a boolean that allows rows to be selected multiple times. Default is False
shuffle a boolean that determines whether the sampled rows should be randomly reordered. Default is False
seed an integer used to make the sampling reproducible. Default is None, meaning a new random seed is used each time.

To demonstrate how sample works, we will return to the clothing_store_sales dataset, which contains one thousand rows. The following example randomly selects 500 rows, representing half of the dataset:

clothing_store_sales = (pl.read_csv('data/sample_sales.csv', try_parse_dates=True)
      .select(pl.exclude('Account Name', 'sku'))
      .rename(lambda c: c.title().replace(' ','_'))
      .rename(dict(Account_Number='Customer_ID'))
      )

(clothing_store_sales
 .sample(500)
 )

shape: (500, 6)

Customer_ID	Category	Quantity	Unit_Price	Ext_Price	Date
i64	str	i64	f64	f64	datetime[μs]
770435	"Sweater"	5	86.72	433.6	2014-07-14 11:42:38
29068	"Sweater"	2	98.81	197.62	2014-03-27 17:00:29
320340	"Socks"	12	67.89	814.68	2014-08-05 06:02:08
752355	"Sweater"	10	35.52	355.2	2014-07-07 04:34:20
346891	"Sweater"	18	52.24	940.32	2014-02-07 03:15:43
…	…	…	…	…	…
461425	"Hat"	16	70.92	1134.72	2014-08-26 11:38:54
223706	"Socks"	1	68.69	68.69	2014-04-04 23:36:59
91059	"Sweater"	17	34.61	588.37	2014-04-06 20:52:15
916492	"Hat"	13	97.65	1269.45	2014-04-06 07:16:11
440852	"Socks"	19	95.38	1812.22	2014-08-17 00:36:11

When the total number of rows is not known in advance, specifying a fixed count may be inconvenient. In these situations, the fraction argument provides a more flexible alternative:

(clothing_store_sales
 .sample(fraction=1/2)
 )

shape: (500, 6)

Customer_ID	Category	Quantity	Unit_Price	Ext_Price	Date
i64	str	i64	f64	f64	datetime[μs]
222704	"Hat"	3	50.75	152.25	2014-07-15 08:43:35
685048	"Sweater"	13	61.71	802.23	2014-07-26 20:06:21
751878	"Socks"	2	100.0	200.0	2014-08-11 04:32:13
938316	"Sweater"	8	22.1	176.8	2014-01-12 06:25:52
622744	"Socks"	18	42.62	767.16	2014-02-09 05:08:54
…	…	…	…	…	…
928796	"Sweater"	14	88.23	1235.22	2013-10-08 11:10:38
169443	"Sweater"	1	31.0	31.0	2014-05-02 23:52:12
461425	"Hat"	16	70.92	1134.72	2014-08-26 11:38:54
201942	"Hat"	13	32.26	419.38	2014-04-04 21:48:24
5971	"Socks"	5	69.25	346.25	2013-11-07 18:50:14

Tip

Writing the fraction as 0.5 produces the same result.

Sampling can also be applied to individual columns rather than the entire dataframe:

(clothing_store_sales
 .select(pl.col('Customer_ID').sample(fraction=1/2))
 )

shape: (500, 1)

Customer_ID
i64
206192
948283
608985
767440
223656
…
680569
927597
269278
642109
112757

For reproducibility, it is generally best practice to set a seed so that others running the same code obtain identical results:

(clothing_store_sales
 .sample(fraction=1/2, seed=42)
 )

shape: (500, 6)

Customer_ID	Category	Quantity	Unit_Price	Ext_Price	Date
i64	str	i64	f64	f64	datetime[μs]
675333	"Socks"	10	59.68	596.8	2014-02-09 07:00:00
894083	"Sweater"	12	94.56	1134.72	2014-04-04 23:07:26
989350	"Socks"	9	56.88	511.92	2013-11-29 03:29:56
770435	"Sweater"	5	86.72	433.6	2014-07-14 11:42:38
314560	"Sweater"	14	98.85	1383.9	2014-04-30 05:49:59
…	…	…	…	…	…
989253	"Sweater"	3	85.92	257.76	2014-01-14 07:36:11
216227	"Socks"	14	26.23	367.22	2014-03-18 12:07:04
838999	"Hat"	2	51.25	102.5	2014-05-07 01:03:32
666731	"Socks"	19	38.97	740.43	2014-03-17 08:44:30
920517	"Sweater"	11	90.57	996.27	2014-08-03 03:37:02

Any integer can be used as the seed, but 42 is a common choice in many codebases. The number references Douglas Adams’ The Hitchhiker’s Guide to the Galaxy, where it is humorously described as the answer to life, the universe, and everything.

Warning

Because sample returns rows in random order, it is not appropriate for time series forecasting. Forecasting methods depend on chronological order, so a better approach is to filter the data up to a specific date rather than sampling it.

Sampling can dramatically reduce dataset size, which speeds up computations in Polars and makes outputs easier to transfer between systems. However, sampling always involves a loss of detail. In cases where this trade-off is unacceptable, alternative approaches may be necessary.

7.4.2 Reducing Dimensionality

The number of unique attribute combinations, often referred to as dimensionality, has a major influence on how many rows appear in a dataset. A simple mental exercise makes this clear. Suppose a table contains a column with 10 unique categories. If we count records and group_by that column, the result contains 10 rows. Introducing a second column, also with 10 unique values, expands the grouped output to 100 rows. Adding a third column with 10 unique values increases the potential result to 1,000 rows. Even when a dataset does not contain every possible combination, it quickly becomes apparent that each additional column can cause the output size to grow rapidly.

During exploratory analysis, we usually have the flexibility to limit the number of columns or filter values so the resulting dataset remains manageable. When building datasets for downstream tools, however, the priority often shifts toward flexibility. This typically means including a wide range of attributes and calculated columns. To preserve detail without letting the dataset become unwieldy, it is important to apply deliberate grouping strategies.

Date and time columns are often the first place to look when trying to reduce dataset size. It is worth asking stakeholders whether daily granularity is truly necessary or whether weekly or monthly summaries would meet their needs. Grouping by month and day of week can also strike a useful balance, reducing volume while still revealing weekday versus weekend patterns. Limiting the time span is another option, though this can restrict the ability to analyze long-term trends. In practice, I have seen teams publish a long-range monthly dataset alongside a second dataset that provides daily or hourly detail for a shorter, more recent period.

Text fields are another common source of unnecessary dimensionality. Minor differences in spelling or capitalization can create many distinct values that add little analytical value. Applying text-cleaning functions from Chapter 4, such as to_lowercase, to_uppercase, or strip_chars, helps standardize entries and often improves usability for stakeholders. More targeted changes, such as correcting misspellings or updating outdated names, can be handled with replace or when-then-otherwise expressions.

In some cases, only a small subset of values is important for analysis, and grouping the remaining values together is both practical and effective. This approach is common with geographic data. Although nearly two hundred countries exist, only a handful may have enough observations to justify reporting them individually.

The liquor_sales dataset used in Chapter 5 includes 99 valid county names. Suppose we want to retain detailed records for the five most populous counties, currently Polk, Linn, Scott, Johnson, and Black Hawk, and categorize all others as “Other.” We can do this using a when-then-otherwise expression:

(liquor_sales
 .join(county_population, on='County', how='inner')
 .with_columns(pl.when(pl.col('County').is_in(['POLK','LINN','SCOTT','JOHNSON','BLACK HAWK']))
               .then(pl.col('County'))
               .otherwise(pl.lit('Other'))
               .alias('County_Group'))
 .group_by('County_Group')
 .agg(Count=pl.len())
 .sort('Count', descending=True)
 )

shape: (6, 2)

County_Group	Count
str	u32
"Other"	17267124
"POLK"	5719277
"LINN"	2559418
"SCOTT"	1838667
"BLACK HAWK"	1709832
"JOHNSON"	1505461

This reduces the output to just 6 rows instead of 99, which is a substantial improvement. To make the grouping more adaptive, we can first compute a rank based on the number of distinct Invoice values, which represent transaction IDs, per county. We then keep the top five counties and label all others as “Other”:

(liquor_sales
1 .filter(pl.col('County').ne('EL PASO'))
 .join(liquor_sales
       .filter(pl.col('County').ne('EL PASO'))
       .group_by('County')
       .agg(pl.col('Invoice').n_unique())
       .with_columns(Rank=pl.col('Invoice').rank(method='min', descending=True))
       ,on='County')
 .with_columns(pl.when(pl.col('Rank') <= 5)
               .then(pl.col('County'))
               .otherwise(pl.lit('Other'))
               .alias('County_Group'))
 .group_by('County_Group')
 .agg(Transactions=pl.col('Invoice').n_unique())
 .sort('Transactions', descending=True)
)

1: El Paso is not a valid Iowa county.

shape: (6, 2)

County_Group	Transactions
str	u32
"Other"	17267124
"POLK"	5719277
"LINN"	2559418
"SCOTT"	1838667
"BLACK HAWK"	1709832
"JOHNSON"	1505461

As new data is added, this dynamic approach ensures that the top counties are always determined using the most current information.

Dimensionality can also be reduced by converting detailed values into simple flag indicators. These flags typically have two possible states. They may be represented as Boolean values, numeric pairs such as 1 and 0, or text labels like “Yes” and “No,” as well as any other meaningful two category scheme. Flags are especially useful when only a threshold matters and the finer grained information is unnecessary. For example, we may care whether a website visitor made a purchase at all, while the exact number of purchases is not essential.

In the library_checkouts dataset, roughly 1.3 million unique ID values represent items checked out from Brisbane libraries. Rather than preserving the exact checkout count for each item, we may only need to record whether an item was checked out at least twice. This can be done by converting the numeric counts into a binary flag.

(library_checkouts
 .filter(pl.col('ID') != "REFRESH")
 .group_by('ID')
 .agg(Checkouts=pl.len())
 .with_columns(pl.when(pl.col('Checkouts') >= 2)
               .then(True)
               .otherwise(False)
               .alias('Two_Checkouts_Flag'))
 .group_by('Two_Checkouts_Flag')
 .agg(Items=pl.len())
)

shape: (2, 2)

Two_Checkouts_Flag	Items
bool	u32
true	826199
false	542858

The number of items checked out two or more times is more than one and a half times the number of items checked out only once. When combined with other columns, this kind of simplification can significantly reduce the size of the resulting dataset.

In some situations, however, a binary flag does not provide enough nuance. When more detail is needed, numeric values can be grouped into multiple categories instead of just two. This is often implemented using a when-then-otherwise expression, with the resulting categories represented by numeric or textual labels.

For example, we might want to distinguish not only items checked out at least twice, but also those checked out ten or more times.

(library_checkouts
 .filter(pl.col('ID') != "REFRESH")
 .group_by('ID')
 .agg(Checkouts=pl.len())
 .with_columns(pl.when(pl.col('Checkouts') >= 10)
               .then(pl.lit('10+'))
               .when(pl.col('Checkouts') >= 2)
               .then(pl.lit('2 - 9'))
               .otherwise(pl.lit('1'))
               .alias('Checkouts_Level'))
 .group_by('Checkouts_Level')
 .agg(Items=pl.len())
)

shape: (3, 2)

Checkouts_Level	Items
str	u32
"2 - 9"	818933
"1"	542858
"10+"	7266

This approach reduces approximately 1.3 million unique values to just three categories while preserving the distinction between items checked out once, items checked out a few times, and items checked out many times. Similar forms of categorization appear in many analytical contexts. As with other transformations, selecting meaningful breakpoints often requires experimentation. Striking the right balance between granularity and simplification can substantially reduce dataset size and often improves the performance and responsiveness of downstream systems.

7.4.3 PII and Data Privacy

Protecting personal data has become a central concern for anyone working with information today. Rich datasets with many columns can enable deeper insights and more accurate recommendations. However, when those datasets describe real people, it is essential to account for both ethical responsibilities and legal requirements related to how the data is collected and used. Regulations governing the privacy of patients, students, and banking customers have existed for decades. More recently, laws focused on consumer data rights have emerged as well. The European Union’s General Data Protection Regulation (GDPR) is the most widely known example. Other important regulations include Canada’s Personal Information Protection and Electronic Documents Act (PIPEDA), India’s Digital Personal Data Protection (DPDP) Act, and Japan’s Act on the Protection of Personal Information (APPI).

These and similar laws define how personally identifiable information (PII) must be handled, stored, and in some cases deleted. Some forms of PII are obvious, such as names, physical addresses, email accounts, birth dates, and Social Security numbers. PII also includes health related measures like pulse rate, blood pressure, and medical conditions. Location data, including GPS coordinates, is also classified as PII because only a small number of points can uniquely identify a person. For example, a GPS reading at my home could identify whoever is present there, and another reading at my workplace could identify me specifically. Anyone working with data should understand the scope of these regulations and consult their organization’s privacy experts, who can provide the most up to date guidance.

A common best practice when working with datasets that contain PII is to ensure that sensitive columns do not appear in analytical outputs. This can be achieved through aggregation, value replacement, or the application of hashing techniques.

Most analytical tasks aim to reveal patterns or summarize behavior. Counting users or calculating averages, rather than exposing individual level details, is usually sufficient. Aggregation often removes PII, but caution is still required. A combination of attributes that appears only once may still reveal an individual’s identity. Treating these cases as outliers and excluding them can help maintain stronger privacy protections.

When individual level data is required, for example to compute distinct user counts in a downstream system, sensitive values can be replaced with randomly generated alternatives that preserve uniqueness. The rank function with method='dense' can be used to assign each person a new identifier. Using the dense method ensures that each new rank increments directly after the previous group of tied values.

Suppose we want to anonymize the email addresses in the dataframe below.

emails_df = pl.DataFrame({
    'Email': ['eleonora@memail.com', 'mae@memail.com',
    'grace@memail.com', 'mae@memail.com', 'elsa@memail.com'],
})
emails_df

shape: (5, 1)

Email
str
"eleonora@memail.com"
"mae@memail.com"
"grace@memail.com"
"mae@memail.com"
"elsa@memail.com"

We can transform the emails to protect individual identities:

(emails_df
1 .sort('Email')
 .with_columns(Anonymized=pl.col('Email').rank(method='dense'))
)

1: Sorting the column is required for this approach. Without sorting, repeated values that are not adjacent, such as multiple occurrences of “mae@memail.com”, would receive different anonymized values.

shape: (5, 2)

Email	Anonymized
str	u32
"eleonora@memail.com"	1
"elsa@memail.com"	2
"grace@memail.com"	3
"mae@memail.com"	4
"mae@memail.com"	4

Hashing is another useful technique. A hash function converts an input value into a new value using a deterministic algorithm. The same input always produces the same hash, which makes this approach useful for preserving uniqueness while hiding the original data. Hashed values can be generated using the hash expression:

(emails_df
 .with_columns(Anonymized=pl.col('Email').hash())
)

shape: (5, 2)

Email	Anonymized
str	u64
"eleonora@memail.com"	13560331844008400074
"mae@memail.com"	17484378495117224810
"grace@memail.com"	18179291770701544741
"mae@memail.com"	17484378495117224810
"elsa@memail.com"	12881059481224718104

Warning

Implementing hash does not guarantee consistent results across different Polars versions. It’s only stable within a single version.

Whenever possible, avoid including PII in the outputs of your Polars workflows so that sensitive information does not propagate into additional systems or files. When that is not feasible, masking or replacing values is the next best option. You may also consider secure data sharing designs, such as building a protected pipeline between a data source, for example via a web API, and an email delivery system. This approach eliminates the need to write email addresses to disk. With thoughtful system design and close collaboration with legal and technical teams, it is possible to produce high quality analyses while still respecting individual privacy.

7.5 Conclusion

Every analysis involves decisions about code structure, complexity management, performance optimization, and the protection of private information in the results. This chapter introduced several techniques, patterns, and Polars specific features that support these goals. There is no need to master every concept at once or to apply every method in every analysis. Many projects require only a subset of these tools, and there are often multiple valid ways to solve the same problem. As you continue working with Polars, you will naturally encounter situations where these strategies become relevant and valuable.

The datasets can be downloaded from the council website: https://data.brisbane.qld.gov.au/explore/?q=library+checkout+dataset&sort=modified ↩︎