Don’t Write Polars Code with a Pandas Accent

JORAM MUTENGE

2026-05-16

Why listen to me?

  • MSc Information Management from University of Illinois Urbana-Champaign
  • Used Polars in production since 2023
  • Author of new Polars book Buy the book!

Outline

  • Five pandas-accented mistakes
  • Show alternative idiomatic code
  • Recap
  • Questions

What is polars?

  • Yes, it’s a bear
  • Open-source library
  • For data manipulation
  • Like pandas, but with typed API that’s expression-based

Pandas-accented mistakes

1. Choosing columns with square brackets

import polars as pl

df = pl.read_parquet('../../datasets/liquor_sales_iowa.parquet')

(df
 [['Date','Store','Pack']]
 )
shape: (1_833_973, 3)
Date Store Pack
date str i8
2025-05-20 "CASEY'S GENERAL STORE #3055 / … 24
2025-05-20 "FAREWAY STORES #561 / WAVERLY" 12
2025-05-19 "7STAR LIQUOR & TOBACCO OUTLET" 12
2025-05-16 "RANDY'S NEIGHBORHOOD MARKET" 12
2025-05-19 "ALLCOOL LIQUORS / WDM" 12
2025-05-19 "A&S FOOD AND GAS / DAVENPORT" 12
2025-05-19 "SOUTHSIDE TOBACCO & LIQUOR" 3
2025-05-20 "ALI'S CORNER / WATERLOO" 12
2025-05-16 "ANOTHER ROUND / DEWITT" 6
2025-05-16 "QUIK TRIP #531 / GRIMES" 24

The idiomatic way

(df
 .select(pl.col('Date'), pl.col('Store'), pl.col('Pack'))
 .select('Date','Store','Pack')
 )
shape: (1_833_973, 3)
Date Store Pack
date str i8
2025-05-20 "CASEY'S GENERAL STORE #3055 / … 24
2025-05-20 "FAREWAY STORES #561 / WAVERLY" 12
2025-05-19 "7STAR LIQUOR & TOBACCO OUTLET" 12
2025-05-16 "RANDY'S NEIGHBORHOOD MARKET" 12
2025-05-19 "ALLCOOL LIQUORS / WDM" 12
2025-05-19 "A&S FOOD AND GAS / DAVENPORT" 12
2025-05-19 "SOUTHSIDE TOBACCO & LIQUOR" 3
2025-05-20 "ALI'S CORNER / WATERLOO" 12
2025-05-16 "ANOTHER ROUND / DEWITT" 6
2025-05-16 "QUIK TRIP #531 / GRIMES" 24

Select column as expression when modifying

(df
 .select(pl.col('Date').dt.strftime('%b').alias('Month'),
         'Store','Category')
 )
shape: (1_833_973, 3)
Month Store Category
str str str
"Apr" "HY-VEE DRUGSTORE / DAVENPORT" "BLENDED WHISKIES"
"Feb" "HY-VEE / WAUKEE" "CREAM LIQUEURS"
"Jul" "BENDER FOODS / GUTTENBERG" "FLAVORED RUM"
"Jan" "STAR LIQUOR TOBACCO AND VAPE /… "WHISKEY LIQUEUR"
"Jun" "380BP / SWISHER" "STRAIGHT BOURBON WHISKIES"
"Jun" "KUM & GO #1202 / WAUKEE" "AMERICAN VODKAS"
"Aug" "HY-VEE WINE AND SPIRITS / WDM" "COCKTAILS/RTD"
"Sep" "CVS PHARMACY #8547 / IOWA CITY" "CANADIAN WHISKIES"
"Feb" "HY-VEE #4 / DAVENPORT" "AMERICAN VODKAS"
"Jul" "BIG G FOOD STORE" "BLENDED WHISKIES"

Pandas-accented mistakes

2. Multiple intermediate variables

  • Q: Which category generates the most revenue, and does that ranking change when you look at volume (gallons) instead?

Pandas-accented mistakes

2. Multiple intermediate variables

Variable 1

revenue_ranking = (df
 .with_columns(Revenue=pl.col('Bottles_Sold') * pl.col('Retail_Price'))
 .group_by('Category')
 .agg(pl.col('Revenue').sum())
 .sort('Revenue', descending=True)
 .with_row_index('Revenue_Rank', offset=1)
 )
revenue_ranking
shape: (44, 3)
Revenue_Rank Category Revenue
u32 str f64
1 "AMERICAN VODKAS" 4.7688e7
2 "CANADIAN WHISKIES" 3.2473e7
3 "STRAIGHT BOURBON WHISKIES" 2.8392e7
4 "100% AGAVE TEQUILA" 2.5294e7
5 "WHISKEY LIQUEUR" 2.0025e7
40 "TRIPLE SEC" 610540.07
41 "MEZCAL" 347583.57
42 "NEUTRAL GRAIN SPIRITS" 327735.29
43 "CORN WHISKIES" 82421.91
44 "AMERICAN SLOE GINS" 15090.18

Variable 2

volume_ranking = (df
 .group_by('Category')
 .agg(pl.col('Vol_Sold_gal').sum().alias('Total_Vol_gal'))
 .sort('Total_Vol_gal', descending=True)
 .with_row_index('Volume_Rank', offset=1)
)
volume_ranking
shape: (44, 3)
Volume_Rank Category Total_Vol_gal
u32 str f64
1 "AMERICAN VODKAS" 1.0737e6
2 "CANADIAN WHISKIES" 498621.63
3 "STRAIGHT BOURBON WHISKIES" 254581.81
4 "SPICED RUM" 234828.75
5 "WHISKEY LIQUEUR" 197825.15
40 "AMERICAN DISTILLED SPIRITS SPE… 6360.33
41 "NEUTRAL GRAIN SPIRITS" 4683.66
42 "MEZCAL" 2261.04
43 "CORN WHISKIES" 695.88
44 "AMERICAN SLOE GINS" 465.34

Variable 3

# Join both rankings together and compare
combined = (revenue_ranking
 .join(volume_ranking, on='Category')
 .with_columns(pl.col('Revenue_Rank','Volume_Rank').cast(pl.Int32))
 .with_columns(Rank_Change=pl.col('Revenue_Rank') - pl.col('Volume_Rank'))
 .sort('Revenue_Rank')
 .select('Category','Revenue_Rank','Volume_Rank','Rank_Change')
 )
combined
shape: (44, 4)
Category Revenue_Rank Volume_Rank Rank_Change
str i32 i32 i32
"AMERICAN VODKAS" 1 1 0
"CANADIAN WHISKIES" 2 2 0
"STRAIGHT BOURBON WHISKIES" 3 3 0
"100% AGAVE TEQUILA" 4 7 -3
"WHISKEY LIQUEUR" 5 5 0
"TRIPLE SEC" 40 24 16
"MEZCAL" 41 42 -1
"NEUTRAL GRAIN SPIRITS" 42 41 1
"CORN WHISKIES" 43 43 0
"AMERICAN SLOE GINS" 44 44 0

Idiomatic - no intermediate variables

(df
 .with_columns(Revenue=pl.col('Bottles_Sold').mul('Retail_Price'))
 .group_by('Category')
 .agg(pl.col('Revenue','Vol_Sold_gal').sum())
 .with_columns(Revenue_Rank=pl.col('Revenue').rank(method='ordinal', descending=True),
               Volume_Rank=pl.col('Vol_Sold_gal').rank(method='ordinal', descending=True))
 .with_columns(Rank_Change=(pl.col('Revenue_Rank').cast(pl.Int32) - pl.col('Volume_Rank').cast(pl.Int32)))
 .sort('Revenue_Rank')
 .select('Category','Revenue_Rank','Volume_Rank','Rank_Change')
)
shape: (44, 4)
Category Revenue_Rank Volume_Rank Rank_Change
str u32 u32 i32
"AMERICAN VODKAS" 1 1 0
"CANADIAN WHISKIES" 2 2 0
"STRAIGHT BOURBON WHISKIES" 3 3 0
"100% AGAVE TEQUILA" 4 7 -3
"WHISKEY LIQUEUR" 5 5 0
"TRIPLE SEC" 40 24 16
"MEZCAL" 41 42 -1
"NEUTRAL GRAIN SPIRITS" 42 41 1
"CORN WHISKIES" 43 43 0
"AMERICAN SLOE GINS" 44 44 0

Your teenage daughter (a metaphor)

Performance comparison

  • With intermediate variables
31.3 ms ± 2.09 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)


  • Idiomatic chained operation
17.3 ms ± 430 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Pandas-accented mistakes

3. Overuse of eager execution

  • Eager mode - operations are executed immediately.
  • Lazy mode - operations are collected into a LazyFrame to build a query plan before being executed.
    • Optimization techniques include:
      • predicate pushdown (filtering rows early)
      • projection pushdown (selecting columns early)

Pandas-accented mistakes

3. Overuse of eager execution

Q: Show categories with a change in ranking greater than 10

shape: (44, 4)
Category Revenue_Rank Volume_Rank Rank_Change
str u32 u32 i32
"AMERICAN VODKAS" 1 1 0
"CANADIAN WHISKIES" 2 2 0
"STRAIGHT BOURBON WHISKIES" 3 3 0
"100% AGAVE TEQUILA" 4 7 3
"WHISKEY LIQUEUR" 5 5 0
"TRIPLE SEC" 40 24 -16
"MEZCAL" 41 42 1
"NEUTRAL GRAIN SPIRITS" 42 41 -1
"CORN WHISKIES" 43 43 0
"AMERICAN SLOE GINS" 44 44 0
(df
 .lazy()
 .with_columns(Revenue=pl.col('Bottles_Sold').mul('Retail_Price'))
 .group_by('Category')
 .agg(pl.col('Revenue','Vol_Sold_gal').sum())
 .with_columns(Revenue_Rank=pl.col('Revenue').rank(method='ordinal', descending=True),
               Volume_Rank=pl.col('Vol_Sold_gal').rank(method='ordinal', descending=True))
 .with_columns(Rank_Change=(pl.col('Volume_Rank').cast(pl.Int32) - pl.col('Revenue_Rank').cast(pl.Int32)))
 .sort('Revenue_Rank')
 .select('Category','Revenue_Rank','Volume_Rank','Rank_Change')
 .filter(pl.col('Rank_Change').abs() > 10)
 .collect()
)
shape: (3, 4)
Category Revenue_Rank Volume_Rank Rank_Change
str u32 u32 i32
"TEMPORARY & SPECIALTY PACKAGES" 9 20 11
"IMPORTED DISTILLED SPIRITS SPE… 30 17 -13
"TRIPLE SEC" 40 24 -16

Performance comparison

  • Eager mode
17.6 ms ± 662 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)


  • Lazy mode
16 ms ± 372 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)

The lazy mode query plan

Read from bottom to top!

Pandas-accented mistakes

4. Adding across columns

shape: (9, 7)
Store Month Mon Tue Wed Thu Fri
str str f64 f64 f64 f64 f64
"CASEY'S GENERAL STORE" "Jan" 421779.11 496222.83 437798.84 470631.42 747257.54
"CASEY'S GENERAL STORE" "Feb" 405312.16 498864.1 493118.19 275825.19 644216.17
"CASEY'S GENERAL STORE" "Mar" 413580.02 581096.63 413073.32 361039.54 761948.53
"CASEY'S GENERAL STORE" "Apr" 343738.39 828913.47 630487.31 290815.02 723226.92
"CASEY'S GENERAL STORE" "May" 284811.77 661586.62 596480.34 441896.98 1005082.4
"CASEY'S GENERAL STORE" "Jun" 568432.08 642478.37 543539.05 388688.94 886570.59
"CASEY'S GENERAL STORE" "Jul" 368161.5 685091.68 598812.68 502196.72 578484.54
"CASEY'S GENERAL STORE" "Aug" 477613.7 541931.52 528699.39 345751.33 979360.23
"CASEY'S GENERAL STORE" "Sep" 428196.8 795148.85 610164.28 327594.13 796973.5
(casey_revenue
 .with_columns(Total=pl.col('Mon') + pl.col('Tue') + pl.col('Wed') + pl.col('Thu') + pl.col('Fri'),
               _Total=pl.sum_horizontal(pl.nth(2,3,4,5,6)),
               __Total=pl.sum_horizontal(pl.nth(range(2,7))),
               ___Total=pl.sum_horizontal(pl.all().exclude('Store','Month')),
               )
 .select('Store', 'Month', 'Total', '_Total', '__Total', '___Total')
 )
shape: (9, 6)
Store Month Total _Total __Total ___Total
str str f64 f64 f64 f64
"CASEY'S GENERAL STORE" "Jan" 2.5737e6 2.5737e6 2.5737e6 2.5737e6
"CASEY'S GENERAL STORE" "Feb" 2.3173e6 2.3173e6 2.3173e6 2.3173e6
"CASEY'S GENERAL STORE" "Mar" 2.5307e6 2.5307e6 2.5307e6 2.5307e6
"CASEY'S GENERAL STORE" "Apr" 2.8172e6 2.8172e6 2.8172e6 2.8172e6
"CASEY'S GENERAL STORE" "May" 2.9899e6 2.9899e6 2.9899e6 2.9899e6
"CASEY'S GENERAL STORE" "Jun" 3.0297e6 3.0297e6 3.0297e6 3.0297e6
"CASEY'S GENERAL STORE" "Jul" 2.7327e6 2.7327e6 2.7327e6 2.7327e6
"CASEY'S GENERAL STORE" "Aug" 2.8734e6 2.8734e6 2.8734e6 2.8734e6
"CASEY'S GENERAL STORE" "Sep" 2.9581e6 2.9581e6 2.9581e6 2.9581e6

Even more idiomatic

import polars.selectors as cs

(casey_revenue
 .with_columns(_Total=pl.sum_horizontal(cs.float()))
 .select('Store','Month', cs.starts_with('_'))
 )
shape: (9, 3)
Store Month _Total
str str f64
"CASEY'S GENERAL STORE" "Jan" 2.5737e6
"CASEY'S GENERAL STORE" "Feb" 2.3173e6
"CASEY'S GENERAL STORE" "Mar" 2.5307e6
"CASEY'S GENERAL STORE" "Apr" 2.8172e6
"CASEY'S GENERAL STORE" "May" 2.9899e6
"CASEY'S GENERAL STORE" "Jun" 3.0297e6
"CASEY'S GENERAL STORE" "Jul" 2.7327e6
"CASEY'S GENERAL STORE" "Aug" 2.8734e6
"CASEY'S GENERAL STORE" "Sep" 2.9581e6

Pandas-accented mistakes

5. Don’t use drop

shape: (9, 7)
Store Month Mon Tue Wed Thu Fri
str str f64 f64 f64 f64 f64
"CASEY'S GENERAL STORE" "Jan" 421779.11 496222.83 437798.84 470631.42 747257.54
"CASEY'S GENERAL STORE" "Feb" 405312.16 498864.1 493118.19 275825.19 644216.17
"CASEY'S GENERAL STORE" "Mar" 413580.02 581096.63 413073.32 361039.54 761948.53
"CASEY'S GENERAL STORE" "Apr" 343738.39 828913.47 630487.31 290815.02 723226.92
"CASEY'S GENERAL STORE" "May" 284811.77 661586.62 596480.34 441896.98 1005082.4
"CASEY'S GENERAL STORE" "Jun" 568432.08 642478.37 543539.05 388688.94 886570.59
"CASEY'S GENERAL STORE" "Jul" 368161.5 685091.68 598812.68 502196.72 578484.54
"CASEY'S GENERAL STORE" "Aug" 477613.7 541931.52 528699.39 345751.33 979360.23
"CASEY'S GENERAL STORE" "Sep" 428196.8 795148.85 610164.28 327594.13 796973.5

Remove the column “Store”

(casey_revenue
 .drop('Store')
 )
shape: (9, 6)
Month Mon Tue Wed Thu Fri
str f64 f64 f64 f64 f64
"Jan" 421779.11 496222.83 437798.84 470631.42 747257.54
"Feb" 405312.16 498864.1 493118.19 275825.19 644216.17
"Mar" 413580.02 581096.63 413073.32 361039.54 761948.53
"Apr" 343738.39 828913.47 630487.31 290815.02 723226.92
"May" 284811.77 661586.62 596480.34 441896.98 1005082.4
"Jun" 568432.08 642478.37 543539.05 388688.94 886570.59
"Jul" 368161.5 685091.68 598812.68 502196.72 578484.54
"Aug" 477613.7 541931.52 528699.39 345751.33 979360.23
"Sep" 428196.8 795148.85 610164.28 327594.13 796973.5

Dropping non-existent column fails

(casey_revenue
 .drop('Joram')
 )
ColumnNotFoundError: "Joram" not found

Use exclude

(casey_revenue
 .select(pl.exclude('Store'))
 .select(pl.exclude('Joram'))
 )
shape: (9, 6)
Month Mon Tue Wed Thu Fri
str f64 f64 f64 f64 f64
"Jan" 421779.11 496222.83 437798.84 470631.42 747257.54
"Feb" 405312.16 498864.1 493118.19 275825.19 644216.17
"Mar" 413580.02 581096.63 413073.32 361039.54 761948.53
"Apr" 343738.39 828913.47 630487.31 290815.02 723226.92
"May" 284811.77 661586.62 596480.34 441896.98 1005082.4
"Jun" 568432.08 642478.37 543539.05 388688.94 886570.59
"Jul" 368161.5 685091.68 598812.68 502196.72 578484.54
"Aug" 477613.7 541931.52 528699.39 345751.33 979360.23
"Sep" 428196.8 795148.85 610164.28 327594.13 796973.5

Recap

  • Think in Polars first, not pandas
  • Leverage the query optimizer with lazy mode
  • Avoid intermediate variables if you can
  • Avoid drop in favor of exclude

Pop quiz!

Is the code pandas-accented or idiomatic?

(df
 .filter(pl.col('Pack').mod(3) == 0)
 .shape[0]
 )
1685373

pandas-accented


The idiomatic way

(df
 .filter(pl.col('Pack').mod(3) == 0)
 .height
 )
1685373

Pop quiz!

Is the code pandas-accented or idiomatic?

(df
 ['Category']
 .value_counts()
 )
shape: (44, 2)
Category count
str u32
"IMPORTED FLAVORED VODKA" 19572
"IMPORTED CORDIALS & LIQUEURS" 41481
"SCOTCH WHISKIES" 20247
"SPICED RUM" 64597
"IMPORTED DRY GINS" 17288
"SINGLE BARREL BOURBON WHISKIES" 7852
"TENNESSEE WHISKIES" 53341
"CANADIAN WHISKIES" 168136
"COCKTAILS/RTD" 64715
"NEUTRAL GRAIN SPIRITS FLAVORED" 23451

pandas-accented

The idiomatic way

(df
 .get_column('Category')
 .value_counts()
 )
shape: (44, 2)
Category count
str u32
"AMERICAN SLOE GINS" 339
"IMPORTED VODKAS" 40479
"IMPORTED FLAVORED VODKA" 19572
"100% AGAVE TEQUILA" 105951
"SINGLE BARREL BOURBON WHISKIES" 7852
"BOTTLED IN BOND BOURBON" 9313
"CORN WHISKIES" 484
"SINGLE MALT SCOTCH" 9560
"IRISH WHISKIES" 21591
"WHITE RUM" 28136

Pop quiz!

Is the code pandas-accented or idiomatic?

(df
 .lazy()
 .group_by('Category')
 .agg(pl.sum('Vol_Sold_gal'))
 .top_k(10, by='Vol_Sold_gal')
 .sort('Vol_Sold_gal', descending=True)
 .with_columns(Cum_Sum=pl.col('Vol_Sold_gal').cum_sum())
 .with_columns(Cum_Pct=pl.col('Cum_Sum') / pl.col('Vol_Sold_gal').sum())
 .filter(pl.col('Cum_Pct').ge(.8))
 .select(['Category','Cum_Pct'])
 .collect()
 )
shape: (5, 2)
Category Cum_Pct
str f64
"COCKTAILS/RTD" 0.812933
"100% AGAVE TEQUILA" 0.866371
"IMPORTED VODKAS" 0.917989
"AMERICAN FLAVORED VODKA" 0.960417
"BLENDED WHISKIES" 1.0

pandas-accented

The idiomatic way

(df
 .lazy()
 .group_by('Category')
 .agg(pl.sum('Vol_Sold_gal'))
 .top_k(10, by='Vol_Sold_gal')
 .sort('Vol_Sold_gal', descending=True)
 .with_columns(Cum_Sum=pl.col('Vol_Sold_gal').cum_sum())
 .with_columns(Cum_Pct=pl.col('Cum_Sum') / pl.col('Vol_Sold_gal').sum())
 .filter(pl.col('Cum_Pct').ge(.8))
 .select('Category','Cum_Pct')
 .collect()
 )
shape: (5, 2)
Category Cum_Pct
str f64
"COCKTAILS/RTD" 0.812933
"100% AGAVE TEQUILA" 0.866371
"IMPORTED VODKAS" 0.917989
"AMERICAN FLAVORED VODKA" 0.960417
"BLENDED WHISKIES" 1.0

Questions


let’s connect

talk slides