Don’t Write Polars Code with a Pandas Accent

JORAM MUTENGE

2026-05-15

Why listen to me?

  • MSc Information Management from University of Illinois Urbana-Champaign
  • Used Polars in production since 2023
  • Author of new Polars book Buy the book!

Outline

  • Five pandas-accented mistakes
  • Show alternative idiomatic code
  • Recap
  • Questions

What is polars?

  • Yes, it’s a bear
  • Open-source library
  • For data manipulation
  • Like pandas, but with typed API that’s expression-based

Pandas-accented mistakes

1. Choosing columns with square brackets

import polars as pl

df = pl.read_parquet('../../datasets/liquor_sales_iowa.parquet')

(df
 [['Date','Store','Pack']]
 )
shape: (1_833_973, 3)
Date Store Pack
date str i8
2025-05-20 "CASEY'S GENERAL STORE #3055 / … 24
2025-05-20 "FAREWAY STORES #561 / WAVERLY" 12
2025-05-19 "7STAR LIQUOR & TOBACCO OUTLET" 12
2025-05-16 "RANDY'S NEIGHBORHOOD MARKET" 12
2025-05-19 "ALLCOOL LIQUORS / WDM" 12
2025-05-19 "A&S FOOD AND GAS / DAVENPORT" 12
2025-05-19 "SOUTHSIDE TOBACCO & LIQUOR" 3
2025-05-20 "ALI'S CORNER / WATERLOO" 12
2025-05-16 "ANOTHER ROUND / DEWITT" 6
2025-05-16 "QUIK TRIP #531 / GRIMES" 24

The idiomatic way

(df
 .select(pl.col('Date'), pl.col('Store'), pl.col('Pack'))
 .select('Date','Store','Pack')
 )
shape: (1_833_973, 3)
Date Store Pack
date str i8
2025-05-20 "CASEY'S GENERAL STORE #3055 / … 24
2025-05-20 "FAREWAY STORES #561 / WAVERLY" 12
2025-05-19 "7STAR LIQUOR & TOBACCO OUTLET" 12
2025-05-16 "RANDY'S NEIGHBORHOOD MARKET" 12
2025-05-19 "ALLCOOL LIQUORS / WDM" 12
2025-05-19 "A&S FOOD AND GAS / DAVENPORT" 12
2025-05-19 "SOUTHSIDE TOBACCO & LIQUOR" 3
2025-05-20 "ALI'S CORNER / WATERLOO" 12
2025-05-16 "ANOTHER ROUND / DEWITT" 6
2025-05-16 "QUIK TRIP #531 / GRIMES" 24

Select column as expression when modifying

(df
 .select(pl.col('Date').dt.strftime('%b').alias('Month'),
         'Store','Category')
 )
shape: (1_833_973, 3)
Month Store Category
str str str
"Jan" "HY-VEE FOOD STORE / IOWA CITY" "AMERICAN VODKAS"
"Mar" "STAR / DENISON" "WHISKEY LIQUEUR"
"Jul" "WILKIE LIQUORS" "FLAVORED RUM"
"Aug" "THE BOONEDOCKS" "SCOTCH WHISKIES"
"Sep" "HY-VEE WINE AND SPIRITS (1010)… "NEUTRAL GRAIN SPIRITS"
"Sep" "HY-VEE FOOD STORE #5 (1151) / … "COCKTAILS/RTD"
"Jul" "WAL-MART 2827 / CORALVILLE" "AMERICAN FLAVORED VODKA"
"Jan" "WAL-MART 0892 / ANKENY" "AMERICAN FLAVORED VODKA"
"Jul" "CASEY'S GENERAL STORE #1028 / … "AMERICAN VODKAS"
"Apr" "FAREWAY STORES #989 / WAUKEE" "IMPORTED DRY GINS"

Pandas-accented mistakes

2. Multiple intermediate variables

  • Q: Which category generates the most revenue, and does that ranking change when you look at volume (gallons) instead?

Pandas-accented mistakes

2. Multiple intermediate variables

Variable 1

revenue_ranking = (df
 .with_columns(Total_Revenue=pl.col('Bottles_Sold') * pl.col('Retail_Price'))
 .group_by('Category')
 .agg(pl.col('Total_Revenue').sum())
 .sort('Total_Revenue', descending=True)
 .with_row_index('Revenue_Rank', offset=1)
 )
revenue_ranking
shape: (44, 3)
Revenue_Rank Category Total_Revenue
u32 str f64
1 "AMERICAN VODKAS" 4.7688e7
2 "CANADIAN WHISKIES" 3.2473e7
3 "STRAIGHT BOURBON WHISKIES" 2.8392e7
4 "100% AGAVE TEQUILA" 2.5294e7
5 "WHISKEY LIQUEUR" 2.0025e7
40 "TRIPLE SEC" 610540.07
41 "MEZCAL" 347583.57
42 "NEUTRAL GRAIN SPIRITS" 327735.29
43 "CORN WHISKIES" 82421.91
44 "AMERICAN SLOE GINS" 15090.18

Variable 2

volume_ranking = (df
 .group_by('Category')
 .agg(pl.col('Vol_Sold_gal').sum().alias('Total_Vol_gal'))
 .sort('Total_Vol_gal', descending=True)
 .with_row_index('Volume_Rank', offset=1)
)
volume_ranking
shape: (44, 3)
Volume_Rank Category Total_Vol_gal
u32 str f64
1 "AMERICAN VODKAS" 1.0737e6
2 "CANADIAN WHISKIES" 498621.63
3 "STRAIGHT BOURBON WHISKIES" 254581.81
4 "SPICED RUM" 234828.75
5 "WHISKEY LIQUEUR" 197825.15
40 "AMERICAN DISTILLED SPIRITS SPE… 6360.33
41 "NEUTRAL GRAIN SPIRITS" 4683.66
42 "MEZCAL" 2261.04
43 "CORN WHISKIES" 695.88
44 "AMERICAN SLOE GINS" 465.34

Variable 3

# Join both rankings together and compare
combined = (revenue_ranking
 .join(volume_ranking, on='Category')
 .with_columns(pl.col('Revenue_Rank','Volume_Rank').cast(pl.Int32))
 .with_columns(Rank_Change=pl.col('Revenue_Rank') - pl.col('Volume_Rank'))
 .sort('Revenue_Rank')
 .select('Category','Revenue_Rank','Volume_Rank','Rank_Change')
 )
combined
shape: (44, 4)
Category Revenue_Rank Volume_Rank Rank_Change
str i32 i32 i32
"AMERICAN VODKAS" 1 1 0
"CANADIAN WHISKIES" 2 2 0
"STRAIGHT BOURBON WHISKIES" 3 3 0
"100% AGAVE TEQUILA" 4 7 -3
"WHISKEY LIQUEUR" 5 5 0
"TRIPLE SEC" 40 24 16
"MEZCAL" 41 42 -1
"NEUTRAL GRAIN SPIRITS" 42 41 1
"CORN WHISKIES" 43 43 0
"AMERICAN SLOE GINS" 44 44 0

Idiomatic - no intermediate variables

(df
 .with_columns(Revenue=pl.col('Bottles_Sold').mul('Retail_Price'))
 .group_by('Category')
 .agg(pl.col('Revenue','Vol_Sold_gal').sum())
 .with_columns(Revenue_Rank=pl.col('Revenue').rank(method='ordinal', descending=True),
               Volume_Rank=pl.col('Vol_Sold_gal').rank(method='ordinal', descending=True))
 .with_columns(Rank_Change=(pl.col('Revenue_Rank').cast(pl.Int32) - pl.col('Volume_Rank').cast(pl.Int32)))
 .sort('Revenue_Rank')
 .select('Category','Revenue_Rank','Volume_Rank','Rank_Change')
)
shape: (44, 4)
Category Revenue_Rank Volume_Rank Rank_Change
str u32 u32 i32
"AMERICAN VODKAS" 1 1 0
"CANADIAN WHISKIES" 2 2 0
"STRAIGHT BOURBON WHISKIES" 3 3 0
"100% AGAVE TEQUILA" 4 7 -3
"WHISKEY LIQUEUR" 5 5 0
"TRIPLE SEC" 40 24 16
"MEZCAL" 41 42 -1
"NEUTRAL GRAIN SPIRITS" 42 41 1
"CORN WHISKIES" 43 43 0
"AMERICAN SLOE GINS" 44 44 0

Your teenage daughter (a metaphor)

Performance comparison

  • With intermediate variables
33.2 ms ± 2.25 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)


  • Idiomatic chained operation
18.1 ms ± 490 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Pandas-accented mistakes

3. Overuse of eager execution

  • Eager mode - operations are executed immediately.
  • Lazy mode - operations are collected into a LazyFrame to build a query plan before being executed.
    • Optimization techniques include:
      • predicate pushdown (filtering rows early)
      • projection pushdown (selecting columns early)

Pandas-accented mistakes

3. Overuse of eager execution

Q: Show categories with a change in ranking greater than 10

shape: (44, 4)
Category Revenue_Rank Volume_Rank Rank_Change
str u32 u32 i32
"AMERICAN VODKAS" 1 1 0
"CANADIAN WHISKIES" 2 2 0
"STRAIGHT BOURBON WHISKIES" 3 3 0
"100% AGAVE TEQUILA" 4 7 3
"WHISKEY LIQUEUR" 5 5 0
"TRIPLE SEC" 40 24 -16
"MEZCAL" 41 42 1
"NEUTRAL GRAIN SPIRITS" 42 41 -1
"CORN WHISKIES" 43 43 0
"AMERICAN SLOE GINS" 44 44 0
(df
 .lazy()
 .with_columns(Revenue=pl.col('Bottles_Sold').mul('Retail_Price'))
 .group_by('Category')
 .agg(pl.col('Revenue','Vol_Sold_gal').sum())
 .with_columns(Revenue_Rank=pl.col('Revenue').rank(method='ordinal', descending=True),
               Volume_Rank=pl.col('Vol_Sold_gal').rank(method='ordinal', descending=True))
 .with_columns(Rank_Change=(pl.col('Volume_Rank').cast(pl.Int32) - pl.col('Revenue_Rank').cast(pl.Int32)))
 .sort('Revenue_Rank')
 .select('Category','Revenue_Rank','Volume_Rank','Rank_Change')
 .filter(pl.col('Rank_Change').abs() > 10)
 .collect()
)
shape: (3, 4)
Category Revenue_Rank Volume_Rank Rank_Change
str u32 u32 i32
"TEMPORARY & SPECIALTY PACKAGES" 9 20 11
"IMPORTED DISTILLED SPIRITS SPE… 30 17 -13
"TRIPLE SEC" 40 24 -16

Performance comparison

  • Eager mode
17.9 ms ± 318 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)


  • Lazy mode
16.5 ms ± 592 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)

The lazy mode query plan

Read from bottom to top!

Pandas-accented mistakes

4. Adding across columns

shape: (9, 7)
Store Month Mon Tue Wed Thu Fri
str str f64 f64 f64 f64 f64
"CASEY'S GENERAL STORE" "Jan" 421779.11 496222.83 437798.84 470631.42 747257.54
"CASEY'S GENERAL STORE" "Feb" 405312.16 498864.1 493118.19 275825.19 644216.17
"CASEY'S GENERAL STORE" "Mar" 413580.02 581096.63 413073.32 361039.54 761948.53
"CASEY'S GENERAL STORE" "Apr" 343738.39 828913.47 630487.31 290815.02 723226.92
"CASEY'S GENERAL STORE" "May" 284811.77 661586.62 596480.34 441896.98 1005082.4
"CASEY'S GENERAL STORE" "Jun" 568432.08 642478.37 543539.05 388688.94 886570.59
"CASEY'S GENERAL STORE" "Jul" 368161.5 685091.68 598812.68 502196.72 578484.54
"CASEY'S GENERAL STORE" "Aug" 477613.7 541931.52 528699.39 345751.33 979360.23
"CASEY'S GENERAL STORE" "Sep" 428196.8 795148.85 610164.28 327594.13 796973.5
(casey_revenue
 .with_columns(Total=pl.col('Mon') + pl.col('Tue') + pl.col('Wed') + pl.col('Thu') + pl.col('Fri'),
               _Total=pl.sum_horizontal(pl.nth(2,3,4,5,6)),
               __Total=pl.sum_horizontal(pl.nth(range(2,7))),
               ___Total=pl.sum_horizontal(pl.all().exclude('Store','Month')),
               )
 .select('Store', 'Month', 'Total', '_Total', '__Total', '___Total')
 )
shape: (9, 6)
Store Month Total _Total __Total ___Total
str str f64 f64 f64 f64
"CASEY'S GENERAL STORE" "Jan" 2.5737e6 2.5737e6 2.5737e6 2.5737e6
"CASEY'S GENERAL STORE" "Feb" 2.3173e6 2.3173e6 2.3173e6 2.3173e6
"CASEY'S GENERAL STORE" "Mar" 2.5307e6 2.5307e6 2.5307e6 2.5307e6
"CASEY'S GENERAL STORE" "Apr" 2.8172e6 2.8172e6 2.8172e6 2.8172e6
"CASEY'S GENERAL STORE" "May" 2.9899e6 2.9899e6 2.9899e6 2.9899e6
"CASEY'S GENERAL STORE" "Jun" 3.0297e6 3.0297e6 3.0297e6 3.0297e6
"CASEY'S GENERAL STORE" "Jul" 2.7327e6 2.7327e6 2.7327e6 2.7327e6
"CASEY'S GENERAL STORE" "Aug" 2.8734e6 2.8734e6 2.8734e6 2.8734e6
"CASEY'S GENERAL STORE" "Sep" 2.9581e6 2.9581e6 2.9581e6 2.9581e6

Even more idiomatic

import polars.selectors as cs

(casey_revenue
 .with_columns(_Total=pl.sum_horizontal(cs.float()))
 .select('Store','Month', cs.starts_with('_'))
 )
shape: (9, 3)
Store Month _Total
str str f64
"CASEY'S GENERAL STORE" "Jan" 2.5737e6
"CASEY'S GENERAL STORE" "Feb" 2.3173e6
"CASEY'S GENERAL STORE" "Mar" 2.5307e6
"CASEY'S GENERAL STORE" "Apr" 2.8172e6
"CASEY'S GENERAL STORE" "May" 2.9899e6
"CASEY'S GENERAL STORE" "Jun" 3.0297e6
"CASEY'S GENERAL STORE" "Jul" 2.7327e6
"CASEY'S GENERAL STORE" "Aug" 2.8734e6
"CASEY'S GENERAL STORE" "Sep" 2.9581e6

Pandas-accented mistakes

5. Don’t use drop

shape: (9, 7)
Store Month Mon Tue Wed Thu Fri
str str f64 f64 f64 f64 f64
"CASEY'S GENERAL STORE" "Jan" 421779.11 496222.83 437798.84 470631.42 747257.54
"CASEY'S GENERAL STORE" "Feb" 405312.16 498864.1 493118.19 275825.19 644216.17
"CASEY'S GENERAL STORE" "Mar" 413580.02 581096.63 413073.32 361039.54 761948.53
"CASEY'S GENERAL STORE" "Apr" 343738.39 828913.47 630487.31 290815.02 723226.92
"CASEY'S GENERAL STORE" "May" 284811.77 661586.62 596480.34 441896.98 1005082.4
"CASEY'S GENERAL STORE" "Jun" 568432.08 642478.37 543539.05 388688.94 886570.59
"CASEY'S GENERAL STORE" "Jul" 368161.5 685091.68 598812.68 502196.72 578484.54
"CASEY'S GENERAL STORE" "Aug" 477613.7 541931.52 528699.39 345751.33 979360.23
"CASEY'S GENERAL STORE" "Sep" 428196.8 795148.85 610164.28 327594.13 796973.5

Remove the column “Store”

(casey_revenue
 .drop('Store')
 )
shape: (9, 6)
Month Mon Tue Wed Thu Fri
str f64 f64 f64 f64 f64
"Jan" 421779.11 496222.83 437798.84 470631.42 747257.54
"Feb" 405312.16 498864.1 493118.19 275825.19 644216.17
"Mar" 413580.02 581096.63 413073.32 361039.54 761948.53
"Apr" 343738.39 828913.47 630487.31 290815.02 723226.92
"May" 284811.77 661586.62 596480.34 441896.98 1005082.4
"Jun" 568432.08 642478.37 543539.05 388688.94 886570.59
"Jul" 368161.5 685091.68 598812.68 502196.72 578484.54
"Aug" 477613.7 541931.52 528699.39 345751.33 979360.23
"Sep" 428196.8 795148.85 610164.28 327594.13 796973.5

Dropping non-existent column fails

(casey_revenue
 .drop('Joram')
 )
ColumnNotFoundError: "Joram" not found

Use exclude

(casey_revenue
 .select(pl.exclude('Store'))
 .select(pl.exclude('Joram'))
 )
shape: (9, 6)
Month Mon Tue Wed Thu Fri
str f64 f64 f64 f64 f64
"Jan" 421779.11 496222.83 437798.84 470631.42 747257.54
"Feb" 405312.16 498864.1 493118.19 275825.19 644216.17
"Mar" 413580.02 581096.63 413073.32 361039.54 761948.53
"Apr" 343738.39 828913.47 630487.31 290815.02 723226.92
"May" 284811.77 661586.62 596480.34 441896.98 1005082.4
"Jun" 568432.08 642478.37 543539.05 388688.94 886570.59
"Jul" 368161.5 685091.68 598812.68 502196.72 578484.54
"Aug" 477613.7 541931.52 528699.39 345751.33 979360.23
"Sep" 428196.8 795148.85 610164.28 327594.13 796973.5

Recap

  • Think in Polars first, not pandas
  • Leverage the query optimizer with lazy mode
  • Avoid intermediate variables if you can
  • Avoid drop in favor of exclude

Pop quiz!

Is the code pandas-accented or idiomatic?

(df
 .filter(pl.col('Pack').mod(3) == 0)
 .shape[0]
 )
1685373

pandas-accented


The idiomatic way

(df
 .filter(pl.col('Pack').mod(3) == 0)
 .height
 )
1685373

Pop quiz!

Is the code pandas-accented or idiomatic?

(df
 ['Category']
 .value_counts()
 )
shape: (44, 2)
Category count
str u32
"IMPORTED SCHNAPPS" 22580
"MEZCAL" 2016
"IRISH WHISKIES" 21591
"STRAIGHT RYE WHISKIES" 13409
"MIXTO TEQUILA" 40405
"NEUTRAL GRAIN SPIRITS FLAVORED" 23451
"AMERICAN VODKAS" 293030
"AGED DARK RUM" 5295
"WHITE RUM" 28136
"IMPORTED FLAVORED VODKA" 19572

pandas-accented

The idiomatic way

(df
 .get_column('Category')
 .value_counts()
 )
shape: (44, 2)
Category count
str u32
"IRISH WHISKIES" 21591
"AMERICAN VODKAS" 293030
"BLENDED WHISKIES" 60516
"MEZCAL" 2016
"AMERICAN BRANDIES" 36799
"TENNESSEE WHISKIES" 53341
"BOTTLED IN BOND BOURBON" 9313
"NEUTRAL GRAIN SPIRITS FLAVORED" 23451
"IMPORTED VODKAS" 40479
"IMPORTED SCHNAPPS" 22580

Pop quiz!

Is the code pandas-accented or idiomatic?

(df
 .lazy()
 .group_by('Category')
 .agg(pl.sum('Vol_Sold_gal'))
 .top_k(10, by='Vol_Sold_gal')
 .sort('Vol_Sold_gal', descending=True)
 .with_columns(Cum_Sum=pl.col('Vol_Sold_gal').cum_sum())
 .with_columns(Cum_Pct=pl.col('Cum_Sum') / pl.col('Vol_Sold_gal').sum())
 .filter(pl.col('Cum_Pct').ge(.8))
 .select(['Category','Cum_Pct'])
 .collect()
 )
shape: (5, 2)
Category Cum_Pct
str f64
"COCKTAILS/RTD" 0.812933
"100% AGAVE TEQUILA" 0.866371
"IMPORTED VODKAS" 0.917989
"AMERICAN FLAVORED VODKA" 0.960417
"BLENDED WHISKIES" 1.0

pandas-accented

The idiomatic way

(df
 .lazy()
 .group_by('Category')
 .agg(pl.sum('Vol_Sold_gal'))
 .top_k(10, by='Vol_Sold_gal')
 .sort('Vol_Sold_gal', descending=True)
 .with_columns(Cum_Sum=pl.col('Vol_Sold_gal').cum_sum())
 .with_columns(Cum_Pct=pl.col('Cum_Sum') / pl.col('Vol_Sold_gal').sum())
 .filter(pl.col('Cum_Pct').ge(.8))
 .select('Category','Cum_Pct')
 .collect()
 )
shape: (5, 2)
Category Cum_Pct
str f64
"COCKTAILS/RTD" 0.812933
"100% AGAVE TEQUILA" 0.866371
"IMPORTED VODKAS" 0.917989
"AMERICAN FLAVORED VODKA" 0.960417
"BLENDED WHISKIES" 1.0

Questions