Don’t Write Polars Code with a Pandas Accent

JORAM MUTENGE

2026-05-15

Why listen to me?

  • MSc Information Management from University of Illinois Urbana-Champaign
  • Used Polars in production since 2023
  • Author of new Polars book Buy the book!

Outline

  • Five pandas-accented mistakes
  • Show alternative idiomatic code
  • Takeaways
  • Questions

What is polars?

  • Yes, it’s a bear
  • Open-source library
  • For data manipulation
  • Like pandas, but with typed API that’s expression-based

Pandas-accented mistakes

1. Using square brackets for column selection

import polars as pl

df = pl.read_parquet('../../datasets/liquor_sales_iowa.parquet')

(df
 [['Date','Store','Pack']]
 )
shape: (1_833_973, 3)
Date Store Pack
date str i8
2025-05-20 "CASEY'S GENERAL STORE #3055 / … 24
2025-05-20 "FAREWAY STORES #561 / WAVERLY" 12
2025-05-19 "7STAR LIQUOR & TOBACCO OUTLET" 12
2025-05-16 "RANDY'S NEIGHBORHOOD MARKET" 12
2025-05-19 "ALLCOOL LIQUORS / WDM" 12
2025-05-19 "A&S FOOD AND GAS / DAVENPORT" 12
2025-05-19 "SOUTHSIDE TOBACCO & LIQUOR" 3
2025-05-20 "ALI'S CORNER / WATERLOO" 12
2025-05-16 "ANOTHER ROUND / DEWITT" 6
2025-05-16 "QUIK TRIP #531 / GRIMES" 24

The idiomatic way

(df
 .select(pl.col('Date'), pl.col('Store'), pl.col('Pack'))
 .select('Date','Store','Pack')
 )
shape: (1_833_973, 3)
Date Store Pack
date str i8
2025-05-20 "CASEY'S GENERAL STORE #3055 / … 24
2025-05-20 "FAREWAY STORES #561 / WAVERLY" 12
2025-05-19 "7STAR LIQUOR & TOBACCO OUTLET" 12
2025-05-16 "RANDY'S NEIGHBORHOOD MARKET" 12
2025-05-19 "ALLCOOL LIQUORS / WDM" 12
2025-05-19 "A&S FOOD AND GAS / DAVENPORT" 12
2025-05-19 "SOUTHSIDE TOBACCO & LIQUOR" 3
2025-05-20 "ALI'S CORNER / WATERLOO" 12
2025-05-16 "ANOTHER ROUND / DEWITT" 6
2025-05-16 "QUIK TRIP #531 / GRIMES" 24

Select column as expression when modifying

(df
 .select(pl.col('Date').dt.strftime('%b').alias('Month'),
         'Store','Category')
 )
shape: (1_833_973, 3)
Month Store Category
str str str
"Mar" "HY-VEE FOOD STORE #2 / STATE A… "COCKTAILS/RTD"
"Mar" "HY-VEE WINE AND SPIRITS / PELL… "STRAIGHT BOURBON WHISKIES"
"Feb" "J D SPIRITS LIQUOR" "IMPORTED DRY GINS"
"Jun" "CASEY'S GENERAL STORE #2544 / … "STRAIGHT BOURBON WHISKIES"
"Feb" "HY-VEE #3 FOOD & DRUGSTORE / D… "NEUTRAL GRAIN SPIRITS FLAVORED"
"Sep" "CENTRAL CITY 2" "AMERICAN CORDIALS & LIQUEURS"
"Apr" "HY-VEE WINE AND SPIRITS #1 (12… "100% AGAVE TEQUILA"
"Sep" "GOLDEN MART" "AMERICAN VODKAS"
"Jul" "IOWA LIQUOR & TOBACCO" "STRAIGHT BOURBON WHISKIES"
"Jan" "HY-VEE FOOD STORE / CORALVILLE" "TEMPORARY & SPECIALTY PACKAGES"

Pandas-accented mistakes

2. Multiple intermediate variables

  • Q: Which category generates the most revenue, and does that ranking change when you look at volume (gallons) instead?

Pandas-accented mistakes

2. Multiple intermediate variables

Variable 1

# Revenue ranking (Bottles_Sold × Retail_Price)
revenue_ranking = (df.with_columns((pl.col('Bottles_Sold') * pl.col('Retail_Price')).alias('Total_Revenue'))
 .group_by('Category')
 .agg(pl.col('Total_Revenue').sum())
 .sort('Total_Revenue', descending=True)
 )
revenue_ranking
shape: (44, 2)
Category Total_Revenue
str f64
"AMERICAN VODKAS" 4.7688e7
"CANADIAN WHISKIES" 3.2473e7
"STRAIGHT BOURBON WHISKIES" 2.8392e7
"100% AGAVE TEQUILA" 2.5294e7
"WHISKEY LIQUEUR" 2.0025e7
"TRIPLE SEC" 610540.07
"MEZCAL" 347583.57
"NEUTRAL GRAIN SPIRITS" 327735.29
"CORN WHISKIES" 82421.91
"AMERICAN SLOE GINS" 15090.18

Variable 2

# Volume ranking (sum of Vol_Sold_gal)
volume_ranking = (df.group_by('Category')
 .agg(pl.col('Vol_Sold_gal').sum().alias('Total_Vol_gal'))
 .sort('Total_Vol_gal', descending=True)
)
volume_ranking
shape: (44, 2)
Category Total_Vol_gal
str f64
"AMERICAN VODKAS" 1.0737e6
"CANADIAN WHISKIES" 498621.63
"STRAIGHT BOURBON WHISKIES" 254581.81
"SPICED RUM" 234828.75
"WHISKEY LIQUEUR" 197825.15
"AMERICAN DISTILLED SPIRITS SPE… 6360.33
"NEUTRAL GRAIN SPIRITS" 4683.66
"MEZCAL" 2261.04
"CORN WHISKIES" 695.88
"AMERICAN SLOE GINS" 465.34

Variable 3

# Join both rankings together and compare
combined = (revenue_ranking
 .with_row_index('Revenue_Rank', offset=1)
 .join(volume_ranking.with_row_index('Volume_Rank', offset=1), on='Category')
 .with_columns(Rank_Change=(pl.col('Volume_Rank').cast(pl.Int32) - pl.col('Revenue_Rank').cast(pl.Int32)))
 .sort('Revenue_Rank')
 )
combined
shape: (44, 6)
Revenue_Rank Category Total_Revenue Volume_Rank Total_Vol_gal Rank_Change
u32 str f64 u32 f64 i32
1 "AMERICAN VODKAS" 4.7688e7 1 1.0737e6 0
2 "CANADIAN WHISKIES" 3.2473e7 2 498621.63 0
3 "STRAIGHT BOURBON WHISKIES" 2.8392e7 3 254581.81 0
4 "100% AGAVE TEQUILA" 2.5294e7 7 161508.45 3
5 "WHISKEY LIQUEUR" 2.0025e7 5 197825.15 0
40 "TRIPLE SEC" 610540.07 24 38478.46 -16
41 "MEZCAL" 347583.57 42 2261.04 1
42 "NEUTRAL GRAIN SPIRITS" 327735.29 41 4683.66 -1
43 "CORN WHISKIES" 82421.91 43 695.88 0
44 "AMERICAN SLOE GINS" 15090.18 44 465.34 0

Idiomatic - no intermediate variables

(df
 .with_columns(Revenue=pl.col('Bottles_Sold').mul('Retail_Price'))
 .group_by('Category')
 .agg(pl.col('Revenue').sum(), pl.col('Vol_Sold_gal').sum())
 .with_columns(Revenue_Rank=pl.col('Revenue').rank(method='ordinal', descending=True),
               Volume_Rank=pl.col('Vol_Sold_gal').rank(method='ordinal', descending=True))
 .with_columns(Rank_Change=(pl.col('Volume_Rank').cast(pl.Int32) - pl.col('Revenue_Rank').cast(pl.Int32)))
 .sort('Revenue_Rank')
)
shape: (44, 6)
Category Revenue Vol_Sold_gal Revenue_Rank Volume_Rank Rank_Change
str f64 f64 u32 u32 i32
"AMERICAN VODKAS" 4.7688e7 1.0737e6 1 1 0
"CANADIAN WHISKIES" 3.2473e7 498621.63 2 2 0
"STRAIGHT BOURBON WHISKIES" 2.8392e7 254581.81 3 3 0
"100% AGAVE TEQUILA" 2.5294e7 161508.45 4 7 3
"WHISKEY LIQUEUR" 2.0025e7 197825.15 5 5 0
"TRIPLE SEC" 610540.07 38478.46 40 24 -16
"MEZCAL" 347583.57 2261.04 41 42 1
"NEUTRAL GRAIN SPIRITS" 327735.29 4683.66 42 41 -1
"CORN WHISKIES" 82421.91 695.88 43 43 0
"AMERICAN SLOE GINS" 15090.18 465.34 44 44 0

Your teenage daughter (a metaphor)

Performance comparison

  • With intermediate variables
32.8 ms ± 1.55 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)


  • Idiomatic chained operation
17.4 ms ± 230 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Pandas-accented mistakes

3. Overuse of eager execution

  • Eager mode - operations are executed immediately.
  • Lazy mode - operations are collected into a LazyFrame to build a query plan before being executed.
    • Optimization techniques include:
      • predicate pushdown (filtering rows early)
      • projection pushdown (selecting columns early)

Pandas-accented mistakes

3. Overuse of eager execution

Q: Show categories with a change in ranking greater than 10

shape: (44, 6)
Category Revenue Vol_Sold_gal Revenue_Rank Volume_Rank Rank_Change
str f64 f64 u32 u32 i32
"AMERICAN VODKAS" 4.7688e7 1.0737e6 1 1 0
"CANADIAN WHISKIES" 3.2473e7 498621.63 2 2 0
"STRAIGHT BOURBON WHISKIES" 2.8392e7 254581.81 3 3 0
"100% AGAVE TEQUILA" 2.5294e7 161508.45 4 7 3
"WHISKEY LIQUEUR" 2.0025e7 197825.15 5 5 0
"TRIPLE SEC" 610540.07 38478.46 40 24 -16
"MEZCAL" 347583.57 2261.04 41 42 1
"NEUTRAL GRAIN SPIRITS" 327735.29 4683.66 42 41 -1
"CORN WHISKIES" 82421.91 695.88 43 43 0
"AMERICAN SLOE GINS" 15090.18 465.34 44 44 0
(df
 .lazy()
 .with_columns(Revenue=pl.col('Bottles_Sold').mul('Retail_Price'))
 .group_by('Category')
 .agg(pl.col('Revenue').sum(), pl.col('Vol_Sold_gal').sum())
 .with_columns(Revenue_Rank=pl.col('Revenue').rank(method='ordinal', descending=True),
               Volume_Rank=pl.col('Vol_Sold_gal').rank(method='ordinal', descending=True))
 .with_columns(Rank_Change=(pl.col('Volume_Rank').cast(pl.Int32) - pl.col('Revenue_Rank').cast(pl.Int32)))
 .sort('Revenue_Rank')
 .filter(pl.col('Rank_Change').abs() > 10)
 .collect()
)
shape: (3, 6)
Category Revenue Vol_Sold_gal Revenue_Rank Volume_Rank Rank_Change
str f64 f64 u32 u32 i32
"TEMPORARY & SPECIALTY PACKAGES" 8.5197e6 54637.97 9 20 11
"IMPORTED DISTILLED SPIRITS SPE… 2.1564e6 67022.9 30 17 -13
"TRIPLE SEC" 610540.07 38478.46 40 24 -16

Performance comparison

  • Eager mode
17.9 ms ± 490 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)


  • Lazy mode
16.8 ms ± 370 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)

The lazy mode query plan

Read from bottom to top!

Pandas-accented mistakes

4. Adding across columns

shape: (9, 7)
Store Month Mon Tue Wed Thu Fri
str str f64 f64 f64 f64 f64
"CASEY'S GENERAL STORE" "Jan" 421779.11 496222.83 437798.84 470631.42 747257.54
"CASEY'S GENERAL STORE" "Feb" 405312.16 498864.1 493118.19 275825.19 644216.17
"CASEY'S GENERAL STORE" "Mar" 413580.02 581096.63 413073.32 361039.54 761948.53
"CASEY'S GENERAL STORE" "Apr" 343738.39 828913.47 630487.31 290815.02 723226.92
"CASEY'S GENERAL STORE" "May" 284811.77 661586.62 596480.34 441896.98 1005082.4
"CASEY'S GENERAL STORE" "Jun" 568432.08 642478.37 543539.05 388688.94 886570.59
"CASEY'S GENERAL STORE" "Jul" 368161.5 685091.68 598812.68 502196.72 578484.54
"CASEY'S GENERAL STORE" "Aug" 477613.7 541931.52 528699.39 345751.33 979360.23
"CASEY'S GENERAL STORE" "Sep" 428196.8 795148.85 610164.28 327594.13 796973.5
(casey_revenue
 .with_columns(Total=pl.col('Mon') + pl.col('Tue') + pl.col('Wed') + pl.col('Thu') + pl.col('Fri'),
               _Total=pl.sum_horizontal(pl.nth(2,3,4,5,6)),
               __Total=pl.sum_horizontal(pl.nth(range(2,7))),
               ___Total=pl.sum_horizontal(pl.all().exclude('Store','Month')),
               )
 .select('Store', 'Month', 'Total', '_Total', '__Total', '___Total')
 )
shape: (9, 6)
Store Month Total _Total __Total ___Total
str str f64 f64 f64 f64
"CASEY'S GENERAL STORE" "Jan" 2.5737e6 2.5737e6 2.5737e6 2.5737e6
"CASEY'S GENERAL STORE" "Feb" 2.3173e6 2.3173e6 2.3173e6 2.3173e6
"CASEY'S GENERAL STORE" "Mar" 2.5307e6 2.5307e6 2.5307e6 2.5307e6
"CASEY'S GENERAL STORE" "Apr" 2.8172e6 2.8172e6 2.8172e6 2.8172e6
"CASEY'S GENERAL STORE" "May" 2.9899e6 2.9899e6 2.9899e6 2.9899e6
"CASEY'S GENERAL STORE" "Jun" 3.0297e6 3.0297e6 3.0297e6 3.0297e6
"CASEY'S GENERAL STORE" "Jul" 2.7327e6 2.7327e6 2.7327e6 2.7327e6
"CASEY'S GENERAL STORE" "Aug" 2.8734e6 2.8734e6 2.8734e6 2.8734e6
"CASEY'S GENERAL STORE" "Sep" 2.9581e6 2.9581e6 2.9581e6 2.9581e6

Even more idiomatic

import polars.selectors as cs

(casey_revenue
 .with_columns(_Total=pl.sum_horizontal(cs.float()))
 .select('Store','Month', cs.starts_with('_'))
 )
shape: (9, 3)
Store Month _Total
str str f64
"CASEY'S GENERAL STORE" "Jan" 2.5737e6
"CASEY'S GENERAL STORE" "Feb" 2.3173e6
"CASEY'S GENERAL STORE" "Mar" 2.5307e6
"CASEY'S GENERAL STORE" "Apr" 2.8172e6
"CASEY'S GENERAL STORE" "May" 2.9899e6
"CASEY'S GENERAL STORE" "Jun" 3.0297e6
"CASEY'S GENERAL STORE" "Jul" 2.7327e6
"CASEY'S GENERAL STORE" "Aug" 2.8734e6
"CASEY'S GENERAL STORE" "Sep" 2.9581e6

Pandas-accented mistakes

5. Don’t use drop

shape: (9, 7)
Store Month Mon Tue Wed Thu Fri
str str f64 f64 f64 f64 f64
"CASEY'S GENERAL STORE" "Jan" 421779.11 496222.83 437798.84 470631.42 747257.54
"CASEY'S GENERAL STORE" "Feb" 405312.16 498864.1 493118.19 275825.19 644216.17
"CASEY'S GENERAL STORE" "Mar" 413580.02 581096.63 413073.32 361039.54 761948.53
"CASEY'S GENERAL STORE" "Apr" 343738.39 828913.47 630487.31 290815.02 723226.92
"CASEY'S GENERAL STORE" "May" 284811.77 661586.62 596480.34 441896.98 1005082.4
"CASEY'S GENERAL STORE" "Jun" 568432.08 642478.37 543539.05 388688.94 886570.59
"CASEY'S GENERAL STORE" "Jul" 368161.5 685091.68 598812.68 502196.72 578484.54
"CASEY'S GENERAL STORE" "Aug" 477613.7 541931.52 528699.39 345751.33 979360.23
"CASEY'S GENERAL STORE" "Sep" 428196.8 795148.85 610164.28 327594.13 796973.5

Remove the column “Store”

(casey_revenue
 .drop('Store')
 )
shape: (9, 6)
Month Mon Tue Wed Thu Fri
str f64 f64 f64 f64 f64
"Jan" 421779.11 496222.83 437798.84 470631.42 747257.54
"Feb" 405312.16 498864.1 493118.19 275825.19 644216.17
"Mar" 413580.02 581096.63 413073.32 361039.54 761948.53
"Apr" 343738.39 828913.47 630487.31 290815.02 723226.92
"May" 284811.77 661586.62 596480.34 441896.98 1005082.4
"Jun" 568432.08 642478.37 543539.05 388688.94 886570.59
"Jul" 368161.5 685091.68 598812.68 502196.72 578484.54
"Aug" 477613.7 541931.52 528699.39 345751.33 979360.23
"Sep" 428196.8 795148.85 610164.28 327594.13 796973.5

Dropping non-existent column fails

(casey_revenue
 .drop('Joram')
 )
ColumnNotFoundError: "Joram" not found

Use exclude

(casey_revenue
 .select(pl.exclude('Store'))
 .select(pl.exclude('Joram'))
 )
shape: (9, 6)
Month Mon Tue Wed Thu Fri
str f64 f64 f64 f64 f64
"Jan" 421779.11 496222.83 437798.84 470631.42 747257.54
"Feb" 405312.16 498864.1 493118.19 275825.19 644216.17
"Mar" 413580.02 581096.63 413073.32 361039.54 761948.53
"Apr" 343738.39 828913.47 630487.31 290815.02 723226.92
"May" 284811.77 661586.62 596480.34 441896.98 1005082.4
"Jun" 568432.08 642478.37 543539.05 388688.94 886570.59
"Jul" 368161.5 685091.68 598812.68 502196.72 578484.54
"Aug" 477613.7 541931.52 528699.39 345751.33 979360.23
"Sep" 428196.8 795148.85 610164.28 327594.13 796973.5

Takeaways

  • Think in Polars first, not pandas
  • Leverage the query optimizer with lazy mode
  • Avoid intermediate variables if you can
  • Avoid drop in favor of exclude

“The only difference(!) between Shakespeare and you was the size of his idiom list - not the size of his vocabulary.”
– Alan Perlis

Pop quiz!

Is the code pandas-accented or idiomatic?

(df
 .filter(pl.col('Pack').mod(3) == 0)
 .shape[0]
 )
1685373

pandas-accented


The idiomatic way

(df
 .filter(pl.col('Pack').mod(3) == 0)
 .height
 )
1685373

Pop quiz!

Is the code pandas-accented or idiomatic?

(df
 ['Category']
 .value_counts()
 )
shape: (44, 2)
Category count
str u32
"MEZCAL" 2016
"AMERICAN SCHNAPPS" 68257
"WHISKEY LIQUEUR" 128963
"STRAIGHT BOURBON WHISKIES" 144868
"IMPORTED DISTILLED SPIRITS SPE… 8509
"AMERICAN FLAVORED VODKA" 83611
"IMPORTED FLAVORED VODKA" 19572
"SINGLE BARREL BOURBON WHISKIES" 7852
"IRISH WHISKIES" 21591
"GOLD RUM" 6791

pandas-accented

The idiomatic way

(df
 .get_column('Category')
 .value_counts()
 )
shape: (44, 2)
Category count
str u32
"SINGLE BARREL BOURBON WHISKIES" 7852
"AMERICAN VODKAS" 293030
"IMPORTED VODKAS" 40479
"STRAIGHT RYE WHISKIES" 13409
"AMERICAN DISTILLED SPIRITS SPE… 4481
"SINGLE MALT SCOTCH" 9560
"AMERICAN SLOE GINS" 339
"SPICED RUM" 64597
"100% AGAVE TEQUILA" 105951
"IMPORTED DISTILLED SPIRITS SPE… 8509

Pop quiz!

Is the code pandas-accented or idiomatic?

(df
 .lazy()
 .group_by('Category')
 .agg(pl.sum('Vol_Sold_gal'))
 .top_k(10, by='Vol_Sold_gal')
 .sort('Vol_Sold_gal', descending=True)
 .with_columns(Cum_Sum=pl.col('Vol_Sold_gal').cum_sum())
 .with_columns(Cum_Pct=pl.col('Cum_Sum') / pl.col('Vol_Sold_gal').sum())
 .filter(pl.col('Cum_Pct').ge(.8))
 .select(['Category','Cum_Pct'])
 .collect()
 )
shape: (5, 2)
Category Cum_Pct
str f64
"COCKTAILS/RTD" 0.812933
"100% AGAVE TEQUILA" 0.866371
"IMPORTED VODKAS" 0.917989
"AMERICAN FLAVORED VODKA" 0.960417
"BLENDED WHISKIES" 1.0

pandas-accented

The idiomatic way

(df
 .lazy()
 .group_by('Category')
 .agg(pl.sum('Vol_Sold_gal'))
 .top_k(10, by='Vol_Sold_gal')
 .sort('Vol_Sold_gal', descending=True)
 .with_columns(Cum_Sum=pl.col('Vol_Sold_gal').cum_sum())
 .with_columns(Cum_Pct=pl.col('Cum_Sum') / pl.col('Vol_Sold_gal').sum())
 .filter(pl.col('Cum_Pct').ge(.8))
 .select('Category','Cum_Pct')
 .collect()
 )
shape: (5, 2)
Category Cum_Pct
str f64
"COCKTAILS/RTD" 0.812933
"100% AGAVE TEQUILA" 0.866371
"IMPORTED VODKAS" 0.917989
"AMERICAN FLAVORED VODKA" 0.960417
"BLENDED WHISKIES" 1.0

Questions