Don’t Write Polars Code with a Pandas Accent

Why listen to me?

MSc Information Management from University of Illinois Urbana-Champaign
Used Polars in production since 2023
Author of new Polars book Buy the book!

Outline

Five pandas-accented mistakes
Show alternative idiomatic code
Takeaways
Questions

What is polars?

Yes, it’s a bear
Open-source library
For data manipulation
Like pandas, but with typed API that’s expression-based

Pandas-accented mistakes

1. Using square brackets for column selection

import polars as pl

df = pl.read_parquet('../../datasets/liquor_sales_iowa.parquet')

(df
 [['Date','Store','Pack']]
 )

shape: (1_833_973, 3)

Date	Store	Pack
date	str	i8
2025-05-20	"CASEY'S GENERAL STORE #3055 / …	24
2025-05-20	"FAREWAY STORES #561 / WAVERLY"	12
2025-05-19	"7STAR LIQUOR & TOBACCO OUTLET"	12
2025-05-16	"RANDY'S NEIGHBORHOOD MARKET"	12
2025-05-19	"ALLCOOL LIQUORS / WDM"	12
…	…	…
2025-05-19	"A&S FOOD AND GAS / DAVENPORT"	12
2025-05-19	"SOUTHSIDE TOBACCO & LIQUOR"	3
2025-05-20	"ALI'S CORNER / WATERLOO"	12
2025-05-16	"ANOTHER ROUND / DEWITT"	6
2025-05-16	"QUIK TRIP #531 / GRIMES"	24

The idiomatic way

(df
 .select(pl.col('Date'), pl.col('Store'), pl.col('Pack'))
 .select('Date','Store','Pack')
 )

shape: (1_833_973, 3)

Date	Store	Pack
date	str	i8
2025-05-20	"CASEY'S GENERAL STORE #3055 / …	24
2025-05-20	"FAREWAY STORES #561 / WAVERLY"	12
2025-05-19	"7STAR LIQUOR & TOBACCO OUTLET"	12
2025-05-16	"RANDY'S NEIGHBORHOOD MARKET"	12
2025-05-19	"ALLCOOL LIQUORS / WDM"	12
…	…	…
2025-05-19	"A&S FOOD AND GAS / DAVENPORT"	12
2025-05-19	"SOUTHSIDE TOBACCO & LIQUOR"	3
2025-05-20	"ALI'S CORNER / WATERLOO"	12
2025-05-16	"ANOTHER ROUND / DEWITT"	6
2025-05-16	"QUIK TRIP #531 / GRIMES"	24

Select column as expression when modifying

(df
 .select(pl.col('Date').dt.strftime('%b').alias('Month'),
         'Store','Category')
 )

shape: (1_833_973, 3)

Month	Store	Category
str	str	str
"Mar"	"HY-VEE FOOD STORE #2 / STATE A…	"COCKTAILS/RTD"
"Mar"	"HY-VEE WINE AND SPIRITS / PELL…	"STRAIGHT BOURBON WHISKIES"
"Feb"	"J D SPIRITS LIQUOR"	"IMPORTED DRY GINS"
"Jun"	"CASEY'S GENERAL STORE #2544 / …	"STRAIGHT BOURBON WHISKIES"
"Feb"	"HY-VEE #3 FOOD & DRUGSTORE / D…	"NEUTRAL GRAIN SPIRITS FLAVORED"
…	…	…
"Sep"	"CENTRAL CITY 2"	"AMERICAN CORDIALS & LIQUEURS"
"Apr"	"HY-VEE WINE AND SPIRITS #1 (12…	"100% AGAVE TEQUILA"
"Sep"	"GOLDEN MART"	"AMERICAN VODKAS"
"Jul"	"IOWA LIQUOR & TOBACCO"	"STRAIGHT BOURBON WHISKIES"
"Jan"	"HY-VEE FOOD STORE / CORALVILLE"	"TEMPORARY & SPECIALTY PACKAGES"

Pandas-accented mistakes

2. Multiple intermediate variables

Q: Which category generates the most revenue, and does that ranking change when you look at volume (gallons) instead?

Pandas-accented mistakes

2. Multiple intermediate variables

Variable 1

# Revenue ranking (Bottles_Sold × Retail_Price)
revenue_ranking = (df.with_columns((pl.col('Bottles_Sold') * pl.col('Retail_Price')).alias('Total_Revenue'))
 .group_by('Category')
 .agg(pl.col('Total_Revenue').sum())
 .sort('Total_Revenue', descending=True)
 )
revenue_ranking

shape: (44, 2)

Category	Total_Revenue
str	f64
"AMERICAN VODKAS"	4.7688e7
"CANADIAN WHISKIES"	3.2473e7
"STRAIGHT BOURBON WHISKIES"	2.8392e7
"100% AGAVE TEQUILA"	2.5294e7
"WHISKEY LIQUEUR"	2.0025e7
…	…
"TRIPLE SEC"	610540.07
"MEZCAL"	347583.57
"NEUTRAL GRAIN SPIRITS"	327735.29
"CORN WHISKIES"	82421.91
"AMERICAN SLOE GINS"	15090.18

Variable 2

# Volume ranking (sum of Vol_Sold_gal)
volume_ranking = (df.group_by('Category')
 .agg(pl.col('Vol_Sold_gal').sum().alias('Total_Vol_gal'))
 .sort('Total_Vol_gal', descending=True)
)
volume_ranking

shape: (44, 2)

Category	Total_Vol_gal
str	f64
"AMERICAN VODKAS"	1.0737e6
"CANADIAN WHISKIES"	498621.63
"STRAIGHT BOURBON WHISKIES"	254581.81
"SPICED RUM"	234828.75
"WHISKEY LIQUEUR"	197825.15
…	…
"AMERICAN DISTILLED SPIRITS SPE…	6360.33
"NEUTRAL GRAIN SPIRITS"	4683.66
"MEZCAL"	2261.04
"CORN WHISKIES"	695.88
"AMERICAN SLOE GINS"	465.34

Variable 3

# Join both rankings together and compare
combined = (revenue_ranking
 .with_row_index('Revenue_Rank', offset=1)
 .join(volume_ranking.with_row_index('Volume_Rank', offset=1), on='Category')
 .with_columns(Rank_Change=(pl.col('Volume_Rank').cast(pl.Int32) - pl.col('Revenue_Rank').cast(pl.Int32)))
 .sort('Revenue_Rank')
 )
combined

shape: (44, 6)

Revenue_Rank	Category	Total_Revenue	Volume_Rank	Total_Vol_gal	Rank_Change
u32	str	f64	u32	f64	i32
1	"AMERICAN VODKAS"	4.7688e7	1	1.0737e6	0
2	"CANADIAN WHISKIES"	3.2473e7	2	498621.63	0
3	"STRAIGHT BOURBON WHISKIES"	2.8392e7	3	254581.81	0
4	"100% AGAVE TEQUILA"	2.5294e7	7	161508.45	3
5	"WHISKEY LIQUEUR"	2.0025e7	5	197825.15	0
…	…	…	…	…	…
40	"TRIPLE SEC"	610540.07	24	38478.46	-16
41	"MEZCAL"	347583.57	42	2261.04	1
42	"NEUTRAL GRAIN SPIRITS"	327735.29	41	4683.66	-1
43	"CORN WHISKIES"	82421.91	43	695.88	0
44	"AMERICAN SLOE GINS"	15090.18	44	465.34	0

Idiomatic - no intermediate variables

(df
 .with_columns(Revenue=pl.col('Bottles_Sold').mul('Retail_Price'))
 .group_by('Category')
 .agg(pl.col('Revenue').sum(), pl.col('Vol_Sold_gal').sum())
 .with_columns(Revenue_Rank=pl.col('Revenue').rank(method='ordinal', descending=True),
               Volume_Rank=pl.col('Vol_Sold_gal').rank(method='ordinal', descending=True))
 .with_columns(Rank_Change=(pl.col('Volume_Rank').cast(pl.Int32) - pl.col('Revenue_Rank').cast(pl.Int32)))
 .sort('Revenue_Rank')
)

shape: (44, 6)

Category	Revenue	Vol_Sold_gal	Revenue_Rank	Volume_Rank	Rank_Change
str	f64	f64	u32	u32	i32
"AMERICAN VODKAS"	4.7688e7	1.0737e6	1	1	0
"CANADIAN WHISKIES"	3.2473e7	498621.63	2	2	0
"STRAIGHT BOURBON WHISKIES"	2.8392e7	254581.81	3	3	0
"100% AGAVE TEQUILA"	2.5294e7	161508.45	4	7	3
"WHISKEY LIQUEUR"	2.0025e7	197825.15	5	5	0
…	…	…	…	…	…
"TRIPLE SEC"	610540.07	38478.46	40	24	-16
"MEZCAL"	347583.57	2261.04	41	42	1
"NEUTRAL GRAIN SPIRITS"	327735.29	4683.66	42	41	-1
"CORN WHISKIES"	82421.91	695.88	43	43	0
"AMERICAN SLOE GINS"	15090.18	465.34	44	44	0

Your teenage daughter (a metaphor)

Performance comparison

With intermediate variables

32.8 ms ± 1.55 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

Idiomatic chained operation

17.4 ms ± 230 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Pandas-accented mistakes

3. Overuse of eager execution

Eager mode - operations are executed immediately.
Lazy mode - operations are collected into a LazyFrame to build a query plan before being executed.
- Optimization techniques include:
  - predicate pushdown (filtering rows early)
  - projection pushdown (selecting columns early)

Pandas-accented mistakes

3. Overuse of eager execution

Q: Show categories with a change in ranking greater than 10

shape: (44, 6)

Category	Revenue	Vol_Sold_gal	Revenue_Rank	Volume_Rank	Rank_Change
str	f64	f64	u32	u32	i32
"AMERICAN VODKAS"	4.7688e7	1.0737e6	1	1	0
"CANADIAN WHISKIES"	3.2473e7	498621.63	2	2	0
"STRAIGHT BOURBON WHISKIES"	2.8392e7	254581.81	3	3	0
"100% AGAVE TEQUILA"	2.5294e7	161508.45	4	7	3
"WHISKEY LIQUEUR"	2.0025e7	197825.15	5	5	0
…	…	…	…	…	…
"TRIPLE SEC"	610540.07	38478.46	40	24	-16
"MEZCAL"	347583.57	2261.04	41	42	1
"NEUTRAL GRAIN SPIRITS"	327735.29	4683.66	42	41	-1
"CORN WHISKIES"	82421.91	695.88	43	43	0
"AMERICAN SLOE GINS"	15090.18	465.34	44	44	0

(df
 .lazy()
 .with_columns(Revenue=pl.col('Bottles_Sold').mul('Retail_Price'))
 .group_by('Category')
 .agg(pl.col('Revenue').sum(), pl.col('Vol_Sold_gal').sum())
 .with_columns(Revenue_Rank=pl.col('Revenue').rank(method='ordinal', descending=True),
               Volume_Rank=pl.col('Vol_Sold_gal').rank(method='ordinal', descending=True))
 .with_columns(Rank_Change=(pl.col('Volume_Rank').cast(pl.Int32) - pl.col('Revenue_Rank').cast(pl.Int32)))
 .sort('Revenue_Rank')
 .filter(pl.col('Rank_Change').abs() > 10)
 .collect()
)

shape: (3, 6)

Category	Revenue	Vol_Sold_gal	Revenue_Rank	Volume_Rank	Rank_Change
str	f64	f64	u32	u32	i32
"TEMPORARY & SPECIALTY PACKAGES"	8.5197e6	54637.97	9	20	11
"IMPORTED DISTILLED SPIRITS SPE…	2.1564e6	67022.9	30	17	-13
"TRIPLE SEC"	610540.07	38478.46	40	24	-16

Performance comparison

Eager mode

17.9 ms ± 490 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Lazy mode

16.8 ms ± 370 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)

The lazy mode query plan

Read from bottom to top!

Pandas-accented mistakes

4. Adding across columns

shape: (9, 7)

Store	Month	Mon	Tue	Wed	Thu	Fri
str	str	f64	f64	f64	f64	f64
"CASEY'S GENERAL STORE"	"Jan"	421779.11	496222.83	437798.84	470631.42	747257.54
"CASEY'S GENERAL STORE"	"Feb"	405312.16	498864.1	493118.19	275825.19	644216.17
"CASEY'S GENERAL STORE"	"Mar"	413580.02	581096.63	413073.32	361039.54	761948.53
"CASEY'S GENERAL STORE"	"Apr"	343738.39	828913.47	630487.31	290815.02	723226.92
"CASEY'S GENERAL STORE"	"May"	284811.77	661586.62	596480.34	441896.98	1005082.4
"CASEY'S GENERAL STORE"	"Jun"	568432.08	642478.37	543539.05	388688.94	886570.59
"CASEY'S GENERAL STORE"	"Jul"	368161.5	685091.68	598812.68	502196.72	578484.54
"CASEY'S GENERAL STORE"	"Aug"	477613.7	541931.52	528699.39	345751.33	979360.23
"CASEY'S GENERAL STORE"	"Sep"	428196.8	795148.85	610164.28	327594.13	796973.5

(casey_revenue
 .with_columns(Total=pl.col('Mon') + pl.col('Tue') + pl.col('Wed') + pl.col('Thu') + pl.col('Fri'),
               _Total=pl.sum_horizontal(pl.nth(2,3,4,5,6)),
               __Total=pl.sum_horizontal(pl.nth(range(2,7))),
               ___Total=pl.sum_horizontal(pl.all().exclude('Store','Month')),
               )
 .select('Store', 'Month', 'Total', '_Total', '__Total', '___Total')
 )

shape: (9, 6)

Store	Month	Total	_Total	__Total	___Total
str	str	f64	f64	f64	f64
"CASEY'S GENERAL STORE"	"Jan"	2.5737e6	2.5737e6	2.5737e6	2.5737e6
"CASEY'S GENERAL STORE"	"Feb"	2.3173e6	2.3173e6	2.3173e6	2.3173e6
"CASEY'S GENERAL STORE"	"Mar"	2.5307e6	2.5307e6	2.5307e6	2.5307e6
"CASEY'S GENERAL STORE"	"Apr"	2.8172e6	2.8172e6	2.8172e6	2.8172e6
"CASEY'S GENERAL STORE"	"May"	2.9899e6	2.9899e6	2.9899e6	2.9899e6
"CASEY'S GENERAL STORE"	"Jun"	3.0297e6	3.0297e6	3.0297e6	3.0297e6
"CASEY'S GENERAL STORE"	"Jul"	2.7327e6	2.7327e6	2.7327e6	2.7327e6
"CASEY'S GENERAL STORE"	"Aug"	2.8734e6	2.8734e6	2.8734e6	2.8734e6
"CASEY'S GENERAL STORE"	"Sep"	2.9581e6	2.9581e6	2.9581e6	2.9581e6

Even more idiomatic

import polars.selectors as cs

(casey_revenue
 .with_columns(_Total=pl.sum_horizontal(cs.float()))
 .select('Store','Month', cs.starts_with('_'))
 )

shape: (9, 3)

Store	Month	_Total
str	str	f64
"CASEY'S GENERAL STORE"	"Jan"	2.5737e6
"CASEY'S GENERAL STORE"	"Feb"	2.3173e6
"CASEY'S GENERAL STORE"	"Mar"	2.5307e6
"CASEY'S GENERAL STORE"	"Apr"	2.8172e6
"CASEY'S GENERAL STORE"	"May"	2.9899e6
"CASEY'S GENERAL STORE"	"Jun"	3.0297e6
"CASEY'S GENERAL STORE"	"Jul"	2.7327e6
"CASEY'S GENERAL STORE"	"Aug"	2.8734e6
"CASEY'S GENERAL STORE"	"Sep"	2.9581e6

Pandas-accented mistakes

5. Don’t use drop

shape: (9, 7)

Store	Month	Mon	Tue	Wed	Thu	Fri
str	str	f64	f64	f64	f64	f64
"CASEY'S GENERAL STORE"	"Jan"	421779.11	496222.83	437798.84	470631.42	747257.54
"CASEY'S GENERAL STORE"	"Feb"	405312.16	498864.1	493118.19	275825.19	644216.17
"CASEY'S GENERAL STORE"	"Mar"	413580.02	581096.63	413073.32	361039.54	761948.53
"CASEY'S GENERAL STORE"	"Apr"	343738.39	828913.47	630487.31	290815.02	723226.92
"CASEY'S GENERAL STORE"	"May"	284811.77	661586.62	596480.34	441896.98	1005082.4
"CASEY'S GENERAL STORE"	"Jun"	568432.08	642478.37	543539.05	388688.94	886570.59
"CASEY'S GENERAL STORE"	"Jul"	368161.5	685091.68	598812.68	502196.72	578484.54
"CASEY'S GENERAL STORE"	"Aug"	477613.7	541931.52	528699.39	345751.33	979360.23
"CASEY'S GENERAL STORE"	"Sep"	428196.8	795148.85	610164.28	327594.13	796973.5

Remove the column “Store”

(casey_revenue
 .drop('Store')
 )

shape: (9, 6)

Month	Mon	Tue	Wed	Thu	Fri
str	f64	f64	f64	f64	f64
"Jan"	421779.11	496222.83	437798.84	470631.42	747257.54
"Feb"	405312.16	498864.1	493118.19	275825.19	644216.17
"Mar"	413580.02	581096.63	413073.32	361039.54	761948.53
"Apr"	343738.39	828913.47	630487.31	290815.02	723226.92
"May"	284811.77	661586.62	596480.34	441896.98	1005082.4
"Jun"	568432.08	642478.37	543539.05	388688.94	886570.59
"Jul"	368161.5	685091.68	598812.68	502196.72	578484.54
"Aug"	477613.7	541931.52	528699.39	345751.33	979360.23
"Sep"	428196.8	795148.85	610164.28	327594.13	796973.5

Dropping non-existent column fails

(casey_revenue
 .drop('Joram')
 )

ColumnNotFoundError: "Joram" not found

Use exclude

(casey_revenue
 .select(pl.exclude('Store'))
 .select(pl.exclude('Joram'))
 )

shape: (9, 6)

Month	Mon	Tue	Wed	Thu	Fri
str	f64	f64	f64	f64	f64
"Jan"	421779.11	496222.83	437798.84	470631.42	747257.54
"Feb"	405312.16	498864.1	493118.19	275825.19	644216.17
"Mar"	413580.02	581096.63	413073.32	361039.54	761948.53
"Apr"	343738.39	828913.47	630487.31	290815.02	723226.92
"May"	284811.77	661586.62	596480.34	441896.98	1005082.4
"Jun"	568432.08	642478.37	543539.05	388688.94	886570.59
"Jul"	368161.5	685091.68	598812.68	502196.72	578484.54
"Aug"	477613.7	541931.52	528699.39	345751.33	979360.23
"Sep"	428196.8	795148.85	610164.28	327594.13	796973.5

Takeaways

Think in Polars first, not pandas
Leverage the query optimizer with lazy mode
Avoid intermediate variables if you can
Avoid ~~drop~~ in favor of exclude

“The only difference(!) between Shakespeare and you was the size of his idiom list - not the size of his vocabulary.”
– Alan Perlis

Pop quiz!

Is the code pandas-accented or idiomatic?

(df
 .filter(pl.col('Pack').mod(3) == 0)
 .shape[0]
 )

pandas-accented

The idiomatic way

(df
 .filter(pl.col('Pack').mod(3) == 0)
 .height
 )

Pop quiz!

Is the code pandas-accented or idiomatic?

(df
 ['Category']
 .value_counts()
 )

shape: (44, 2)

Category	count
str	u32
"MEZCAL"	2016
"AMERICAN SCHNAPPS"	68257
"WHISKEY LIQUEUR"	128963
"STRAIGHT BOURBON WHISKIES"	144868
"IMPORTED DISTILLED SPIRITS SPE…	8509
…	…
"AMERICAN FLAVORED VODKA"	83611
"IMPORTED FLAVORED VODKA"	19572
"SINGLE BARREL BOURBON WHISKIES"	7852
"IRISH WHISKIES"	21591
"GOLD RUM"	6791

pandas-accented

The idiomatic way

(df
 .get_column('Category')
 .value_counts()
 )

shape: (44, 2)

Category	count
str	u32
"SINGLE BARREL BOURBON WHISKIES"	7852
"AMERICAN VODKAS"	293030
"IMPORTED VODKAS"	40479
"STRAIGHT RYE WHISKIES"	13409
"AMERICAN DISTILLED SPIRITS SPE…	4481
…	…
"SINGLE MALT SCOTCH"	9560
"AMERICAN SLOE GINS"	339
"SPICED RUM"	64597
"100% AGAVE TEQUILA"	105951
"IMPORTED DISTILLED SPIRITS SPE…	8509

Pop quiz!

Is the code pandas-accented or idiomatic?

(df
 .lazy()
 .group_by('Category')
 .agg(pl.sum('Vol_Sold_gal'))
 .top_k(10, by='Vol_Sold_gal')
 .sort('Vol_Sold_gal', descending=True)
 .with_columns(Cum_Sum=pl.col('Vol_Sold_gal').cum_sum())
 .with_columns(Cum_Pct=pl.col('Cum_Sum') / pl.col('Vol_Sold_gal').sum())
 .filter(pl.col('Cum_Pct').ge(.8))
 .select(['Category','Cum_Pct'])
 .collect()
 )

shape: (5, 2)

Category	Cum_Pct
str	f64
"COCKTAILS/RTD"	0.812933
"100% AGAVE TEQUILA"	0.866371
"IMPORTED VODKAS"	0.917989
"AMERICAN FLAVORED VODKA"	0.960417
"BLENDED WHISKIES"	1.0

pandas-accented

The idiomatic way

(df
 .lazy()
 .group_by('Category')
 .agg(pl.sum('Vol_Sold_gal'))
 .top_k(10, by='Vol_Sold_gal')
 .sort('Vol_Sold_gal', descending=True)
 .with_columns(Cum_Sum=pl.col('Vol_Sold_gal').cum_sum())
 .with_columns(Cum_Pct=pl.col('Cum_Sum') / pl.col('Vol_Sold_gal').sum())
 .filter(pl.col('Cum_Pct').ge(.8))
 .select('Category','Cum_Pct')
 .collect()
 )

shape: (5, 2)

Category	Cum_Pct
str	f64
"COCKTAILS/RTD"	0.812933
"100% AGAVE TEQUILA"	0.866371
"IMPORTED VODKAS"	0.917989
"AMERICAN FLAVORED VODKA"	0.960417
"BLENDED WHISKIES"	1.0

Questions

github.com/jorammutenge

linkedin.com/in/jorammutenge

www.conterval.com/blog/