= 'https://raw.githubusercontent.com/jorammutenge/learn-rust/refs/heads/main/sample_sales.csv' csv_file
Speed up analysis with lazyframes in polars
scan_csv
100DaysOfPolars
Datasets are getting bigger, which means loading the entire dataset into memory can put a strain on your computer. Fortunately, Polars allows you to scan data and store it in a lazyframe before loading it into memory as a dataframe.
Say we have this CSV file:
Create a lazyframe by scanning a CSV file
To read a CSV file as a lazyframe, you use the Polars expression scan_csv
, like this:
import polars as pl
(pl.scan_csv(csv_file)'Account Name','ext price')
.select('Account Name')
.group_by('ext price'))
.agg(pl.mean(
.collect() )
shape: (718, 2)
Account Name | ext price |
---|---|
str | f64 |
"Brekke PLC" | 1398.505 |
"Halvorson and Sons" | 594.81 |
"O'Conner Inc" | 313.2 |
"Miller PLC" | 293.88 |
"Conroy-Schaden" | 498.9 |
… | … |
"Batz Inc" | 1069.98 |
"Swift-Okuneva" | 320.295 |
"Medhurst and Sons" | 500.8 |
"Treutel, Muller and O'Kon" | 1513.34 |
"Nicolas-Emard" | 68.39 |
In the lazyframe, I selected the two columns I wanted and computed the average price for each customer. The resulting lazyframe was then loaded into memory as a dataframe.
Doing computations on a lazyframe is faster because nothing is stored in memory until you collect
the results.
Click to join 150+ students learning Polars course.