csv_file = 'https://raw.githubusercontent.com/jorammutenge/learn-rust/refs/heads/main/sample_sales.csv'Speed up analysis with lazyframes in polars
scan_csv
100DaysOfPolars
Datasets are getting bigger, which means loading the entire dataset into memory can put a strain on your computer. Fortunately, Polars allows you to scan data and store it in a lazyframe before loading it into memory as a dataframe.
Say we have this CSV file:
Create a lazyframe by scanning a CSV file
To read a CSV file as a lazyframe, you use the Polars expression scan_csv, like this:
import polars as pl
(pl.scan_csv(csv_file)
.select('Account Name','ext price')
.group_by('Account Name')
.agg(pl.mean('ext price'))
.collect()
)
shape: (718, 2)
| Account Name | ext price |
|---|---|
| str | f64 |
| "Kuphal, Flatley and Casper" | 929.46 |
| "Feil LLC" | 1383.9 |
| "Thiel-Volkman" | 279.52 |
| "McDermott, Gerlach and Bechtel… | 187.68 |
| "VonRueden, Wiza and Balistreri" | 338.52 |
| … | … |
| "O'Conner Inc" | 313.2 |
| "Herzog-Homenick" | 1051.2 |
| "Wilderman Group" | 1155.64 |
| "O'Kon, Braun and Corkery" | 290.2 |
| "Corwin, Nienow and Reichert" | 393.24 |
In the lazyframe, I selected the two columns I wanted and computed the average price for each customer. The resulting lazyframe was then loaded into memory as a dataframe.
Doing computations on a lazyframe is faster because nothing is stored in memory until you collect the results.
Click to join 150+ students learning Polars course.