csv_file = 'https://raw.githubusercontent.com/jorammutenge/learn-rust/refs/heads/main/sample_sales.csv'Speed up analysis with lazyframes in polars
scan_csv
100DaysOfPolars
Datasets are getting bigger, which means loading the entire dataset into memory can put a strain on your computer. Fortunately, Polars allows you to scan data and store it in a lazyframe before loading it into memory as a dataframe.
Say we have this CSV file:
Create a lazyframe by scanning a CSV file
To read a CSV file as a lazyframe, you use the Polars expression scan_csv, like this:
import polars as pl
(pl.scan_csv(csv_file)
.select('Account Name','ext price')
.group_by('Account Name')
.agg(pl.mean('ext price'))
.collect()
)
shape: (718, 2)
| Account Name | ext price |
|---|---|
| str | f64 |
| "Beatty and Sons" | 521.46 |
| "Reilly-Leannon" | 574.135 |
| "Schimmel, Schaefer and Treutel" | 656.596667 |
| "Swift-Okuneva" | 320.295 |
| "Terry PLC" | 453.0 |
| … | … |
| "Shields-Boyer" | 933.2 |
| "Bode, Mohr and Bogan" | 438.76 |
| "Maggio Inc" | 383.48 |
| "Beatty-Dickinson" | 36.55 |
| "Tillman-Schowalter" | 430.54 |
In the lazyframe, I selected the two columns I wanted and computed the average price for each customer. The resulting lazyframe was then loaded into memory as a dataframe.
Doing computations on a lazyframe is faster because nothing is stored in memory until you collect the results.
Click to join 150+ students learning Polars course.