csv_file = 'https://raw.githubusercontent.com/jorammutenge/learn-rust/refs/heads/main/sample_sales.csv'Speed up analysis with lazyframes in polars
scan_csv
100DaysOfPolars
Datasets are getting bigger, which means loading the entire dataset into memory can put a strain on your computer. Fortunately, Polars allows you to scan data and store it in a lazyframe before loading it into memory as a dataframe.
Say we have this CSV file:
Create a lazyframe by scanning a CSV file
To read a CSV file as a lazyframe, you use the Polars expression scan_csv, like this:
import polars as pl
(pl.scan_csv(csv_file)
.select('Account Name','ext price')
.group_by('Account Name')
.agg(pl.mean('ext price'))
.collect()
)
shape: (718, 2)
| Account Name | ext price |
|---|---|
| str | f64 |
| "Fritsch-Glover" | 407.453333 |
| "Larson-Huels" | 31.0 |
| "Murray, Herzog and Treutel" | 1399.66 |
| "Sporer, Hickle and Steuber" | 149.536667 |
| "Lockman, Fisher and Considine" | 25.18 |
| … | … |
| "Wuckert-Gulgowski" | 341.52 |
| "Stoltenberg, Berge and Roberts" | 626.32 |
| "Treutel, Muller and O'Kon" | 1513.34 |
| "Strosin, Nader and Zulauf" | 624.72 |
| "Marvin, Schroeder and Herman" | 1897.53 |
In the lazyframe, I selected the two columns I wanted and computed the average price for each customer. The resulting lazyframe was then loaded into memory as a dataframe.
Doing computations on a lazyframe is faster because nothing is stored in memory until you collect the results.
Click to join 150+ students learning Polars course.