csv_file = 'https://raw.githubusercontent.com/jorammutenge/learn-rust/refs/heads/main/sample_sales.csv'Speed up analysis with lazyframes in polars
scan_csv
100DaysOfPolars
Datasets are getting bigger, which means loading the entire dataset into memory can put a strain on your computer. Fortunately, Polars allows you to scan data and store it in a lazyframe before loading it into memory as a dataframe.
Say we have this CSV file:
Create a lazyframe by scanning a CSV file
To read a CSV file as a lazyframe, you use the Polars expression scan_csv, like this:
import polars as pl
(pl.scan_csv(csv_file)
.select('Account Name','ext price')
.group_by('Account Name')
.agg(pl.mean('ext price'))
.collect()
)
shape: (718, 2)
| Account Name | ext price |
|---|---|
| str | f64 |
| "Bergstrom, Medhurst and Zieme" | 198.56 |
| "Runolfsdottir, Rolfson and Pac… | 988.19 |
| "Weimann, Swift and Conroy" | 940.32 |
| "Watsica PLC" | 46.88 |
| "Mills Inc" | 570.538 |
| … | … |
| "Lemke, Kovacek and McClure" | 521.18 |
| "Jakubowski, Stark and Glover" | 126.3 |
| "Sawayn-Harris" | 450.48 |
| "Nicolas, Buckridge and Rowe" | 98.42 |
| "Hoppe PLC" | 275.76 |
In the lazyframe, I selected the two columns I wanted and computed the average price for each customer. The resulting lazyframe was then loaded into memory as a dataframe.
Doing computations on a lazyframe is faster because nothing is stored in memory until you collect the results.
Click to join 150+ students learning Polars course.