Checking dataset memory usage in polars

estimated_size

100DaysOfPolars

Author

Joram Mutenge

Published

2025-07-29

It’s important to know how much memory the dataset you’re processing is consuming on your machine. Why? Because lower memory consumption leads to faster processing. Fortunately, there are ways to reduce that memory usage. In Polars, you can check memory usage using the estimated_size method. Below is a dataframe from a clothing store showing customer purchase details.

shape: (1_000, 6)

company	sku	category	quantity	price	ext price
str	str	str	i64	f64	f64
"Fritsch-Glover"	"HX-24728"	"Hat"	1	98.98	98.98
"O'Conner Inc"	"LK-02338"	"Sweater"	9	34.8	313.2
"Beatty and Sons"	"ZC-07383"	"Sweater"	12	60.24	722.88
"Gleason, Bogisich and Franecki"	"QS-76400"	"Sweater"	5	15.25	76.25
"Morissette-Heathcote"	"RU-25060"	"Sweater"	19	51.83	984.77
…	…	…	…	…	…
"Brekke and Sons"	"FT-50146"	"Sweater"	2	46.48	92.96
"Lang-Wunsch"	"IC-59308"	"Socks"	19	29.25	555.75
"Bogisich and Sons"	"IC-59308"	"Socks"	18	54.79	986.22
"Kutch, Cormier and Harber"	"RU-25060"	"Sweater"	15	62.53	937.95
"Roberts, Volkman and Batz"	"LK-02338"	"Sweater"	11	86.4	950.4

Check memory usage

Here’s how to check how much memory, in kilobytes (KB), this dataset is using on your machine.

initial_usage = df.estimated_size('kb')
initial_usage

54.2451171875

Tip

To check memory usage in megabytes, replace kb with mb.

Techniques to reduce memory usage

You can reduce the memory usage of the dataframe by changing the data types of some columns. For instance, you can change numerical columns from Int64 (which is unnecessarily large) to Int32, as long as the maximum value in that column does not exceed the maximum value that Int32 can hold. But how do you determine the maximum or minimum values allowed for each data type? This is where NumPy comes into the picture.

Change data types of numerical columns

First, check the summary statistics of the numerical columns to see their maximum and minimum values.

import polars.selectors as cs

(df
 .select(cs.numeric())
 .describe()
 )

shape: (9, 4)

statistic	quantity	price	ext price
str	f64	f64	f64
"count"	1000.0	1000.0	1000.0
"null_count"	0.0	0.0	0.0
"mean"	10.565	54.06643	570.17994
"std"	5.887311	26.068011	443.949007
"min"	1.0	10.01	11.13
"25%"	5.0	31.25	204.0
"50%"	11.0	53.27	456.64
"75%"	16.0	75.1	848.55
"max"	20.0	100.0	1958.6

Now, let’s see the minimum and maximum values accepted by Int64 and Int8 data types.

import numpy as np

print(f'{np.iinfo(np.int64)}\n\n{np.iinfo(np.int8)}')

Machine parameters for int64
---------------------------------------------------------------
min = -9223372036854775808
max = 9223372036854775807
---------------------------------------------------------------


Machine parameters for int8
---------------------------------------------------------------
min = -128
max = 127
---------------------------------------------------------------

Starting with the quantity column, you’ll see that the maximum value is within the bounds of the Int8 data type. Therefore, we can convert the data type for quantity from Int64 to Int8.

Let’s apply the same technique to columns with floating-point data types. Again, we’ll use NumPy to determine the appropriate bounds for these data types.

print(f'{np.finfo(np.float64)}\n\n{np.finfo(np.float32)}')

Machine parameters for float64
---------------------------------------------------------------
precision =  15   resolution = 1.0000000000000001e-15
machep =    -52   eps =        2.2204460492503131e-16
negep =     -53   epsneg =     1.1102230246251565e-16
minexp =  -1022   tiny =       2.2250738585072014e-308
maxexp =   1024   max =        1.7976931348623157e+308
nexp =       11   min =        -max
smallest_normal = 2.2250738585072014e-308   smallest_subnormal = 4.9406564584124654e-324
---------------------------------------------------------------


Machine parameters for float32
---------------------------------------------------------------
precision =   6   resolution = 1.0000000e-06
machep =    -23   eps =        1.1920929e-07
negep =     -24   epsneg =     5.9604645e-08
minexp =   -126   tiny =       1.1754944e-38
maxexp =    128   max =        3.4028235e+38
nexp =        8   min =        -max
smallest_normal = 1.1754944e-38   smallest_subnormal = 1.4012985e-45
---------------------------------------------------------------

From the summary statistics table, you can see that the Float64 data type for the price and ext price columns is overkill. So, we can safely convert their data type to Float32, which is the lowest float type in Polars.

(df
 .with_columns(pl.col('quantity').cast(pl.Int8),
               pl.col(pl.Float64).cast(pl.Float32))
 )

shape: (1_000, 6)

company	sku	category	quantity	price	ext price
str	str	str	i8	f32	f32
"Fritsch-Glover"	"HX-24728"	"Hat"	1	98.980003	98.980003
"O'Conner Inc"	"LK-02338"	"Sweater"	9	34.799999	313.200012
"Beatty and Sons"	"ZC-07383"	"Sweater"	12	60.240002	722.880005
"Gleason, Bogisich and Franecki"	"QS-76400"	"Sweater"	5	15.25	76.25
"Morissette-Heathcote"	"RU-25060"	"Sweater"	19	51.830002	984.77002
…	…	…	…	…	…
"Brekke and Sons"	"FT-50146"	"Sweater"	2	46.48	92.959999
"Lang-Wunsch"	"IC-59308"	"Socks"	19	29.25	555.75
"Bogisich and Sons"	"IC-59308"	"Socks"	18	54.790001	986.219971
"Kutch, Cormier and Harber"	"RU-25060"	"Sweater"	15	62.529999	937.950012
"Roberts, Volkman and Batz"	"LK-02338"	"Sweater"	11	86.400002	950.400024

Change data types of text columns

For text columns, you can reduce memory usage by converting them to categorical data types. I recommend applying this conversion only to columns with low cardinality (a small number of unique values).

Let’s check the number of unique values in each of the text columns.

(df
 .select(~cs.numeric())
 .select(pl.all().n_unique())
 )

shape: (1, 3)

company	sku	category
u32	u32	u32
718	10	3

We can convert the sku and category columns to categorical because they have low cardinality.

(df
 .with_columns(pl.col(pl.Float64).cast(pl.Float32),
               pl.col('quantity').cast(pl.Int8),
               pl.col('sku','category').cast(pl.Categorical))
 )

shape: (1_000, 6)

company	sku	category	quantity	price	ext price
str	cat	cat	i8	f32	f32
"Fritsch-Glover"	"HX-24728"	"Hat"	1	98.980003	98.980003
"O'Conner Inc"	"LK-02338"	"Sweater"	9	34.799999	313.200012
"Beatty and Sons"	"ZC-07383"	"Sweater"	12	60.240002	722.880005
"Gleason, Bogisich and Franecki"	"QS-76400"	"Sweater"	5	15.25	76.25
"Morissette-Heathcote"	"RU-25060"	"Sweater"	19	51.830002	984.77002
…	…	…	…	…	…
"Brekke and Sons"	"FT-50146"	"Sweater"	2	46.48	92.959999
"Lang-Wunsch"	"IC-59308"	"Socks"	19	29.25	555.75
"Bogisich and Sons"	"IC-59308"	"Socks"	18	54.790001	986.219971
"Kutch, Cormier and Harber"	"RU-25060"	"Sweater"	15	62.529999	937.950012
"Roberts, Volkman and Batz"	"LK-02338"	"Sweater"	11	86.400002	950.400024

Memory usage after

Let’s now check the memory usage of the dataset after making these data type changes.

new_usage = (df
 .with_columns(pl.col(pl.Float64).cast(pl.Float32),
               pl.col('quantity').cast(pl.Int8),
               pl.col('sku','category').cast(pl.Categorical))
 .estimated_size('kb')
 )
new_usage

34.125

Now we have the same dataset using less memory. In fact, that’s a 37.09 percent reduction.

My Polars course is now open for enrollment, if you want to learn more!