Slicing and dicing pandas & polars dataframes

How to select subsets of rows and columns

data analysis
Author

Joram Mutenge

Published

2026-06-08

A few years ago, I read an article on pandas’ loc method by Matt Harrison, published on the now-defunct Ponder website. I loved reading it because it showed, in depth, the many ways to use loc. I want to revisit the article and perform in Polars the operations that Matt performed in pandas. Matt also created a chart at the end of the article using pandas’ built-in charting library, which relies on Matplotlib. I’ll recreate two versions of the same chart: one using HvPlot and the other using Glyphx, a new charting library I’m slowly falling in love with.

The dataset

The dataset we’ll be analyzing contains crime data from neighborhoods in Los Angeles. I’ll read the dataset with pandas and store it in a variable called df. The Polars dataframe will be stored in a variable called data.

import pandas as pd

url = "https://data.lacity.org/api/views/2nrs-mtv8/rows.csv?accessType=DOWNLOAD"
df = pd.read_csv(url)
df.head()
DR_NO Date Rptd DATE OCC TIME OCC AREA AREA NAME Rpt Dist No Part 1-2 Crm Cd Crm Cd Desc ... Status Status Desc Crm Cd 1 Crm Cd 2 Crm Cd 3 Crm Cd 4 LOCATION Cross Street LAT LON
0 211507896 04/11/2021 12:00:00 AM 11/07/2020 12:00:00 AM 845 15 N Hollywood 1502 2 354 THEFT OF IDENTITY ... IC Invest Cont 354.0 NaN NaN NaN 7800    BEEMAN                       AV NaN 34.2124 -118.4092
1 201516622 10/21/2020 12:00:00 AM 10/18/2020 12:00:00 AM 1845 15 N Hollywood 1521 1 230 ASSAULT WITH DEADLY WEAPON, AGGRAVATED ASSAULT ... IC Invest Cont 230.0 NaN NaN NaN ATOLL                        AV N  GAULT 34.1993 -118.4203
2 240913563 12/10/2024 12:00:00 AM 10/30/2020 12:00:00 AM 1240 9 Van Nuys 933 2 354 THEFT OF IDENTITY ... IC Invest Cont 354.0 NaN NaN NaN 14600    SYLVAN                       ST NaN 34.1847 -118.4509
3 210704711 12/24/2020 12:00:00 AM 12/24/2020 12:00:00 AM 1310 7 Wilshire 782 1 331 THEFT FROM MOTOR VEHICLE - GRAND ($950.01 AND ... ... IC Invest Cont 331.0 NaN NaN NaN 6000    COMEY                        AV NaN 34.0339 -118.3747
4 201418201 10/03/2020 12:00:00 AM 09/29/2020 12:00:00 AM 1830 14 Pacific 1454 1 420 THEFT FROM MOTOR VEHICLE - PETTY ($950 & UNDER) ... IC Invest Cont 420.0 NaN NaN NaN 4700    LA VILLA MARINA NaN 33.9813 -118.4350

5 rows × 28 columns

I’ll read the dataset again with Polars.

import polars as pl

data = pl.read_csv(url)
data.head()
shape: (5, 28)
DR_NO Date Rptd DATE OCC TIME OCC AREA AREA NAME Rpt Dist No Part 1-2 Crm Cd Crm Cd Desc Mocodes Vict Age Vict Sex Vict Descent Premis Cd Premis Desc Weapon Used Cd Weapon Desc Status Status Desc Crm Cd 1 Crm Cd 2 Crm Cd 3 Crm Cd 4 LOCATION Cross Street LAT LON
i64 str str i64 i64 str i64 i64 i64 str str i64 str str i64 str i64 str str str i64 i64 str str str str f64 f64
211507896 "04/11/2021 12:00:00 AM" "11/07/2020 12:00:00 AM" 845 15 "N Hollywood" 1502 2 354 "THEFT OF IDENTITY" "0377" 31 "M" "H" 501 "SINGLE FAMILY DWELLING" null null "IC" "Invest Cont" 354 null null null "7800    BEEMAN                … null 34.2124 -118.4092
201516622 "10/21/2020 12:00:00 AM" "10/18/2020 12:00:00 AM" 1845 15 "N Hollywood" 1521 1 230 "ASSAULT WITH DEADLY WEAPON, AG… "0416 0334 2004 1822 1414 0305 … 32 "M" "H" 102 "SIDEWALK" 200 "KNIFE WITH BLADE 6INCHES OR LE… "IC" "Invest Cont" 230 null null null "ATOLL                        A… "N  GAULT" 34.1993 -118.4203
240913563 "12/10/2024 12:00:00 AM" "10/30/2020 12:00:00 AM" 1240 9 "Van Nuys" 933 2 354 "THEFT OF IDENTITY" "0377" 30 "M" "W" 501 "SINGLE FAMILY DWELLING" null null "IC" "Invest Cont" 354 null null null "14600    SYLVAN               … null 34.1847 -118.4509
210704711 "12/24/2020 12:00:00 AM" "12/24/2020 12:00:00 AM" 1310 7 "Wilshire" 782 1 331 "THEFT FROM MOTOR VEHICLE - GRA… "0344" 47 "F" "A" 101 "STREET" null null "IC" "Invest Cont" 331 null null null "6000    COMEY                 … null 34.0339 -118.3747
201418201 "10/03/2020 12:00:00 AM" "09/29/2020 12:00:00 AM" 1830 14 "Pacific" 1454 1 420 "THEFT FROM MOTOR VEHICLE - PET… "1300 0344 1606 2032" 63 "M" "H" 103 "ALLEY" null null "IC" "Invest Cont" 420 null null null "4700    LA VILLA MARINA" null 33.9813 -118.435

Selecting a row

The loc method in pandas can be used to select specific rows and columns. I’ll select the fifth row in the dataframe.

df.loc[5]
DR_NO                                                240412063
Date Rptd                               12/11/2024 12:00:00 AM
DATE OCC                                11/11/2020 12:00:00 AM
TIME OCC                                                  1210
AREA                                                         4
AREA NAME                                           Hollenbeck
Rpt Dist No                                                429
Part 1-2                                                     2
Crm Cd                                                     354
Crm Cd Desc                                  THEFT OF IDENTITY
Mocodes                                                   0100
Vict Age                                                    35
Vict Sex                                                     M
Vict Descent                                                 B
Premis Cd                                                502.0
Premis Desc       MULTI-UNIT DWELLING (APARTMENT, DUPLEX, ETC)
Weapon Used Cd                                             NaN
Weapon Desc                                                NaN
Status                                                      IC
Status Desc                                        Invest Cont
Crm Cd 1                                                 354.0
Crm Cd 2                                                   NaN
Crm Cd 3                                                   NaN
Crm Cd 4                                                   NaN
LOCATION               5300    CRONUS                       ST
Cross Street                                               NaN
LAT                                                     34.083
LON                                                  -118.1678
Name: 5, dtype: object

Let’s do the same with Polars. I’ll show two ways to select a specific row.

Version 1

(data
 .slice(5, 1)
 )
shape: (1, 28)
DR_NO Date Rptd DATE OCC TIME OCC AREA AREA NAME Rpt Dist No Part 1-2 Crm Cd Crm Cd Desc Mocodes Vict Age Vict Sex Vict Descent Premis Cd Premis Desc Weapon Used Cd Weapon Desc Status Status Desc Crm Cd 1 Crm Cd 2 Crm Cd 3 Crm Cd 4 LOCATION Cross Street LAT LON
i64 str str i64 i64 str i64 i64 i64 str str i64 str str i64 str i64 str str str i64 i64 str str str str f64 f64
240412063 "12/11/2024 12:00:00 AM" "11/11/2020 12:00:00 AM" 1210 4 "Hollenbeck" 429 2 354 "THEFT OF IDENTITY" "0100" 35 "M" "B" 502 "MULTI-UNIT DWELLING (APARTMENT… null null "IC" "Invest Cont" 354 null null null "5300    CRONUS                … null 34.083 -118.1678

Version 2

(data
 .gather(5)
 )
shape: (1, 28)
DR_NO Date Rptd DATE OCC TIME OCC AREA AREA NAME Rpt Dist No Part 1-2 Crm Cd Crm Cd Desc Mocodes Vict Age Vict Sex Vict Descent Premis Cd Premis Desc Weapon Used Cd Weapon Desc Status Status Desc Crm Cd 1 Crm Cd 2 Crm Cd 3 Crm Cd 4 LOCATION Cross Street LAT LON
i64 str str i64 i64 str i64 i64 i64 str str i64 str str i64 str i64 str str str i64 i64 str str str str f64 f64
240412063 "12/11/2024 12:00:00 AM" "11/11/2020 12:00:00 AM" 1210 4 "Hollenbeck" 429 2 354 "THEFT OF IDENTITY" "0100" 35 "M" "B" 502 "MULTI-UNIT DWELLING (APARTMENT… null null "IC" "Invest Cont" 354 null null null "5300    CRONUS                … null 34.083 -118.1678

Personally, I prefer version 2, but you can use whichever one you like.

Note

You’ve probably noticed that the selected row is displayed differently by the two libraries. Polars displays it horizontally, while pandas displays it vertically.

If we wanted Polars to produce output closer to the pandas output, we could use to_dict to produce a list and then retrieve the first item in that list, which happens to be the dictionary we want.

(data
 .gather(5)
 .to_dicts()[0]
 )
{'DR_NO': 240412063,
 'Date Rptd': '12/11/2024 12:00:00 AM',
 'DATE OCC': '11/11/2020 12:00:00 AM',
 'TIME OCC': 1210,
 'AREA': 4,
 'AREA NAME': 'Hollenbeck',
 'Rpt Dist No': 429,
 'Part 1-2': 2,
 'Crm Cd': 354,
 'Crm Cd Desc': 'THEFT OF IDENTITY',
 'Mocodes': '0100',
 'Vict Age': 35,
 'Vict Sex': 'M',
 'Vict Descent': 'B',
 'Premis Cd': 502,
 'Premis Desc': 'MULTI-UNIT DWELLING (APARTMENT, DUPLEX, ETC)',
 'Weapon Used Cd': None,
 'Weapon Desc': None,
 'Status': 'IC',
 'Status Desc': 'Invest Cont',
 'Crm Cd 1': 354,
 'Crm Cd 2': None,
 'Crm Cd 3': None,
 'Crm Cd 4': None,
 'LOCATION': '5300    CRONUS                       ST',
 'Cross Street': None,
 'LAT': 34.083,
 'LON': -118.1678}

Although this is a dictionary, it is displayed vertically and therefore looks closer to the pandas output.

Selecting columns

Now let’s use loc to select a single column in df. The : means “return all the rows in the dataframe.”

df.loc[:, 'Crm Cd 1']
0          354.0
1          230.0
2          354.0
3          331.0
4          420.0
           ...  
1004889    510.0
1004890    745.0
1004891    341.0
1004892    230.0
1004893    510.0
Name: Crm Cd 1, Length: 1004894, dtype: float64

In Polars, we can display a single column by using the select method.

data.select('Crm Cd 1')
shape: (1_004_894, 1)
Crm Cd 1
i64
354
230
354
331
420
510
745
341
230
510

The first thing to note between these two outputs is that when a single column is selected in pandas, the output is a series, unlike in Polars, where the output is always a dataframe regardless of how many columns you select. Another difference is that the pandas output shows an index. Polars dataframes, on the other hand, do not have an index unless you create one.

If we really want the Polars output to be a series as well, we can do it with to_series.

(data
 .select('Crm Cd 1')
 .to_series()
 )
shape: (1_004_894,)
Crm Cd 1
i64
354
230
354
331
420
510
745
341
230
510

The two Polars outputs look the same, but they are different types. The way I distinguish between them is by looking at their shape at the top. The dataframe shows both the number of rows and columns: (1004894, 1), while the series only shows the number of rows: (1004894,).

Here’s how we can select two columns and all the rows in the pandas dataframe.

df.loc[:, ['AREA', 'TIME OCC']]
AREA TIME OCC
0 15 845
1 15 1845
2 9 1240
3 7 1310
4 14 1830
... ... ...
1004889 7 1400
1004890 1 100
1004891 4 2330
1004892 3 1500
1004893 9 2300

1004894 rows × 2 columns

And here’s how to do it the Polars way.

(data
 .select('AREA','TIME OCC')
 )
shape: (1_004_894, 2)
AREA TIME OCC
i64 i64
15 845
15 1845
9 1240
7 1310
14 1830
7 1400
1 100
4 2330
3 1500
9 2300

Combine with list of rows and column labels

We can use a list to select the rows we want to return in both pandas and Polars dataframes.

df.loc[[0,1,2], ['AREA', 'TIME OCC']]
AREA TIME OCC
0 15 845
1 15 1845
2 9 1240

In a Polars dataframe, it is not necessary to use a list to select columns.

(data
 .gather([0,1,2])
 .select('AREA', 'TIME OCC')
 )
shape: (3, 2)
AREA TIME OCC
i64 i64
15 845
15 1845
9 1240

Selecting rows and columns over a range

Instead of listing the rows or columns we want to select, we can select them over a range. Let’s get the rows from 0 to 2 and the columns from TIME OCC to Part 1-2 in both dataframes.

df.loc[0:2, 'TIME OCC':'Part 1-2']
TIME OCC AREA AREA NAME Rpt Dist No Part 1-2
0 845 15 N Hollywood 1502 2
1 1845 15 N Hollywood 1521 1
2 1240 9 Van Nuys 933 2
Note

In pandas, range selection using loc is inclusive. Thus, selecting rows from 0 to 2 includes the row at index position 2.

It turns out that in Polars, there is no direct way to select columns over a range by their names. However, we can achieve the same result as in the pandas example using two workarounds. Both require selecting columns by their index positions.

I’ll begin with the one that uses a list comprehension.

cols = data.columns
data.slice(0, 3).select(cols[cols.index("TIME OCC") : cols.index("Part 1-2") + 1])
shape: (3, 5)
TIME OCC AREA AREA NAME Rpt Dist No Part 1-2
i64 i64 str i64 i64
845 15 "N Hollywood" 1502 2
1845 15 "N Hollywood" 1521 1
1240 9 "Van Nuys" 933 2
Note

We added 1 because, in Polars, index selection is not inclusive. Without it, we would miss the Part 1-2 column in the resulting dataframe.

I would argue that the above method is not idiomatic. List comprehensions are not native to Polars; remember that the library is written in Rust. Here’s the idiomatic approach.

(data
 .select(pl.nth(range(data.get_column_index('TIME OCC'), data.get_column_index('Part 1-2') + 1)))
 )
shape: (1_004_894, 5)
TIME OCC AREA AREA NAME Rpt Dist No Part 1-2
i64 i64 str i64 i64
845 15 "N Hollywood" 1502 2
1845 15 "N Hollywood" 1521 1
1240 9 "Van Nuys" 933 2
1310 7 "Wilshire" 782 1
1830 14 "Pacific" 1454 1
1400 7 "Wilshire" 788 1
100 1 "Central" 101 2
2330 4 "Hollenbeck" 421 1
1500 3 "Southwest" 358 1
2300 9 "Van Nuys" 914 1

Because loc accepts list comprehensions, we can add logic to customize our row or column selection. The example below pulls out every third row and selects only the columns that contain ‘Crm’ in their names.

df.loc[[True if i%3 == 2 else False for i in range(len(df))],
       [True if 'Crm' in col else False for col in df.columns]] 
Crm Cd Crm Cd Desc Crm Cd 1 Crm Cd 2 Crm Cd 3 Crm Cd 4
2 354 THEFT OF IDENTITY 354.0 NaN NaN NaN
5 354 THEFT OF IDENTITY 354.0 NaN NaN NaN
8 354 THEFT OF IDENTITY 354.0 NaN NaN NaN
11 354 THEFT OF IDENTITY 354.0 NaN NaN NaN
14 330 BURGLARY FROM VEHICLE 330.0 NaN NaN NaN
... ... ... ... ... ... ...
1004879 354 THEFT OF IDENTITY 354.0 NaN NaN NaN
1004882 230 ASSAULT WITH DEADLY WEAPON, AGGRAVATED ASSAULT 230.0 NaN NaN NaN
1004885 440 THEFT PLAIN - PETTY ($950 & UNDER) 440.0 NaN NaN NaN
1004888 341 THEFT-GRAND ($950.01 & OVER)EXCPT,GUNS,FOWL,LI... 341.0 NaN NaN NaN
1004891 341 THEFT-GRAND ($950.01 & OVER)EXCPT,GUNS,FOWL,LI... 341.0 NaN NaN NaN

334964 rows × 6 columns

We can achieve similar results in Polars by creating an index or by using gather_every. The example below creates an index on the dataframe and performs a modulo calculation on the generated index numbers to get the desired rows. Then we use a regex pattern to select the columns that contain the string ‘Crm’.

(data
 .with_row_index()
 .filter(pl.col('index') % 3 == 2)
 .select(pl.col('^.*Crm.*$'))
)
shape: (334_964, 6)
Crm Cd Crm Cd Desc Crm Cd 1 Crm Cd 2 Crm Cd 3 Crm Cd 4
i64 str i64 i64 str str
354 "THEFT OF IDENTITY" 354 null null null
354 "THEFT OF IDENTITY" 354 null null null
354 "THEFT OF IDENTITY" 354 null null null
354 "THEFT OF IDENTITY" 354 null null null
330 "BURGLARY FROM VEHICLE" 330 null null null
354 "THEFT OF IDENTITY" 354 null null null
230 "ASSAULT WITH DEADLY WEAPON, AG… 230 null null null
440 "THEFT PLAIN - PETTY ($950 & UN… 440 null null null
341 "THEFT-GRAND ($950.01 & OVER)EX… 341 null null null
341 "THEFT-GRAND ($950.01 & OVER)EX… 341 null null null

An even more idiomatic way to write the above code is by using gather_every.

(data
1 .gather_every(3, offset=2)
 .select(pl.col('^.*Crm.*$'))
)
1
Remember that the condition % 3 == 2 keeps rows where the index has a remainder of 2 when divided by 3, i.e., indices 2, 5, 8, 11, and so on. Therefore, the first row selected is index 2, which maps directly to offset=2.
shape: (334_964, 6)
Crm Cd Crm Cd Desc Crm Cd 1 Crm Cd 2 Crm Cd 3 Crm Cd 4
i64 str i64 i64 str str
354 "THEFT OF IDENTITY" 354 null null null
354 "THEFT OF IDENTITY" 354 null null null
354 "THEFT OF IDENTITY" 354 null null null
354 "THEFT OF IDENTITY" 354 null null null
330 "BURGLARY FROM VEHICLE" 330 null null null
354 "THEFT OF IDENTITY" 354 null null null
230 "ASSAULT WITH DEADLY WEAPON, AG… 230 null null null
440 "THEFT PLAIN - PETTY ($950 & UN… 440 null null null
341 "THEFT-GRAND ($950.01 & OVER)EX… 341 null null null
341 "THEFT-GRAND ($950.01 & OVER)EX… 341 null null null

Just to show off how versatile Polars is, I’ll demonstrate other interesting ways to write the code above. I’ll begin with the one that uses the arange function. This also eliminates the need for an index.

(data
 .filter((pl.arange(0, pl.len()) % 3) == 2)
 .select(pl.col('^.*Crm.*$'))
 )
shape: (334_964, 6)
Crm Cd Crm Cd Desc Crm Cd 1 Crm Cd 2 Crm Cd 3 Crm Cd 4
i64 str i64 i64 str str
354 "THEFT OF IDENTITY" 354 null null null
354 "THEFT OF IDENTITY" 354 null null null
354 "THEFT OF IDENTITY" 354 null null null
354 "THEFT OF IDENTITY" 354 null null null
330 "BURGLARY FROM VEHICLE" 330 null null null
354 "THEFT OF IDENTITY" 354 null null null
230 "ASSAULT WITH DEADLY WEAPON, AG… 230 null null null
440 "THEFT PLAIN - PETTY ($950 & UN… 440 null null null
341 "THEFT-GRAND ($950.01 & OVER)EX… 341 null null null
341 "THEFT-GRAND ($950.01 & OVER)EX… 341 null null null

We can also use the int_range function to get the same result.

(data
 .filter((pl.int_range(pl.len()) % 3) == 2)
 .select(pl.col('^.*Crm.*$'))
)
shape: (334_964, 6)
Crm Cd Crm Cd Desc Crm Cd 1 Crm Cd 2 Crm Cd 3 Crm Cd 4
i64 str i64 i64 str str
354 "THEFT OF IDENTITY" 354 null null null
354 "THEFT OF IDENTITY" 354 null null null
354 "THEFT OF IDENTITY" 354 null null null
354 "THEFT OF IDENTITY" 354 null null null
330 "BURGLARY FROM VEHICLE" 330 null null null
354 "THEFT OF IDENTITY" 354 null null null
230 "ASSAULT WITH DEADLY WEAPON, AG… 230 null null null
440 "THEFT PLAIN - PETTY ($950 & UN… 440 null null null
341 "THEFT-GRAND ($950.01 & OVER)EX… 341 null null null
341 "THEFT-GRAND ($950.01 & OVER)EX… 341 null null null

Regex patterns are notorious for being difficult to remember. We can avoid them altogether by using special selectors in Polars.

import polars.selectors as cs

(data
 .filter(pl.int_range(pl.len()).mod(3) == 2)
 .select(cs.contains('Crm'))
 )
shape: (334_964, 6)
Crm Cd Crm Cd Desc Crm Cd 1 Crm Cd 2 Crm Cd 3 Crm Cd 4
i64 str i64 i64 str str
354 "THEFT OF IDENTITY" 354 null null null
354 "THEFT OF IDENTITY" 354 null null null
354 "THEFT OF IDENTITY" 354 null null null
354 "THEFT OF IDENTITY" 354 null null null
330 "BURGLARY FROM VEHICLE" 330 null null null
354 "THEFT OF IDENTITY" 354 null null null
230 "ASSAULT WITH DEADLY WEAPON, AG… 230 null null null
440 "THEFT PLAIN - PETTY ($950 & UN… 440 null null null
341 "THEFT-GRAND ($950.01 & OVER)EX… 341 null null null
341 "THEFT-GRAND ($950.01 & OVER)EX… 341 null null null

Boolean arrays

Most row filtering in pandas uses boolean masks. Let’s begin with a simple example: filter for rows where AREA NAME is ‘Central’. The code below produces a boolean mask.

df['AREA NAME'] == 'Central'
0          False
1          False
2          False
3          False
4          False
           ...  
1004889    False
1004890     True
1004891    False
1004892    False
1004893    False
Name: AREA NAME, Length: 1004894, dtype: bool

We can use this boolean mask in loc to get the desired rows in the dataframe.

df.loc[df['AREA NAME'] == 'Central']
DR_NO Date Rptd DATE OCC TIME OCC AREA AREA NAME Rpt Dist No Part 1-2 Crm Cd Crm Cd Desc ... Status Status Desc Crm Cd 1 Crm Cd 2 Crm Cd 3 Crm Cd 4 LOCATION Cross Street LAT LON
107 230110144 04/04/2023 12:00:00 AM 07/03/2020 12:00:00 AM 900 1 Central 182 2 354 THEFT OF IDENTITY ... IC Invest Cont 354.0 NaN NaN NaN 1100 S  GRAND                        AV NaN 34.0415 -118.2620
563 210118909 10/19/2021 12:00:00 AM 06/02/2020 12:00:00 AM 1605 1 Central 134 1 480 BIKE - STOLEN ... IC Invest Cont 480.0 NaN NaN NaN 3RD                          ST BROADWAY 34.0510 -118.2480
572 210104305 01/08/2021 12:00:00 AM 10/28/2020 12:00:00 AM 1203 1 Central 182 2 668 EMBEZZLEMENT, GRAND THEFT ($950.01 & OVER) ... IC Invest Cont 668.0 NaN NaN NaN 1200 S  FIGUEROA                     ST NaN 34.0416 -118.2671
1007 210110076 04/27/2021 12:00:00 AM 07/01/2020 12:00:00 AM 1200 1 Central 155 2 812 CRM AGNST CHLD (13 OR UNDER) (14-15 & SUSP 10 ... ... AO Adult Other 812.0 860.0 NaN NaN 200 E  6TH                          ST NaN 34.0448 -118.2474
1708 210104086 01/03/2021 12:00:00 AM 01/03/2020 12:00:00 AM 1510 1 Central 101 2 624 BATTERY - SIMPLE ASSAULT ... IC Invest Cont 624.0 NaN NaN NaN 900    CENTENNIAL                   ST NaN 34.0669 -118.2458
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
1004822 240119704 10/09/2024 12:00:00 AM 10/09/2024 12:00:00 AM 700 1 Central 119 2 888 TRESPASSING ... IC Invest Cont 888.0 NaN NaN NaN 800 N  ALAMEDA                      ST NaN 34.0561 -118.2375
1004847 240109835 04/01/2024 12:00:00 AM 04/01/2024 12:00:00 AM 1250 1 Central 152 2 624 BATTERY - SIMPLE ASSAULT ... IC Invest Cont 624.0 NaN NaN NaN 7TH                          ST FLOWER                       ST 34.0490 -118.2592
1004868 240112635 06/10/2024 12:00:00 AM 06/10/2024 12:00:00 AM 1855 1 Central 138 1 510 VEHICLE - STOLEN ... IC Invest Cont 510.0 NaN NaN NaN 300 S  ALAMEDA                      ST NaN 34.0468 -118.2415
1004885 240119644 10/04/2024 12:00:00 AM 09/27/2024 12:00:00 AM 2345 1 Central 142 1 440 THEFT PLAIN - PETTY ($950 & UNDER) ... IC Invest Cont 440.0 NaN NaN NaN W  3RD S  GRAND 34.0531 -118.2512
1004890 240104953 01/15/2024 12:00:00 AM 01/15/2024 12:00:00 AM 100 1 Central 101 2 745 VANDALISM - MISDEAMEANOR ($399 OR UNDER) ... IC Invest Cont 745.0 NaN NaN NaN 1300 W  SUNSET                       BL NaN 34.0685 -118.2460

69668 rows × 28 columns

By contrast, Polars does not require a boolean mask to perform this filtering operation.

(data
 .filter(pl.col('AREA NAME') == 'Central')
 )
shape: (69_668, 28)
DR_NO Date Rptd DATE OCC TIME OCC AREA AREA NAME Rpt Dist No Part 1-2 Crm Cd Crm Cd Desc Mocodes Vict Age Vict Sex Vict Descent Premis Cd Premis Desc Weapon Used Cd Weapon Desc Status Status Desc Crm Cd 1 Crm Cd 2 Crm Cd 3 Crm Cd 4 LOCATION Cross Street LAT LON
i64 str str i64 i64 str i64 i64 i64 str str i64 str str i64 str i64 str str str i64 i64 str str str str f64 f64
230110144 "04/04/2023 12:00:00 AM" "07/03/2020 12:00:00 AM" 900 1 "Central" 182 2 354 "THEFT OF IDENTITY" "0930 0929" 25 "M" "H" 502 "MULTI-UNIT DWELLING (APARTMENT… null null "IC" "Invest Cont" 354 null null null "1100 S  GRAND                 … null 34.0415 -118.262
210118909 "10/19/2021 12:00:00 AM" "06/02/2020 12:00:00 AM" 1605 1 "Central" 134 1 480 "BIKE - STOLEN" "0344 2032 1822" 35 "M" "W" 101 "STREET" null null "IC" "Invest Cont" 480 null null null "3RD                          S… "BROADWAY" 34.051 -118.248
210104305 "01/08/2021 12:00:00 AM" "10/28/2020 12:00:00 AM" 1203 1 "Central" 182 2 668 "EMBEZZLEMENT, GRAND THEFT ($95… "1501" 0 "M" "W" 203 "OTHER BUSINESS" null null "IC" "Invest Cont" 668 null null null "1200 S  FIGUEROA              … null 34.0416 -118.2671
210110076 "04/27/2021 12:00:00 AM" "07/01/2020 12:00:00 AM" 1200 1 "Central" 155 2 812 "CRM AGNST CHLD (13 OR UNDER) (… null 14 "M" "H" 501 "SINGLE FAMILY DWELLING" 400 "STRONG-ARM (HANDS, FIST, FEET … "AO" "Adult Other" 812 860 null null "200 E  6TH                    … null 34.0448 -118.2474
210104086 "01/03/2021 12:00:00 AM" "01/03/2020 12:00:00 AM" 1510 1 "Central" 101 2 624 "BATTERY - SIMPLE ASSAULT" "1822 0444 1309 0448" 58 "M" "H" 102 "SIDEWALK" 400 "STRONG-ARM (HANDS, FIST, FEET … "IC" "Invest Cont" 624 null null null "900    CENTENNIAL             … null 34.0669 -118.2458
240119704 "10/09/2024 12:00:00 AM" "10/09/2024 12:00:00 AM" 700 1 "Central" 119 2 888 "TRESPASSING" "0910 2004" 0 "M" "O" 900 "MTA - RED LINE - UNION STATION" null null "IC" "Invest Cont" 888 null null null "800 N  ALAMEDA                … null 34.0561 -118.2375
240109835 "04/01/2024 12:00:00 AM" "04/01/2024 12:00:00 AM" 1250 1 "Central" 152 2 624 "BATTERY - SIMPLE ASSAULT" "2021 2048 1822 0400 0910 0417" 60 "M" "B" 903 "MTA - RED LINE - 7TH AND METRO… 400 "STRONG-ARM (HANDS, FIST, FEET … "IC" "Invest Cont" 624 null null null "7TH                          S… "FLOWER                       S… 34.049 -118.2592
240112635 "06/10/2024 12:00:00 AM" "06/10/2024 12:00:00 AM" 1855 1 "Central" 138 1 510 "VEHICLE - STOLEN" null 0 null null 101 "STREET" null null "IC" "Invest Cont" 510 null null null "300 S  ALAMEDA                … null 34.0468 -118.2415
240119644 "10/04/2024 12:00:00 AM" "09/27/2024 12:00:00 AM" 2345 1 "Central" 142 1 440 "THEFT PLAIN - PETTY ($950 & UN… "0344 2014" 42 "F" "B" 122 "VEHICLE, PASSENGER/TRUCK" null null "IC" "Invest Cont" 440 null null null "W  3RD" "S  GRAND" 34.0531 -118.2512
240104953 "01/15/2024 12:00:00 AM" "01/15/2024 12:00:00 AM" 100 1 "Central" 101 2 745 "VANDALISM - MISDEAMEANOR ($399… "0329 0400 0416" 0 "X" "X" 503 "HOTEL" 500 "UNKNOWN WEAPON/OTHER WEAPON" "IC" "Invest Cont" 745 null null null "1300 W  SUNSET                … null 34.0685 -118.246

Now let’s move to a more interesting example. The goal is to select all the columns that have more than the median number of unique values. We’ll calculate the number of unique entries in each column and then find the median value for those counts.

Let’s show the number of unique values in each column of the pandas dataframe.

df.nunique()
DR_NO             1004894
Date Rptd            1870
DATE OCC             1826
TIME OCC             1439
AREA                   21
AREA NAME              21
Rpt Dist No          1210
Part 1-2                2
Crm Cd                140
Crm Cd Desc           140
Mocodes            310912
Vict Age              104
Vict Sex                5
Vict Descent           20
Premis Cd             314
Premis Desc           306
Weapon Used Cd         79
Weapon Desc            79
Status                  6
Status Desc             6
Crm Cd 1              142
Crm Cd 2              126
Crm Cd 3               38
Crm Cd 4                6
LOCATION            66566
Cross Street        10413
LAT                  5426
LON                  4982
dtype: int64

Because the output is displayed vertically, calculating the median is as simple as calling median.

df.nunique().median()
np.float64(140.0)

The calculation is somewhat more involved in Polars. To begin with, the result showing the number of unique values in each column is displayed horizontally as a dataframe.

data.select(pl.all().n_unique())
shape: (1, 28)
DR_NO Date Rptd DATE OCC TIME OCC AREA AREA NAME Rpt Dist No Part 1-2 Crm Cd Crm Cd Desc Mocodes Vict Age Vict Sex Vict Descent Premis Cd Premis Desc Weapon Used Cd Weapon Desc Status Status Desc Crm Cd 1 Crm Cd 2 Crm Cd 3 Crm Cd 4 LOCATION Cross Street LAT LON
u32 u32 u32 u32 u32 u32 u32 u32 u32 u32 u32 u32 u32 u32 u32 u32 u32 u32 u32 u32 u32 u32 u32 u32 u32 u32 u32 u32
1004894 1870 1826 1439 21 21 1210 2 140 140 310913 104 6 21 315 307 80 80 7 6 143 127 39 7 66566 10414 5426 4982

We can calculate the median value 140.0 by using the transpose method or by using the concat_list function.

(data
 .select(pl.all().n_unique())
 .transpose()
 .median()
 .item()
 )
140.0

Transposing flips the dataframe from horizontal to vertical, making it possible to perform the median calculation.

The second approach collects the unique counts for each column into a list and then performs the median calculation.

(data
 .select(pl.all().n_unique())                   
 .select(pl.concat_list(pl.all()).list.median())
 .item()                                        
)
140.0

Now that we know the median value is 140.0, we can use it to get the desired columns. Here’s how to do it in pandas.

df.loc[:, df.nunique() > 140]
DR_NO Date Rptd DATE OCC TIME OCC Rpt Dist No Mocodes Premis Cd Premis Desc Crm Cd 1 LOCATION Cross Street LAT LON
0 211507896 04/11/2021 12:00:00 AM 11/07/2020 12:00:00 AM 845 1502 0377 501.0 SINGLE FAMILY DWELLING 354.0 7800    BEEMAN                       AV NaN 34.2124 -118.4092
1 201516622 10/21/2020 12:00:00 AM 10/18/2020 12:00:00 AM 1845 1521 0416 0334 2004 1822 1414 0305 0319 0400 102.0 SIDEWALK 230.0 ATOLL                        AV N  GAULT 34.1993 -118.4203
2 240913563 12/10/2024 12:00:00 AM 10/30/2020 12:00:00 AM 1240 933 0377 501.0 SINGLE FAMILY DWELLING 354.0 14600    SYLVAN                       ST NaN 34.1847 -118.4509
3 210704711 12/24/2020 12:00:00 AM 12/24/2020 12:00:00 AM 1310 782 0344 101.0 STREET 331.0 6000    COMEY                        AV NaN 34.0339 -118.3747
4 201418201 10/03/2020 12:00:00 AM 09/29/2020 12:00:00 AM 1830 1454 1300 0344 1606 2032 103.0 ALLEY 420.0 4700    LA VILLA MARINA NaN 33.9813 -118.4350
... ... ... ... ... ... ... ... ... ... ... ... ... ...
1004889 240710284 07/24/2024 12:00:00 AM 07/23/2024 12:00:00 AM 1400 788 NaN 101.0 STREET 510.0 4000 W  23RD                         ST NaN 34.0362 -118.3284
1004890 240104953 01/15/2024 12:00:00 AM 01/15/2024 12:00:00 AM 100 101 0329 0400 0416 503.0 HOTEL 745.0 1300 W  SUNSET                       BL NaN 34.0685 -118.2460
1004891 240410786 10/14/2024 12:00:00 AM 10/11/2024 12:00:00 AM 2330 421 0344 210.0 RESTAURANT/FAST FOOD 341.0 1700    ALBION                       ST NaN 34.0675 -118.2240
1004892 240309674 04/24/2024 12:00:00 AM 04/24/2024 12:00:00 AM 1500 358 1822 0334 0416 0445 0449 1202 102.0 SIDEWALK 230.0 FLOWER                       ST JEFFERSON                    BL 34.0215 -118.2868
1004893 240910892 08/13/2024 12:00:00 AM 08/12/2024 12:00:00 AM 2300 914 NaN 108.0 PARKING LOT 510.0 6900    VESPER                       AV NaN 34.1961 -118.4510

1004894 rows × 13 columns

And here’s how to do it in Polars.

(data
 .select(col for col in data.columns if data.select(pl.col(col).n_unique()).item() > 140)
 )
shape: (1_004_894, 13)
DR_NO Date Rptd DATE OCC TIME OCC Rpt Dist No Mocodes Premis Cd Premis Desc Crm Cd 1 LOCATION Cross Street LAT LON
i64 str str i64 i64 str i64 str i64 str str f64 f64
211507896 "04/11/2021 12:00:00 AM" "11/07/2020 12:00:00 AM" 845 1502 "0377" 501 "SINGLE FAMILY DWELLING" 354 "7800    BEEMAN                … null 34.2124 -118.4092
201516622 "10/21/2020 12:00:00 AM" "10/18/2020 12:00:00 AM" 1845 1521 "0416 0334 2004 1822 1414 0305 … 102 "SIDEWALK" 230 "ATOLL                        A… "N  GAULT" 34.1993 -118.4203
240913563 "12/10/2024 12:00:00 AM" "10/30/2020 12:00:00 AM" 1240 933 "0377" 501 "SINGLE FAMILY DWELLING" 354 "14600    SYLVAN               … null 34.1847 -118.4509
210704711 "12/24/2020 12:00:00 AM" "12/24/2020 12:00:00 AM" 1310 782 "0344" 101 "STREET" 331 "6000    COMEY                 … null 34.0339 -118.3747
201418201 "10/03/2020 12:00:00 AM" "09/29/2020 12:00:00 AM" 1830 1454 "1300 0344 1606 2032" 103 "ALLEY" 420 "4700    LA VILLA MARINA" null 33.9813 -118.435
240710284 "07/24/2024 12:00:00 AM" "07/23/2024 12:00:00 AM" 1400 788 null 101 "STREET" 510 "4000 W  23RD                  … null 34.0362 -118.3284
240104953 "01/15/2024 12:00:00 AM" "01/15/2024 12:00:00 AM" 100 101 "0329 0400 0416" 503 "HOTEL" 745 "1300 W  SUNSET                … null 34.0685 -118.246
240410786 "10/14/2024 12:00:00 AM" "10/11/2024 12:00:00 AM" 2330 421 "0344" 210 "RESTAURANT/FAST FOOD" 341 "1700    ALBION                … null 34.0675 -118.224
240309674 "04/24/2024 12:00:00 AM" "04/24/2024 12:00:00 AM" 1500 358 "1822 0334 0416 0445 0449 1202" 102 "SIDEWALK" 230 "FLOWER                       S… "JEFFERSON                    B… 34.0215 -118.2868
240910892 "08/13/2024 12:00:00 AM" "08/12/2024 12:00:00 AM" 2300 914 null 108 "PARKING LOT" 510 "6900    VESPER                … null 34.1961 -118.451

Closing with a practical example

Let’s end with a visualization of the previous analysis. I’ll focus on the Burglary from Vehicle column and plot it over time for the different AREA NAME values. Before we begin creating the charts, I’ll show the dataframe that will be used to create the plot so we know what it looks like.

Here’s the pandas version.

(df
 .astype({'Date Rptd': 'datetime64[ns]',
          'DATE OCC': 'datetime64[ns]',
          })
 .groupby([pd.Grouper(key='DATE OCC', freq='ME'), 'AREA NAME', 'Crm Cd Desc'])
 .size()
 .unstack()
 .loc[:, 'BURGLARY FROM VEHICLE']
 .unstack()
1 .loc[:, lambda adf: sorted(adf.columns, key=lambda col: 1 if col == 'Hollywood' else -1)]
)
1
Move the Hollywood column to the end so that it appears on top of the other columns.
AREA NAME 77th Street Central Devonshire Foothill Harbor Hollenbeck Mission N Hollywood Newton Northeast ... Pacific Rampart Southeast Southwest Topanga Van Nuys West LA West Valley Wilshire Hollywood
DATE OCC
2020-01-31 75.0 144.0 67.0 37.0 29.0 32.0 53.0 108.0 72.0 80.0 ... 70.0 72.0 42.0 53.0 57.0 59.0 83.0 79.0 78.0 77.0
2020-02-29 45.0 148.0 58.0 40.0 34.0 31.0 47.0 78.0 60.0 73.0 ... 73.0 58.0 44.0 41.0 36.0 52.0 62.0 63.0 79.0 65.0
2020-03-31 55.0 114.0 52.0 41.0 38.0 40.0 43.0 64.0 37.0 74.0 ... 85.0 37.0 38.0 46.0 44.0 60.0 55.0 34.0 52.0 69.0
2020-04-30 61.0 96.0 56.0 28.0 53.0 24.0 30.0 78.0 51.0 74.0 ... 80.0 62.0 30.0 60.0 34.0 64.0 69.0 45.0 52.0 47.0
2020-05-31 41.0 78.0 46.0 40.0 39.0 23.0 40.0 77.0 44.0 58.0 ... 66.0 60.0 31.0 33.0 51.0 60.0 67.0 34.0 49.0 56.0
2020-06-30 32.0 80.0 74.0 34.0 30.0 34.0 38.0 56.0 29.0 59.0 ... 70.0 37.0 38.0 41.0 55.0 81.0 80.0 51.0 50.0 40.0
2020-07-31 51.0 93.0 46.0 34.0 28.0 20.0 39.0 53.0 32.0 50.0 ... 62.0 39.0 31.0 50.0 44.0 62.0 64.0 48.0 56.0 64.0
2020-08-31 44.0 95.0 59.0 18.0 14.0 15.0 27.0 53.0 36.0 25.0 ... 64.0 49.0 26.0 40.0 38.0 67.0 53.0 51.0 40.0 41.0
2020-09-30 27.0 75.0 35.0 20.0 26.0 25.0 20.0 50.0 23.0 44.0 ... 47.0 48.0 31.0 29.0 30.0 50.0 46.0 37.0 39.0 46.0
2020-10-31 24.0 89.0 34.0 23.0 24.0 20.0 33.0 58.0 23.0 29.0 ... 59.0 36.0 37.0 47.0 51.0 67.0 75.0 42.0 42.0 77.0
2020-11-30 42.0 114.0 48.0 32.0 32.0 16.0 34.0 58.0 29.0 37.0 ... 55.0 53.0 45.0 25.0 41.0 50.0 74.0 46.0 56.0 47.0
2020-12-31 36.0 67.0 41.0 32.0 24.0 32.0 46.0 89.0 35.0 54.0 ... 67.0 45.0 38.0 26.0 61.0 43.0 85.0 52.0 58.0 58.0
2021-01-31 30.0 95.0 44.0 42.0 51.0 31.0 28.0 71.0 45.0 59.0 ... 96.0 56.0 29.0 43.0 59.0 48.0 85.0 41.0 54.0 76.0
2021-02-28 52.0 100.0 31.0 34.0 37.0 24.0 32.0 63.0 34.0 42.0 ... 77.0 53.0 28.0 51.0 41.0 46.0 79.0 29.0 59.0 66.0
2021-03-31 30.0 75.0 43.0 27.0 38.0 24.0 33.0 67.0 29.0 56.0 ... 56.0 60.0 28.0 47.0 40.0 62.0 64.0 34.0 58.0 91.0
2021-04-30 28.0 90.0 47.0 32.0 28.0 17.0 28.0 82.0 28.0 42.0 ... 83.0 34.0 26.0 51.0 42.0 43.0 68.0 36.0 44.0 81.0
2021-05-31 37.0 118.0 44.0 36.0 35.0 25.0 30.0 67.0 45.0 55.0 ... 108.0 37.0 40.0 28.0 49.0 55.0 68.0 30.0 61.0 79.0
2021-06-30 37.0 112.0 62.0 33.0 22.0 12.0 32.0 85.0 34.0 43.0 ... 73.0 38.0 33.0 23.0 43.0 41.0 59.0 46.0 51.0 85.0
2021-07-31 25.0 146.0 64.0 16.0 30.0 26.0 34.0 61.0 79.0 50.0 ... 67.0 45.0 26.0 41.0 43.0 43.0 68.0 48.0 53.0 95.0
2021-08-31 36.0 138.0 50.0 32.0 26.0 29.0 39.0 71.0 51.0 48.0 ... 81.0 48.0 23.0 42.0 43.0 55.0 57.0 44.0 43.0 101.0
2021-09-30 18.0 165.0 52.0 28.0 25.0 30.0 42.0 80.0 64.0 43.0 ... 59.0 45.0 26.0 26.0 39.0 55.0 63.0 44.0 48.0 89.0
2021-10-31 37.0 196.0 67.0 21.0 39.0 29.0 54.0 102.0 48.0 55.0 ... 66.0 64.0 33.0 39.0 57.0 69.0 66.0 48.0 57.0 116.0
2021-11-30 30.0 178.0 58.0 25.0 38.0 28.0 37.0 87.0 43.0 66.0 ... 64.0 55.0 23.0 35.0 38.0 47.0 75.0 64.0 68.0 105.0
2021-12-31 22.0 247.0 51.0 25.0 25.0 33.0 43.0 72.0 53.0 56.0 ... 73.0 56.0 28.0 45.0 65.0 59.0 70.0 57.0 75.0 84.0
2022-01-31 22.0 230.0 51.0 30.0 36.0 30.0 52.0 71.0 43.0 70.0 ... 95.0 58.0 28.0 41.0 46.0 62.0 72.0 51.0 78.0 112.0
2022-02-28 30.0 178.0 43.0 37.0 32.0 22.0 46.0 66.0 52.0 51.0 ... 65.0 54.0 15.0 37.0 51.0 33.0 69.0 45.0 66.0 96.0
2022-03-31 24.0 264.0 26.0 24.0 35.0 29.0 38.0 72.0 42.0 62.0 ... 103.0 51.0 17.0 34.0 36.0 41.0 67.0 46.0 68.0 84.0
2022-04-30 32.0 272.0 62.0 25.0 29.0 15.0 35.0 70.0 58.0 64.0 ... 89.0 45.0 21.0 39.0 42.0 56.0 71.0 61.0 75.0 85.0
2022-05-31 28.0 234.0 43.0 24.0 30.0 25.0 21.0 58.0 50.0 68.0 ... 74.0 63.0 25.0 35.0 48.0 46.0 63.0 53.0 53.0 120.0
2022-06-30 33.0 242.0 32.0 27.0 34.0 26.0 27.0 58.0 59.0 70.0 ... 84.0 50.0 20.0 33.0 39.0 44.0 53.0 50.0 56.0 152.0
2022-07-31 32.0 275.0 31.0 25.0 22.0 15.0 26.0 49.0 56.0 60.0 ... 47.0 41.0 22.0 23.0 24.0 40.0 52.0 31.0 64.0 101.0
2022-08-31 26.0 278.0 40.0 16.0 29.0 17.0 28.0 51.0 55.0 58.0 ... 58.0 46.0 33.0 24.0 28.0 35.0 47.0 38.0 54.0 77.0
2022-09-30 34.0 218.0 37.0 22.0 23.0 26.0 39.0 53.0 46.0 58.0 ... 53.0 34.0 16.0 30.0 37.0 55.0 50.0 46.0 50.0 93.0
2022-10-31 20.0 228.0 46.0 21.0 30.0 21.0 11.0 49.0 63.0 50.0 ... 63.0 64.0 29.0 31.0 42.0 50.0 50.0 74.0 47.0 87.0
2022-11-30 27.0 205.0 37.0 26.0 26.0 28.0 27.0 59.0 58.0 41.0 ... 59.0 37.0 24.0 32.0 32.0 55.0 55.0 56.0 39.0 63.0
2022-12-31 28.0 317.0 39.0 36.0 29.0 25.0 28.0 64.0 50.0 66.0 ... 62.0 53.0 31.0 32.0 33.0 55.0 60.0 53.0 54.0 72.0
2023-01-31 31.0 292.0 35.0 44.0 25.0 21.0 38.0 57.0 54.0 75.0 ... 61.0 51.0 27.0 39.0 34.0 32.0 67.0 36.0 55.0 75.0
2023-02-28 18.0 236.0 36.0 15.0 29.0 24.0 30.0 40.0 39.0 78.0 ... 72.0 33.0 16.0 44.0 26.0 62.0 57.0 45.0 57.0 77.0
2023-03-31 31.0 141.0 38.0 18.0 27.0 24.0 28.0 68.0 42.0 53.0 ... 67.0 43.0 25.0 38.0 43.0 45.0 59.0 52.0 48.0 79.0
2023-04-30 19.0 140.0 26.0 20.0 23.0 22.0 15.0 54.0 42.0 52.0 ... 68.0 30.0 24.0 31.0 37.0 31.0 33.0 47.0 55.0 47.0
2023-05-31 26.0 192.0 34.0 21.0 24.0 20.0 20.0 49.0 63.0 53.0 ... 61.0 46.0 21.0 41.0 38.0 25.0 46.0 27.0 53.0 52.0
2023-06-30 26.0 175.0 30.0 19.0 28.0 26.0 14.0 55.0 55.0 58.0 ... 74.0 62.0 26.0 22.0 42.0 36.0 42.0 45.0 71.0 71.0
2023-07-31 42.0 186.0 40.0 21.0 35.0 22.0 32.0 52.0 68.0 61.0 ... 53.0 36.0 29.0 49.0 30.0 44.0 55.0 57.0 71.0 50.0
2023-08-31 34.0 237.0 22.0 21.0 29.0 17.0 28.0 50.0 68.0 64.0 ... 58.0 40.0 21.0 43.0 31.0 39.0 64.0 35.0 86.0 70.0
2023-09-30 36.0 199.0 39.0 19.0 32.0 26.0 27.0 57.0 59.0 67.0 ... 65.0 52.0 18.0 48.0 54.0 29.0 65.0 60.0 72.0 63.0
2023-10-31 33.0 226.0 43.0 23.0 29.0 34.0 34.0 68.0 76.0 55.0 ... 89.0 52.0 28.0 47.0 60.0 50.0 94.0 53.0 85.0 79.0
2023-11-30 42.0 221.0 45.0 26.0 30.0 32.0 33.0 71.0 72.0 53.0 ... 68.0 76.0 31.0 35.0 74.0 54.0 80.0 63.0 55.0 65.0
2023-12-31 37.0 205.0 52.0 27.0 40.0 24.0 37.0 59.0 86.0 71.0 ... 59.0 66.0 24.0 42.0 58.0 48.0 61.0 67.0 89.0 72.0
2024-01-31 34.0 143.0 74.0 24.0 34.0 18.0 29.0 74.0 64.0 51.0 ... 74.0 47.0 16.0 48.0 50.0 49.0 67.0 65.0 72.0 82.0
2024-02-29 30.0 132.0 45.0 20.0 25.0 31.0 25.0 51.0 54.0 48.0 ... 71.0 49.0 15.0 35.0 43.0 38.0 58.0 72.0 65.0 65.0
2024-03-31 31.0 106.0 56.0 23.0 26.0 25.0 24.0 41.0 62.0 47.0 ... 57.0 47.0 16.0 40.0 61.0 61.0 49.0 61.0 75.0 85.0
2024-04-30 20.0 88.0 52.0 29.0 19.0 31.0 30.0 43.0 37.0 36.0 ... 72.0 43.0 19.0 31.0 59.0 45.0 26.0 51.0 40.0 53.0
2024-05-31 22.0 87.0 29.0 23.0 21.0 27.0 21.0 38.0 44.0 46.0 ... 31.0 18.0 8.0 32.0 26.0 30.0 25.0 30.0 37.0 40.0
2024-06-30 21.0 104.0 27.0 19.0 18.0 21.0 21.0 33.0 34.0 27.0 ... 53.0 26.0 12.0 20.0 37.0 16.0 25.0 23.0 33.0 47.0
2024-07-31 11.0 103.0 22.0 11.0 13.0 20.0 13.0 38.0 22.0 45.0 ... 48.0 30.0 14.0 22.0 21.0 16.0 43.0 23.0 49.0 40.0
2024-08-31 22.0 125.0 25.0 15.0 21.0 14.0 26.0 28.0 38.0 38.0 ... 49.0 34.0 7.0 27.0 28.0 13.0 22.0 22.0 39.0 59.0
2024-09-30 26.0 121.0 20.0 17.0 20.0 18.0 22.0 31.0 25.0 34.0 ... 53.0 37.0 5.0 23.0 24.0 21.0 37.0 23.0 29.0 46.0
2024-10-31 16.0 145.0 33.0 15.0 10.0 11.0 26.0 35.0 23.0 42.0 ... 38.0 36.0 10.0 22.0 16.0 28.0 29.0 31.0 33.0 45.0
2024-11-30 15.0 152.0 15.0 7.0 20.0 7.0 17.0 37.0 29.0 52.0 ... 54.0 65.0 9.0 21.0 22.0 21.0 28.0 38.0 38.0 42.0
2024-12-31 16.0 145.0 19.0 5.0 35.0 14.0 15.0 35.0 26.0 26.0 ... 31.0 43.0 11.0 24.0 19.0 21.0 11.0 23.0 32.0 33.0

60 rows × 21 columns

And here’s the Polars version, which I’ll store in a variable called to_plot to use later.

to_plot = (data
 .with_columns(pl.col('Date Rptd', 'DATE OCC').str.strptime(pl.Datetime, '%m/%d/%Y %I:%M:%S %p'))
 .sort('DATE OCC')
 .group_by_dynamic('DATE OCC', every='1mo', group_by=['AREA NAME', 'Crm Cd Desc'])
 .agg(pl.len())
 .filter(pl.col('Crm Cd Desc') =='BURGLARY FROM VEHICLE')
 .select(pl.exclude('Crm Cd Desc'))
 .pivot(on='AREA NAME', index='DATE OCC', values='len')
1 .select(pl.all().exclude('Hollywood'), 'Hollywood')
)
to_plot
1
Move the Hollywood column to the end of the dataframe.
shape: (60, 22)
DATE OCC Harbor Newton Central Mission Olympic Pacific Rampart Topanga West LA Foothill Van Nuys Wilshire Northeast Southeast Southwest Devonshire Hollenbeck 77th Street N Hollywood West Valley Hollywood
datetime[μs] u32 u32 u32 u32 u32 u32 u32 u32 u32 u32 u32 u32 u32 u32 u32 u32 u32 u32 u32 u32 u32
2020-01-01 00:00:00 29 72 144 53 77 70 72 57 83 37 59 78 80 42 53 67 32 75 108 79 77
2020-02-01 00:00:00 34 60 148 47 48 73 58 36 62 40 52 79 73 44 41 58 31 45 78 63 65
2020-03-01 00:00:00 38 37 114 43 42 85 37 44 55 41 60 52 74 38 46 52 40 55 64 34 69
2020-04-01 00:00:00 53 51 96 30 57 80 62 34 69 28 64 52 74 30 60 56 24 61 78 45 47
2020-05-01 00:00:00 39 44 78 40 56 66 60 51 67 40 60 49 58 31 33 46 23 41 77 34 56
2024-08-01 00:00:00 21 38 125 26 29 49 34 28 22 15 13 39 38 7 27 25 14 22 28 22 59
2024-09-01 00:00:00 20 25 121 22 38 53 37 24 37 17 21 29 34 5 23 20 18 26 31 23 46
2024-10-01 00:00:00 10 23 145 26 40 38 36 16 29 15 28 33 42 10 22 33 11 16 35 31 45
2024-11-01 00:00:00 20 29 152 17 34 54 65 22 28 7 21 38 52 9 21 15 7 15 37 38 42
2024-12-01 00:00:00 35 26 145 15 36 31 43 19 11 5 21 32 26 11 24 19 14 16 35 23 33

Creating the charts

With pandas’ built-in plot

It turns out that pandas has built-in plotting capabilities, so we do not have to import a charting library.

def set_colors(df):
    global colors
    colors = ['#991111' if col == 'Hollywood' else '#999999' for col in df.columns]
    return df

ax = (df
 .astype({'Date Rptd': 'datetime64[ns]',
          'DATE OCC': 'datetime64[ns]',
          })
 .groupby([pd.Grouper(key='DATE OCC', freq='ME'), 'AREA NAME', 'Crm Cd Desc'])
 .size()
 .unstack()
 .loc[:, 'BURGLARY FROM VEHICLE']
 .unstack()
 .loc[:, lambda adf: sorted(adf.columns, key=lambda col: 1 if col == 'Hollywood' else -1)]
 .pipe(set_colors)
 .plot(color=colors, title='BURGLARY FROM VEHICLE in LA', figsize=(10,6))
)
ax.legend(bbox_to_anchor=(1,1), ncol=2);

With hvplot

HvPlot is one of the few charting libraries that does not require converting the data into a pandas dataframe or a Python list to create charts.

import hvplot.polars

(to_plot
 .pipe(lambda df_: df_.hvplot.line(
       x='DATE OCC',
       y=(cols := [c for c in df_.columns[1:] if c != 'Hollywood'] + ['Hollywood']),
       color=['#999999'] * (len(cols) - 1) + ['#991111'],
       title='BURGLARY FROM VEHICLE in LA',
       legend_opts={'title':'Neighborhood'},
       legend='top_left', legend_cols=3, height=575, width=900))
)

With glyphx

I’ve been experimenting with the Glyphx charting library after hearing about it from Christopher Bailey on The Real Python Podcast. It captured my interest because the charts created with this library are in SVG format. This is handy because you can export the chart and enhance it in a tool like Inkscape.

However, Glyphx currently does not support Polars dataframes. We have to pass a list to the x-axis or y-axis to create the chart.

from glyphx import Figure
from glyphx.series import LineSeries

non_hollywood = [col for col in to_plot.columns[1:] if col != 'Hollywood']

dates = to_plot['DATE OCC'].to_list()

fig = Figure(width=800, height=450).set_title('BURGLARY FROM VEHICLE in LA')

for col in non_hollywood:
    fig.add(LineSeries(dates, to_plot[col].to_list(), color='#bbbbbb', label=col))

fig.add(LineSeries(dates, to_plot['Hollywood'].to_list(), color='red', label='Hollywood'))

fig.set_legend('right').tight_layout()
fig.show();

Other limitations I encountered are:

  1. The circle markers on the lines cannot be removed. Because the data points are very close to each other in the lower part of the chart, the markers make the chart look more like a scatterplot than a line chart.
  2. It’s not possible to display the legend in a three-column layout like in the HvPlot chart. I do not like the long list of neighborhoods, so I ended up listing only the neighborhood of interest in the legend.
from glyphx import Figure
from glyphx.series import LineSeries

non_hollywood = [col for col in to_plot.columns[1:] if col != 'Hollywood']

dates = to_plot['DATE OCC'].to_list()

fig = Figure(width=800, height=450).set_title('BURGLARY FROM VEHICLE in LA')

for col in non_hollywood:
    fig.add(LineSeries(dates, to_plot[col].to_list(), color='#bbbbbb'))

fig.add(LineSeries(dates, to_plot['Hollywood'].to_list(), color='red', label='Hollywood'))

fig.set_legend('right').tight_layout()
fig.show();

Thanks for reading. To improve your data analysis skill using the Polars library, check out my book Deep Analysis with Polars.