Polars como alternativa a Pandas

Daniel Molina Cabrera

Código QR

¿Qué es Polars?

https://polars.danimolina.net

Alternativa a Pandas

https://pola.rs/

Implementado en Rust pero accesible en Python.
Centrado en eficiencia y nuevo API.

Pero Pandas ya es mejor

Versión 2.0 mucho más rápido usando pyarrow.

Polars también.

Polars tiene un Interfaz más coherente.
Paralelismo, reordena consultas.
Soporte de CSV, Excel, Parquet, Sqlite, Postgres, …

Algunos ejemplos

No tenemos tiempo, así que pondré algunos ejemplos para generar curiosidad.

Crear un DataFrame

A partir de un diccionario.

import polars as pl

df1 = pl.DataFrame({'Nombre': ['Daniel', 'Luis', 'Pablo'],
                   'Apellidos': ['Molina', 'Pepe', 'Pérez'],
                   'Puntuación': [8, 9, 7]})
print(df1)

shape: (3, 3)
┌────────┬───────────┬────────────┐
│ Nombre ┆ Apellidos ┆ Puntuación │
│ ---    ┆ ---       ┆ ---        │
│ str    ┆ str       ┆ i64        │
╞════════╪═══════════╪════════════╡
│ Daniel ┆ Molina    ┆ 8          │
│ Luis   ┆ Pepe      ┆ 9          │
│ Pablo  ┆ Pérez     ┆ 7          │
└────────┴───────────┴────────────┘

Crear un dataframe

A partir de un fichero csv

df = pl.read_csv("adult.csv")
print(df.columns)

['age', 'workclass', 'fnlwgt', 'education', 'education-num', 'marital-status', 'occupation', 'relationship', 'race', 'sex', 'capital-gain', 'capital-loss', 'hours-per-week', 'native-country', 'income']

Filtrando datos

Trabajar con datos

print(df[['age', 'sex', 'education', 'income']].head())

shape: (5, 4)
┌─────┬─────────┬────────────┬────────┐
│ age ┆ sex     ┆ education  ┆ income │
│ --- ┆ ---     ┆ ---        ┆ ---    │
│ i64 ┆ str     ┆ str        ┆ str    │
╞═════╪═════════╪════════════╪════════╡
│ 39  ┆  Male   ┆  Bachelors ┆  <=50K │
│ 50  ┆  Male   ┆  Bachelors ┆  <=50K │
│ 38  ┆  Male   ┆  HS-grad   ┆  <=50K │
│ 53  ┆  Male   ┆  11th      ┆  <=50K │
│ 28  ┆  Female ┆  Bachelors ┆  <=50K │
└─────┴─────────┴────────────┴────────┘

Formato select y filter

df.select('age', 'sex', 'education', 'income').head()

shape: (5, 4)
┌─────┬─────────┬────────────┬────────┐
│ age ┆ sex     ┆ education  ┆ income │
│ --- ┆ ---     ┆ ---        ┆ ---    │
│ i64 ┆ str     ┆ str        ┆ str    │
╞═════╪═════════╪════════════╪════════╡
│ 39  ┆  Male   ┆  Bachelors ┆  <=50K │
│ 50  ┆  Male   ┆  Bachelors ┆  <=50K │
│ 38  ┆  Male   ┆  HS-grad   ┆  <=50K │
│ 53  ┆  Male   ┆  11th      ┆  <=50K │
│ 28  ┆  Female ┆  Bachelors ┆  <=50K │
└─────┴─────────┴────────────┴────────┘

Excluyendo atributo

Usando col podemos usar expresiones regulares.

df1.select(pl.col('*').exclude('Nombre'))

shape: (3, 2)
┌───────────┬────────────┐
│ Apellidos ┆ Puntuación │
│ ---       ┆ ---        │
│ str       ┆ i64        │
╞═══════════╪════════════╡
│ Molina    ┆ 8          │
│ Pepe      ┆ 9          │
│ Pérez     ┆ 7          │
└───────────┴────────────┘

Uso de expresiones regulares

df.select(pl.col('^[a,e].*$')).head()

shape: (5, 3)
┌─────┬────────────┬───────────────┐
│ age ┆ education  ┆ education-num │
│ --- ┆ ---        ┆ ---           │
│ i64 ┆ str        ┆ str           │
╞═════╪════════════╪═══════════════╡
│ 39  ┆  Bachelors ┆  13           │
│ 50  ┆  Bachelors ┆  13           │
│ 38  ┆  HS-grad   ┆  9            │
│ 53  ┆  11th      ┆  7            │
│ 28  ┆  Bachelors ┆  13           │
└─────┴────────────┴───────────────┘

Filtrando

Es eficiente.

df.select('age', 'sex', 'education', 'income').filter(pl.col('age') > 50).head()

shape: (5, 4)
┌─────┬─────────┬────────────┬────────┐
│ age ┆ sex     ┆ education  ┆ income │
│ --- ┆ ---     ┆ ---        ┆ ---    │
│ i64 ┆ str     ┆ str        ┆ str    │
╞═════╪═════════╪════════════╪════════╡
│ 53  ┆  Male   ┆  11th      ┆  <=50K │
│ 52  ┆  Male   ┆  HS-grad   ┆  >50K  │
│ 54  ┆  Female ┆  HS-grad   ┆  <=50K │
│ 59  ┆  Female ┆  HS-grad   ┆  <=50K │
│ 56  ┆  Male   ┆  Bachelors ┆  >50K  │
└─────┴─────────┴────────────┴────────┘

Filtrando por tipos

Usando selector podemos filtrar por tipo.

import polars.selectors as cs

dfl = pl.DataFrame(
    {
        "w": ["xx", "yy", "xx", "yy", "xx"],
        "x": [1, 2, 1, 4, -2],
        "y": [3.0, 4.5, 1.0, 2.5, -2.0],
        "z": ["a", "b", "a", "b", "b"],
    },
)

Filtrando por tipo

Es fácil filtrar por tipo.

dfl.select(cs.numeric())

shape: (5, 2)
┌─────┬──────┐
│ x   ┆ y    │
│ --- ┆ ---  │
│ i64 ┆ f64  │
╞═════╪══════╡
│ 1   ┆ 3.0  │
│ 2   ┆ 4.5  │
│ 1   ┆ 1.0  │
│ 4   ┆ 2.5  │
│ -2  ┆ -2.0 │
└─────┴──────┘

Agrupando datos

Calculando medias

df.group_by('sex').agg(pl.col('age').mean().alias('mean_age'))

shape: (3, 2)
┌─────────┬───────────┐
│ sex     ┆ mean_age  │
│ ---     ┆ ---       │
│ str     ┆ f64       │
╞═════════╪═══════════╡
│  Male   ┆ 39.433547 │
│ null    ┆ null      │
│  Female ┆ 36.85823  │
└─────────┴───────────┘

Agrupando por tipos

Usando selector podemos agrupar por tipo.

dfl.group_by(cs.string()).agg(cs.numeric().sum())

shape: (3, 4)
┌─────┬─────┬─────┬──────┐
│ w   ┆ z   ┆ x   ┆ y    │
│ --- ┆ --- ┆ --- ┆ ---  │
│ str ┆ str ┆ i64 ┆ f64  │
╞═════╪═════╪═════╪══════╡
│ xx  ┆ a   ┆ 2   ┆ 4.0  │
│ yy  ┆ b   ┆ 6   ┆ 7.0  │
│ xx  ┆ b   ┆ -2  ┆ -2.0 │
└─────┴─────┴─────┴──────┘

Eso es todo

Espero haber generado curiosidad hacia polars