pandas Wishes

pandas tidyverse Spark Python R

Features that I miss in pandas coming from the tidyverse.

Bruno Testaguzza Carlin https://twosidesdata.netlify.app/
2022-02-28

TLDR

Here are some points that I think pandas should work on to achieve it’s vision of being the best data manipulation tool out there.

Setup

Libraries in R

Libraries in Python

import pandas as pd
import numpy as np

The Data: Penguins!

Horst AM, Hill AP, Gorman KB (2020). palmerpenguins: Palmer Archipelago (Antarctica) penguin data. R package version 0.1.0. https://allisonhorst.github.io/palmerpenguins/

I choose the Palmer Penguins Dataset. Some call it the new iris after the iris dataset was rightly criticized for Eugenics issues. You can read more about it here Iris dataset retirement.

What does the dataset look like?

penguins <- palmerpenguins::penguins

head(penguins)
# A tibble: 6 x 8
  species island    bill_length_mm bill_depth_mm flipper_length_mm
  <fct>   <fct>              <dbl>         <dbl>             <int>
1 Adelie  Torgersen           39.1          18.7               181
2 Adelie  Torgersen           39.5          17.4               186
3 Adelie  Torgersen           40.3          18                 195
4 Adelie  Torgersen           NA            NA                  NA
5 Adelie  Torgersen           36.7          19.3               193
6 Adelie  Torgersen           39.3          20.6               190
# ... with 3 more variables: body_mass_g <int>, sex <fct>, year <int>
glimpse(penguins, width = 50)
Rows: 344
Columns: 8
$ species           <fct> Adelie, Adelie, Adelie~
$ island            <fct> Torgersen, Torgersen, ~
$ bill_length_mm    <dbl> 39.1, 39.5, 40.3, NA, ~
$ bill_depth_mm     <dbl> 18.7, 17.4, 18.0, NA, ~
$ flipper_length_mm <int> 181, 186, 195, NA, 193~
$ body_mass_g       <int> 3750, 3800, 3250, NA, ~
$ sex               <fct> male, female, female, ~
$ year              <int> 2007, 2007, 2007, 2007~

The dataset consists of 344 samples of three species of Penguins.

How do they look like?

The three species have different characteristics.

Artwork by @allison_horst

And what the hell is Bill Lenght and Depth?

Artwork by @allison_horst

Interaction among variables and species

Credit to Allison once again

Penguin mass vs. flipper length

As expected from the picture, the Gento species is a bit bigger, and we can see a heavy correlation between Flipper length and weight.

Bill length vs. depth

As we saw in the picture, the Adelie has smaller-length Bills.
The Gentoo has much more elongated ones than the Adelie, but they look ‘flatter,’ what the dataset quantifies as a smaller Bill depth.
The Chinstrap are positioned on the bigger side of both metrics.

Impact of sex on size

My experience

At this point, I have worked more than 4 years with both Python and R, Python has been my language choice when interacting with colleges simply because of culture or better support in platforms such as AWS, but it is relevant to understand that R was my first data language (I dabbled in some Macros on excel before) and I have used it since graduation more than 7 years ago.

The tasks of pandas framework?

Firstly, I call it a framework because pandas comes bundled with NumPy and other friends like matplot or seaborn. It is fair to compare them as a bundle since the competition is also split among many packages. I will borrow a picture from the tidyverse to explain where pandas stands on the analysis workflow.

Credit tohttps://oliviergimenez.github.io/

pandas Is responsible for the Import -> Tidy -> Transform part

But it can sometimes reach a little bit into Visualize by delegating when necessary. If your communication task is a graph, you can use matplot or seaborn, and there are also some simple APIs inside pandas that can plot basic graphs.

If you are importing from parquet, sometimes pandas calls on the arrow package.

There are many other examples, but I think of pandas as this central player on the data workflow of the Python ecosystem, it is unmatched in its amount of methods, and therefore it should be expected that some things could use some improvement, and that is the point of this post, I want pandas to change for the better, I want to discuss why pandas is in this state of right now and what I would like to see instead.

The current landscape

In R, I have used three different frameworks when dealing with data that can be considered a problem. Still, there is a crucial difference when comparing Python and R, you will usually find competition inside R, and it happens on every sphere.

Modeling tidymodels vs. mlr.
Visualization plotly vs. ggplot2.
Statistical tests sometimes have tens of different implementations from all over the world.

There are many other examples, and as an economist, I see this competition with good eyes. The competition has fostered fast and tested change on multiple instances as competing.

The timeline

I use the dates from pip and cran, but most packages get released earlier on mailing lists or private repos.

First, Base R was created in August the late 90s, but with many more years of baggage from the S era.
Then pandas started development around 2005 and got open-sourced in 2008. It is a game-changer and vastly more declarative and faster than Base R.
data.table comes along on the premise that speed matters a lot. Remember that it is 2006, and the single big machine paradigm was the end-all for big data analysis. Maybe it wasn’t even called big data back then.
The tidyverse releases dplyr on the back of the massive success of ggplot2. It wasn’t even called the tidyverse back then.
And finally, Koalas got created in 2019 to help pandas scale with PySpark. It eventually gets incorporated back into PySpark as the pyspark.pandas API.

Base R

The oldest, most of the time slowest way, base R has scared so many away from r. You will see some nasty behaviors here, like the famous partial string matching or the very crypt function names and arguments. Base R feels dated because it is dated, now more than 20 years old, this framework inspired pandas, but you need some patience if you use it. There is an excellent book about all of Base R’s little details that I highly recommend called The R Inferno

Base r implement some design ideas that users of the pandas ecosystem will recognize, like the dreadful row names pandas turned into indexes and using square brackets for 2d manipulation of data.

Some details of R for the Python folks, R doesn’t expect you to know what a pointer is because R doesn’t expect you to be a regular programmer. Base R envisions a statistician with some thousand lines of beautiful math that got turned into a package. This means that if you assign a copy of a data.frame to a new name, R initially creates a pointer, and eventually, if you have changed it in a destructible way, R automatically copies it into another new object. There is no need to keep manually using the copy method as in pandas.

Data Prep

penguins_base <- penguins |> as.data.frame()
row.names(penguins_base) <- str_c("penguin_", row_number(penguins_base$species))

head(penguins_base) |> knitr::kable()
species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g sex year
penguin_1 Adelie Torgersen 39.1 18.7 181 3750 male 2007
penguin_2 Adelie Torgersen 39.5 17.4 186 3800 female 2007
penguin_3 Adelie Torgersen 40.3 18.0 195 3250 female 2007
penguin_4 Adelie Torgersen NA NA NA NA NA 2007
penguin_5 Adelie Torgersen 36.7 19.3 193 3450 female 2007
penguin_6 Adelie Torgersen 39.3 20.6 190 3650 male 2007

Slicing

Now let’s use some base R

You can filter in a 2d manner based on df[x,y] just like pandas .loc

penguins_base["penguin_1", ]
          species    island bill_length_mm bill_depth_mm
penguin_1  Adelie Torgersen           39.1          18.7
          flipper_length_mm body_mass_g  sex year
penguin_1               181        3750 male 2007
penguins_base[c("penguin_1", "penguin_10"), c("species", "island")]
           species    island
penguin_1   Adelie Torgersen
penguin_10  Adelie Torgersen

It even feels like pandas all the way into it randomly deciding to change my data types.

penguins_base[c("penguin_1", "penguin_10"), c("species")]
[1] Adelie Adelie
Levels: Adelie Chinstrap Gentoo

Yep, we just fell out of a data.frame straight into a vector of the factor class, fantastic.

Complex manipulation

Let’s try to get the mean of the kgs of the females by species.

Doing some complex calculations on base R feels like a chore, but some functions that work as an apply on steroids may usually.

You will also rely heavily on saving intermediary df’s unless you overwrite the original, or you cheat a little bit and use pipes from the tidyverse (I don’t even think of it as cheating anymore as pipes come natively with R since the 4.0 release), as I will explain the tidyverse is a superset of base, meaning that it can be used inside of the Base R workflow. It can borrow functions from base as well. Some may call this a modern Base R code as it would not run on earlier than 4.0 versions of R. This enables shorter anonymous functions using ’’ instead of ‘function’.

results_base <- penguins_base[penguins_base["sex"] == "female", ] |>
  with(aggregate(x = body_mass_g, by = list(species), FUN = \(x) mean(x) / 1000))

results_base
    Group.1        x
1    Adelie 3.368836
2 Chinstrap 3.527206
3    Gentoo 4.679741

Nice looking table, It is hell to use multiple functions, but if you know what you are doing is simple, Base R gets the job done with no imports… if you still care about that.

data.table

The motto here is gotta go fast

Going as far as naturally parallelize execution on local cores when possible, some love the syntax, I honestly think it is the worst out of all the options, but when speed on a single machine is relevant (something that I encounter less and less as we will discuss later) data.table really shines, outperforming just about anything I have ever used on Python and R.

Data Prep

penguins_data_table <- as.data.table(penguins)

Complex manipulation

penguins_data_table[, list(species, body_mass_g)]
       species body_mass_g
  1:    Adelie        3750
  2:    Adelie        3800
  3:    Adelie        3250
  4:    Adelie          NA
  5:    Adelie        3450
 ---                      
340: Chinstrap        4000
341: Chinstrap        3400
342: Chinstrap        3775
343: Chinstrap        4100
344: Chinstrap        3775
penguins_data_table[species %in% c("Adelie", "Gentoo") & sex == "female", list(species, body_mass_g)][, lapply(.SD, mean, na.rm = TRUE), species]
   species body_mass_g
1:  Adelie    3368.836
2:  Gentoo    4679.741

It produces these magical one-liners with speed to spare. The problem is that I can barely glimpse what I did here, as almost all of the execution depends on you remembering this model behind the scenes.

DT[i, j, by]

i = order by | select
j = update
by = group by

And trust me when I say it gets complicated data.table is Turing complete as all options here are, and it is out there performing all of the functions of either dplyr or pandas, with just three arguments! That produces some of the most confusing pieces of code you will ever read, at least the data.table team killed the idea of row names as well.

Dplyr / the tidyverse

Storybench picture of the tidyverse

My clear favorite, in a perfect world, everyone should know the tidyverse for the power it brings on expressing ideas about data with straightforward declarative syntax. This is very much the empowered version of SQL. A nice thing that I already showed on the base R part is that the tidyverse is only a part of the R ecosystem, meaning you can get your old statistic operations and just plug it into place. I will further detail how easy it is to develop an extension for the tidyverse but first, let’s see some syntax.

Simple Manipulation

penguins |>
  filter(
    species %in% c("Adelie", "Gentoo"),
    sex == "female"
  ) |>
  group_by(species) |>
  summarise(body_mass_g_to_kg = mean(body_mass_g) / 1000)
# A tibble: 2 x 2
  species body_mass_g_to_kg
  <fct>               <dbl>
1 Adelie               3.37
2 Gentoo               4.68

This is an example of what dplyr can do while remaining very similar to English, you can opt into named arguments that are very well thought out, some of which have gone into twitter polls, the team at RStudio clearly thinks about usage and is willing to redesign old parts of the systems to reach new usability levels.

Complex Manipulation

result_tidyverse <- penguins |>
  select(-year) |>
  filter(
    species %in% c("Adelie", "Gentoo"),
    sex == "female"
  ) |>
  group_by(species) |>
  select(where(is.numeric)) |>
  summarise(across(
    .cols = where(\(x) mean(x) > 188),
    .fns = list(median = median, mean = mean),
    .names = "{.fn}-{.col}"
  )) |>
  mutate(across(
    .cols = ends_with("_g"),
    .fns = list(to_kg = \(x) x / 1000),
    .names = "{.col}-{.fn}"
  ))

result_tidyverse |> rmarkdown::paged_table()

This is incredibly similar to my usage of data manipulation in the real world. Some functions are simple, like some business metric is better on a kg basis instead of g, while others empower you to write incredibly succinct syntax that feels like a superpower for your laziness. You start to write smaller and smaller code to deal with more and more complex problems. I realize that most of what is in here exists solely on the tidyverse (for now) and that newcomers may not understand somewhat complex functions like across the first time they try to use it. Still, it is such a game-changer that functions like across exist, the alternatives being you sometimes writing tens of column names, or that you pre-compute a list as I will show it in pandas.

Also, you can see how the tidyverse easily interacts with custom functions. The anonymous function gets placed right into the middle of the pipeline without a custom method or any other fancy workaround, and it just builds upon what R offers.

One drawback is that your code feels ‘old’ pretty fast on the tidyverse ecosystem. In this example alone, the |> operator called a pipe, the new anonymous function syntax, the across, and the where functions from the package have less than two years.

The tidyverse can also turn this code written in r, and with only a connection to a data source, compile it into code for another language, it is mainly SQL code, but this is super helpful, as I will show later on what I wished pandas implemented.

pandas

pandas

I usually think of pandas as a project to copy into Python what worked on other languages, mainly what people call base R and some SQL into Python, and it is hugely successful, and usage is most of the time a joyful experience, it is not the prettiest, but it can get the job done.

pandas is flexible enough to the point where you can write the same code and make it feel like Base R or Tidyverse depending on what methods you choose, for example, if you go heavy into indexing the base R style.

We can read its mission on the page

Mission
pandas aims to be the fundamental high-level building block for doing practical, real-world data analysis in Python. Additionally, it has the broader goal of becoming the most powerful and flexible open-source data analysis/manipulation tool available in any language.

That is quite a greedy statement, and I love it. Pandas should strive for perfection, power, and flexibility, but let’s try to see some current limitations and quirks which I personally dislike. It doesn’t mean that you can’t like it.

Set up data

We can easily import the penguins dataset by reading the repository csv.

penguins = pd.read_csv('https://raw.githubusercontent.com/allisonhorst/palmerpenguins/master/inst/extdata/penguins.csv')

Slicing

Using loc, you get very close to base R philosophy.

df_result = penguins.loc[penguins.species.isin(['Adelie','Chinstrap']),['species','sex','body_mass_g']]

df_result
       species     sex  body_mass_g
0       Adelie    male       3750.0
1       Adelie  female       3800.0
2       Adelie  female       3250.0
3       Adelie     NaN          NaN
4       Adelie  female       3450.0
..         ...     ...          ...
339  Chinstrap    male       4000.0
340  Chinstrap  female       3400.0
341  Chinstrap    male       3775.0
342  Chinstrap    male       4100.0
343  Chinstrap  female       3775.0

[220 rows x 3 columns]

But pandas gets ahead of itself and starts changing the data types depending on the parameters, so what started out as a DataFrame may sometimes get back as a Series… You can avoid this behavior by passing lists on this example. This is similar to the data.table way, and it baffles me why this is even a possibility. It overcharges the slicing operations into another capacity of object-type manipulators, common in Pandas and data.frame, slicing is this super powerful method that may return wildly different results depending on very little change.

df_result = penguins.loc[penguins.species.isin(['Adelie','Chinstrap']),'species']

df_result
0         Adelie
1         Adelie
2         Adelie
3         Adelie
4         Adelie
         ...    
339    Chinstrap
340    Chinstrap
341    Chinstrap
342    Chinstrap
343    Chinstrap
Name: species, Length: 220, dtype: object

Simple Manipulation

Numpy has a super similar syntax to the tidyverse if you opt into it.


penguins\
  .query("species in ('Adelie', 'Gentoo')")\
  .groupby('species')\
  .agg({'body_mass_g':lambda x: np.mean(x)/1000})


# penguins |>
#   filter(species %in% c('Adelie', 'Gentoo'),
#          sex == 'female') |> 
#   group_by(species) |> 
#   summarise(body_mass_g_to_kg = mean(body_mass_g)/1000)
# df_full_indexes.columns = ["_".join(col_name).rstrip('_') for col_name in df_full_indexes.columns.to_flat_index()]
# 
# df_full_indexes['B_sum']
# 
# df_full_indexes['B_mean']
         body_mass_g
species             
Adelie      3.700662
Gentoo      5.076016

Complex Manipulations

pandas has this tendency to create more and more indexes, drop_index will quickly become your go-to solution, and when we add hierarchical indexes to the mix, you are going to be copying and pasting some answers from Stack Overflow to flatten the data you created, or you will need some sophisticated indexing operation to get some specific results back.

Another detail is that I need to manually drop the columns before the groupby operation, and this sucks because no data type won’t exclude species (our grouping variable) while excluding island and sex.

penguins\
  .drop(columns = 'year')\
  .query("species in ('Adelie', 'Gentoo')")\
  .groupby('species')\
  .select_dtypes('numeric')
Error in py_call_impl(callable, dots$args, dots$keywords): AttributeError: 'DataFrameGroupBy' object has no attribute 'select_dtypes'

Detailed traceback:
  File "<string>", line 1, in <module>
  File "C:\Users\bruno\AppData\Local\R-MINI~1\lib\site-packages\pandas\core\groupby\groupby.py", line 904, in __getattr__
    raise AttributeError(

So instead I need to pre-select my columns.

result_hierarchical = penguins\
  .drop(columns = ['year','island','sex'])\
  .query("species in ('Adelie', 'Gentoo')")\
  .groupby('species')\
  .agg([np.mean,np.median])

result_hierarchical
        bill_length_mm        bill_depth_mm  ... flipper_length_mm  body_mass_g        
                  mean median          mean  ...            median         mean  median
species                                      ...                                       
Adelie       38.791391   38.8     18.346358  ...             190.0  3700.662252  3700.0
Gentoo       47.504878   47.3     14.982114  ...             216.0  5076.016260  5000.0

[2 rows x 8 columns]

To access the body_mass_g columns and transform them to kg’s, I need to deal with the index system without using the loc method.

result_hierarchical["body_mass_g"] /1000
             mean  median
species                  
Adelie   3.700662     3.7
Gentoo   5.076016     5.0

So you might just think, OK, simple, I just need to assign it back.

result_hierarchical["body_mass_g"] = result_hierarchical["body_mass_g"] /1000

and it works if you don’t mind losing the original data if you want to create some new name

result_hierarchical["body_mass_g_back_to_g"] = result_hierarchical["body_mass_g"] * 1000
Error in py_call_impl(callable, dots$args, dots$keywords): ValueError: Expected a 1D array, got an array with shape (2, 2)

Detailed traceback:
  File "<string>", line 1, in <module>
  File "C:\Users\bruno\AppData\Local\R-MINI~1\lib\site-packages\pandas\core\frame.py", line 3645, in __setitem__
    self._set_item_frame_value(key, value)
  File "C:\Users\bruno\AppData\Local\R-MINI~1\lib\site-packages\pandas\core\frame.py", line 3788, in _set_item_frame_value
    self._set_item_mgr(key, arraylike)
  File "C:\Users\bruno\AppData\Local\R-MINI~1\lib\site-packages\pandas\core\frame.py", line 3802, in _set_item_mgr
    self._mgr.insert(len(self._info_axis), key, value)
  File "C:\Users\bruno\AppData\Local\R-MINI~1\lib\site-packages\pandas\core\internals\managers.py", line 1235, in insert
    raise ValueError(

Infuriating, you can go on a SO hunt to see the right approach to keep the indexes, but at this point, I am done with pandas indexing and just cheat my way into the result with some flattened data frame.

# Flattern MultiIndex columns
result_hierarchical.columns = ["_".join(col_name).rstrip('_') for col_name in result_hierarchical.columns.to_flat_index()]

result_hierarchical['body_mass_g_median_back_to_g'] = result_hierarchical['body_mass_g_median'] * 1000


result_hierarchical['body_mass_g_median']
species
Adelie    3.7
Gentoo    5.0
Name: body_mass_g_median, dtype: float64
result_hierarchical['body_mass_g_median_back_to_g']
species
Adelie    3700.0
Gentoo    5000.0
Name: body_mass_g_median_back_to_g, dtype: float64

PySpark/Koalas

This framework is delightful to work with mainly because you can go back and forth between 4 APIs SQL, Spark, Koalas, and Pandas, and chances are one of them has an excellent approach to your problem. This post would deviate too much if I talked in depth about this framework, but it certainly has its place on the big data manipulation side, with some APIs that are sometimes superior to what pandas can offer, I will touch on the issue of laziness on the topic of what I wanted that Pandas implemented. Also, PySpark really struggles with indexing as it should, and Koalas allows for some crazy distributed index rules.

You must understand that any speed analysis among packages on the individual personal computer level gets turned irrelevant as long as you access Spark Clusters. This is how you can query billions of records with ease, not by having slightly faster performance on a single machine level but by leveraging the horizontal scaling capacity of big companies such as Amazon, Google, and Microsoft.

Problems with pandas

I classify two kinds of problems on the pandas’ framework. The first and honestly the simplest to explain are things that it implements, and I think it shouldn’t. Many pandas problems come from what it copied from base R, which is why you needed to understand the timeline of the packages at the beginning of this post. The second is the new features that recent packages on both the R and Python sphere put into the table in recent years features that when pandas was being created didn’t exist and that I hope the pandas or some other packages teams will eventually be able to integrate into the pandas’ ecosystem.

What pandas has and it shouldn’t

Here are the things that I hope that pandas makes more straightforward to get away from or ditches altogether.

Indexing

This wasn’t even a feature back in the SQL days. Indexing comes from languages that tried to facilitate the… Index of data, so you could play a version of Battleship on your data, the idea sounds suitable, and it works for mostly simply indexes, data[x,y] if you can picture it, and even the index of rows what R call row.names is, in my opinion, a mistake. It adds unnecessary flutter to what could have been easily one more column, so you can define all datasets or at least all structured datasets ass the tidy diagram shows.

The id column on this example is treated as just one more column, and it doesn’t need any special treatment as we will discuss slicing.

MultiIndex

This is the worst of the bunch if data[x,y] is discussable and maybe human-readable, or excel friendly as I like to call it. MultiIndex feels like the epitome of unnecessary structure. MultiIndex burdens you with a new, more complex data type than the one you started. It is hardly an easy feature from a usability standpoint as multiple SO threads will show, it causes confusion, generates code to get back into a flatter structure, it only exists on pandas and copycats. For a good reason, this format is impossible to translate into common file structures such as CSV or xlsx. It can’t be passed back into a SQL server or Spark. It is too much for what could have been a flatter structure with more names or a nested return dataframe.

The Solution

Two types of functions suffer from a practical/usability standpoint because of the Index concept.

The first are the functions that don’t work out of the box because of indexing. I am not talking about functions that have bad names or arguments because of indexing like the sort_values, not sort because god forbid sort messed up with the Index, and you didn’t realize it. The biggest issue comes from functions that almost always come accompanied by reset_index() or have some internal argument to ignore_index like concat.

df1 = pd.DataFrame({'A': ['A0', 'A1', 'A2', 'A3'],
                    'D': ['D0', 'D1', 'D2', 'D3']},
                    index=[0, 2, 3,4])

df2 = pd.DataFrame({'A2': ['A4', 'A5', 'A6', 'A7'],
                    'D2': ['D4', 'D5', 'D6', 'D7']},
                    index=[ 5, 6, 7,3])

How is this the default behavior?

pd.concat([df1,df2],axis =1)
     A    D   A2   D2
0   A0   D0  NaN  NaN
2   A1   D1  NaN  NaN
3   A2   D2   A7   D7
4   A3   D3  NaN  NaN
5  NaN  NaN   A4   D4
6  NaN  NaN   A5   D5
7  NaN  NaN   A6   D6

instead of this

pd.concat([df1.reset_index(),df2.reset_index()],axis =1)
   index   A   D  index  A2  D2
0      0  A0  D0      5  A4  D4
1      2  A1  D1      6  A5  D5
2      3  A2  D2      7  A6  D6
3      4  A3  D3      3  A7  D7

Maybe I am expecting too much from concat, and there should be another method to do what the tidyverse calls bind_cols, an operation where you combine two dfs with the same number of rows, but still, where is this magical function? It can’t possibly exist under the current Index-dominated philosophy because it would be ignored entirely.

The second are the ones that give you back a deeper, more nested index than what you started with. My solution is to provide functions that always return a dataframe with the same flatness level, either with named returns like how dplyr does across using the nicknames .col and .fun or with nested data instead of nested indexes. These functions can still rely on Indexes on the back-end, but on a higher level, let the user decide how to handle naming or if he wants a nested return.

The Solution

Define a function to deal with the index.

def agg_no_index(df,arg_to_agg):
  df = df\
  .agg(arg_to_agg)
  
  df.columns = ["_".join(col_name).rstrip('_') for col_name in df.columns.to_flat_index()]
  return(df)

Put it into a pipe.

penguins\
  .drop(columns = ['year','island','sex'])\
  .query("species in ('Adelie', 'Gentoo')")\
  .groupby('species')\
  .pipe(agg_no_index,arg_to_agg = [np.sum,np.mean])
         bill_length_mm_sum  ...  body_mass_g_mean
species                      ...                  
Adelie               5857.5  ...       3700.662252
Gentoo               5843.1  ...       5076.016260

[2 rows x 8 columns]

Create some functions like the one I just showed and make them available as methods so that we don’t have to use the subpar pipe method (also, why isn’t pipe some symbol?).

Slicing

Slicing suffers from the sins of indexing, and it feels like there are tens of different ways to do the same thing. Just look at how many ways pandas covers on it’s site.

https://pandas.pydata.org/docs/user_guide/indexing.html#

And I am sure some people have created even more complicated ones out there.

The Solution

Assume that all data will be in a tidy format and just use two kinds of operations you select to get your columns and you where/filter/query to get your rows.

What pandas hasn’t and it should

Better Selectors

As shown on the select_dtypes fiasco with groupby pandas needs to step up its selecting game.

dplyr supports all manner of straightforward ways to select data like regex expressions.

penguins |>
  group_by(sex) |>
  select(matches("mm$"))
# A tibble: 344 x 4
# Groups:   sex [3]
   sex    bill_length_mm bill_depth_mm flipper_length_mm
   <fct>           <dbl>         <dbl>             <int>
 1 male             39.1          18.7               181
 2 female           39.5          17.4               186
 3 female           40.3          18                 195
 4 <NA>             NA            NA                  NA
 5 female           36.7          19.3               193
 6 male             39.3          20.6               190
 7 female           38.9          17.8               181
 8 male             39.2          19.6               195
 9 <NA>             34.1          18.1               193
10 <NA>             42            20.2               190
# ... with 334 more rows

Or data type inference that actually works post grouped operations.

penguins |>
  group_by(sex) |>
  select(where(is.numeric))
# A tibble: 344 x 6
# Groups:   sex [3]
   sex    bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
   <fct>           <dbl>         <dbl>             <int>       <int>
 1 male             39.1          18.7               181        3750
 2 female           39.5          17.4               186        3800
 3 female           40.3          18                 195        3250
 4 <NA>             NA            NA                  NA          NA
 5 female           36.7          19.3               193        3450
 6 male             39.3          20.6               190        3650
 7 female           38.9          17.8               181        3625
 8 male             39.2          19.6               195        4675
 9 <NA>             34.1          18.1               193        3475
10 <NA>             42            20.2               190        4250
# ... with 334 more rows, and 1 more variable: year <int>

It can even go as far as to accept custom functions that return Boolean values, although it may require some protection from function errors.

safe_sum <- \(original_col) safely(\(x) sum(x, na.rm = TRUE))(original_col)$result

penguins |>
  group_by(sex) |>
  select(where(\(x) isTRUE(safe_sum(x) > 100))) |>
  select(!year)
# A tibble: 344 x 5
# Groups:   sex [3]
   sex    bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
   <fct>           <dbl>         <dbl>             <int>       <int>
 1 male             39.1          18.7               181        3750
 2 female           39.5          17.4               186        3800
 3 female           40.3          18                 195        3250
 4 <NA>             NA            NA                  NA          NA
 5 female           36.7          19.3               193        3450
 6 male             39.3          20.6               190        3650
 7 female           38.9          17.8               181        3625
 8 male             39.2          19.6               195        4675
 9 <NA>             34.1          18.1               193        3475
10 <NA>             42            20.2               190        4250
# ... with 334 more rows

Yes, it may require some workarounds compared to safer functions like is.numeric, but the point is that it is possible to do it in the tidyverse.

True Connection to a data source

What I mean by a true connection is that there are packages out there like siuba, with the same syntax as and dplyr that can be extended to directly write queries on your data source, without changing the code of your regular on memory code. This is also possible on tools such as Tableau and Power Query, and it drastically increases the potential of writing cross-platform code. I don’t mean to write .sql() code in PySpark or similar I mean something like this example using sparklyr + DBI on an in-memory Spark cluster.

Set up fake cluster

library(sparklyr)

sc <- spark_connect(master = "local")

Now I can send my penguins dataset to the fake cloud.

tbl_penguins <- copy_to(sc, penguins, "spark_penguins")

Writing dynamic calls to a data source

Let’s quickly show what dplyr can do now that it has access to a spark connection. It defaults to lazy as R always is, so now queries don’t truly return unless you use collect()

return_lazy <- tbl_penguins |>
  mutate(bill_depth_mm**2)

class(return_lazy)
[1] "tbl_spark" "tbl_sql"   "tbl_lazy"  "tbl"      
return_lazy |> nrow()
[1] NA

As you can see, my table is no longer a dataframe, so nrow won’t work unless I bring it into local memory.

return_lazy |>
  collect() |>
  nrow()
[1] 344

What is impressive is that you can easily use the clusters resources since what dplyr is doing is writing SQL queries to a Spark cluster, in this case, my own machine, but it could have been a cluster with hundreds of executors, ready to quickly finish executing my code, it is like a google translator for code!

return_lazy |>
  show_query()
<SQL>
SELECT `species`, `island`, `bill_length_mm`, `bill_depth_mm`, `flipper_length_mm`, `body_mass_g`, `sex`, `year`, POWER(`bill_depth_mm`, 2.0) AS `bill_depth_mm^2`
FROM `spark_penguins`

On a side note, writing SQL queries to the target data source is not the limit of the integration between sparklyr and dplyr. It can even go as far as to use some of Spark’s more powerful functions like its ml capabilities or invoke spark-specific functions.

return_lazy |>
  select(sex, where(is.numeric)) |>
  na.omit() |>
  ml_logistic_regression(sex ~ .)
Formula: sex ~ .

Coefficients:
      (Intercept)    bill_length_mm     bill_depth_mm 
     -0.018096739      -0.107405083      -1.043594480 
flipper_length_mm       body_mass_g              year 
      0.034393443      -0.005437584       0.023328984 
  bill_depth_mm^2 
     -0.028100181 

It has this painless integration between Spark and the tidyverse, which is better, IMHO, to even what PySpark has right now and miles ahead of what pandas.api has been able to achieve for my own usage far.

Conclusion

While pandas is the biggest dog on the block right now in terms of Data Manipulation on memory, it still has some points that it can copy from the competition that could make pandas even better or more flexible.

References

  1. Horst AM, Hill AP, Gorman KB (2020). palmerpenguins: Palmer Archipelago (Antarctica) penguin data. R package version 0.1.0. https://allisonhorst.github.io/palmerpenguins/](https://github.com/allisonhorst/palmerpenguins/
  2. The R Inferno
  3. SO solution for indexes
  4. Querying SQL with siuba
  5. sparklyr guide

Corrections

If you see mistakes or want to suggest changes, please create an issue on the source repository.

Reuse

Text and figures are licensed under Creative Commons Attribution CC BY 4.0. Source code is available at https://github.com/brunocarlin/carlin, unless otherwise noted. The figures that have been reused from other sources don't fall under this license and can be recognized by a note in their caption: "Figure from ...".

Citation

For attribution, please cite this work as

Carlin (2022, Feb. 28). Bruno Testaguzza Carlin blog: pandas Wishes. Retrieved from https://carlin-blog.netlify.app/posts/2022-02-28-pandaswishes/

BibTeX citation

@misc{carlin2022pandas,
  author = {Carlin, Bruno Testaguzza},
  title = {Bruno Testaguzza Carlin blog: pandas Wishes},
  url = {https://carlin-blog.netlify.app/posts/2022-02-28-pandaswishes/},
  year = {2022}
}