Features that I miss in pandas coming from the tidyverse.
Here are some points that I think pandas should work on to achieve it’s vision of being the best data manipulation tool out there.
import pandas as pd
import numpy as np
I choose the Palmer Penguins Dataset. Some call it the new iris after the iris dataset was rightly criticized for Eugenics issues. You can read more about it here Iris dataset retirement.
# A tibble: 6 x 8
species island bill_length_mm bill_depth_mm flipper_length_mm
<fct> <fct> <dbl> <dbl> <int>
1 Adelie Torgersen 39.1 18.7 181
2 Adelie Torgersen 39.5 17.4 186
3 Adelie Torgersen 40.3 18 195
4 Adelie Torgersen NA NA NA
5 Adelie Torgersen 36.7 19.3 193
6 Adelie Torgersen 39.3 20.6 190
# ... with 3 more variables: body_mass_g <int>, sex <fct>, year <int>
glimpse(penguins, width = 50)
Rows: 344
Columns: 8
$ species <fct> Adelie, Adelie, Adelie~
$ island <fct> Torgersen, Torgersen, ~
$ bill_length_mm <dbl> 39.1, 39.5, 40.3, NA, ~
$ bill_depth_mm <dbl> 18.7, 17.4, 18.0, NA, ~
$ flipper_length_mm <int> 181, 186, 195, NA, 193~
$ body_mass_g <int> 3750, 3800, 3250, NA, ~
$ sex <fct> male, female, female, ~
$ year <int> 2007, 2007, 2007, 2007~
The dataset consists of 344 samples of three species of Penguins.
The three species have different characteristics.
And what the hell is Bill Lenght and Depth?
As expected from the picture, the Gento species is a bit bigger, and we can see a heavy correlation between Flipper length and weight.
As we saw in the picture, the Adelie has smaller-length Bills.
The Gentoo has much more elongated ones than the Adelie, but they look
‘flatter,’ what the dataset quantifies as a smaller Bill depth.
The Chinstrap are positioned on the bigger side of both metrics.
At this point, I have worked more than 4 years with both Python and R, Python has been my language choice when interacting with colleges simply because of culture or better support in platforms such as AWS, but it is relevant to understand that R was my first data language (I dabbled in some Macros on excel before) and I have used it since graduation more than 7 years ago.
Firstly, I call it a framework because pandas comes bundled with NumPy and other friends like matplot or seaborn. It is fair to compare them as a bundle since the competition is also split among many packages. I will borrow a picture from the tidyverse to explain where pandas stands on the analysis workflow.
pandas Is responsible for the Import -> Tidy -> Transform part
But it can sometimes reach a little bit into Visualize by delegating when necessary. If your communication task is a graph, you can use matplot or seaborn, and there are also some simple APIs inside pandas that can plot basic graphs.
If you are importing from parquet, sometimes pandas calls on the arrow package.
There are many other examples, but I think of pandas as this central player on the data workflow of the Python ecosystem, it is unmatched in its amount of methods, and therefore it should be expected that some things could use some improvement, and that is the point of this post, I want pandas to change for the better, I want to discuss why pandas is in this state of right now and what I would like to see instead.
In R, I have used three different frameworks when dealing with data that can be considered a problem. Still, there is a crucial difference when comparing Python and R, you will usually find competition inside R, and it happens on every sphere.
Modeling tidymodels vs. mlr.
Visualization plotly vs. ggplot2.
Statistical tests sometimes have tens of different implementations from
all over the world.
There are many other examples, and as an economist, I see this competition with good eyes. The competition has fostered fast and tested change on multiple instances as competing.
I use the dates from pip and cran, but most packages get released earlier on mailing lists or private repos.
First, Base R was created in August the late 90s, but with many more
years of baggage from the S era.
Then pandas started development around 2005 and got open-sourced in
2008. It is a game-changer and vastly more declarative and faster than
Base R.
data.table comes along on the premise that speed matters a lot. Remember
that it is 2006, and the single big machine paradigm was the end-all for
big data analysis. Maybe it wasn’t even called big data back then.
The tidyverse releases dplyr on the back of the massive success of
ggplot2. It wasn’t even called the tidyverse back then.
And finally, Koalas got created in 2019 to help pandas scale with
PySpark. It eventually gets incorporated back into PySpark as the
pyspark.pandas API.
The oldest, most of the time slowest way, base R has scared so many away from r. You will see some nasty behaviors here, like the famous partial string matching or the very crypt function names and arguments. Base R feels dated because it is dated, now more than 20 years old, this framework inspired pandas, but you need some patience if you use it. There is an excellent book about all of Base R’s little details that I highly recommend called The R Inferno
Base r implement some design ideas that users of the pandas ecosystem will recognize, like the dreadful row names pandas turned into indexes and using square brackets for 2d manipulation of data.
Some details of R for the Python folks, R doesn’t expect you to know what a pointer is because R doesn’t expect you to be a regular programmer. Base R envisions a statistician with some thousand lines of beautiful math that got turned into a package. This means that if you assign a copy of a data.frame to a new name, R initially creates a pointer, and eventually, if you have changed it in a destructible way, R automatically copies it into another new object. There is no need to keep manually using the copy method as in pandas.
penguins_base <- penguins |> as.data.frame()
row.names(penguins_base) <- str_c("penguin_", row_number(penguins_base$species))
head(penguins_base) |> knitr::kable()
species | island | bill_length_mm | bill_depth_mm | flipper_length_mm | body_mass_g | sex | year | |
---|---|---|---|---|---|---|---|---|
penguin_1 | Adelie | Torgersen | 39.1 | 18.7 | 181 | 3750 | male | 2007 |
penguin_2 | Adelie | Torgersen | 39.5 | 17.4 | 186 | 3800 | female | 2007 |
penguin_3 | Adelie | Torgersen | 40.3 | 18.0 | 195 | 3250 | female | 2007 |
penguin_4 | Adelie | Torgersen | NA | NA | NA | NA | NA | 2007 |
penguin_5 | Adelie | Torgersen | 36.7 | 19.3 | 193 | 3450 | female | 2007 |
penguin_6 | Adelie | Torgersen | 39.3 | 20.6 | 190 | 3650 | male | 2007 |
Now let’s use some base R
You can filter in a 2d manner based on df[x,y] just like pandas .loc
penguins_base["penguin_1", ]
species island bill_length_mm bill_depth_mm
penguin_1 Adelie Torgersen 39.1 18.7
flipper_length_mm body_mass_g sex year
penguin_1 181 3750 male 2007
species island
penguin_1 Adelie Torgersen
penguin_10 Adelie Torgersen
It even feels like pandas all the way into it randomly deciding to change my data types.
[1] Adelie Adelie
Levels: Adelie Chinstrap Gentoo
Yep, we just fell out of a data.frame straight into a vector of the factor class, fantastic.
Let’s try to get the mean of the kgs of the females by species.
Doing some complex calculations on base R feels like a chore, but some functions that work as an apply on steroids may usually.
You will also rely heavily on saving intermediary df’s unless you overwrite the original, or you cheat a little bit and use pipes from the tidyverse (I don’t even think of it as cheating anymore as pipes come natively with R since the 4.0 release), as I will explain the tidyverse is a superset of base, meaning that it can be used inside of the Base R workflow. It can borrow functions from base as well. Some may call this a modern Base R code as it would not run on earlier than 4.0 versions of R. This enables shorter anonymous functions using ’’ instead of ‘function’.
results_base <- penguins_base[penguins_base["sex"] == "female", ] |>
with(aggregate(x = body_mass_g, by = list(species), FUN = \(x) mean(x) / 1000))
results_base
Group.1 x
1 Adelie 3.368836
2 Chinstrap 3.527206
3 Gentoo 4.679741
Nice looking table, It is hell to use multiple functions, but if you know what you are doing is simple, Base R gets the job done with no imports… if you still care about that.
The motto here is gotta go fast
Going as far as naturally parallelize execution on local cores when possible, some love the syntax, I honestly think it is the worst out of all the options, but when speed on a single machine is relevant (something that I encounter less and less as we will discuss later) data.table really shines, outperforming just about anything I have ever used on Python and R.
penguins_data_table <- as.data.table(penguins)
penguins_data_table[, list(species, body_mass_g)]
species body_mass_g
1: Adelie 3750
2: Adelie 3800
3: Adelie 3250
4: Adelie NA
5: Adelie 3450
---
340: Chinstrap 4000
341: Chinstrap 3400
342: Chinstrap 3775
343: Chinstrap 4100
344: Chinstrap 3775
penguins_data_table[species %in% c("Adelie", "Gentoo") & sex == "female", list(species, body_mass_g)][, lapply(.SD, mean, na.rm = TRUE), species]
species body_mass_g
1: Adelie 3368.836
2: Gentoo 4679.741
It produces these magical one-liners with speed to spare. The problem is that I can barely glimpse what I did here, as almost all of the execution depends on you remembering this model behind the scenes.
DT[i, j, by]
i = order by | select
j = update
by = group by
And trust me when I say it gets complicated data.table is Turing complete as all options here are, and it is out there performing all of the functions of either dplyr or pandas, with just three arguments! That produces some of the most confusing pieces of code you will ever read, at least the data.table team killed the idea of row names as well.
My clear favorite, in a perfect world, everyone should know the tidyverse for the power it brings on expressing ideas about data with straightforward declarative syntax. This is very much the empowered version of SQL. A nice thing that I already showed on the base R part is that the tidyverse is only a part of the R ecosystem, meaning you can get your old statistic operations and just plug it into place. I will further detail how easy it is to develop an extension for the tidyverse but first, let’s see some syntax.
penguins |>
filter(
species %in% c("Adelie", "Gentoo"),
sex == "female"
) |>
group_by(species) |>
summarise(body_mass_g_to_kg = mean(body_mass_g) / 1000)
# A tibble: 2 x 2
species body_mass_g_to_kg
<fct> <dbl>
1 Adelie 3.37
2 Gentoo 4.68
This is an example of what dplyr can do while remaining very similar to English, you can opt into named arguments that are very well thought out, some of which have gone into twitter polls, the team at RStudio clearly thinks about usage and is willing to redesign old parts of the systems to reach new usability levels.
result_tidyverse <- penguins |>
select(-year) |>
filter(
species %in% c("Adelie", "Gentoo"),
sex == "female"
) |>
group_by(species) |>
select(where(is.numeric)) |>
summarise(across(
.cols = where(\(x) mean(x) > 188),
.fns = list(median = median, mean = mean),
.names = "{.fn}-{.col}"
)) |>
mutate(across(
.cols = ends_with("_g"),
.fns = list(to_kg = \(x) x / 1000),
.names = "{.col}-{.fn}"
))
result_tidyverse |> rmarkdown::paged_table()
This is incredibly similar to my usage of data manipulation in the real world. Some functions are simple, like some business metric is better on a kg basis instead of g, while others empower you to write incredibly succinct syntax that feels like a superpower for your laziness. You start to write smaller and smaller code to deal with more and more complex problems. I realize that most of what is in here exists solely on the tidyverse (for now) and that newcomers may not understand somewhat complex functions like across the first time they try to use it. Still, it is such a game-changer that functions like across exist, the alternatives being you sometimes writing tens of column names, or that you pre-compute a list as I will show it in pandas.
Also, you can see how the tidyverse easily interacts with custom functions. The anonymous function gets placed right into the middle of the pipeline without a custom method or any other fancy workaround, and it just builds upon what R offers.
One drawback is that your code feels ‘old’ pretty fast on the tidyverse ecosystem. In this example alone, the |> operator called a pipe, the new anonymous function syntax, the across, and the where functions from the package have less than two years.
The tidyverse can also turn this code written in r, and with only a connection to a data source, compile it into code for another language, it is mainly SQL code, but this is super helpful, as I will show later on what I wished pandas implemented.
I usually think of pandas as a project to copy into Python what worked on other languages, mainly what people call base R and some SQL into Python, and it is hugely successful, and usage is most of the time a joyful experience, it is not the prettiest, but it can get the job done.
pandas is flexible enough to the point where you can write the same code and make it feel like Base R or Tidyverse depending on what methods you choose, for example, if you go heavy into indexing the base R style.
We can read its mission on the page
Mission
pandas aims to be the fundamental high-level building block for doing
practical, real-world data analysis in Python. Additionally, it has the
broader goal of becoming the most powerful and flexible open-source data
analysis/manipulation tool available in any language.
That is quite a greedy statement, and I love it. Pandas should strive for perfection, power, and flexibility, but let’s try to see some current limitations and quirks which I personally dislike. It doesn’t mean that you can’t like it.
We can easily import the penguins dataset by reading the repository csv.
= pd.read_csv('https://raw.githubusercontent.com/allisonhorst/palmerpenguins/master/inst/extdata/penguins.csv') penguins
Using loc, you get very close to base R philosophy.
= penguins.loc[penguins.species.isin(['Adelie','Chinstrap']),['species','sex','body_mass_g']]
df_result
df_result
species sex body_mass_g
0 Adelie male 3750.0
1 Adelie female 3800.0
2 Adelie female 3250.0
3 Adelie NaN NaN
4 Adelie female 3450.0
.. ... ... ...
339 Chinstrap male 4000.0
340 Chinstrap female 3400.0
341 Chinstrap male 3775.0
342 Chinstrap male 4100.0
343 Chinstrap female 3775.0
[220 rows x 3 columns]
But pandas gets ahead of itself and starts changing the data types depending on the parameters, so what started out as a DataFrame may sometimes get back as a Series… You can avoid this behavior by passing lists on this example. This is similar to the data.table way, and it baffles me why this is even a possibility. It overcharges the slicing operations into another capacity of object-type manipulators, common in Pandas and data.frame, slicing is this super powerful method that may return wildly different results depending on very little change.
= penguins.loc[penguins.species.isin(['Adelie','Chinstrap']),'species']
df_result
df_result
0 Adelie
1 Adelie
2 Adelie
3 Adelie
4 Adelie
...
339 Chinstrap
340 Chinstrap
341 Chinstrap
342 Chinstrap
343 Chinstrap
Name: species, Length: 220, dtype: object
Numpy has a super similar syntax to the tidyverse if you opt into it.
\
penguins"species in ('Adelie', 'Gentoo')")\
.query('species')\
.groupby('body_mass_g':lambda x: np.mean(x)/1000})
.agg({
# penguins |>
# filter(species %in% c('Adelie', 'Gentoo'),
# sex == 'female') |>
# group_by(species) |>
# summarise(body_mass_g_to_kg = mean(body_mass_g)/1000)
# df_full_indexes.columns = ["_".join(col_name).rstrip('_') for col_name in df_full_indexes.columns.to_flat_index()]
#
# df_full_indexes['B_sum']
#
# df_full_indexes['B_mean']
body_mass_g
species
Adelie 3.700662
Gentoo 5.076016
pandas has this tendency to create more and more indexes, drop_index will quickly become your go-to solution, and when we add hierarchical indexes to the mix, you are going to be copying and pasting some answers from Stack Overflow to flatten the data you created, or you will need some sophisticated indexing operation to get some specific results back.
Another detail is that I need to manually drop the columns before the groupby operation, and this sucks because no data type won’t exclude species (our grouping variable) while excluding island and sex.
\
penguins= 'year')\
.drop(columns "species in ('Adelie', 'Gentoo')")\
.query('species')\
.groupby('numeric') .select_dtypes(
Error in py_call_impl(callable, dots$args, dots$keywords): AttributeError: 'DataFrameGroupBy' object has no attribute 'select_dtypes'
Detailed traceback:
File "<string>", line 1, in <module>
File "C:\Users\bruno\AppData\Local\R-MINI~1\lib\site-packages\pandas\core\groupby\groupby.py", line 904, in __getattr__
raise AttributeError(
So instead I need to pre-select my columns.
= penguins\
result_hierarchical = ['year','island','sex'])\
.drop(columns "species in ('Adelie', 'Gentoo')")\
.query('species')\
.groupby(
.agg([np.mean,np.median])
result_hierarchical
bill_length_mm bill_depth_mm ... flipper_length_mm body_mass_g
mean median mean ... median mean median
species ...
Adelie 38.791391 38.8 18.346358 ... 190.0 3700.662252 3700.0
Gentoo 47.504878 47.3 14.982114 ... 216.0 5076.016260 5000.0
[2 rows x 8 columns]
To access the body_mass_g columns and transform them to kg’s, I need to deal with the index system without using the loc method.
"body_mass_g"] /1000 result_hierarchical[
mean median
species
Adelie 3.700662 3.7
Gentoo 5.076016 5.0
So you might just think, OK, simple, I just need to assign it back.
"body_mass_g"] = result_hierarchical["body_mass_g"] /1000 result_hierarchical[
and it works if you don’t mind losing the original data if you want to create some new name
"body_mass_g_back_to_g"] = result_hierarchical["body_mass_g"] * 1000 result_hierarchical[
Error in py_call_impl(callable, dots$args, dots$keywords): ValueError: Expected a 1D array, got an array with shape (2, 2)
Detailed traceback:
File "<string>", line 1, in <module>
File "C:\Users\bruno\AppData\Local\R-MINI~1\lib\site-packages\pandas\core\frame.py", line 3645, in __setitem__
self._set_item_frame_value(key, value)
File "C:\Users\bruno\AppData\Local\R-MINI~1\lib\site-packages\pandas\core\frame.py", line 3788, in _set_item_frame_value
self._set_item_mgr(key, arraylike)
File "C:\Users\bruno\AppData\Local\R-MINI~1\lib\site-packages\pandas\core\frame.py", line 3802, in _set_item_mgr
self._mgr.insert(len(self._info_axis), key, value)
File "C:\Users\bruno\AppData\Local\R-MINI~1\lib\site-packages\pandas\core\internals\managers.py", line 1235, in insert
raise ValueError(
Infuriating, you can go on a SO hunt to see the right approach to keep the indexes, but at this point, I am done with pandas indexing and just cheat my way into the result with some flattened data frame.
# Flattern MultiIndex columns
= ["_".join(col_name).rstrip('_') for col_name in result_hierarchical.columns.to_flat_index()]
result_hierarchical.columns
'body_mass_g_median_back_to_g'] = result_hierarchical['body_mass_g_median'] * 1000
result_hierarchical[
'body_mass_g_median'] result_hierarchical[
species
Adelie 3.7
Gentoo 5.0
Name: body_mass_g_median, dtype: float64
'body_mass_g_median_back_to_g'] result_hierarchical[
species
Adelie 3700.0
Gentoo 5000.0
Name: body_mass_g_median_back_to_g, dtype: float64
This framework is delightful to work with mainly because you can go back and forth between 4 APIs SQL, Spark, Koalas, and Pandas, and chances are one of them has an excellent approach to your problem. This post would deviate too much if I talked in depth about this framework, but it certainly has its place on the big data manipulation side, with some APIs that are sometimes superior to what pandas can offer, I will touch on the issue of laziness on the topic of what I wanted that Pandas implemented. Also, PySpark really struggles with indexing as it should, and Koalas allows for some crazy distributed index rules.
You must understand that any speed analysis among packages on the individual personal computer level gets turned irrelevant as long as you access Spark Clusters. This is how you can query billions of records with ease, not by having slightly faster performance on a single machine level but by leveraging the horizontal scaling capacity of big companies such as Amazon, Google, and Microsoft.
I classify two kinds of problems on the pandas’ framework. The first and honestly the simplest to explain are things that it implements, and I think it shouldn’t. Many pandas problems come from what it copied from base R, which is why you needed to understand the timeline of the packages at the beginning of this post. The second is the new features that recent packages on both the R and Python sphere put into the table in recent years features that when pandas was being created didn’t exist and that I hope the pandas or some other packages teams will eventually be able to integrate into the pandas’ ecosystem.
Here are the things that I hope that pandas makes more straightforward to get away from or ditches altogether.
This wasn’t even a feature back in the SQL days. Indexing comes from languages that tried to facilitate the… Index of data, so you could play a version of Battleship on your data, the idea sounds suitable, and it works for mostly simply indexes, data[x,y] if you can picture it, and even the index of rows what R call row.names is, in my opinion, a mistake. It adds unnecessary flutter to what could have been easily one more column, so you can define all datasets or at least all structured datasets ass the tidy diagram shows.
The id column on this example is treated as just one more column, and it doesn’t need any special treatment as we will discuss slicing.
This is the worst of the bunch if data[x,y] is discussable and maybe human-readable, or excel friendly as I like to call it. MultiIndex feels like the epitome of unnecessary structure. MultiIndex burdens you with a new, more complex data type than the one you started. It is hardly an easy feature from a usability standpoint as multiple SO threads will show, it causes confusion, generates code to get back into a flatter structure, it only exists on pandas and copycats. For a good reason, this format is impossible to translate into common file structures such as CSV or xlsx. It can’t be passed back into a SQL server or Spark. It is too much for what could have been a flatter structure with more names or a nested return dataframe.
The Solution
Two types of functions suffer from a practical/usability standpoint because of the Index concept.
The first are the functions that don’t work out of the box because of indexing. I am not talking about functions that have bad names or arguments because of indexing like the sort_values, not sort because god forbid sort messed up with the Index, and you didn’t realize it. The biggest issue comes from functions that almost always come accompanied by reset_index() or have some internal argument to ignore_index like concat.
= pd.DataFrame({'A': ['A0', 'A1', 'A2', 'A3'],
df1 'D': ['D0', 'D1', 'D2', 'D3']},
=[0, 2, 3,4])
index
= pd.DataFrame({'A2': ['A4', 'A5', 'A6', 'A7'],
df2 'D2': ['D4', 'D5', 'D6', 'D7']},
=[ 5, 6, 7,3]) index
How is this the default behavior?
=1) pd.concat([df1,df2],axis
A D A2 D2
0 A0 D0 NaN NaN
2 A1 D1 NaN NaN
3 A2 D2 A7 D7
4 A3 D3 NaN NaN
5 NaN NaN A4 D4
6 NaN NaN A5 D5
7 NaN NaN A6 D6
instead of this
=1) pd.concat([df1.reset_index(),df2.reset_index()],axis
index A D index A2 D2
0 0 A0 D0 5 A4 D4
1 2 A1 D1 6 A5 D5
2 3 A2 D2 7 A6 D6
3 4 A3 D3 3 A7 D7
Maybe I am expecting too much from concat, and there should be another method to do what the tidyverse calls bind_cols, an operation where you combine two dfs with the same number of rows, but still, where is this magical function? It can’t possibly exist under the current Index-dominated philosophy because it would be ignored entirely.
The second are the ones that give you back a deeper, more nested index than what you started with. My solution is to provide functions that always return a dataframe with the same flatness level, either with named returns like how dplyr does across using the nicknames .col and .fun or with nested data instead of nested indexes. These functions can still rely on Indexes on the back-end, but on a higher level, let the user decide how to handle naming or if he wants a nested return.
The Solution
Define a function to deal with the index.
def agg_no_index(df,arg_to_agg):
= df\
df
.agg(arg_to_agg)
= ["_".join(col_name).rstrip('_') for col_name in df.columns.to_flat_index()]
df.columns return(df)
Put it into a pipe.
\
penguins= ['year','island','sex'])\
.drop(columns "species in ('Adelie', 'Gentoo')")\
.query('species')\
.groupby(= [np.sum,np.mean]) .pipe(agg_no_index,arg_to_agg
bill_length_mm_sum ... body_mass_g_mean
species ...
Adelie 5857.5 ... 3700.662252
Gentoo 5843.1 ... 5076.016260
[2 rows x 8 columns]
Create some functions like the one I just showed and make them available as methods so that we don’t have to use the subpar pipe method (also, why isn’t pipe some symbol?).
Slicing suffers from the sins of indexing, and it feels like there are tens of different ways to do the same thing. Just look at how many ways pandas covers on it’s site.
And I am sure some people have created even more complicated ones out there.
The Solution
Assume that all data will be in a tidy format and just use two kinds of operations you select to get your columns and you where/filter/query to get your rows.
As shown on the select_dtypes fiasco with groupby pandas needs to step up its selecting game.
dplyr supports all manner of straightforward ways to select data like regex expressions.
# A tibble: 344 x 4
# Groups: sex [3]
sex bill_length_mm bill_depth_mm flipper_length_mm
<fct> <dbl> <dbl> <int>
1 male 39.1 18.7 181
2 female 39.5 17.4 186
3 female 40.3 18 195
4 <NA> NA NA NA
5 female 36.7 19.3 193
6 male 39.3 20.6 190
7 female 38.9 17.8 181
8 male 39.2 19.6 195
9 <NA> 34.1 18.1 193
10 <NA> 42 20.2 190
# ... with 334 more rows
Or data type inference that actually works post grouped operations.
# A tibble: 344 x 6
# Groups: sex [3]
sex bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
<fct> <dbl> <dbl> <int> <int>
1 male 39.1 18.7 181 3750
2 female 39.5 17.4 186 3800
3 female 40.3 18 195 3250
4 <NA> NA NA NA NA
5 female 36.7 19.3 193 3450
6 male 39.3 20.6 190 3650
7 female 38.9 17.8 181 3625
8 male 39.2 19.6 195 4675
9 <NA> 34.1 18.1 193 3475
10 <NA> 42 20.2 190 4250
# ... with 334 more rows, and 1 more variable: year <int>
It can even go as far as to accept custom functions that return Boolean values, although it may require some protection from function errors.
safe_sum <- \(original_col) safely(\(x) sum(x, na.rm = TRUE))(original_col)$result
penguins |>
group_by(sex) |>
select(where(\(x) isTRUE(safe_sum(x) > 100))) |>
select(!year)
# A tibble: 344 x 5
# Groups: sex [3]
sex bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
<fct> <dbl> <dbl> <int> <int>
1 male 39.1 18.7 181 3750
2 female 39.5 17.4 186 3800
3 female 40.3 18 195 3250
4 <NA> NA NA NA NA
5 female 36.7 19.3 193 3450
6 male 39.3 20.6 190 3650
7 female 38.9 17.8 181 3625
8 male 39.2 19.6 195 4675
9 <NA> 34.1 18.1 193 3475
10 <NA> 42 20.2 190 4250
# ... with 334 more rows
Yes, it may require some workarounds compared to safer functions like is.numeric, but the point is that it is possible to do it in the tidyverse.
What I mean by a true connection is that there are packages out there like siuba, with the same syntax as and dplyr that can be extended to directly write queries on your data source, without changing the code of your regular on memory code. This is also possible on tools such as Tableau and Power Query, and it drastically increases the potential of writing cross-platform code. I don’t mean to write .sql() code in PySpark or similar I mean something like this example using sparklyr + DBI on an in-memory Spark cluster.
library(sparklyr)
sc <- spark_connect(master = "local")
Now I can send my penguins dataset to the fake cloud.
tbl_penguins <- copy_to(sc, penguins, "spark_penguins")
Let’s quickly show what dplyr can do now that it has access to a spark connection. It defaults to lazy as R always is, so now queries don’t truly return unless you use collect()
As you can see, my table is no longer a dataframe, so nrow won’t work unless I bring it into local memory.
What is impressive is that you can easily use the clusters resources since what dplyr is doing is writing SQL queries to a Spark cluster, in this case, my own machine, but it could have been a cluster with hundreds of executors, ready to quickly finish executing my code, it is like a google translator for code!
return_lazy |>
show_query()
<SQL>
SELECT `species`, `island`, `bill_length_mm`, `bill_depth_mm`, `flipper_length_mm`, `body_mass_g`, `sex`, `year`, POWER(`bill_depth_mm`, 2.0) AS `bill_depth_mm^2`
FROM `spark_penguins`
On a side note, writing SQL queries to the target data source is not the limit of the integration between sparklyr and dplyr. It can even go as far as to use some of Spark’s more powerful functions like its ml capabilities or invoke spark-specific functions.
return_lazy |>
select(sex, where(is.numeric)) |>
na.omit() |>
ml_logistic_regression(sex ~ .)
Formula: sex ~ .
Coefficients:
(Intercept) bill_length_mm bill_depth_mm
-0.018096739 -0.107405083 -1.043594480
flipper_length_mm body_mass_g year
0.034393443 -0.005437584 0.023328984
bill_depth_mm^2
-0.028100181
It has this painless integration between Spark and the tidyverse, which is better, IMHO, to even what PySpark has right now and miles ahead of what pandas.api has been able to achieve for my own usage far.
While pandas is the biggest dog on the block right now in terms of Data Manipulation on memory, it still has some points that it can copy from the competition that could make pandas even better or more flexible.
If you see mistakes or want to suggest changes, please create an issue on the source repository.
Text and figures are licensed under Creative Commons Attribution CC BY 4.0. Source code is available at https://github.com/brunocarlin/carlin, unless otherwise noted. The figures that have been reused from other sources don't fall under this license and can be recognized by a note in their caption: "Figure from ...".
For attribution, please cite this work as
Carlin (2022, Feb. 28). Bruno Testaguzza Carlin blog: pandas Wishes. Retrieved from https://carlin-blog.netlify.app/posts/2022-02-28-pandaswishes/
BibTeX citation
@misc{carlin2022pandas, author = {Carlin, Bruno Testaguzza}, title = {Bruno Testaguzza Carlin blog: pandas Wishes}, url = {https://carlin-blog.netlify.app/posts/2022-02-28-pandaswishes/}, year = {2022} }