-
Notifications
You must be signed in to change notification settings - Fork 82
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Enable polars users to easily access to package datasets #91
Comments
IMO a nice approach would be to create a tiny class,
This should just involve implementing concretes in |
@machow , I've come up with two ideas inspired by your content. Approach 1# data/__init__.py
class DataFrameProxy1:
def __init__(self, fname):
self._fname= fname
self._pandas = None
self._polars = None
@property
def pandas(self):
if self._pandas is None:
import pandas as pd
self._pandas = pd.read_csv(self._fname)
return self._pandas
@property
def polars(self):
if self._polars is None:
import polars as pl
# or using `pl.read_csv` directly, but need to
# be careful of setting `dtypes`
self._polars = pl.from_pandas(self.pandas)
return self._polars
air: DataFrameProxy1 = DataFrameProxy1(_airquality_fname) # type: ignore
Approach 2# data/__init__.py
class DataFrameProxy2(DataFrameProxy1):
def __getattr__(self, name):
return getattr(self.pandas, name)
air: DataFrameProxy2 = DataFrameProxy2(_airquality_fname) # type: ignore
>>> from great_tables.data import air
>>> air.head()
Ozone Solar_R Wind Temp Month Day
0 41.0 190.0 7.4 67 5 1
1 36.0 118.0 8.0 72 5 2
2 12.0 149.0 12.6 74 5 3
3 18.0 313.0 11.5 62 5 4
4 NaN NaN 14.3 56 5 5
>>> air.assign(NewDay=lambda df_: df_.Day.add(1))
Ozone Solar_R Wind Temp Month Day NewDay
0 41.0 190.0 7.4 67 5 1 2
1 36.0 118.0 8.0 72 5 2 3
2 12.0 149.0 12.6 74 5 3 4
3 18.0 313.0 11.5 62 5 4 5
4 NaN NaN 14.3 56 5 5 6
.. ... ... ... ... ... ... ...
148 30.0 193.0 6.9 70 9 26 27
149 NaN 145.0 13.2 77 9 27 28
150 14.0 191.0 14.3 75 9 28 29
151 18.0 131.0 8.0 76 9 29 30
152 20.0 223.0 11.5 68 9 30 31
[153 rows x 7 columns] However, this will cause |
These two approaches appear to be related to issue #8. |
From discussion here: #525 (comment) Let's plan on pulling SimpleFrame from reactable-py into here. Once we get everything working nicely here, we can always pull SimpleFrame into its own package. The advantages of the simple frame approach are...
|
Currently, great tables includes over a dozen datasets in its
.data
submodule:However, these datasets are pandas DataFrames, so polars users need to convert them:
This isn't too bad. But maybe it could be better? This issue will discuss various ways we could approach loading data for both pandas and polars users.
This is mostly me thinking out loud about different options, without a strong opinion on an approach yet 😅.
Possible approaches
pl.from_pandas()
to convert.set_options(data_frame = pl.DataFrame)
.set_options(data_frame="polars")
.exibble(pl.DataFrame)
, orexibble("polars")
, orexibble() # uses set_options() to get DataFrame
from great_tables.data.polars import airquality
, ORfrom great_tables.data_pl import airquality
, ORfrom some_data_package import airquality
Desirable outcomes
Easy to perform
For example, if data is simply imported and
set_options()
is used, then people will need to do some code in-between imports. This feels a cludgy.Here's an example:
At the same time, calling functions gets kind of annoying:
Helpful DataFrame completions in IDE
I'm not sure how to implement something like
set_options()
and a data fetcher likeairquality()
? Is there a way to type it, so tools like pyright know that when an option is set to a specific value, that airquality() returns a specific type of DataFrame?Last thoughts
I like the idea of us having a
great_tables.data_pl
submodule, or even a separate data package for datasets, but am curious what seems most useful to folks!The text was updated successfully, but these errors were encountered: