Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: R poly compatibility #92

Open
wants to merge 6 commits into
base: master
Choose a base branch
from

Conversation

thequackdaddy
Copy link
Contributor

Hello,

I've added the function poly which attempts to reproduce R's poly function.

See here: https://stat.ethz.ch/R-manual/R-devel/library/stats/html/poly.html

Thanks.

@coveralls
Copy link

coveralls commented Sep 15, 2016

Coverage Status

Coverage increased (+0.007%) to 98.667% when pulling d290dd3 on thequackdaddy:poly_qr into 8b6c712 on pydata:master.

@codecov-io
Copy link

codecov-io commented Sep 15, 2016

Codecov Report

Merging #92 into master will increase coverage by 0.02%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master      #92      +/-   ##
==========================================
+ Coverage   98.96%   98.98%   +0.02%     
==========================================
  Files          30       32       +2     
  Lines        5585     5703     +118     
  Branches      775      791      +16     
==========================================
+ Hits         5527     5645     +118     
  Misses         35       35              
  Partials       23       23
Impacted Files Coverage Δ
patsy/__init__.py 96.66% <100%> (+0.11%) ⬆️
patsy/polynomials.py 100% <100%> (ø)
patsy/contrasts.py 100% <100%> (ø) ⬆️
patsy/test_poly_data.py 100% <100%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 4c613d0...4d107c6. Read the comment docs.

@thequackdaddy
Copy link
Contributor Author

I realized after writing that this may relate to #20 .

But 20 points out some good questions...

  • The code could be simplified to use the different vander methods that numpy provides.
  • Josef brings up a point that qr might be an unnecessary complication. I think simply scaling the values might get you to the same end.
  • My real need is for a method to provide some non-linear relationships between exogenous and endogenous data with the ability to extrapolate values not in the training data. So changing the scope of this PR to still accommodate for that might be a good initial step.

Thanks.

@njsmith
Copy link
Member

njsmith commented Oct 25, 2016

Is it possible for this to share the core implementation with patsy.contrasts.Poly? That was cribbed from R as well, and in principle at least they should be doing essentially the same thing...

@thequackdaddy
Copy link
Contributor Author

Is it possible for this to share the core implementation with patsy.contrasts.Poly?

I would think so. Honestly I wrote this before I noticed the above noted, Issue that was already filed. Maybe I can incorporate those thoughts into this too.

@thequackdaddy
Copy link
Contributor Author

@njsmith 3 changes...

  1. patsy.contrast.Poly uses the same code as patsy.polynomials to generate the ContrastMatrix.
  2. Per some notes in the PR, I made it so that you can use any of the XXXvander methods from numpy.
  3. In addition to QR-orthogonalization to "scale" the data, you can just use a straight standardizer

@thequackdaddy
Copy link
Contributor Author

Also dumb question...

Does patsy require pandas now? I noticed that my codecov score is not the best, the reason appears to be that I'm skipping the have_pandas == False scenario. Not sure if that's even something we should be considering? Why write exception handling if the exception isn't supported?

@njsmith
Copy link
Member

njsmith commented Oct 26, 2016

Hmm, that's a ton of options there, between raw versus not, different polynomial bases, standardizing... I guess my first question is, are any of these... useful? Besides the boring basic R-compatible orthonormalized polynomials? Is there some compelling argument for using them in some particular situation, or an existing audience who expects them?

Regarding pandas: nominally at least we still don't depend on pandas. You're right that there ought to be tests for this, though, oops. Possibly this doesn't make sense anymore; it's a different world now than it was back in 2012 or whatever when this decision was made... OTOH I dunno if people using patsy with scikit-learn for example necessarily use/want pandas.

I've just tried adding a no-pandas test to the travis matrix -- I guess in ~10 minutes we'll know if (a) I'm any good at convincing travis to do what I want, (b) if so, whether the no-pandas branches actually work!

@njsmith
Copy link
Member

njsmith commented Oct 26, 2016

master branch is now testing the no-pandas configuration too, so the next time you push it then codecov should start calculating your stats more accurately :-)

@thequackdaddy
Copy link
Contributor Author

Yeah I may have gone overboard there... :-/

I think QR vs. raw is necessary. Standardizing is kind of dumb... Not sure why you'd use it ever. AFAIU, the point of scaling in this way is just so that 1. the columns of data in the design matrix are relatively orthogonal and 2. when columns are of wildly different scale, the np.linalg.matrix_rank function won't have sufficient precision to correctly rank the matrix.

I don't really know the reasons for the differences in poly vs. chebyshev vs legende vs. laguerre. Josef poster mentioned that we should point to a generic XXXvander, I just assumed we let the user pick. I may have misinterpreted some of what Josef was trying to say.

You think I should drop all the other vanders other than polyvander and standardizing? I doubt I would use anything but that. Plus, my tests are vs. R, which is just QR-orthogonalized polyvander so the rest of this we are just assuming...

cc @josef-pkt - I know you're probably busy, but I'd appreciate your thoughts if there's any value in all this. Happy to remove this complexity if its not worthwhile.

@thequackdaddy
Copy link
Contributor Author

I've simplified this back down so that it mimics R's poly function. Removed all the other vanders. When raw=False (the default) it will use QR-to orthogonalize a polyvander.

@thequackdaddy
Copy link
Contributor Author

thequackdaddy commented Feb 28, 2018

@njsmith - Its been over a year, but I think this is close to being good enough. Any comments? Thanks!

@has2k1
Copy link
Contributor

has2k1 commented Sep 10, 2019

Any update on this PR?

@jonathan-taylor
Copy link

Is this merged yet?

@matthewwardrop
Copy link
Collaborator

Hi @thequackdaddy ! I took a quick look through, and this looks good to me. Let me know if this is still good to merge as is. If so, I'll merge it in :).

@matthewwardrop
Copy link
Collaborator

Support for poly has now been merged into formulaic.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants