Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Some questions about the API #29

Open
willtebbutt opened this issue Aug 9, 2021 · 6 comments
Open

Some questions about the API #29

willtebbutt opened this issue Aug 9, 2021 · 6 comments

Comments

@willtebbutt
Copy link
Member

willtebbutt commented Aug 9, 2021

The JuliaGPs org is trying to figure out how best to provide a high-level front-end for our GPs -- currently they're useful for researchers and people who know a bit more about GPs, but we've not built functionality which lets people just call "fit" and expect something sensible to happen.

We're investigating all of the ML frameworks that we can find in the Julia ecosystem to figure out which ones are likely to work for us (see e.g. https://github.com/willtebbutt/MLJAbstractGPsGlue.jl/). We might pick one, or we might pick a couple if there's a good reason to do so.

To that end, I have a couple of API-related questions, to try and establish where there is / is not flexibility in the current Models.jl API:

  1. Input and output types. JuliaGPs leans heavily on the idea of collections of inputs being represented as AbstractVectors of the same length as the number of outputs. For an explanation of this, see our API docs and design discussion docs. The Models.jl API presently requires that inputs are AbstractMatrixs. Would it be possible to generalise this to AbstractVectors?
  2. The same question applies to outputs: JuliaGPs requires a single output per input (including in the multi-output case -- we transform multi-output problems into single-output problems on an extended input space).
  3. Joint predictions. The current API requires that distributional predictions are marginal, in the sense that the output of predict must be a vector of distributions. Often, we're interested in joint predictions in JuliaGPs, so it would be nice to return a single distribution object when someone calls predict, which represents the joint distribution over the predictions at all locations requested by the user.

None of these are show-stoppers for us, but it would be good to know how set-in-stone they are.

@nickrobinson251
Copy link
Contributor

I've not much to do with the package, but these all sound like good generalisations

The Models.jl API presently requires that inputs are AbstractMatrixs. Would it be possible to generalise this to AbstractVectors?

tbh i thought this one was already the case (it seems it might be for predict but maybe not fit?

predict(model::Model, inputs::AbstractVector{<:AbstractVector})
)

@willtebbutt
Copy link
Member Author

willtebbutt commented Aug 9, 2021

Ah interesting. I wonder how that happened! Though that method is still slightly more restrictive than we would like. Concretely, we would be after something like the following being permissible:

fit(::Template, x::AbstractVector, y::AbstractVector)
predict(::Model, x::AbstractVector)

@wytbella
Copy link
Member

I like the move from Matrix to Vector, which seems to be more aligned with many other ML packages. Just to double check that my understanding is currect:

  • For inputs, if they used to be of dimension num_features x num_observations, will the AbstractVector be of length num_observations? Are the elements in this AbstractVector also Vector (of lengh num_features)?
  • For outputs, could you explain a bit more what will happen to multi-output case? (do we treat each dimension in the output independently?) I think our current outputs from predict are Vector{MultivariateDistribution}. Do you propose to change it to a single MatrixDistribution? If that's the case, how do we ensure the inputs and outputs are vectors of the same length? What will the output format be for multidimension outputs, if we model or not model the correlation between different observations in outputs?

Ah interesting. I wonder how that happened!

That happened in some early model development when the Model.jl API was not very well-defined/restrictive.

@willtebbutt
Copy link
Member Author

willtebbutt commented Aug 12, 2021

Sorry for the slow response @wytbella -- had to think a bit about the output stuff.

I like the move from Matrix to Vector, which seems to be more aligned with many other ML packages.

To be clear, I'm not suggesting dispensing with the current API -- presumably that would break lots of existing code unnecessarily. I would prefer non-breaking extensions where possible. I think that extending the API to officially allow for AbstractVector inputs would be non-breaking, but changes to outputs would probably be breaking. I guess it's a question of what kinds of changes Invenia is interested in making anyway.

For inputs, if they used to be of dimension num_features x num_observations, will the AbstractVector be of length num_observations?

num_observations indeed.

Are the elements in this AbstractVector also Vector (of lengh num_features)?

In this particular instance, probably. In general, the approach we take in JuliaGPs is to avoiding stating what types each element of the inputs has to be. So Reals, AbstractVectors, YourFavouriteDataType, are all permissible under our API. Our thoughts on the properties we think that you need collections of inputs to satisfy can be found here.

For outputs, could you explain a bit more what will happen to multi-output case? (do we treat each dimension in the output independently?) I think our current outputs from predict are Vector{MultivariateDistribution}. Do you propose to change it to a single MatrixDistribution? If that's the case, how do we ensure the inputs and outputs are vectors of the same length? What will the output format be for multidimension outputs, if we model or not model the correlation between different observations in outputs?

I've been thinking about this a bit, and I think it might just be easier for now to leave this aspect of the API.

I think our current outputs from predict are Vector{MultivariateDistribution}

Yes, this is also my understanding. We could certainly make the things that JuliaGPs provides implement this requirement for multi-output GPs, it's just a bit restricting.

Do you propose to change it to a single MatrixDistribution?

This is certainly one option. I think it's probably a good one. IIRC the need for this has been discussed before...

If that's the case, how do we ensure the inputs and outputs are vectors of the same length?

I think this is the crux of the reason that I'm happy to not try and get the JuliaGPs way of doing multi-output things adopted here (but I would of course be happy to do so if it's something that people like the idea of doing!). The way that we handle this in JuliaGPs is explained here -- essentially we convert all multi-output GPs into single-output GPs by extending their inputs to also contain an integer saying which output a given input corresponds to. It seems to work remarkably well for internal stuff, because people building stuff on top of the API don't immediately have to care whether they're dealing with a single-output or multi-output GP, and it means that we don't have to do anything special (in the API) to handle situations when you only get one observation per output at each "input" (the term we use for this is heterotopic, because it seems to be used elsewhere). Whether it's what you want for user-facing stuff is less clear to me (it would certainly work, but whether the most intuitive thing if you're working with data data that is essentially always vector-valued is less clear to me. I think people tend to think in terms of vector-valued outputs in that context, which is presumably the reason for the current API).

That happened in some early model development when the Model.jl API was not very well-defined/restrictive.

Ahh, I see. Good to know the history.

@wytbella
Copy link
Member

Thanks for the explanation @willtebbutt !

I think I'm on board with the input API extension.

For output, I'm less sure. I guess we need to consider what people find most intuitive and how many downstream code we need to change due to this.

And just to double check that my understanding is correct: If way we predict some multi-output variables for each, e.g. hour, independently, the lengths of inputs/outputs are both num_hours? If we want to predict multiple hours jointly, we have to edit the inputs as well to ensure the length of inputs and outputs match, right?

@willtebbutt
Copy link
Member Author

willtebbutt commented Aug 20, 2021

I think I'm on board with the input API extension.

Excellent -- I'll open a PR to update docs and test utils.

And just to double check that my understanding is correct: If way we predict some multi-output variables for each, e.g. hour, independently, the lengths of inputs/outputs are both num_hours? If we want to predict multiple hours jointly, we have to edit the inputs as well to ensure the length of inputs and outputs match, right?

Not quite. The proposal would be that the length of the inputs is always equal to num_outputs x num_hours -- essentially you treat each (output, hour) pair as a single input, hence turning a multi-output problem into a single-output one. Then the prediction would always be a vector-valued distribution (MultivariateDistribution? I always forget the type names...).

I think it would be best to hold off doing this for now though, I agree. It's one of those APIs that seems to work really well for internals, and lets you express all of the things you might want to express, but isn't the most immediately intuitive thing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants