-
Notifications
You must be signed in to change notification settings - Fork 21
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
RHEL AI / InstructLab CPT support #117
Conversation
GET localhost:8000/api/v1/ilab/runs?benchmark=ilab will query the ilab.crucible OpenSearch instance and return a list of ilab benchmark runs.
Add Crucible readme file. Cleanups and refactoring
Also added the option to override the default graph title generator using the new `Graph.title` field.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Initial comments on my own re-re-re-review. 🤣
The backend comments are already fixed, but I'm posting them "for posterity" (whomever Mr Posterity may be); not quite ready to push yet. I haven't done anything about the UI comments ...
export const ILABS_JOBS_API_V1 = "/api/v1/ilab/runs"; | ||
export const ILAB_GRAPH_API_V1 = "/api/v1/ilab/runs/"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why are these identical? Are both used? Could they be combined?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After loading the page many times, I'm starting to think that we should open the graph accordion with just the primary metric(s) graphed. That is, we shouldn't wait until someone makes a selection from the secondary metric pulldown -- but when they do, we can add the second graph.
This cleans up my direct API call to get the run's periods for graphing, to use a separate action and a reducer. I also experimented with trying to improve error diagnosis by looking at some of the error responses to "toast" instead of just saying something went wrong.
Add a Crucible `close` method, and use a FastAPI yield dependency to ensure every API connection is closed cleanly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is what I have right now. I'm working on ilab.py
and crucible_svc.py
.
@@ -1,5 +1,5 @@ | |||
|
|||
# Openshift Performance Dashbaord | |||
# Openshift Performance Dashboard |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this can be factored out into a tidying pr
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For a one letter swap? That seems ... excessive? 😦
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
that's why you bundle the tidyings together, but yeah I'll let it slide
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The real question is more from the other side -- at which point, finding small incidental fixes and helpful refactorings during a development task, does it become necessary (or even "advisable") to stop "making progress" in order to pull out all those changes and separate them into another PR?
Reformatting an entire codebase (which I'd love to do to cpt-dashboard, by the way, because it's wildly inconsistent and often ugly), for example, is clearly not something to even attempt in the middle of something else.
I backed out the fallback exception handler for you, which I kinda regret as I did some refactoring and had to guess at where a problem was because there's no traceback. (I thought I might try to investigate where FastAPI or some other part of the framework is dropping the ball, but it's awkward to even figure out where to start, and I've not had time.) This stuff could go in a separate PR, sure; but one needs it when writing significant new code, and we don't have a quick pipeline in cpt-dashboard so separating it means it might as well not be there at all. (And possibly never will be.) 🤷🏻
Minor stuff like fixing a one letter typo isn't worth tracking, and putting it off inevitably means it never gets done. It's isolated, it's trivial to review, and it gets done.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And I realized later that I hadn't backed out the fallback handler ... but it somehow stopped working for me. After some investigation I realize that it was because I added the yield "dependency" mechanism to manage the Crucible connection, and my except
block there was absconding with my "unhandled" exception. I think I fixed it. Again, while this is technically separate from the "InstructLab/Crucible support", it was an important development tool I can't believe people were living without before (though I still don't understand why the built-in FastAPI unhandled exception handler isn't logging traceback, even with FastAPI(debug=True,...)
, and some effort spent trying to figure that out was ultimately a complete waste of time... 😦 )
const onSelect = (_event, value) => { | ||
console.log("selected", value); | ||
const run = value.split("*"); | ||
//setSelected(run[1].trim()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this dead code?
frontend/src/actions/ilabActions.js
Outdated
|
||
export const setSelectedMetrics = (id, metrics) => (dispatch, getState) => { | ||
const metrics_selected = cloneDeep(getState().ilab.metrics_selected); | ||
// if (id in metrics_selected) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this dead code?
frontend/src/actions/ilabActions.js
Outdated
const { start_date, end_date, size, offset } = getState().ilab; | ||
const response = await API.get(API_ROUTES.ILABS_JOBS_API_V1, { | ||
params: { | ||
...(start_date && { start_date }), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what does this line mean?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, interesting. ...
is the spread operator, which unrolls an array or object. I think start_date && { start_date }
should result in ...{ start_date }
(which will be spread to start_date: start_date
in the surrounding object as an API query parameter) iff start_date
is non-null.
But when I try a simplified version of this interactively in the Chrome console, it's unhappy about the paren, so I wonder if this'll actually work:
Uncaught SyntaxError: Unexpected token '('
@MVarshini ??
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These lines are still pretty confusing. I think a comment in the code may be worth it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some comments for the backend.
+ other review feedback
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't see any blockers. I'll leave the final approval up to the developers that maintain this repository.
frontend/src/actions/ilabActions.js
Outdated
const { start_date, end_date, size, offset } = getState().ilab; | ||
const response = await API.get(API_ROUTES.ILABS_JOBS_API_V1, { | ||
params: { | ||
...(start_date && { start_date }), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These lines are still pretty confusing. I think a comment in the code may be worth it.
const checkAndFetch = (_evt, newPage) => { | ||
if (props.type === "ilab") { | ||
dispatch(fetchNextJobs(newPage)); | ||
} | ||
}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this because it's only fetched per page for iLab, and the others do one fetch for all pages?
data={getGraphData(item.id)[0]?.data} | ||
layout={getGraphData(item.id)[0]?.layout} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the reason it's a list? And why the first item?
import { Title } from "@patternfly/react-core"; | ||
import { uid } from "@/utils/helper"; | ||
|
||
const MetaRow = (props) => { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It may be helpful to add a brief comment stating what a meta row is for.
|
||
return hasData; | ||
}; | ||
/* Metrics select */ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a reason a block comment is used for a single exclusive line?
status: "Status", | ||
}; | ||
|
||
return ( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it possible for this thing to be split up? If so, do you think it would be worth it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suppose it could be refactored into subcomponents; whether that'd be worthwhile is a different question. There isn't a lot of logic here above the external components it's already using.
+ add some method documentation + misc review feedback
I'll split this up along functional lines. That'll be a bit neater after #118 is resolved, as it factors out the Python dependency changes. |
Type of change
Description
Support CPT needs for the RHEL AI / InstructLab performance team.
This PR adds three components:
The
crucible_svc
is intended to be "reasonably general purpose" to support use by additional projects over time.Related Tickets & Documents
Various Jira stories under Epic PANDA-496.
Checklist before requesting a review
Testing
InstructLab CPT is using a persistent Crucible controller system in RDU3, tied to a 4-way L40S test system. The data store (a private OpenSearch instance) contains a set of Crucible runs capturing both training and SDG runs.