Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

refactor(api, robot-server): CSV parameter improvements #15907

Merged
merged 11 commits into from
Aug 7, 2024

Conversation

jbleon95
Copy link
Contributor

@jbleon95 jbleon95 commented Aug 6, 2024

Overview

This PR follows up on #15855 with various small refactors/improvements.

The biggest change in this PR is refactoring how we open the CSV parameter files. Previously we were opening them when we first initialized the parameter. This proved to be a potential issue because we were only closing the files when we reached the execute step of running the protocol (i.e. in a finally clause within run_protocol), yet there was the small yet possible chance of not reaching that step (for instance if the initial home command failed unexpectedly). To solve this, we are now passing the Path object to the CSVParameter interface and only opening the file when it is first accessed in the protocol, meaning it will always occur within that try/except block.

Another change introduced is properly raising a 400 error for POSTs to the /runs, /protocols and /protocols/{protocolId}/analyses endpoints when referencing data files (for runtime parameter files) that don't exist.

API Version is now also enforced for using add_csv_file which now requires your protocol to be at 2.20. The api version is also now passed down to the CSVParameter interface, though it is not currently used to gate anything yet.

Test Plan and Hands on Testing

Tested end to end on hardware with the same protocol/csv files as #15855.

Changelog

  • Pass Path to CSVParameter interface and open files within that.
  • Raise a 400 error when referencing non existent data file IDs for /runs, /protocols and /protocols/{protocolId}/analyses
  • Require API version 2.20 for add_csv_file
  • Fixes/improvements for testing CSV files/file opening in protocol api

Review requests

Risk assessment

Low.

@jbleon95 jbleon95 requested a review from a team as a code owner August 6, 2024 20:44
Copy link
Contributor

@SyntaxColoring SyntaxColoring left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool, thanks!


# Open a new file handler for the temporary file with read-only permissions and close the other
read_only_temp_file = open(temporary_file.name, "r")
temporary_file.close()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure if this is the cause of your Windows testing woes, but: NamedTemporaryFile will, by default, delete temporary_file when it's .close()d. So we're opening a new handle to the file and then deleting the underlying file. That might work, I dunno, but it seems iffy.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hypothetically, what would we lose by just opening the original file in read-only mode and directly passing it to the protocol? I guess a user could access file.name and manually muck with things that way, and that would be bad. I guess there could also be some robot-server bug that causes the original file to be modified concurrently, and that would also be bad. Maybe those are reasons enough to avoid passing the file directly, I dunno.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another thing that you might consider here is to use shutil.copyfile() instead of writing into a NamedTemporaryFile. That will probably be faster because it can happen entirely in the OS, or even entirely in the filesystem, without bytes having to go through Python.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The NamedTemporaryFile issue was related exclusively to Windows. From the python docs https://docs.python.org/3.9/library/tempfile.html#tempfile.NamedTemporaryFile

Whether the name can be used to open the file a second time, while the named temporary file is still open, varies across platforms (it can be so used on Unix; it cannot on Windows).

So it seems safe to do and unrelated to the file deletion on close. Since I want to get this is in for the next alpha I'm keeping this method, but I'll look into shutil.copyfile() in case there's bugs related to this or it proves to be a more stable way.

for name, file_id in rtp_files.items()
}
except FileIdNotFoundError as e:
raise FileIdNotFound(detail=str(e)).as_error(status.HTTP_400_BAD_REQUEST)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't have a strong opinion either way, but when a run is created with a bad protocol, we return a 404, not a 400. Should we do that here, too?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You mean a 422?
Ya I guess that works too.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nope, I mean a 404:

except ProtocolNotFoundError as e:
raise ProtocolNotFound(detail=str(e)).as_error(status.HTTP_404_NOT_FOUND) from e

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh sorry, I just realized you were talking about the runs endpoint and not protocols endpoint. I still think 400/422 is more appropriate for the reason below

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think 404 would have also worked, but after discussion with Nick and Shlok we all came to an agreement on 400 which is not an incorrect response code for the situation.

Comment on lines +332 to +333
except FileIdNotFoundError as e:
raise FileIdNotFound(detail=str(e)).as_error(status.HTTP_400_BAD_REQUEST)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comment–404 instead of 400, maybe?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think either a 400 or 422 is more appropriate here because 404 indicates that the URL resource is not found. I wouldn't consider a csv file as the URL's primary resource.

and parameter.value is not None
):
parameter.value.close()
parameter.value = file_path
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm.. is it necessary to assign file path to parameter value? I think the value of the parameter is represented more closely by either the file handler or the file contents. So here's what my mental map is for CSV params:

  • name, id, path properties carry info about the file
  • value is the actual file; could be either the handler or the contents

What do you think? Also we're kinda close to the deadline so is it an easy thing to change if we want to?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I made this change to help facilitate opening the file handler in the parameter itself rather than outside of it (to better couple it with the closing of files done in run_protocol) but you're right in that this exposes the path, however indirectly and privately. Changing it to be the contents is a non-trivial change and I want to get in these important changes for the next release, so that can come during future refactor work.

Copy link
Contributor

@SyntaxColoring SyntaxColoring Aug 7, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there any special significance to value here? Like is this .value directly exposed to PAPI users as a public .value property or anything? Oops, didn't see your reply, thanks!

Copy link
Member

@sanni-t sanni-t left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Talked offline about whether value should be the path or not but since it's not a first class API property, hopefully there's not much harm in exposing the path in this way. We will however keep an eye out for possible mis-use, and also get feedback from science/ FAS and change if needed.

Everything else looks good!

@jbleon95 jbleon95 merged commit b7a008b into chore_release-8.0.0 Aug 7, 2024
42 of 43 checks passed
@jbleon95 jbleon95 deleted the csv_param_improvements branch August 7, 2024 19:44
sanni-t pushed a commit that referenced this pull request Aug 12, 2024
…nd less reliance on file reading/creating (#15956)

# Overview

This PR follows up both #15855 and #15907 with further refinements and
small changes to defining and setting CSV parameters, along with behind
the scenes refactors.

The two main user/client facing changes of this PR are changing the HTTP
status code when invalid data file IDs are sent to the `POST` `/runs`,
`/protocols` and `/protocols/{protocolId}/analyses` endpoints from 400
to 422, and raising a new `IncompatibleParameterError` when multiple CSV
file parameters are added to a single protocol. The change in status
code is to reflect that the error is not because of a malformed request,
but because there was something in the request that could not be
processed (in this case being the data file ID). The new error was added
to align with the original PRD and the current UI. While there's no
inherent technical limitation for multiple CSV parameters (other than
the theoretical limitations with memory, number of file handlers, etc),
there are unexpected UI bugs when multiple ones are defined and this
change enforces only one per protocol.

The other major change of this PR is a non-user facing refactor of how
we set the `CSVParameter` interface. Rather than passing the `Path` of
the file to the interface and then using a `TemporaryFile` as the basis
for all further access of the CSV file and it's contents, now the
contents of the file (passed as `bytes`) is what everything else is
based off of, including CSV parsing in `parse_as_csv`. With this change,
we only ever make a temporary file handler when the user accesses
`.file`. With this change reducing the chance of there being an open
file handler, a new `file_opened` property was added to the public
interface to reduce needless file opening. This is technically
user-facing code meant more for internal usage, but I see no danger in
exposing it, though if needed we can tag it is a private non-guaranteed
method.

## Changelog

- Raise `IncompatibleParameterError` when `add_csv_file` in the
`ParameterContext` is used more than once
- Change status code of invalid data file IDs posted to runs and
protocols endpoints from `400` to `422`
- Refactor `CSVParameter` to only open temporary file handler when user
requests one

## Review requests

Should we label `file_opened` as private/use another way of preventing
spurious temporary file creation just to close them?

## Risk assessment

Low.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants