Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add an API to fetch repositories and list their branches. #12

Merged
merged 3 commits into from
Jan 14, 2025
Merged

Conversation

plietar
Copy link
Member

@plietar plietar commented Jan 9, 2025

This API is a replacement for the one currently provided by outpack_server.

A single instance of orderly.runner API is able to operate over an arbitrary number of Git repositories. The first time a URL is fetched, a git clone is performed. Subsequent fetches are incremental using git fetch.

Repositories are each stored in their own directory, named using the hash of the URL. Using the hash makes it easy to support arbitrary upstreams, without having to worry about encoding the URL characters as file names.

The existing endpoints still operate on the "root" repository, which is assumed to exist before the API is launched. Eventually all those APIs will be migrated to have a url argument.

@plietar plietar force-pushed the mrc-6120 branch 7 times, most recently from f0d9b3a to 9da5588 Compare January 10, 2025 09:22
@plietar plietar force-pushed the mrc-6120 branch 3 times, most recently from d601784 to fa82fc8 Compare January 10, 2025 09:41
docker/test/run-test Outdated Show resolved Hide resolved
@plietar plietar requested a review from M-Kusumgar January 10, 2025 09:43
@plietar plietar marked this pull request as ready for review January 10, 2025 09:43
Copy link
Contributor

@M-Kusumgar M-Kusumgar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got a couple of comments but on the whole looks good!

R/api.R Outdated Show resolved Hide resolved
#' is raised.
#' @return the path to the repository.
repository_path <- function(base, url, check = TRUE) {
hash <- openssl::sha1(url)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i dont mind this, but also makes it less easy for us to poke around, you said its to help with weird characters and stuff in the url, whats a character that would break it that github allows?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I do get the less pokeable argument and was a bit worried about it. It is still possible to get the URL from a cloned repo though by doing git remote get-url origin, although of course it's not as obvious at a glance than just looking at directory names.

The "weird character" comes in large part from the fact that I did not want to assume anything about GitHub here, so we need to encode the full URL, not just org name and repo name. That include : and @, eg. [email protected]:org/repo (technically Linux allows any characters but NUL and / in filenames, but not really a game I want to play). We would need to deal with slashes (creat nested directories?), including leading slashes if the URL is actually a file path (which the tests do a lot). You also have to consider what the effect of . and .. in the URL has on the filesystem (should http://foo/bar/.. and http://foo map to the same repository in the cache?).

I suppose the alternative is to do a percent encoding of the URL, which would work to avoid all the special characters (but it would do nothing about . and ..), but you still have a problem if the URL is longer than 256 characters, which is the limit for filenames on Linux. Not a likely scenario, but a possible one.

Using a hash of the URL was a very easy to workaround all these problems by guaranteeing a fixed length, unique name using only safe characters.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah okay i was just thinking a very simple directory structure like org/repo wasnt thinking about the whole url although is there any reason we even want the whole url? and github repo are pretty clean, they only allow the following:
image
and also . and .. are reserved names so i think thats chill too

but yh youre right about it being us relying on github

i do think in practice however, none of these issues will ever affect us with the org/repo directory structure but the hash with the whole url doesnt care about where the repo actually is so more general i guess

happy to leave it to your judgement!

if we do it with the hashes though, i think it would be useful to do a tiny script in another pr that prints a mapping of hashes to origin url by doing a simple dir walk so that we can easily debug

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is there any reason we even want the whole url?

Just being able to use a file path as URL in unit tests is a pretty good use case, instead of having to mock GitHub or have some tedious fake setup

if we do it with the hashes though, i think it would be useful to do a tiny script in another pr that prints a mapping of hashes to origin url by doing a simple dir walk so that we can easily debug

Sure can do

Comment on lines +79 to +86
branches = data.frame(
name = branches$name,
commit_hash = branches$commit,
time = as.numeric(branches$updated),
message = message,
row.names = NULL
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not a suggestion, just curious as to why you used data.frame instead of list, is it because it fills in the data frame and serialises each row and you dont have to wrap each branch name in scalar for example?

also are we not worrying about the new line characters? rich was worrying about them for some reason in the past, i cant remember what we decided on that

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not a suggestion, just curious as to why you used data.frame instead of list, is it because it fills in the data frame and serialises each row and you dont have to wrap each branch name in scalar for example?

Err, I don't know. It just works. I don't even know how what the easy way to turn this into a list would be.

If I just replace data.frame by list and do no other transformation then the result is radically different. It would be one object with fields name, commit_hash, ... where each entry is an array of strings, which is not idiomatic JSON at all. Using data.frame gives the expected result because jsonlite has special support for it. Orderly2 does this all the time when serializing, eg. the list of files in a packet.

The other way would be to do the pivot myself and generate a list of lists, where each nested list contains a bunch of scalars. Doable but way more tedious and verbose.

also are we not worrying about the new line characters? rich was worrying about them for some reason in the past, i cant remember what we decided on that

I never understood this concern. JSON can serialize newline characters just fine. But yeah you are right this breaks with the outpack_server API. I can go back to using a list of strings if you prefer, but it makes no sense to me.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yh i mean feel free to include newline characters, we can always fix it pretty easily if needed, do you think its worth having the type as an array of 1 though? just thinking about the fact that we may have to migrate the database to be just string and then if we implement a newline char change in the future itll have to be migrated again to string arrays, idm either way

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just thinking about the fact that we may have to migrate the database to be just string and then if we implement a newline char change in the future itll have to be migrated again to string arrays

This doesn't show up in the database does it?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe not, im not 100% sure but it shows up in the data class for the response, should be easy enough to change but might wanna make a quick ticket for it so we dont forget!

inst/schema/repository_branches.json Show resolved Hide resolved
This API is a replacement for the one currently provided by
outpack_server.

A single instance of orderly.runner API is able to operate over an
arbitrary number of Git repositories. The first time a URL is fetched, a
`git clone` is performed. Subsequent fetches are incremental using `git
fetch`.

Repositories are each stored in their own directory, named using the
hash of the URL. Using the hash makes it easy to support arbitrary
upstreams, without having to worry about encoding the URL characters as
file names.

The existing endpoints still operate on the "root" repository, which is
assumed to exist before the API is launched. Eventually all those APIs
will be migrated to have a `url` argument.
@plietar plietar requested a review from M-Kusumgar January 13, 2025 17:02
Copy link
Contributor

@M-Kusumgar M-Kusumgar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I approve

@@ -1,6 +1,6 @@
parse_main <- function(args = commandArgs(TRUE)) {
usage <- "Usage:
orderly.runner.server [options] <path>
orderly.runner.server [options] <path> <repositories>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

here too right?

Suggested change
orderly.runner.server [options] <path> <repositories>
orderly.runner.server [options] <path> <repositories-base-path>

@plietar plietar merged commit b467256 into main Jan 14, 2025
7 checks passed
@plietar plietar deleted the mrc-6120 branch January 15, 2025 15:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants