Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support reading additional files with references #156

Closed
wants to merge 2 commits into from

Conversation

lmolkova
Copy link
Contributor

@lmolkova lmolkova commented Mar 9, 2023

Fixes #155

Adds --references-only flag to semconvgen which allows to specify additional sources for referenced (extended, included) semconv.

E.g. browser resource semconv references general user_agent.original attribute. When code for resources is generated, user_agent.original needs to be resolved, even though user_agent semconv is not generated.

Usage (tested on Java):

All semconv except resources

docker run --rm \
  -v ${SCRIPT_DIR}/opentelemetry-specification/semantic_conventions:/source \
  -v ${SCRIPT_DIR}/templates:/templates \
  -v ${ROOT_DIR}/semconv/src/main/java/io/opentelemetry/semconv/trace/attributes/:/output \
  otel/semconvgen:$GENERATOR_VERSION \
-  --exclude resource/** \
+  --reference-only resource/** \
  -f /source code \
 ...

New flag excludes semconvs from the output, but allows references to them.

Resources:

docker run --rm \
- -v ${SCRIPT_DIR}/opentelemetry-specification/semantic_conventions/resource/:/source \
+ -v ${SCRIPT_DIR}/opentelemetry-specification/semantic_conventions:/source \
  -v ${SCRIPT_DIR}/templates:/templates \
  -v ${ROOT_DIR}/semconv/src/main/java/io/opentelemetry/semconv/resource/attributes/:/output \
  semconvgen:$GENERATOR_VERSION \
+ --only resources \
+ --reference-only *.yaml \
  -f /source code \
....

I.e. from all semconv output resource conventions, and use yaml files from root folder semantic_conventions to resolve references.

@lmolkova lmolkova requested a review from a team March 9, 2023 08:16
@lmolkova lmolkova requested a review from Oberon00 as a code owner March 9, 2023 08:16
@lmolkova
Copy link
Contributor Author

lmolkova commented Mar 9, 2023

@Oberon00 @arminru can you please take a look? Apparently, codegen is broken for 1.19.0 release (open-telemetry/opentelemetry-specification#3299).

Oberon00
Oberon00 previously approved these changes Mar 9, 2023
parser.add_argument(
"--reference-only",
"-r",
help="Additional files to resolve references only (and extends) using GLOB syntax",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this feature warrants a bit more description. This suggestion is also a check if I understood this PR correctly 😃

Suggested change
help="Additional files to resolve references only (and extends) using GLOB syntax",
help="""Additional files that will be loaded and available for referencing, but will not be part of the output.
As semantic conventions can, in various ways, e.g. using `ref` or `include`, reference other conventions you
sometimes need to add files/groups to your input but you don't want the groups in there be processed by
the code generator template or markdown generator. In this case, add only the subset of files you want to
process with your template / markdown generator to `--yaml-root`, and add any additional required
(transitively depended-upon) files with `--reference-only`.
""",

@@ -223,6 +244,12 @@ def setup_parser():
parser.add_argument(
"--exclude", "-e", help="Exclude the matching files using GLOB syntax", type=str
)
parser.add_argument(
"--reference-only",
"-r",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if we actually want a short-name for this flag.

errors: bool = False

def parse(self, file):
def parse(self, file, ref_only=False):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe instead add parse_into(self, file, models)?

If you prefer keeping the boolean argument, please add * before to enforce using named arguments for it:

Suggested change
def parse(self, file, ref_only=False):
def parse(self, file, *, ref_only=False):

semconv = SemanticConventionSet(debug=False)
semconv.parse(self.load_file("yaml/http.yaml"))
self.assertEqual(len(semconv.models), 3)
semconv.parse(self.load_file("yaml/general.yaml"), True)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
semconv.parse(self.load_file("yaml/general.yaml"), True)
semconv.parse(self.load_file("yaml/general.yaml"), ref_only=True)

etc
(or see comment before for maybe using an API that does not rely on a boolean)

@tsloughter
Copy link
Member

tsloughter commented Mar 9, 2023

I hit this with Go and Python so was also looking at making this patch. My 2 cents, I expected it to work without an additional argument. As in, if I give a path to where the yaml schema files are to -f and that includes all the necessary references within it then there should be no need to add additional files through a new option.

If the user doesn't want to include all the schema files in the path they give to -f there would still be --exclude.

--

And --only would still used for limiting the generated code to specific types.

@Oberon00
Copy link
Member

Oberon00 commented Mar 9, 2023

As in, if I give a path to where the yaml schema files are to -f and that includes all the necessary references within it

I think this is already the case. IIUC, this PR is actually about adding a way to exclude files from output generation but not from internal model construction (see #156 (comment)).

--exclude I think completely excludes the files from being loaded. --only should be applied after parsing I believe.

I wonder if this PR is maybe going into the wrong direction, as it adds more reliance on the directory structure. Maybe globbing on the group ID would be better. Or maybe --only would already solve the problem, or if not, could be adjusted a bit?

@Oberon00 Oberon00 dismissed their stale review March 9, 2023 12:51

Noticed some room for improvement

@Oberon00
Copy link
Member

Oberon00 commented Mar 9, 2023

Actually I now looked at how --only is implemented, see https://github.com/open-telemetry/build-tools/blob/main/semantic-conventions/src/opentelemetry/semconv/main.py#L68 and called function. I think that approach would also work for this PR and would be much simpler. At the moment, the source file path of a convention is not part of the model, but it could be added, or alternatively, the approach could be changed to filter by group ID (regex/glob) instead of filepath.

If going for the ID-based filtering, this could mean that some group IDs in the semantic conventions might have to be renamed in order to improve "filterability". This would not change the actual attribute keys, as the "prefix" is independent of the group ID.

If going for adding the filepath to the model, the existing --exclude could be changed to filter after a complete parse of the YAML root. I think this would be a mostly backwards-compatible change as it would not change behavior in cases that already worked (except for making parsing a bit slower, but it's quite fast anyway) but allow additional cases where references are going into the excluded code. The only problem would be if you have any template/example/duplicate files in your YAML root that would actually break parsing

@Oberon00
Copy link
Member

Oberon00 commented Mar 9, 2023

Another option would be: instead of adding additional paths, you would add the actual root as --yaml-root, and you would add another option --only-paths that then restricts the output. So instead of --yaml-root some-subdirectory-I-want-to-generate-output-for --reference-only common-directory (which breaks as soon as there is one than more common directory, or would become cumbersome), you would do --yaml-root actual-toplevel-yaml-root --only-paths some-subdirectory-I-want-to-generate-output-for

I think all these options work, including the one currently in the PR, it's more of an UX decision.

@marcalff
Copy link
Member

marcalff commented Mar 9, 2023

Thanks for looking into this.

I don't have specific comments about the implementation itself, and which options to pass to generate code.

From an end-user point of view, and based on my (very limited) understanding of semconv code generation, expectations are as follows:

  • (1) Example documentation that works, see issue Semconv code generator example does not work for opentelemetry-specifications 1.14.0 #122
  • (2) Code generation separate for traces / resources / metrics (when available), so that semconv per signal can go into different files, which contains only the applicable constants (for C++, trace semconv in trace.h, resource semconv in resource.h, etc), hence the use of -only span, ...
  • (3) A stable docker run command, that does not change every spec release.

A docker command like:

docker run --rm \
  -v ${SCRIPT_DIR}/opentelemetry-specification/semantic_conventions:/source \
  --only span \
 ...

seems to meet these expectations, assuming it can be supported.

If there is a need to:

  • add more -v mount options to add some *.yaml files
  • add more --exclude options to hide some other *.yaml files

every time the spec releases a different set of *.yaml files, due to changes in internal dependencies, this becomes more difficult to deal with, and is more subject to errors.

Changing the docker run command once this is resolved is not an issue, but having to change it all the time (3) is a concern here.

If somehow these expectations are not realistic, please clarify what we (users in languages) are supposed to do then (see (1)).

Regards.

@Oberon00
Copy link
Member

Oberon00 commented Mar 9, 2023

@marcalff

Thanks for your comment. I was of the opinion that the docker command you mention (with some syntax changes such as moving the -v before run adding the image name, but that's not the point here) should work even today (even before this PR). If not, I'd like to know what the problem is.

Based on #122 (comment), it might be enough to add support for multiple types to --only, e.g. support --only span,event to include both span+event. CC @jkwatson

@lmolkova
Copy link
Contributor Author

lmolkova commented Mar 9, 2023

to address @marcalff concern, I think

  • we need to add a sample codegen dry run to -spec repo workflow so that authors of PR should see if the change breaks codegen
  • we should reasonably unify how codegen is used (I see significant differences in Java and c++)

It's also the first time for me looking into codegen and here I'd really like to reasonably unblock ourselves for 1.19.0 release and take steps to prevent it. I wonder if these steps are good enough.

Regarding reusing current functionality without a new flag and allowing multiple --only specs:
I don't think it's a good idea to allow referring to metrics attributes from spans and vice versa. We can solve it in several ways:

  1. explicit inclusion of files I do here gives the most control
  2. hardcoding what's allowed in build-tools (i.e. traces can refer to resource and attribute_group attributes, but not metric ones)

The opt2 is somewhat controversial and I'm not ready to drive it forward.

@Oberon00
Copy link
Member

Oberon00 commented Mar 9, 2023

Regarding reusing current functionality without a new flag and allowing multiple --only specs: I don't think it's a good idea to allow referring to metrics attributes from spans and vice versa.

I think that has very little to do with the issue here. I think this should be either enforced by the tool (option 2) or by spec reviewers. Option 1 would be "enforcement by difficult UX".

In general, I believe adding reliance on the directory structure instead of explicit group kinds/types is going into the wrong direction.

@marcalff
Copy link
Member

marcalff commented Mar 9, 2023

Thanks @lmolkova

to address @marcalff concern, I think

* we need to add a sample codegen dry run to -spec repo workflow so that authors of PR should see if the change breaks codegen

* we should reasonably unify how codegen is used (I see significant differences in Java and c++)

+1 on both points.

@lmolkova
Copy link
Contributor Author

lmolkova commented Mar 9, 2023

I think that has very little to do with the issue here. I think this should be either enforced by the tool (option 2) or by spec reviewers.

makes sense. Options 2 can happen any time in the future as well and not blocking.

In general, I believe adding reliance on the directory structure instead of explicit group kinds/types is going into the wrong direction.

ok, so I can do the following in this PR instead of reference-only:

  • allow multiple values for --only
  • maybe print a warning if attributes from metrics are reused on traces and vice versa

sounds good @Oberon00 ?

@lmolkova
Copy link
Contributor Author

lmolkova commented Mar 9, 2023

Thanks for your comment. I was of the opinion that the docker command you mention (with some syntax changes such as moving the -v before run adding the image name, but that's not the point here) should work even today (even before this PR). If not, I'd like to know what the problem is.

great point that only should already work (it did, but didn't produce a lot of stuff)! It should with --only span,attribute_group,event once I implement it here.

@tsloughter
Copy link
Member

At the moment, the source file path of a convention is not part of the model

Right, I think that is what I was trying to suggest.

I'm trying to catch up. I don't think having multiple values to --only will work because of how the script is used to generate a specific file. Like in Go:

$(SEMCONVGEN) -i "$(OTEL_SPEC_REPO)/semantic_conventions/." --only=span -p conventionType=trace -f trace.go -t "$(SEMCONVPKG)/template.j2" -s "$(TAG)"

And Python:

docker run --rm \
  -v ${SCRIPT_DIR}/opentelemetry-specification/semantic_conventions/trace:/source \
  -v ${SCRIPT_DIR}/templates:/templates \
  -v ${ROOT_DIR}/opentelemetry-semantic-conventions/src/opentelemetry/semconv/trace/:/output \
  otel/semconvgen:$OTEL_SEMCONV_GEN_IMG_VERSION \
  -f /source code \
  --template /templates/semantic_attributes.j2 \
  --output /output/__init__.py \
  -Dclass=SpanAttributes

@lmolkova
Copy link
Contributor Author

lmolkova commented Mar 9, 2023

I don't think having multiple values to --only will work because of how the script is used to generate a specific file. Like in Go:

we'll have to update generate scripts anyway and the common pattern can be:

-v ${SCRIPT_DIR}/opentelemetry-specification/semantic_conventions:/source \
--only  span,attribute_group,event

this will preserve current attribute locations in files.

an alternative could be to run script per semconv type and move attributes into individual files

-v ${SCRIPT_DIR}/opentelemetry-specification/semantic_conventions:/source \
--only  span
-v ${SCRIPT_DIR}/opentelemetry-specification/semantic_conventions:/source \
--only  attribute_group
-v ${SCRIPT_DIR}/opentelemetry-specification/semantic_conventions:/source \
--only  event

@lmolkova
Copy link
Contributor Author

Did multiple --only here - #157, tried on Java, works just fine and much simpler.

@Oberon00
Copy link
Member

Should attribute_group even be required to be mentioned? I think for referencing, you don't need to include it in --only

@lmolkova
Copy link
Contributor Author

Should attribute_group even be required to be mentioned? I think for referencing, you don't need to include it in --only

good point. thought about it and I think it's totally valid to generate md or template for an attribute group.
E.g. span-general attributes are an attribute_group.

@lmolkova
Copy link
Contributor Author

closing in favor of #157

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Support referencing attributes across signals in codegen
4 participants