Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Automate maintenance of generic_COBALT diagnostics #22

Closed

Conversation

kakearney
Copy link

This addresses some of the discussion in #15 by automating the generation of source code related to the generic_COBALT_register_diag subroutine.

This update adds a few new files:

  • generic_COBALT_diag.yaml: I chose a YAML format because it allows table-like display, is both human- and machine-readable, and allows for commenting. In this format, it's relatively easy to scan through the many (798 currently!) diagnostics to check for consistency and to add new stuff as it becomes necessary. I preserved most comments from the existing source code here (although I dropped a few version-control type ones indicating when certain diagnostics had been added or updated and by who; we can re-add those if necessary.)
  • generic_COBALT_diagyaml2register_diag_body.py: A simple python script to reformat the info in the .yaml file to the appropriate Fortran source code. It can run with two different options for formatting: a) "original", which replicates the existing code, and b) "novardesc", which strips out the vardesc commands as was suggested in the linked issue.
  • generic_COBALT_register_diag_body.f90: The Fortran code auto-generated by the above python script, run with the "original" formatting option. The corresponding code in the generic_COBALT.F90 file was deleted and replaced with an include statement pointing to this file.

The idea here is that all maintenance of diagnostic registration is done by editing the .yaml file and rerunning the .py script, with no human ever directly editing generic_COBALT_register_diag_body.f90. (Down the road, we could get fancy and automate the re-running of the .py script any time an update was pushed to the .yaml file)

A few points for discussion:

  • What are the guidelines for including non-source code (like the .yaml and .py files) within this repo?
  • Does the MOM6 code style allow for include statements? Adding the code in this manner isolates the 3196-3994 lines of code related to this process away from the main module, which makes the main code more readable and reduces the chances of accidental copy-paste errors. But it does add an extra file, and possibly has some impact on the building of the source code.
  • The existing code used an interesting mix of quote styles (single vs double quotes), which I preserved in the .yaml file. Was this done for a purpose or just an artifact of different coding styles? If the latter, it may make sense to edit for consistency.

Warning: I don't yet have my gaea account, so I have not tested the Fortran code for compilation! Someone please check my work, haha.

To simplify maintenance of registering diagnostics (in the
generic_COBALT_register_diag subroutine), I've moved the info for the
COBALT diagnostics to a .yaml file.  I also added a small python script
to generate the necessary source code from that yaml file and dump
it to a separate file, which can be regenerated as needed when new
diagnostics are added or diagnostic-registering syntax is changed.
@andrew-c-ross
Copy link
Contributor

This is amazing!

@andrew-c-ross
Copy link
Contributor

I got this to compile and run a singe-column test case on my GFDL workstation. The only thing I had to do was change the file extension of the generated fortran file from .f90 to something else (.inc) so that the make script doesn't try to compile it on its own. (Then just the file extension of the file in the include line in generic_COBALT.f90 has to be changed to match).


# Parse and check input

if len(sys.argv) < 1:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be if len(sys.argv) < 2: if we want to keep original as the default when users do not provide an input argument?


# Open file for writing

f = open("generic_COBALT_register_diag_body.f90", "w")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As @andrew-c-ross pointed out, the current make script (as well as the GFDL FRE workflow) will try to compile files with extensions like *.F90 and *.f90. For now, I suggest we simply change the file extension to .inc, so it becomes generic_COBALT_register_diag_body.inc. I also quickly checked the MOM6 side, and it looks like they have mostly abandoned the INCLUDE statement except for some header files. I do not think our changes here will have any impact on them, though.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree, the only INCLUDE statement I am used to seeing is for MOM_memory.

However, MOM6 does use modules and has many statements like
use cobalt_register_diag_body only : register_diag_body
I am not sure what the motivation for having all of cobalt as one module is, but this seems like a case where having multiple .f90 files with separate modules could benefit the main code. Breaking up the generic_COBALT code into multiple modules would be part of a bigger conversation.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Upon further consideration, I think using a separate module, as @theresa-morrison suggested, would be a better idea than using an INCLUDE statement.


# This format eliminates the creation of an intermediate vardesc variable

mystr = ( " {var_to_register} = register_diag_field(package_name, {name} {axes}, &\n"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing a comma after {name}. It should be{name} {axes},

cmor_standard_name="sea_water_alkalinity_expressed_as_mole_equivalent", &
cmor_long_name="Surface Total Alkalinity")

vardesc_temp = vardesc("talknatos_raw","Surface Natural Total Alkalinity
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suggest we change the file extension to inc. Please see my comments in the Python script for the reason.

#
# Limitation Diagnostics
#
- {name: '"P_C_max_Di"' , longname: '"Diaz. Maximum Growth Rate"' , hor_grid: "'h'", z_grid: "'L'", t_grid: "'s'", units: "'sec-1'" , mem_size: "'f'", var_to_register: "phyto(DIAZO )%id_P_C_max" , axes: "axes(1:3)", missing_value: "missing_value1"}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By checking the MOM6 code, I can confirm that a mix of quote styles (single vs. double quotes) exists on their side as well. @theresa-morrison, do you have any ideas for that?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should choose what ever is most readable here. If using only single or double quotes would mean we can eliminate having both ' " x" ' and " ' x ' " that would be good -- I can see that tripping up a lot of users.

Copy link
Contributor

@andrew-c-ross andrew-c-ross Mar 28, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we remove most of the quotes in the yaml file and have them inserted by the python code? I think we would just need to keep single quotes in the yaml file for strings with special characters like %. EDIT: I think the %s are okay here, it's mostly just the axes(#:#) that need quoted.

@andrew-c-ross
Copy link
Contributor

andrew-c-ross commented Mar 28, 2024

RE: include statements, one idea for an alternative would be to generate a complete module named something like generic_COBALT_utils that has a subroutine with all of the diagnostic registration code. Longer-term I think we can also automatically generate the diagnostic sending calls with just one addition to the yaml file, so it might be helpful to be able to put everything auto generated into one module file.

EDIT: I guess this could create some ugly circular dependencies though, and not sure how cobalt's use of module variables will impact that.

@theresa-cordero
Copy link
Contributor

I think there are ways that we could add auto generated modules and manually generated modules to cobalt without introducing circular dependencies. It would definitely take a bit of work and involve moving things around within the ocean_BGC directory, but I think there are real advantages to not having all of cobalt in one 16,000 line .f90 file.

# register_diag_field that were passed by position (rather than name/value) in the original
# generic_COBALT code.

primary = ["name", "longname", "hor_grid", "z_grid", "t_grid", "units", "mem_size",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since some of these appear to be always the same (hor_grid, t_grid, mem_size, missing_value), could we put defaults for those in the python file that can still be overwritten by the yaml if ever necessary?

It also seems like technically we only need one of z_grid and axes, since 2d variables have z_grid=1 and axes=axes(1:2) and 3D variables have z_grid=L and axes=axes(1:3).

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are some variables on the "1" z-axis (i.e. neither 2D nor 3D) where the axes variable can vary (either axes(1:2) or axesTi(1:1))... these are things like runoff and bottom fluxes.

But there is also a comment there indicating that the values of the axes variable is never really used, other than a length check, so there is possibly some simplification that could be done there.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the idea of automating code generation with Python. Here, I have some additional thoughts if we want to pursue this path. The pro is that Python is straightforward and easy for new-generation users to follow. The con is that if we decide to mix different languages, we will have to maintain both of them. For instance, we probably have to consider implementing unit tests to ensure that those functions in the Python scripts work properly. In this sense, we will have to enforce stricter rules for those Python scripts. Taking this Python script as an example, we better refactor the main code as a function:

import yaml

def build_register_diag_body(diag, formatopt="original"):
    primary = ["name", "longname", "hor_grid", "z_grid", "t_grid", "units", "mem_size",
               "var_to_register", "axes"]

    output = []
    for d in diag:
        nvstr = ""
        for key, value in d.items():
            if key not in primary:
                nvstr = ", &\n          ".join([nvstr, "{} = {}".format(key, value)])

        if formatopt == "original":
            mystr = ("     vardesc_temp = vardesc({name}, {longname}, {hor_grid}, {z_grid}, {t_grid}, {units}, {mem_size})\n"
                     "     {var_to_register} = register_diag_field(package_name, vardesc_temp%name, {axes}, &\n"
                     "          init_time, vardesc_temp%longname, vardesc_temp%units"
                    ).format(**d)
            mystr = mystr + nvstr + ")\n"
        elif formatopt == "novardesc":
            mystr = ("     {var_to_register} = register_diag_field(package_name, {name}, {axes}, &\n"
                     "          init_time, {longname}, {units}"
                    ).format(**d)
            mystr = mystr + nvstr + ")\n"
        
        output.append(mystr)

        if d['name'] == 'b_di14c': # begin radiocarbon block
            output.append("     if (do_14c) then\n")
        elif d['name'] == 'jdo14c': # end radiocarbon block
            output.append("     endif\n")

    return '\n'.join(output)

if __name__ == "__main__":
    import sys

    if len(sys.argv) < 2:
        formatopt = "original"
    else:
        formatopt = sys.argv[1]

    with open('generic_COBALT_diag.yaml', 'r') as f:
        diag = yaml.safe_load(f)

    output = build_register_diag_body(diag, formatopt)

    with open('generic_COBALT_register_diag_body.f90', 'w') as f:
        f.write(output)

By doing this we can write a unit test for this specific function.

We don't have to address this issue in this PR, but it's something we'll have to think about. I'm open to any further thoughts.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm in favor of this refactoring. I'll leave the the code as is for now because my python fluency is a little shaky when it comes to "doing things the right way" syntax.

A comment on this toy example: the 'b_di14c' tracer is the first one in the radiocarbon block, so the if statement needs to be appended before the mystr addition in that case. I added a note in the YAML file advising developers to be cautious about inserting tracers within that block, but it might be worth considering a more robust check for any specific diagnostics that include special handling, like the radiocarbon ones.

Copy link
Collaborator

@yichengt900 yichengt900 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kakearney, thank you for your prompt action. This is truly amazing work (as you can see, everyone is encouraged by your work and cannot wait to share their thoughts 💯)! Overall, I love the whole idea. I only have a few comments on the code and how we handle the include statement (include vs modules).

@kakearney
Copy link
Author

Great to see all the feedback so soon! And I agree that a full discussion of how the code might be refactored would be good. There are many other places in the code (the g_tracer_add_param, g_tracer_add, g_tracer_get_values, g_send_data, and allocation and deallocation calls) that are similarly repetitive and might benefit from the same table -> autogenerated code treatment.

I updated my branch to fix the typos pointed out in the .py script and to change the extension of the autogenerated file to .inc.

@yichengt900
Copy link
Collaborator

@kakearney, thanks! The updates look good to me. I will start the CI test. @charliestock, just keeping you in the loop. @kakearney opened a great PR as a starting point. I recommend we spend some time in next week's cobaltV3 meeting to further discuss how we are going to handle code refactoring.

@yichengt900
Copy link
Collaborator

yichengt900 commented Mar 28, 2024

@kakearney, oops! It looks like the CI is complaining that the head commit for this pull request is not ahead of the base commit (dev/cefi repo). Could you please check and make sure your feature branch is up-to-date with the current main branch?

@kakearney
Copy link
Author

Oops, overlooked the overnight (to me) update... should be synced now.

@yichengt900
Copy link
Collaborator

Great job, @kakearney 🥇 ! The PR passed the CI test. I'll approve it for now, but let's hold off on merging until we discuss code refactoring in our next COBALTv3 meeting.

@yichengt900
Copy link
Collaborator

We have initiated the diagnostics cleanup process under #109. Once that is completed, we can revisit the option for automated diagnostic code generation. I will close this PR for now.

@yichengt900 yichengt900 closed this Jan 7, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CodeCleanUp enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants