-
Notifications
You must be signed in to change notification settings - Fork 11
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Initial stab at actx.trace_call #210
Conversation
# {{{ utilities | ||
|
||
|
||
def _ary_container_key_stringifier(keys: Tuple[Any, ...]) -> str: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't these functions already exist in impl.pytato.compile
? What has changed?
@@ -520,6 +520,12 @@ def compile(self, f: Callable[..., Any]) -> Callable[..., Any]: | |||
""" | |||
return f | |||
|
|||
# Supporting interface for function/call tracing in actx implementations | |||
def trace_call(self, f: Callable[..., Any], | |||
*args, identifier=None, **kwargs): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Missing return type annotation. More generally, trace_call
really should permit the user to do two things:
- Use the result of the traced call.
- Call the trace result using new data.
I'm not sure we'll be able to do both with just a single return value.
@@ -520,6 +520,12 @@ def compile(self, f: Callable[..., Any]) -> Callable[..., Any]: | |||
""" | |||
return f | |||
|
|||
# Supporting interface for function/call tracing in actx implementations | |||
def trace_call(self, f: Callable[..., Any], | |||
*args, identifier=None, **kwargs): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you also want the interface to apply tags to the call site.
Is this something that is still being worked on? If yes, I can help with some stuff here, I have some bandwidth for it this week. If not, then what's the plan for addressing illinois-ceesd/mirgecom#777? (I don't think compilation times of those orders should be tolerated) |
I pushed and updated my latest on this. My apologies for the state of it; it is very much wip. I had a fork/branch of meshmode with a simple test of using the actx-provided |
Thanks! If you want, I can take over and maybe it will be faster that way? My ETA for this is the weekend. However, if you think you want to get accustomed to these parts of the codebase, I'm happy to step away. |
The compilation times seem well under control for our current prediction I blv. We need to run some tests at scale to make sure that this is the case, but currently the build times are hovering around less than an hour for high order 3D with all physics and features enabled on 128 ranks. That is well into the prediction-enabling realm, and also not a giant bottleneck. The build time for serial is around 600-1000s, and 3000+ for many ranks. I understand the current situation is brought about mostly by mitigation and not a permanent solution. 1-dimensional domain decomposition is currently limiting our max number of MPI neighbors to 2, minimizing the impact of the splatted flux dags for partition boundaries. It does seem like, for our problem and geometry, we could stick with that strategy indefinitely. Solving dag splat I think will help us avoid splatting the flux calculation for each unique domain boundary. That would be useful. |
I would be delighted for you to take over and am very interested in following what happens here. I would get back to this after the review, but won't be anywhere near as efficient at getting it going. I believe it is close to functioning as a demo - but not close to final form. |
Compilation times of 1 hour (even 20 minutes tbh) are very bad and something that must be fixed ASAP. Another problem that is hovering around is that this also leads to a poorly generated device code, as each sub-array from a neighboring rank is assigned a small-ish buffer, and launching GPU kernels for such small problem sizes brings down the throughput.
Yep, let's do this then. |
See #221. |
In #221, I added a test under (py311_env) [line@line examples]$ python how_to_outline.py
[Pre-concatenation] Number of nodes = 32701
[Post-concatenation] Number of nodes = 3710 I think we can close this PR. |
Augments inducer/pytato/#364 by adding array context version of
trace_call
that handles array container types.cc: @lukeolson