Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FIX] Remove code to reinitialise random numbers on fork() #2409

Closed
wants to merge 2 commits into from

Conversation

ghalliday
Copy link

Fixes #2408

Changes

Remove the AtFork hook in the random number generator which potentially causes memory corruptions (as well as being inefficient and unnecessary).

I realise the regression test case will need deleting, but have left that until the change is agreed in principle.

  • CHANGELOG.md updated for non-trivial changes
  • Unit tests have been added
  • Changes in public API reviewed

@ghalliday ghalliday requested a review from a team November 16, 2023 17:34
Copy link

linux-foundation-easycla bot commented Nov 16, 2023

CLA Signed

The committers listed above are authorized under a signed CLA.

  • ✅ login: ghalliday / name: Gavin Halliday (ad3bb56)
  • ✅ login: ThomsonTan / name: Tom Tan (3c140a4)

Copy link

codecov bot commented Nov 16, 2023

Codecov Report

All modified and coverable lines are covered by tests ✅

Comparison is base (064fef0) 87.06% compared to head (3c140a4) 87.07%.
Report is 48 commits behind head on main.

Additional details and impacted files

Impacted file tree graph

@@            Coverage Diff             @@
##             main    #2409      +/-   ##
==========================================
+ Coverage   87.06%   87.07%   +0.02%     
==========================================
  Files         199      198       -1     
  Lines        6079     6077       -2     
==========================================
- Hits         5292     5291       -1     
+ Misses        787      786       -1     
Files Coverage Δ
sdk/src/common/random.cc 100.00% <100.00%> (+4.17%) ⬆️

... and 2 files with indirect coverage changes

@lalitb
Copy link
Member

lalitb commented Nov 17, 2023

I am not sure if we should remove code to reseed the random number on fork(). The random number generator is used for generating trace_id/span_id. If the RNG state is duplicated across forked processes, it could affect the randomness of the sampling process, potentially leading to biased sampling results.

Thinking out loud, should we have a thread local flag to indicate if a fork has occurred, and set that in fork handler. Something like this:

class TlsRandomNumberGenerator {
  ....
  public: static FastRandomNumberGenerator & engine() noexcept {
    if (forked_) {
      Seed();
      forked_ = false;
    }
    return engine_;
  }

  private: 

  static thread_local bool forked_;

  static void OnFork() noexcept {
    forked_ = true;
  }
}

This will add an extra conditional check for each span creation, so that need to be considered too, specifically when most of the applications mayn't be doing any fork.

@lalitb
Copy link
Member

lalitb commented Nov 17, 2023

Not directly related to this PR, but it is also possible to plugin a custom random-id generator if the existing one doesn't suffice -

std::unique_ptr<IdGenerator>(new RandomIdGenerator())) noexcept;

@ghalliday
Copy link
Author

Once a process has been forked it is legal to do very little until it calls exec() or similar to switch to a different process. I don't think it would be legal to call any of the open telemetry functions. See the documentation in https://man7.org/linux/man-pages/man3/pthread_atfork.3.htmlin the linked issue.

I know this change was made a while back in response to an issue, but I could not work out from the report what the application was aiming to do in the forked() child process. I suspect the code in the application should have been changed rather than making this change here.

@lalitb
Copy link
Member

lalitb commented Nov 17, 2023

Once a process has been forked it is legal to do very little until it calls exec() or similar to switch to a different process.

The code I shared is only setting a flag in the atfork hook, and actual reseeding is deferred. We can change it to an atomic flag to be thread-safe. Do you think this is not safe/legal?

I suspect the code in the application should have been changed rather than making this change here.

Do you mean changes done for this issue - opentracing-contrib/nginx-opentracing#52?

@ghalliday
Copy link
Author

ghalliday commented Nov 20, 2023

Once a process has been forked it is legal to do very little until it calls exec() or similar to switch to a different process.

The code I shared is only setting a flag in the atfork hook, and actual reseeding is deferred. We can change it to an atomic flag to be thread-safe. Do you think this is not safe/legal?

That change would also avoid the problem - both the time taken and the code - but I cannot think of any reason why any processing is required for the fork. It also adds an impact as you say to each request.

I suspect the code in the application should have been changed rather than making this change here.

Do you mean changes done for this issue - opentracing-contrib/nginx-opentracing#52?
Yes. This was the change. The documentation is clear that very little can be done inside after a fork() inside the child process.

I have looked at ingress-nginx, and can't find the code that calls fork - so I can't easily see the issue the fix was trying to address. @rnburn are you able to clarify what the issue was, and how this was causing problems with a fork?

@ThomsonTan
Copy link
Contributor

If exec is not called after the forked, would sharing the random seed from fork causing issue?

That change would also avoid the problem - both the time taken and the code - but I cannot think of any reason why any processing is required for the fork. It also adds an impact as you say to each request.

@owent
Copy link
Member

owent commented Nov 30, 2023

If exec is not called after the forked, would sharing the random seed from fork causing issue?

That change would also avoid the problem - both the time taken and the code - but I cannot think of any reason why any processing is required for the fork. It also adds an impact as you say to each request.

It will make generator to generate same random sequence between processes.

@ghalliday
Copy link
Author

If exec is not called after the forked, would sharing the random seed from fork causing issue?

That change would also avoid the problem - both the time taken and the code - but I cannot think of any reason why any processing is required for the fork. It also adds an impact as you say to each request.

It will make generator to generate same random sequence between processes.

That is true, but what is that process calling? My point is that the child process is highly restricted in what it can do at that point (see the linux references) - calling into the open telemetry libraries is almost certainly too much (calling any heap management functions can cause memory corruptions). If the child process needs an id, then it should be allocated before calling fork().

As soon as exec() is called the new child process will reinitialise the random number generators and the numbers will be different.

@github-actions github-actions bot added Stale and removed Stale labels Feb 3, 2024
@lalitb
Copy link
Member

lalitb commented Feb 13, 2024

I am not sure if we should remove code to reseed the random number on fork(). The random number generator is used for generating trace_id/span_id. If the RNG state is duplicated across forked processes, it could affect the randomness of the sampling process, potentially leading to biased sampling results.

I believe this PR isn't ready for merging due to the above mentioned concerns. Additionally, the updates suggested in this PR could be implemented through a custom random number generator. Therefore, closing this PR, but feel free to reopen it with the required modifications.

@lalitb lalitb closed this Feb 13, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

TlsRandomNumberGenerator causes memory corruption when the process is forked.
4 participants