Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Stabilize execution of test.sh on dev Linux environments: certifier_tests_wrap.cc:178:11: fatal error: Python.h: No such file or directory #215

Open
gapisback opened this issue Oct 25, 2023 · 7 comments
Assignees
Labels
bug Something isn't working code-cleanup Minor code cleanup changes, to simplify, improve code. No logic changes.

Comments

@gapisback
Copy link
Collaborator

The newly added test.sh script is not working correctly on engineers' Linux dev env.
There are some hard-coded dependencies in various build scripts that need to be corrected.

There are at least a couple of diff issues to resolve under this issue:

  • Dir-path for PY_INCLUDE = -I /usr/include/python3.10/ needs to be fixed in all make files.
  • Support passing down LOCAL_LIB=/usr/local/lib64 from test.sh to sub-scripts / makefiles.

Here are some examples of failures:

483 + make -f certifier_tests.mak --always-make -j8 sharedlib LOCAL_LIB=/usr/local/lib64
484 protoc --cpp_out=. --proto_path ../certifier_service/certprotos ../certifier_service/certprotos/certif    ier.proto
485
486 compiling support.cc
[...]

550 compiling x509_tests.cc
551 g++ -I ../include -I/usr/local/opt/[email protected]/include/ -I ./sev-snp -I ./gramine -g -std=c++17 -D X64     -Wall -Wno-unused-variable -Wno-deprecated-declarations -O3 -fpic -o ./x509_tests.o -c x509_tests.cc
552 compiling certifier_tests_wrap.cc
553 g++ -I ../include -I/usr/local/opt/[email protected]/include/ -I ./sev-snp -I ./gramine -g -std=c++17 -D X64     -Wall -Wno-unused-variable -Wno-deprecated-declarations -O3 -fpic -I /usr/include/python3.10/ -o ./ce    rtifier_tests_wrap.o -c certifier_tests_wrap.cc
554 certifier_tests_wrap.cc:178:11: fatal error: Python.h: No such file or directory
555   178 | # include <Python.h>
556       |           ^~~~~~~~~~
557 compilation terminated.
558 make: *** [certifier_tests.mak:195: certifier_tests_wrap.o] Error 1
559 make: *** Waiting for unfinished jobs....
560 ++ cleanup
561 ++ set +x
562
563 test.sh: **** Error **** Failed command, make -f certifier_tests.mak --always-make -j${NumMakeThreads}     sharedlib LOCAL_LIB=/usr/local/lib64, at line 130 ./CI/scripts/test.sh while executing function test-core-certifier-programs
@gapisback gapisback self-assigned this Oct 25, 2023
@gapisback gapisback added bug Something isn't working code-cleanup Minor code cleanup changes, to simplify, improve code. No logic changes. labels Oct 25, 2023
rgerganov added a commit that referenced this issue Oct 26, 2023
@rgerganov
Copy link
Collaborator

I tried running test.sh on my Linux dev machine and I hit the following issues:

  • compilation failures due to hardcoded Python include path (submitted PR Use pkg-config for finding Python include dir #216)
  • missing dependencies like pytest and protobuf
  • incompatible openssl options; I have openssl 1.1.1 and it doesn't support noenc; this one is minor, not sure if we should care for openssl v1.x at all
  • waiting 120sec. for sockets to be closed; we should be able to fix this with SO_REUSEADDR?
  • test script deleting non-git files; this is not ok -- it's common practice to keep all kind of files in the project dir like virtual environments, IDE configurations, etc. and deleting those is quite unexpected and frustrating
  • using sudo during tests; I understand we need this to emulate SEV but I won't give root permissions to a test script running on my machine

I guess the best way to go here is to provide a docker container with all of the dependencies that can be used for running tests. The Dockerfile can be used as a reference for what packages and configurations are needed if you want to run this on your physical machine.

@gapisback
Copy link
Collaborator Author

The Docker container is a good approach. Let's discuss this next week to put this work item in its priority queue.

gapisback pushed a commit that referenced this issue Oct 26, 2023
…envs.

This commit does some clean-up of scripts and Makefiles so that
the newly-added test.sh can be used by other dev engineers on
diff Linux physical machines.

- Fix few Makefiles to allow re-specifying LOCAL_LIB, to point
  to, say, /usr/local/lib (for SSL libs, e.g.)
- Hacky support to allow use of OpenSSL V1.x versions
- Skip a few pytests that seem to not run stably. Needs further
  investigation.
@gapisback
Copy link
Collaborator Author

@rgerganov -- I investigated one of the points you raised above:

waiting 120sec. for sockets to be closed; we should be able to fix this with SO_REUSEADDR?

The sleep is coming in the pytest-case test_certifier_framework.py in-between channel.init_server_ssl() and cfm.server_dispatch().

In server_dispatch() -> open_server_socket(), we are using this re-use-address semantic, as seen below:

2283       // Reuse addresses and ports
2284 #define REUSE_SOCKETS_AND_PORTS
2285 #ifdef REUSE_SOCKETS_AND_PORTS
2286     int reuse = 1;
2287     if (setsockopt(sfd,
2288                    SOL_SOCKET,
2289                    SO_REUSEADDR,      <<<<<
2290                    (const char *)&reuse,
2291                    sizeof(reuse))
2292         < 0) {
2293       fprintf(stderr, "Can't reuse socket %s\n", __func__);
2294       return false;
2295     }

Empirically, testing evidence shows that even though this flag has been used, on dev-machine, if you re-run the test-case repeatedly, the socket is not being closed and re-used fast enough. Hence, this sleep has been added.

This test file test_certifier_framework.py has several test cases each of which establishes a SSL connect, does some work and then disconnects. So, when run back-to-back in CI, I need a mechanism to ensure that the socket re-use property kicks-in "fast enough".

I know this is irritating, but I don't think this is functionally broken or use of this sleep is masking some broken functionality.

I can investigate further, but debugging this is a bit difficult on CI machines.

@gapisback
Copy link
Collaborator Author

@yelvmw -- Here's the trace evidence showing why some of the Python tests in test_certifier_framework.py are hanging when run on the SGX machine.

I think it's because in the shared library, the code for ENABLE_SEV=1 is automatically enabled for all hardware platforms. (I had raised this 'issue' in our previous visits to certifier.mak. At that time, it was felt that it's better to have all code #include'ed by default for all TEEs.)

That I suspect is causing an issue as seen in the execution flow trace instrumentation:

test_certifier_framework.py::test_run_app_as_a_client_init_client_ssl_with_trust_manager
*** cc_trust_manager_get_certified() ...

init_policy_key() ... Starting ...
PublicKeyFromCert() Starting ...
PublicKeyFromCert() ... Starting ...
GetX509FromCert() succeeded.
X509_get_pubkey() succeeded.
X509_NAME_get_text_by_NID() succeeded, len=15.
X509_NAME_get_text_by_NID() succeeded.
PublicKeyFromCert():425 succeeded.
PublicKeyFromCert():432 succeeded.
PublicKeyFromCert():435 Executing ...
get_tcb_version_from_vcek():270 Executing ...
vcek_ext_byte_value():230 Executing ...

Basically, the Python method invokes init_policy_key(), which goes through till:

get_tcb_version_from_vcek() -> vcek_ext_byte_value().

I instrumented the last function, and it's hanging on L232 below:

 230   printf("%s():%d Executing ...\n", __func__, __LINE__);
 231   // Use OID for both lname and sname so OBJ_create does not fail
 232   nid = OBJ_create(oid, oid, oid);
 233   printf("%s():%d Executing ...\n", __func__, __LINE__);

OBJ_create() seems to be a SEV_SNP interface of some sort. I can't locate its sources.


To validate this theory that running SEV_SNP-specific code in this simulated-enclave, I conditionalized the definition of ENABLE_SEV in certifier.mak and re-built the shared lib:

  4 ifndef NO_ENABLE_SEV
  5     ENABLE_SEV=1
  6 endif

This allows the tests to run cleanly without running into this hanging issue.

gapisback pushed a commit that referenced this issue Oct 28, 2023
…envs.

This commit does some clean-up of scripts and Makefiles so that
the newly-added test.sh can be used by other dev engineers on
diff Linux physical machines.

- Fix few Makefiles to allow re-specifying LOCAL_LIB, to point
  to, say, /usr/local/lib (for SSL libs, e.g.)
- Hacky support to allow use of OpenSSL V1.x versions
- Add NO_ENABLE_SEV=1 flag in certifier.mak, so that we can build
  shared library without #include'ing SEV_SNP-specific code. This
  seems to cause pytests which to "hang" as they end up calling
  SEV_SNP code-paths.

  - Use this new flag when building shared-libraries in the
    test-run_example-simple_app() test-case, in test.sh .
    This is used to execute pytests that need Certifier Service.
gapisback pushed a commit that referenced this issue Oct 28, 2023
…envs.

This commit does some clean-up of scripts and Makefiles so that
the newly-added test.sh can be used by other dev engineers on
diff Linux physical machines.

- Fix few Makefiles to allow re-specifying LOCAL_LIB, to point
  to, say, /usr/local/lib (for SSL libs, e.g.)
- Hacky support to allow use of OpenSSL V1.x versions
- Add NO_ENABLE_SEV=1 flag in certifier.mak, so that we can build
  shared library without #include'ing SEV_SNP-specific code. This
  seems to cause pytests which to "hang" as they end up calling
  SEV_SNP code-paths.

  - Use this new flag when building shared-libraries in the
    test-run_example-simple_app() test-case, in test.sh .
    This is used to execute pytests that need Certifier Service.

cleanup.
@yelvmw
Copy link
Contributor

yelvmw commented Oct 30, 2023

It's surprising to see OBJ_create hanging. It's an OpenSSL function. I checked the bugs section of the manual and only see the caveat of concurrent invocation. Is this the case?

gapisback pushed a commit that referenced this issue Oct 30, 2023
…nvs.

This commit does some clean-up of scripts and Makefiles so that
the newly-added test.sh can be used by other dev engineers on
diff Linux physical machines.

- Fix few Makefiles to allow re-specifying LOCAL_LIB, to point
  to, say, /usr/local/lib (for SSL libs, e.g.)
- Hacky support to allow use of OpenSSL V1.x versions
- Add NO_ENABLE_SEV=1 flag in certifier.mak, so that we can build
  shared library without #include'ing SEV_SNP-specific code.
  Including SEV_SNP-specific code seems to cause pytests to
  "hang", when running pytests on simulated-enclave, as the
  tests end up calling SEV_SNP code-paths.

  - test.sh: Use this new flag when building shared-libraries in
    the test-run_example-simple_app() test-case. This is used to
    execute pytests that need Certifier Service on SEV-enabled h/w.
@gapisback
Copy link
Collaborator Author

@yelvmw - About the concurrent use of OBJ_create(), the server & client code are being executed concurrently, although there isn't any sort of race condition.

However in the affected test case(s) [e.g., test_run_app_as_a_client_init_client_ssl_with_trust_manager()], we just start the client-app portion on its own, talking to the Certifier Service. So, no concurrent invocation of said SSL interface should come into play.

gapisback pushed a commit that referenced this issue Oct 30, 2023
…nvs.

This commit does some clean-up of scripts and Makefiles so that
the newly-added test.sh can be used by other dev engineers on
diff Linux physical machines.

- Fix few Makefiles to allow re-specifying LOCAL_LIB, to point
  to, say, /usr/local/lib (for SSL libs, e.g.)
- Hacky support to allow use of OpenSSL V1.x versions
- Add NO_ENABLE_SEV=1 flag in certifier.mak, so that we can build
  shared library without #include'ing SEV_SNP-specific code.
  Including SEV_SNP-specific code seems to cause pytests to
  "hang", when running pytests on simulated-enclave, as the
  tests end up calling SEV_SNP code-paths.

  - test.sh: Use this new flag when building shared-libraries in
    the test-run_example-simple_app() test-case. This is used to
    execute pytests that need Certifier Service on SEV-enabled h/w.
@gapisback
Copy link
Collaborator Author

gapisback commented Nov 3, 2023

@rgerganov : I'm back to cleaning up some items you logged here: "test script deleting non-git files; this is not ok -"C

Firstly, I do need rm_non_git_files or some hook like that, to clean-up cruft between run-to-run of this run_example.sh.

Otherwise, I've found that some workflow steps for some sample-apps may interfere causing instability. I just need a way to "cleanup the entire dev env" at the start of each run of this script.

Second, the command does git ls from certifier-root dir. Why would a user's config / env-files be living inside the certifier-framework-for-confidential-computing/ sub-dir? Wouldn't we expect these to usually live under $HOME?
Are you worried about IDE config files for this Certifier Framework to be living inside this projects's root sub-dir?

What if we exclude dot-files; i.e. filter-out \.* files from the rm? Would that suffice to address your concern?

Alternate Solution: Let me know if you think this is good enough:

I took a look at the files that do need to be cleaned-up by this command. Seems like if we can filter out the output from git ls-files to only delete the files in app_data*/ and provisioning sub-dirs, and pick-up build output files (e.g. *.o, *.exe, *wrap* etc.) that may be sufficient.

This kind of filtering is prone to being incomplete ... but it may be the only reliable way to do this clean-up w/o impacting the user's config / env-files.

--
[Update] I just discovered that the list of files reported by git ls-files . --exclude-standard --others --ignored is probably sufficient. It includes all objects, exes and SWIG-generated files. Would you be willing to go with this approach?

The idea is that we only rm files that are in our project's .gitignore, ... which is not a bad thing, right?

Do you expect user project's config/env-files to live in our project's .gitignore?


Let me know your thoughts on this approach ... and whether we are really to be worried-about config / env-files in the certifier root-dir.

gapisback added a commit that referenced this issue Nov 3, 2023
This commit attempts to improve the behaviour of the rm_non_git_files
sub-command of run_example.sh . The objective of this command is to
clean-up build-outputs and other [stale] artifacts that may have been
produced by a prior execution of this script. We may be, inadvertently,
also deleting user's config-/env-files that reside in the project's
sub-tree. This commit changes the behaviour of rm_non_git_files to
now only delete the files registered in the .gitignore list.
gapisback added a commit that referenced this issue Nov 3, 2023
This commit attempts to improve the behaviour of the rm_non_git_files
sub-command of run_example.sh . The objective of this command is to
clean-up build-outputs and other [stale] artifacts that may have been
produced by a prior execution of this script. We may be, inadvertently,
also deleting user's config-/env-files that reside in the project's
sub-tree. This commit changes the behaviour of rm_non_git_files to
now only delete the files registered in the .gitignore list.

Also, now "--dry-run rm_non_git_files" will only print an info message
without actually deleting all affected files.
gapisback added a commit that referenced this issue Nov 6, 2023
This commit attempts to improve the behaviour of the rm_non_git_files
sub-command of run_example.sh . The objective of this command is to
clean-up build-outputs and other [stale] artifacts that may have been
produced by a prior execution of this script. We may be, inadvertently,
also deleting user's config-/env-files that reside in the project's
sub-tree. This commit changes the behaviour of rm_non_git_files to
now only delete the files registered in the .gitignore list.

Also, now "--dry-run rm_non_git_files" will only print an info message
without actually deleting all affected files.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working code-cleanup Minor code cleanup changes, to simplify, improve code. No logic changes.
Projects
None yet
Development

No branches or pull requests

3 participants