Stabilize execution of test.sh on dev Linux environments: certifier_tests_wrap.cc:178:11: fatal error: Python.h: No such file or directory #215

gapisback · 2023-10-25T15:41:36Z

The newly added test.sh script is not working correctly on engineers' Linux dev env.
There are some hard-coded dependencies in various build scripts that need to be corrected.

There are at least a couple of diff issues to resolve under this issue:

Dir-path for PY_INCLUDE = -I /usr/include/python3.10/ needs to be fixed in all make files.
Support passing down LOCAL_LIB=/usr/local/lib64 from test.sh to sub-scripts / makefiles.

Here are some examples of failures:

483 + make -f certifier_tests.mak --always-make -j8 sharedlib LOCAL_LIB=/usr/local/lib64
484 protoc --cpp_out=. --proto_path ../certifier_service/certprotos ../certifier_service/certprotos/certif    ier.proto
485
486 compiling support.cc
[...]

550 compiling x509_tests.cc
551 g++ -I ../include -I/usr/local/opt/[email protected]/include/ -I ./sev-snp -I ./gramine -g -std=c++17 -D X64     -Wall -Wno-unused-variable -Wno-deprecated-declarations -O3 -fpic -o ./x509_tests.o -c x509_tests.cc
552 compiling certifier_tests_wrap.cc
553 g++ -I ../include -I/usr/local/opt/[email protected]/include/ -I ./sev-snp -I ./gramine -g -std=c++17 -D X64     -Wall -Wno-unused-variable -Wno-deprecated-declarations -O3 -fpic -I /usr/include/python3.10/ -o ./ce    rtifier_tests_wrap.o -c certifier_tests_wrap.cc
554 certifier_tests_wrap.cc:178:11: fatal error: Python.h: No such file or directory
555   178 | # include <Python.h>
556       |           ^~~~~~~~~~
557 compilation terminated.
558 make: *** [certifier_tests.mak:195: certifier_tests_wrap.o] Error 1
559 make: *** Waiting for unfinished jobs....
560 ++ cleanup
561 ++ set +x
562
563 test.sh: **** Error **** Failed command, make -f certifier_tests.mak --always-make -j${NumMakeThreads}     sharedlib LOCAL_LIB=/usr/local/lib64, at line 130 ./CI/scripts/test.sh while executing function test-core-certifier-programs

The text was updated successfully, but these errors were encountered:

Issue: #215

rgerganov · 2023-10-26T06:47:23Z

I tried running test.sh on my Linux dev machine and I hit the following issues:

compilation failures due to hardcoded Python include path (submitted PR Use pkg-config for finding Python include dir #216)
missing dependencies like pytest and protobuf
incompatible openssl options; I have openssl 1.1.1 and it doesn't support noenc; this one is minor, not sure if we should care for openssl v1.x at all
waiting 120sec. for sockets to be closed; we should be able to fix this with SO_REUSEADDR?
test script deleting non-git files; this is not ok -- it's common practice to keep all kind of files in the project dir like virtual environments, IDE configurations, etc. and deleting those is quite unexpected and frustrating
using sudo during tests; I understand we need this to emulate SEV but I won't give root permissions to a test script running on my machine

I guess the best way to go here is to provide a docker container with all of the dependencies that can be used for running tests. The Dockerfile can be used as a reference for what packages and configurations are needed if you want to run this on your physical machine.

gapisback · 2023-10-26T11:43:17Z

The Docker container is a good approach. Let's discuss this next week to put this work item in its priority queue.

Issue: #215

…envs. This commit does some clean-up of scripts and Makefiles so that the newly-added test.sh can be used by other dev engineers on diff Linux physical machines. - Fix few Makefiles to allow re-specifying LOCAL_LIB, to point to, say, /usr/local/lib (for SSL libs, e.g.) - Hacky support to allow use of OpenSSL V1.x versions - Skip a few pytests that seem to not run stably. Needs further investigation.

gapisback · 2023-10-27T23:33:26Z

@rgerganov -- I investigated one of the points you raised above:

waiting 120sec. for sockets to be closed; we should be able to fix this with SO_REUSEADDR?

The sleep is coming in the pytest-case test_certifier_framework.py in-between channel.init_server_ssl() and cfm.server_dispatch().

In server_dispatch() -> open_server_socket(), we are using this re-use-address semantic, as seen below:

2283       // Reuse addresses and ports
2284 #define REUSE_SOCKETS_AND_PORTS
2285 #ifdef REUSE_SOCKETS_AND_PORTS
2286     int reuse = 1;
2287     if (setsockopt(sfd,
2288                    SOL_SOCKET,
2289                    SO_REUSEADDR,      <<<<<
2290                    (const char *)&reuse,
2291                    sizeof(reuse))
2292         < 0) {
2293       fprintf(stderr, "Can't reuse socket %s\n", __func__);
2294       return false;
2295     }

Empirically, testing evidence shows that even though this flag has been used, on dev-machine, if you re-run the test-case repeatedly, the socket is not being closed and re-used fast enough. Hence, this sleep has been added.

This test file test_certifier_framework.py has several test cases each of which establishes a SSL connect, does some work and then disconnects. So, when run back-to-back in CI, I need a mechanism to ensure that the socket re-use property kicks-in "fast enough".

I know this is irritating, but I don't think this is functionally broken or use of this sleep is masking some broken functionality.

I can investigate further, but debugging this is a bit difficult on CI machines.

gapisback · 2023-10-27T23:53:43Z

@yelvmw -- Here's the trace evidence showing why some of the Python tests in test_certifier_framework.py are hanging when run on the SGX machine.

I think it's because in the shared library, the code for ENABLE_SEV=1 is automatically enabled for all hardware platforms. (I had raised this 'issue' in our previous visits to certifier.mak. At that time, it was felt that it's better to have all code #include'ed by default for all TEEs.)

That I suspect is causing an issue as seen in the execution flow trace instrumentation:

test_certifier_framework.py::test_run_app_as_a_client_init_client_ssl_with_trust_manager
*** cc_trust_manager_get_certified() ...

init_policy_key() ... Starting ...
PublicKeyFromCert() Starting ...
PublicKeyFromCert() ... Starting ...
GetX509FromCert() succeeded.
X509_get_pubkey() succeeded.
X509_NAME_get_text_by_NID() succeeded, len=15.
X509_NAME_get_text_by_NID() succeeded.
PublicKeyFromCert():425 succeeded.
PublicKeyFromCert():432 succeeded.
PublicKeyFromCert():435 Executing ...
get_tcb_version_from_vcek():270 Executing ...
vcek_ext_byte_value():230 Executing ...

Basically, the Python method invokes init_policy_key(), which goes through till:

get_tcb_version_from_vcek() -> vcek_ext_byte_value().

I instrumented the last function, and it's hanging on L232 below:

 230   printf("%s():%d Executing ...\n", __func__, __LINE__);
 231   // Use OID for both lname and sname so OBJ_create does not fail
 232   nid = OBJ_create(oid, oid, oid);
 233   printf("%s():%d Executing ...\n", __func__, __LINE__);

OBJ_create() seems to be a SEV_SNP interface of some sort. I can't locate its sources.

To validate this theory that running SEV_SNP-specific code in this simulated-enclave, I conditionalized the definition of ENABLE_SEV in certifier.mak and re-built the shared lib:

  4 ifndef NO_ENABLE_SEV
  5     ENABLE_SEV=1
  6 endif

This allows the tests to run cleanly without running into this hanging issue.

…envs. This commit does some clean-up of scripts and Makefiles so that the newly-added test.sh can be used by other dev engineers on diff Linux physical machines. - Fix few Makefiles to allow re-specifying LOCAL_LIB, to point to, say, /usr/local/lib (for SSL libs, e.g.) - Hacky support to allow use of OpenSSL V1.x versions - Add NO_ENABLE_SEV=1 flag in certifier.mak, so that we can build shared library without #include'ing SEV_SNP-specific code. This seems to cause pytests which to "hang" as they end up calling SEV_SNP code-paths. - Use this new flag when building shared-libraries in the test-run_example-simple_app() test-case, in test.sh . This is used to execute pytests that need Certifier Service.

…envs. This commit does some clean-up of scripts and Makefiles so that the newly-added test.sh can be used by other dev engineers on diff Linux physical machines. - Fix few Makefiles to allow re-specifying LOCAL_LIB, to point to, say, /usr/local/lib (for SSL libs, e.g.) - Hacky support to allow use of OpenSSL V1.x versions - Add NO_ENABLE_SEV=1 flag in certifier.mak, so that we can build shared library without #include'ing SEV_SNP-specific code. This seems to cause pytests which to "hang" as they end up calling SEV_SNP code-paths. - Use this new flag when building shared-libraries in the test-run_example-simple_app() test-case, in test.sh . This is used to execute pytests that need Certifier Service. cleanup.

yelvmw · 2023-10-30T03:05:14Z

It's surprising to see OBJ_create hanging. It's an OpenSSL function. I checked the bugs section of the manual and only see the caveat of concurrent invocation. Is this the case?

…nvs. This commit does some clean-up of scripts and Makefiles so that the newly-added test.sh can be used by other dev engineers on diff Linux physical machines. - Fix few Makefiles to allow re-specifying LOCAL_LIB, to point to, say, /usr/local/lib (for SSL libs, e.g.) - Hacky support to allow use of OpenSSL V1.x versions - Add NO_ENABLE_SEV=1 flag in certifier.mak, so that we can build shared library without #include'ing SEV_SNP-specific code. Including SEV_SNP-specific code seems to cause pytests to "hang", when running pytests on simulated-enclave, as the tests end up calling SEV_SNP code-paths. - test.sh: Use this new flag when building shared-libraries in the test-run_example-simple_app() test-case. This is used to execute pytests that need Certifier Service on SEV-enabled h/w.

gapisback · 2023-10-30T15:59:03Z

@yelvmw - About the concurrent use of OBJ_create(), the server & client code are being executed concurrently, although there isn't any sort of race condition.

However in the affected test case(s) [e.g., test_run_app_as_a_client_init_client_ssl_with_trust_manager()], we just start the client-app portion on its own, talking to the Certifier Service. So, no concurrent invocation of said SSL interface should come into play.

…nvs. This commit does some clean-up of scripts and Makefiles so that the newly-added test.sh can be used by other dev engineers on diff Linux physical machines. - Fix few Makefiles to allow re-specifying LOCAL_LIB, to point to, say, /usr/local/lib (for SSL libs, e.g.) - Hacky support to allow use of OpenSSL V1.x versions - Add NO_ENABLE_SEV=1 flag in certifier.mak, so that we can build shared library without #include'ing SEV_SNP-specific code. Including SEV_SNP-specific code seems to cause pytests to "hang", when running pytests on simulated-enclave, as the tests end up calling SEV_SNP code-paths. - test.sh: Use this new flag when building shared-libraries in the test-run_example-simple_app() test-case. This is used to execute pytests that need Certifier Service on SEV-enabled h/w.

gapisback · 2023-11-03T15:59:26Z

@rgerganov : I'm back to cleaning up some items you logged here: "test script deleting non-git files; this is not ok -"C

Firstly, I do need rm_non_git_files or some hook like that, to clean-up cruft between run-to-run of this run_example.sh.

Otherwise, I've found that some workflow steps for some sample-apps may interfere causing instability. I just need a way to "cleanup the entire dev env" at the start of each run of this script.

Second, the command does git ls from certifier-root dir. Why would a user's config / env-files be living inside the certifier-framework-for-confidential-computing/ sub-dir? Wouldn't we expect these to usually live under $HOME?
Are you worried about IDE config files for this Certifier Framework to be living inside this projects's root sub-dir?

What if we exclude dot-files; i.e. filter-out \.* files from the rm? Would that suffice to address your concern?

Alternate Solution: Let me know if you think this is good enough:

I took a look at the files that do need to be cleaned-up by this command. Seems like if we can filter out the output from git ls-files to only delete the files in app_data*/ and provisioning sub-dirs, and pick-up build output files (e.g. *.o, *.exe, *wrap* etc.) that may be sufficient.

This kind of filtering is prone to being incomplete ... but it may be the only reliable way to do this clean-up w/o impacting the user's config / env-files.

--
[Update] I just discovered that the list of files reported by git ls-files . --exclude-standard --others --ignored is probably sufficient. It includes all objects, exes and SWIG-generated files. Would you be willing to go with this approach?

The idea is that we only rm files that are in our project's .gitignore, ... which is not a bad thing, right?

Do you expect user project's config/env-files to live in our project's .gitignore?

Let me know your thoughts on this approach ... and whether we are really to be worried-about config / env-files in the certifier root-dir.

This commit attempts to improve the behaviour of the rm_non_git_files sub-command of run_example.sh . The objective of this command is to clean-up build-outputs and other [stale] artifacts that may have been produced by a prior execution of this script. We may be, inadvertently, also deleting user's config-/env-files that reside in the project's sub-tree. This commit changes the behaviour of rm_non_git_files to now only delete the files registered in the .gitignore list.

This commit attempts to improve the behaviour of the rm_non_git_files sub-command of run_example.sh . The objective of this command is to clean-up build-outputs and other [stale] artifacts that may have been produced by a prior execution of this script. We may be, inadvertently, also deleting user's config-/env-files that reside in the project's sub-tree. This commit changes the behaviour of rm_non_git_files to now only delete the files registered in the .gitignore list. Also, now "--dry-run rm_non_git_files" will only print an info message without actually deleting all affected files.

gapisback self-assigned this Oct 25, 2023

gapisback added bug Something isn't working code-cleanup Minor code cleanup changes, to simplify, improve code. No logic changes. labels Oct 25, 2023

rgerganov added a commit that referenced this issue Oct 26, 2023

Use pkg-config for finding Python include dir

2a63464

Issue: #215

rgerganov mentioned this issue Oct 26, 2023

Use pkg-config for finding Python include dir #216

Merged

rgerganov added a commit that referenced this issue Oct 26, 2023

Use pkg-config for finding Python include dir (#216)

aff114d

Issue: #215

gapisback mentioned this issue Oct 28, 2023

(#215) Fix scripts, Makefiles to have test.sh run on other Linux dev boxes. #217

Merged

gapisback mentioned this issue Nov 3, 2023

(#215) Prevent rm_non_git_files from rm'ing user's config/env-files #222

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Stabilize execution of test.sh on dev Linux environments: certifier_tests_wrap.cc:178:11: fatal error: Python.h: No such file or directory #215

Stabilize execution of test.sh on dev Linux environments: certifier_tests_wrap.cc:178:11: fatal error: Python.h: No such file or directory #215

gapisback commented Oct 25, 2023

rgerganov commented Oct 26, 2023

gapisback commented Oct 26, 2023

gapisback commented Oct 27, 2023

gapisback commented Oct 27, 2023

yelvmw commented Oct 30, 2023

gapisback commented Oct 30, 2023

gapisback commented Nov 3, 2023 •

edited

Loading

Stabilize execution of test.sh on dev Linux environments: certifier_tests_wrap.cc:178:11: fatal error: Python.h: No such file or directory #215

Stabilize execution of test.sh on dev Linux environments: certifier_tests_wrap.cc:178:11: fatal error: Python.h: No such file or directory #215

Comments

gapisback commented Oct 25, 2023

rgerganov commented Oct 26, 2023

gapisback commented Oct 26, 2023

gapisback commented Oct 27, 2023

gapisback commented Oct 27, 2023

yelvmw commented Oct 30, 2023

gapisback commented Oct 30, 2023

gapisback commented Nov 3, 2023 • edited Loading

gapisback commented Nov 3, 2023 •

edited

Loading