Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implementation of Publish Subscribe Message Exchange interface #1676

Merged
merged 97 commits into from
Sep 20, 2023

Conversation

dpotman
Copy link
Contributor

@dpotman dpotman commented Apr 26, 2023

As a replacement for the current Iceoryx integration in Cyclone, this introduces the Publish Subscribe Message Exchange (PSMX) interface that allows implementing additional methods of pub-sub data-exchange as an alternative to Cyclone's networked transport. The PSMX interface allows to load implementations as plugins, so there is no need for compile-time linking.

One of the PSMX implementations is the Iceoryx plugin, which allows using Iceoryx as shared memory transport for Cyclone. For testing, a Cyclone based PSMX implementation is included. This also includes some rework of the loan mechanism, so that it better integrates with the PSMX mechanism.

I've created this as a draft PR, because some things still need to be improved and fixed. But in the current state the mechanism is fully working using the Cyclone plugin implementation (for testing purposes), I think it is useful to get some feedback at this point. For the Iceoryx implementation I need to do some work, I'll add a commit with that plugin included later this week.

src/core/ddsc/include/dds/ddsc/dds_psmx.h Fixed Show fixed Hide fixed
src/core/ddsi/include/dds/ddsi/ddsi_psmx.h Fixed Show fixed Hide fixed
src/core/ddsc/include/dds/ddsc/dds_loan.h Fixed Show fixed Hide fixed
src/core/ddsc/src/dds__psmx.h Fixed Show fixed Hide fixed
src/core/ddsc/src/dds_heap_loan.c Fixed Show fixed Hide fixed
src/core/ddsc/src/dds_loan.c Fixed Show fixed Hide fixed
src/core/ddsc/src/dds_psmx.c Fixed Show fixed Hide fixed
src/core/ddsc/src/dds__heap_loan.h Fixed Show fixed Hide fixed
@dpotman dpotman force-pushed the pubsub-msg-exchange branch from 581dc14 to eaabeb3 Compare July 18, 2023 08:07
@dpotman dpotman force-pushed the pubsub-msg-exchange branch from eaabeb3 to 6fd8322 Compare July 27, 2023 13:51
@dpotman dpotman marked this pull request as ready for review July 27, 2023 13:53
Copy link

@github-advanced-security github-advanced-security bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CodeQL found more than 10 potential problems in the proposed changes. Check the Files changed tab for more details.

@dpotman dpotman force-pushed the pubsub-msg-exchange branch from 6fd8322 to 7f86b8c Compare August 1, 2023 10:33
@reicheratwork
Copy link
Contributor

I am having issues with the following sequence of events:

  • a sample with kind = SDK_KEY is written with serdata_default_from_loaned_sample and serialization does not occur
    • a ddsi_serdata is created with kind = SDK_KEY
    • the sample_state field in the metadata is set to DDS_LOANED_SAMPLE_STATE_RAW because no serialization
  • it is received with serdata_default_from_psmx
    • this sees that the metadata sample_state says DDS_LOANED_SAMPLE_STATE_RAW
    • therefore the loaned_sample_state_to_serdata_kind sets the kind for the ddsi_serdata to be created to be SDK_DATA

I see an asymmetry in this flow, but I do not know whether this may cause real issues with readers/history caches/etc?
@eboasson , @dpotman could you please check?

@eboasson
Copy link
Contributor

eboasson commented Aug 8, 2023

I am having issues with the following sequence of events:

  • a sample with kind = SDK_KEY is written with serdata_default_from_loaned_sample and serialization does not occur

    • a ddsi_serdata is created with kind = SDK_KEY
    • the sample_state field in the metadata is set to DDS_LOANED_SAMPLE_STATE_RAW because no serialization
  • it is received with serdata_default_from_psmx

    • this sees that the metadata sample_state says DDS_LOANED_SAMPLE_STATE_RAW
    • therefore the loaned_sample_state_to_serdata_kind sets the kind for the ddsi_serdata to be created to be SDK_DATA

I see an asymmetry in this flow, but I do not know whether this may cause real issues with readers/history caches/etc? @eboasson , @dpotman could you please check?

It seems to me that that would be a problem. For example, the difference between a dispose and a writedispose here is whether the serdata is SDK_KEY or SDK_DATA and that in turn determines whether data can be pushed out of the reader history cache by it.

An obvious way to address it would be to add yet another state but that may well add bloat (I haven't checked the code to see whether I actually expect it to do so in this case, it is more of a general statement on what happens when you add yet another case). The alternative I see is to always serialize keys, which I think would be a reasonable option as they are typically small and sent only rarely.

@reicheratwork
Copy link
Contributor

Ran into the following issue:

Using the Helloworld examples, and adding the following config file in both HelloWorldPublisher and HelloWorldSubscriber's paths:

<CycloneDDS>
<General>
<Interfaces>
<PubSubMessageExchange name'"iox" library="psmx_iox" priority="1000000" />
</Interfaces>
</General>
</CycloneDDS>

I get the following error on the process which is opened first:
src/core/ddsi/src/ddsi_plist.c:2844 is_valid_psmx_instance_id: Assertion gv->interfaces[i].loc.kind == DDSI_LOCATOR_KIND_PSMX failed.

Running the process in gdb I found out that while ddsi_network_interface::is_psmx flag is set to true, ddsi_network_interface::loc::kind is 1 which is the value for DDSI_LOCATOR_KIND_UDPv4

Something is inconsistent here, probably in the discovery/matching part of DDS.

@reicheratwork
Copy link
Contributor

Ran into another "issue":
By compiling CycloneDDS with the iox wrapper and supplying it with a config which indicates that it should use it, the local delivery of data (so in the same process) no longer uses the fastpath delivery. This is caused by get_num_fast_path_readers returning 0 on dds_write.c:588
Writing on a writer with a matching reader in the same process now makes the exchange using PSMX.

While in most cases this will not have a major effect on the behaviour experienced by end-users, the extra delay incurred by having the loaned sample go from program - psmx - program, and it not being immediately available on the reader after the call to dds_write finishes may cause some programs to behave differently.

@eboasson
Copy link
Contributor

Ran into the following issue:

Using the Helloworld examples, and adding the following config file in both HelloWorldPublisher and HelloWorldSubscriber's paths:

<CycloneDDS>
<General>
<Interfaces>
<PubSubMessageExchange name'"iox" library="psmx_iox" priority="1000000" />
</Interfaces>
</General>
</CycloneDDS>

I get the following error on the process which is opened first: src/core/ddsi/src/ddsi_plist.c:2844 is_valid_psmx_instance_id: Assertion gv->interfaces[i].loc.kind == DDSI_LOCATOR_KIND_PSMX failed.

Running the process in gdb I found out that while ddsi_network_interface::is_psmx flag is set to true, ddsi_network_interface::loc::kind is 1 which is the value for DDSI_LOCATOR_KIND_UDPv4

Something is inconsistent here, probably in the discovery/matching part of DDS.

Is this reproducible for you? Because I did the "obvious" thing and that went fine. Of course there are also tons of subtle and less-subtle differences between my platform and yours ... If it is reproducible, a pair of traces would probably be interesting.

@eboasson
Copy link
Contributor

Ran into another "issue": By compiling CycloneDDS with the iox wrapper and supplying it with a config which indicates that it should use it, the local delivery of data (so in the same process) no longer uses the fastpath delivery. This is caused by get_num_fast_path_readers returning 0 on dds_write.c:588 Writing on a writer with a matching reader in the same process now makes the exchange using PSMX.

While in most cases this will not have a major effect on the behaviour experienced by end-users, the extra delay incurred by having the loaned sample go from program - psmx - program, and it not being immediately available on the reader after the call to dds_write finishes may cause some programs to behave differently.

With Iceoryx this is the correct behaviour, despite it being surprising and exhibiting worse performance. The problem is that Iceoryx doesn't give us the ability to prevent it from delivering data to subscribers in the same process as the publisher. So rather than two everything twice, we use the unavoidable path.

The test cases are definitely sensitive to this sort of thing, because there are many tests that depend on the synchronous updating of the RHC, but they are not typical. Applications are less likely to run into this, firstly because of the way DDS applications tend to be structured, and secondly because there is not a formal promise that the RHC update is synchronous.

@eboasson
Copy link
Contributor

Ran into the following issue:
Using the Helloworld examples, and adding the following config file in both HelloWorldPublisher and HelloWorldSubscriber's paths:

<CycloneDDS>
<General>
<Interfaces>
<PubSubMessageExchange name'"iox" library="psmx_iox" priority="1000000" />
</Interfaces>
</General>
</CycloneDDS>

I get the following error on the process which is opened first: src/core/ddsi/src/ddsi_plist.c:2844 is_valid_psmx_instance_id: Assertion gv->interfaces[i].loc.kind == DDSI_LOCATOR_KIND_PSMX failed.
Running the process in gdb I found out that while ddsi_network_interface::is_psmx flag is set to true, ddsi_network_interface::loc::kind is 1 which is the value for DDSI_LOCATOR_KIND_UDPv4
Something is inconsistent here, probably in the discovery/matching part of DDS.

Is this reproducible for you? Because I did the "obvious" thing and that went fine. Of course there are also tons of subtle and less-subtle differences between my platform and yours ... If it is reproducible, a pair of traces would probably be interesting.

Apologies ... I should remember to always have a look before trying ...

Does this fix it:

diff --git a/src/core/ddsi/src/ddsi_nwinterfaces.c b/src/core/ddsi/src/ddsi_nwinterfaces.c
index 169a734fd..3a9cfaba3 100644
--- a/src/core/ddsi/src/ddsi_nwinterfaces.c
+++ b/src/core/ddsi/src/ddsi_nwinterfaces.c
@@ -220,6 +220,7 @@ static enum maybe_add_interface_result maybe_add_interface (struct ddsi_domaingv
   dst->loopback = loopback ? 1 : 0;
   dst->link_local = link_local ? 1 : 0;
   dst->if_index = ifa->index;
+  dst->is_psmx = false;
   if ((dst->name = ddsrt_strdup (ifa->name)) == NULL)
     return MAI_OUT_OF_MEMORY;
   dst->priority = loopback ? 2 : 0;

It fits the bill perfectly. Memory is malloc'd and so the missing initializations means the status of the flag was undefined. Then you're safely in platform-specific territory. Address sanitizer initializes it with 0xbe, it is quite possible that on my device that means it ends up in the correct state; it is also quite likely that you have different build settings or a slightly different memory layout and that it does something else for you.

@reicheratwork
Copy link
Contributor

Ran into another "issue": By compiling CycloneDDS with the iox wrapper and supplying it with a config which indicates that it should use it, the local delivery of data (so in the same process) no longer uses the fastpath delivery. This is caused by get_num_fast_path_readers returning 0 on dds_write.c:588 Writing on a writer with a matching reader in the same process now makes the exchange using PSMX.
While in most cases this will not have a major effect on the behaviour experienced by end-users, the extra delay incurred by having the loaned sample go from program - psmx - program, and it not being immediately available on the reader after the call to dds_write finishes may cause some programs to behave differently.

With Iceoryx this is the correct behaviour, despite it being surprising and exhibiting worse performance. The problem is that Iceoryx doesn't give us the ability to prevent it from delivering data to subscribers in the same process as the publisher. So rather than two everything twice, we use the unavoidable path.

The test cases are definitely sensitive to this sort of thing, because there are many tests that depend on the synchronous updating of the RHC, but they are not typical. Applications are less likely to run into this, firstly because of the way DDS applications tend to be structured, and secondly because there is not a formal promise that the RHC update is synchronous.

I ran into this when having my "custom" config file still being used even when running the tests, because CYCLONEDDS_URI was set.
Maybe we should have the tests which do not touch on the PSMX functionality (so almost all of them) explicitly load a "correct" config, and not rely on what the end-user's config is?

@reicheratwork
Copy link
Contributor

Does this fix it:

I will see what this does on my end.

@eboasson
Copy link
Contributor

Ran into another "issue": By compiling CycloneDDS with the iox wrapper and supplying it with a config which indicates that it should use it, the local delivery of data (so in the same process) no longer uses the fastpath delivery. This is caused by get_num_fast_path_readers returning 0 on dds_write.c:588 Writing on a writer with a matching reader in the same process now makes the exchange using PSMX.
While in most cases this will not have a major effect on the behaviour experienced by end-users, the extra delay incurred by having the loaned sample go from program - psmx - program, and it not being immediately available on the reader after the call to dds_write finishes may cause some programs to behave differently.

With Iceoryx this is the correct behaviour, despite it being surprising and exhibiting worse performance. The problem is that Iceoryx doesn't give us the ability to prevent it from delivering data to subscribers in the same process as the publisher. So rather than two everything twice, we use the unavoidable path.
The test cases are definitely sensitive to this sort of thing, because there are many tests that depend on the synchronous updating of the RHC, but they are not typical. Applications are less likely to run into this, firstly because of the way DDS applications tend to be structured, and secondly because there is not a formal promise that the RHC update is synchronous.

I ran into this when having my "custom" config file still being used even when running the tests, because CYCLONEDDS_URI was set. Maybe we should have the tests which do not touch on the PSMX functionality (so almost all of them) explicitly load a "correct" config, and not rely on what the end-user's config is?

Overriding the configuration has it is own drawbacks: I just stuff <Gen><Interfaces><Netw n="lo0"/></></><Int><LivelinessM>true</></><Tr><Out>cdds.log</></> in CYCLONEDDS_URI in my shell initialisation script so the firewall doesn't get in my way as badly and the thread liveliness monitoring is enabled always, and I also use it to enable tracing in test runs. So simply overriding is a no-go as far as I am concerned.

I'd rather modify the tests that are affected by this to avoid PSMX by setting a QoS.

@reicheratwork
Copy link
Contributor

I'd rather modify the tests that are affected by this to avoid PSMX by setting a QoS.

Let me have look into this.

src/core/ddsi/src/ddsi_endpoint.c Dismissed Show dismissed Hide dismissed
src/core/ddsi/src/ddsi__endpoint.h Dismissed Show dismissed Hide dismissed
src/core/ddsc/include/dds/ddsc/dds_public_qos.h Dismissed Show dismissed Hide dismissed
src/core/ddsc/include/dds/ddsc/dds_public_qos.h Dismissed Show dismissed Hide dismissed
src/core/ddsc/src/dds_qos.c Dismissed Show dismissed Hide dismissed
src/core/ddsc/src/dds_qos.c Dismissed Show dismissed Hide dismissed
src/core/ddsi/src/ddsi_config.c Dismissed Show dismissed Hide dismissed
src/tools/ddsperf/ddsperf.c Dismissed Show dismissed Hide dismissed
@dpotman dpotman force-pushed the pubsub-msg-exchange branch from a043eb8 to e55db74 Compare September 19, 2023 07:41
dpotman and others added 6 commits September 19, 2023 15:55
Signed-off-by: Dennis Potman <[email protected]>
Co-authored-by: Martijn Reicher <[email protected]>
Co-authored-by: Erik Boasson <[email protected]>
Co-authored-by: Thijs Miedema <[email protected]>
…igin member in struct loaned_sample to enum with kind and psmx-instance ptr

Signed-off-by: Dennis Potman <[email protected]>
dpotman and others added 26 commits September 19, 2023 15:55
with newline (to get proper output on CI for failing Windows builds)

Signed-off-by: Dennis Potman <[email protected]>
When we switch from a dds to a psmx writer, we start with just one local
reader and expect that there are no locators in the writer's address
set, but:

- if there is no unacknowledged data in the writer at the time it gets
  deleted, and

- the discovery of the demise of the old readers may not have taken
  place yet at the time of creation of the writer, and

- consequently the writer was matched with the old readers, and then

- the discovery of said demise happens just prior to fetching the list
  of matched subscriptions, and

- also prior to the GC performing the actual unmatching of the proxy
  readers

and the waitset in "allmatched" is spinning/happens to wake up at the
right time, then the test code will see the correct set of readers, but
there are still some ports lingering around.  It is very unlikely, but
it can happen.

Signed-off-by: Erik Boasson <[email protected]>
* Track current value for a few seconds before freezing it as the
  reference value

* Add a custom allocator wrapper that tracks exact memory usage, because
  RSS turns out to be too fragile (RSS is also still available)

* Tweak sanity check script to use a tight live memory bound and a
  generous RSS bound

Signed-off-by: Erik Boasson <[email protected]>
* Add ddsi_sertype_init_props
* Clean up sertype flags
* Rename IS_FIXED_SIZE to IS_MEMCPY_SAFE
* Delete useless "psmx endpoint serialization required"

Signed-off-by: Erik Boasson <[email protected]>
Signed-off-by: Erik Boasson <[email protected]>
Integrating (for example) application code using Iceoryx with Cyclone
DDS + PSMX + Iceoryx requires that the application knows the type name.
So we must pass it to the plug-in.

It seems to be best to use the actual type name by default for this
reason.

Signed-off-by: Erik Boasson <[email protected]>
* ENABLE_SHM => ENABLE_ICEORYX

* No code referencing Iceoryx remaining in the core, except in
  code for supporting backwards compatibility

* No more references to iceoryx_binding_c

* CI builds use "iceoryx" instead of "psmx_iox"

Signed-off-by: Erik Boasson <[email protected]>
For keyless topics, a serialized key has length 0, for which dds_take
leaves the _buffer field a null pointer.

Signed-off-by: Erik Boasson <[email protected]>
Update (and rename) dds_loan_shared_memory_buffer so that it
returns a dds_loaned_sample_t and not just a raw loan pointer
(that can't be used with dds_write because the buffer is not
registered in the loan administration). With this change, the
loaned sample (with custom size) can be used with the regular
write function; only difference with the regular request_loan
is that the user can specify the size in this case.

The request_raw_loan fn prototype is removed from the PSMX
endpoint ops; request_loan with the user-specified size
is used instead.

Signed-off-by: Dennis Potman <[email protected]>
... by adding SIGINT handler that terminates the process cleanly.

Signed-off-by: Erik Boasson <[email protected]>
@dpotman dpotman force-pushed the pubsub-msg-exchange branch from 1fa1a23 to 5da81b8 Compare September 19, 2023 13:55
@eboasson
Copy link
Contributor

I can't think of something else to check, the test runs have been good for a while, the Python binding is unaffected and the C++ has its companion PR ready, so I think it is time to merge this.

For the record, I am not pressing "approve" because that would suggest I am in the position of an impartial reviewer, which I am not. This PR is a joint effort where all work has been reviewed prior to updating this branch.

Also, I am expecting some unknown unknown causing trouble, because no matter how hard I try, it always happens. But I really do not know what it is that will cause said trouble.

@eboasson eboasson merged commit 175a024 into eclipse-cyclonedds:master Sep 20, 2023
19 of 22 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants