EQ oracles should not use hypothesis state iterators! #148
Replies: 1 comment 1 reply
-
To me, this sounds like that they are not the same algorithm then. IMO, as you correctly pointed out, the core issue is the fact that "equivalence" means "modulo isomorphism". The two algorithms return different state permutations of the original model. They are equivalent (isomorphic) but not equal. However, I think this is a necessary degree of freedom. Otherwise, you couldn't use different counterexamples analysis strategies (e.g., forward-oriented and backward-oriented) which may discover new hypothesis states at different times.
Mainly in order to allow for mapping from/to indices. You have the same problem here since you are free to provide a (To be fair, this point is also often neglected in benchmarks as well, i.e., you can easily choose the order in which your algorithm performs best compared to others.) |
Beta Was this translation helpful? Give feedback.
-
I just discovered a very bizarre phenomenon. I tested two different implementations of the same Mealy learning algorithm (can not publish the code yet). When I did some benchmarks to check if they behave the same, I noticed they differed in query and symbol performance. Naturally, I thought I had a bug that causes them to pose different queries. I then logged every of their interactions with the membership oracle and the counterexamples they received. It turned out that they posed exactly the same queries up until some point when they received different counterexamples. I thought: How could that be? If they pose the same queries, they must also construct the same hypotheses, since the algorithm is fully deterministic. So I exported the two hypotheses that caused the first divergence in counterexamples. Re-imported them in LearnLib and verified that they were equivalent. How could they possibly lead to different counterexamples? I am using RandomWp, which despite its name, given a fixed seed, behaves completely deterministic.
Turns out: My two learners constructed equivalent hypotheses, but created their respective states in different orders, because their internal data structures used different iterators. This then lead to different iteration orders in hypothesis.getStates(). I believe that this line of code in the RandomWp implementation then leads to different test words being generated.
This may sound like a minor issue, but I do believe it is significant. In my benchmarks, I have noticed that learning performance for some SULs is highly sensitive to the counterexamples passed to the learner (across many algorithms, not just mine). Thus, different implementations of the very same algorithm may have drastically different performance for the exact same configuration. And since this interference is not controlled by the oracle seed, it is not clear whether this will balance out over many runs, posing a threat to the reliablity of benchmark results.
I found an easy fix, exploiting a semantic gap between automata learning theory and LearnLib's implementation: From my understanding, many AAL algorithms are not really fully deterministic in theory. They are technically only deterministic if there is a well-defined order over the input alphabet, which is not an assumption the MAT framework makes. But without such an order, there is no deterministic way of iterating transitions and identifying target states, possibly changing the order in which new states are discovered. However, in practice, at least in LearnLib, the alphabet always has a deterministic iteration order. This way, learners are able to explore transitions in a deterministic order and learning performance is exactly reproducible.
Instead of passing the learner's hypothesis as is into the EQ oracle, I now always construct a canonical copy of it first. This copy is guaranteed to always store states in a deterministic order, independent of the order in which they are stored in the hypothesis instance received from the learner. I do this by visiting the states of the original hypothesis with a BFS starting from its initial state, using the iteration order of the alphabet.
I am not sure how relevant this phenomenon is when comparing different learning algorithms. But at least for some models (like this), I have observed for multiple different learners that average performance across 100 runs can vary by double-digit percentages when switching on and off this canonization step! I think that this is a noise factor that should be eliminated from testing algorithms.
Beta Was this translation helpful? Give feedback.
All reactions