-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Rules Discussion #1
Comments
Thank you for those who were able to join the meeting, we will make the recording available soon. After the introduction and overview, there were two primary discussion items.
Anyone may propose benchmarks (tool participants and anyone else); so far there are some benchmarks submitted in the registration form and listed at the benchmark proposal issue #2 There was a proposal (thanks @KaidiXu !) to create 2 tracks for scoring: a regular track and an extended track. For these tracks, we will set up a voting mechanism (probably Google form similar to registration) to select benchmarks for scoring in the different tracks, with the different tracks to be created based on a number of votes by the tool participants. Any tool participant may vote for any number of the proposed benchmarks. This changes from the past where we had tools nominate 2 benchmarks and required at least 2 tools to submit results for the benchmark. For the regular track: any benchmark with at least X votes from participating tools will be scored in the regular track: X is to be decided, we discussed different alternatives (e.g., at least 3 votes, 1/3 of all participant tools votes, 50%, etc.): I would suggest 50% of tools so that it is basically a majority plan to support the benchmark to be in the regular track For the extended track: any benchmark with at least 2 votes from participating tools and not included in the regular track will be scored in the extended track. For all other benchmarks (so 0-1 votes): these will be unscored, but can be included in the report. There was a voting aspect of which votes count for the benchmarks: probably the votes should only count from participating tools (those who submit results for at least some benchmark during the evaluation phase), which could possibly change these tracks after the benchmark proposal phase and only be completely known at the evaluation phase, as otherwise non-participants are voting and could be unfair. However, this likely would not significantly impact the regular track, but could change the extended track. We would then have rankings/winners in both of these tracks per the typical scoring we have been using the past couple iterations. Otherwise there were not any proposals to substantial change the scoring or rules around benchmark selection, other than to ensure things are available as early as possible to give everyone enough time to look at them and add support if needed. If there are further comments/concerns/discussions, please comment below.
This was discussed and has come up in most prior VNN-COMP rules discussions. Some tools may use portions of other tools or libraries developed by other teams, e.g., auto_lirpa and CORA both came up as instances of where some participating tools may use these other tools/libraries in their tool (as e.g., NNV uses CORA in some cases as an example, and some other tools may be building upon auto_lirpa, etc.). It was proposed to mostly eliminate the meta-solver and % duplicate code aspects of the rules, as these have not come up in practice in any prior VNN-COMP in any meaningful way, and instead to ensure any libraries/tools used by any participating tools are properly acknowledged (e.g., via citations to the dependencies in the report). If any strange edge case arises (e.g., submission of multiple tools with just different heuristics, a team just directly reusing another team's tool, etc.), the organizers will handle these cases as they arise and in communication with the participants. Reasoning: it is too hard to define substantially similar code bases, using % lines of code is complicated (and probably infeasible), and further, many tools have many dependencies and where would we draw the line (e.g., probably everyone using onnx libraries, ML frameworks, etc., and these dependencies may dominate the % code base that is the same). Further, it is common (at least in formal methods research) for subsequent tools to build upon prior tools. Another example would be an SMT-based tool that uses Z3, and suppose there are 2 such tools, should these be excluded from the competition (as likely Z3 would dominate the code bases between the tools)?; probably not, as the higher-level aspects may be important for doing things efficiently and there likely is significant effort in developing on top of these existing libraries. So, general rule of thumb for dependencies, substantially similar tools/metasolvers: just acknowledge all dependencies appropriately and if anything strange or seemingly unfair arises, we will deal with it when it does (but at least has not mattered significantly in any prior iteration so somewhat a moot point aside from ensuring appropriate credit/citation/attribution). If there are further comments or concerns on this, please also raise them below.
|
The April 3, 2024 rules telecon recording is available here: |
We are a little behind the proposed timeline for finalizing rules, but will do so by end of this week (May 10; sorry, have been swamped), and will also extend new benchmark submission a bit (probably May 17), but we want to give as much time as possible for them to be considered, so need to finalize soon. There has not been any further discussion on the rules, so we will finalize as recommended above unless there are further concerns, suggestions, or comments. For the 2 tracks: as suggested above, let's plan to go with 50% tools voting for the regular track, and >= 2 tools voting for the extended track, with anything with 0-1 tools unscored for rankings, but included in report We will set up a voting/proposal mechanism for benchmarks for the tools to be scored so we all know which ones are under consideration for the competition. As has been the case in the prior recent VNN-COMP iterations, any prior VNN-COMP benchmarks can be nominated again, so we will work on adding these to the form as well, as most of the benchmark discussion in the other issue #2 so far is mostly proposal of new benchmarks, and will want to finalize all those under consideration by late May at very latest, so everyone has time to consider them |
@ttj Thank you for leading the discussions about the new rules! The points you made are great. Here are my two cents:
|
thanks @huanzhang12 for the suggestions and comments! It will be great if others can please share opinions on these suggestions ASAP, we do need to finalize rules shortly (ideally tomorrow so we don't have to extend again) and have agreement among participants on what they are agreeing about in terms of rules |
in the interest of time as we need to finalize, going to try mentioning some people I could find github ids for to try to garner any final feedback on rules in case have not seen discussion, please comment if you have any opinion on some of the proposals @hocdot |
It seems there are no further opinions, so let's finalize as follows and I will try to get the rules doc updated shortly. 1: After voting, the benchmarks and tracks will be fixed for scoring (so this will be by about end of May), which gives about a month to consider everything, probably a bit more, so that we have time to assemble everything for presentation at SAIV/CAV on July 22/23. We will not modify the tracks based on what tools submit results. However, we will ensure the process is fair (e.g., if some strange edge case arises with some tool participants voting on things and then not participating, hence influencing the competition without active involvement, we will handle it based on discussion with the tool participants). 2: For the regular track, let's go with 50% votes of the tool participants (handling rounding/etc in a reasonable way, we will see how it looks when the votes come in). There are some pros and cons, but one viewpoint is that we understand performance aspects through the competition on a reasonable set of benchmarks supported by most tools, even if the models utilize standard architectures on ML tasks that have been considered in neural network verification up to now. There are already significant barriers to entry in the competition (supporting VNN-LIB/ONNX, etc.), and most new tools need to start with relatively standard architectures (ReLUs, etc.). The two tracks mitigate this issue handling the AI/ML community versus FM community differences (e.g. challenge vs competition), and with point 1 above, what benchmarks will be considered is going to be set a priori. Broadening architectures is of course a great goal for the community, but this is more likely to happen more slowly over time as new ML tasks and architectures are considered (e.g., in papers and not in this competition). If any benchmark proposer is disappointed by whatever outcome arises based on the tracks, we can still describe it in the competition report, and I will again advertise AISOLA being very receptive for papers describing benchmarks (see the VNSAI track: https://2024-isola.isola-conference.org/aisola-tracks/ ), where several were presented last year in the SVDNN track: https://link.springer.com/book/10.1007/978-3-031-46002-9#toc The deadline is flexible enough that if any benchmark proposed is not ended up being scored in either the regular or extended track, we could consider a paper describing it for AISOLA. 3: To be included in the extended track for scoring, let's go with any tool voting for the benchmark (so 1 vote to be scored). This also mitigates potential issues with point 2 above and incentivizes supporting new benchmarks. 4: Tool participants should utilize standard academic/scholarly courtesy and cite the relevant libraries they build upon in the report. We will add a field in the voting form to indicate what significant dependencies tools have, beyond just the standard things like ML libraries, ONNX, etc. We can cross-check this based on the installation scripts, etc. when results are submitted. 5: A final aspect that arose upon some discussion, and similar to point 4, some tool participants actively or previously collaborate with some benchmark proposers, which can give a potential advantage to some tool participants (e.g., they already support the benchmark and do not need to add support for new layers types/tasks, etc.). We will require disclosure of collaborations between tool participants and benchmark proposers, as is standard in handling conflicts of interest (COI) in scholarly publishing/activity, and will collect these in the voting process as well. As this one is new and had not come up in the discussion above, if any concerns, please speak quickly. |
We have updated the rules doc with these changes, available at link above and here again, if any final comments or suggestions, please post soon: https://docs.google.com/document/d/1bgkx5lnugrwlNzQ2MPRSd47MAkZGJfR9v2jo7oRskd0/edit?usp=sharing |
We have shared a draft overleaf report and are also fixing some execution issues identified by some participants. Please let us know if you didn't get it and we will share it, we emailed the listservs so far |
Please discuss rules here, the draft Google docs rules document is available here:
https://docs.google.com/document/d/1bgkx5lnugrwlNzQ2MPRSd47MAkZGJfR9v2jo7oRskd0/edit?usp=sharing
In particular, we need to settle on the benchmark proposal, selection, and scoring processes for this year
The text was updated successfully, but these errors were encountered: