Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unneeded patterns/rules influence the result of the parsing #29

Open
robin-xyzt-ai opened this issue Feb 17, 2023 · 0 comments
Open

Unneeded patterns/rules influence the result of the parsing #29

robin-xyzt-ai opened this issue Feb 17, 2023 · 0 comments

Comments

@robin-xyzt-ai
Copy link

This might be an issue with retree rather than with the dateparser though.

The following test (which you cannot execute via the public API) fails:

    @Test
    public void parserWithLimitedPatterns(){
        List<String> rules = Arrays.asList(
          "(?<year>\\d{4})\\W{1}(?<month>\\d{1,2})\\W{1}(?<day>\\d{1,2})[^\\d]?",
          "\\W*(?:at )?(?<hour>\\d{1,2}):(?<minute>\\d{1,2})(?::(?<second>\\d{1,2}))?(?:[.,](?<ns>\\d{1,9}))?(?<zero>z)?",
          " ?(?<zoneOffset>[-+]\\d{1,2}:?(?:\\d{2})?)"
        );

        DateParser dateParser = new DateParser(rules, new HashSet<>(rules), Collections.emptyMap(), true, false);
        String input = "2022-08-09 19:04:31.600000+00:00";
        Date date = dateParser.parseDate(input);
        assertEquals(parser.parseDate(input), date);
    }

Note how those 3 rules should be sufficient to parse the date.

  • There is a rule for the year-month-day part
  • There is a rule for the hours:minutes:seconds.ns part
  • There is a rule for the zone offset part

However, during parsing the zoneoffset rule is never used. Instead, it uses the rule for the hours twice.

The weird thing is that when I add a rule that should not be used (`" ?(?\d{4})$"), the test suddenly succeeds:

    @Test
    public void parserWithLimitedPatterns(){
        List<String> rules = Arrays.asList(
          "(?<year>\\d{4})\\W{1}(?<month>\\d{1,2})\\W{1}(?<day>\\d{1,2})[^\\d]?",
          " ?(?<year>\\\\d{4})$",
          "\\W*(?:at )?(?<hour>\\d{1,2}):(?<minute>\\d{1,2})(?::(?<second>\\d{1,2}))?(?:[.,](?<ns>\\d{1,9}))?(?<zero>z)?",
          " ?(?<zoneOffset>[-+]\\d{1,2}:?(?:\\d{2})?)"
        );

        DateParser dateParser = new DateParser(rules, new HashSet<>(rules), Collections.emptyMap(), true, false);
        String input = "2022-08-09 19:04:31.600000+00:00";
        Date date = dateParser.parseDate(input);
        assertEquals(parser.parseDate(input), date);
    }

The position where I add that additional rule is important. For example adding it at the end of the list instead of at index 1 makes the test fail again.

I bumped into this issue for PR #28 , where I try to reduce the number of rules that are used for parsing to improve the performance.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant