Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Comment multi-line parsing support issues #97

Open
pelva opened this issue Nov 11, 2024 · 5 comments
Open

Comment multi-line parsing support issues #97

pelva opened this issue Nov 11, 2024 · 5 comments

Comments

@pelva
Copy link

pelva commented Nov 11, 2024

The following is a signal multi-line comment, which will cause an error during parsing.
Unkonwn syntax at line 48
image

VERSION ""


NS_ : 
	NS_DESC_
	CM_
	BA_DEF_
	BA_
	VAL_
	CAT_DEF_
	CAT_
	FILTER
	BA_DEF_DEF_
	EV_DATA_
	ENVVAR_DATA_
	SGTYPE_
	SGTYPE_VAL_
	BA_DEF_SGTYPE_
	BA_SGTYPE_
	SIG_TYPE_REF_
	VAL_TABLE_
	SIG_GROUP_
	SIG_VALTYPE_
	SIGTYPE_VALTYPE_
	BO_TX_BU_
	BA_DEF_REL_
	BA_REL_
	BA_DEF_DEF_REL_
	BU_SG_REL_
	BU_EV_REL_
	BU_BO_REL_
	SG_MUL_VAL_

BS_:

BU_:


BO_ 256 New_Message_1: 8 Vector__XXX
 SG_ New_Signal_2 : 8|8@1- (1,0) [0|0] "" Vector__XXX
 SG_ New_Signal_1 : 0|8@1- (1,0) [0|0] "" Vector__XXX



CM_ SG_ 256 New_Signal_1 "Flag to indicate newly created object in CAN bus.
Object history flag
0x0: new object in the cycle;
0x1: object existed in previous cycle";
BA_DEF_  "MultiplexExtEnabled" ENUM  "No","Yes";
BA_DEF_ BO_  "CANFD_BRS" ENUM  "0","1";
BA_DEF_  "DBName" STRING ;
BA_DEF_  "BusType" STRING ;
BA_DEF_ BU_  "NodeLayerModules" STRING ;
BA_DEF_ BU_  "ECU" STRING ;
BA_DEF_ BU_  "CANoeJitterMax" INT 0 0;
BA_DEF_ BU_  "CANoeJitterMin" INT 0 0;
BA_DEF_ BU_  "CANoeDrift" INT 0 0;
BA_DEF_ BU_  "CANoeStartDelay" INT 0 0;
BA_DEF_ BO_  "VFrameFormat" ENUM  "StandardCAN","ExtendedCAN","reserved","reserved","reserved","reserved","reserved","reserved","reserved","reserved","reserved","reserved","reserved","reserved","StandardCAN_FD","ExtendedCAN_FD";
BA_DEF_DEF_  "MultiplexExtEnabled" "No";
BA_DEF_DEF_  "CANFD_BRS" "1";
BA_DEF_DEF_  "DBName" "";
BA_DEF_DEF_  "BusType" "";
BA_DEF_DEF_  "NodeLayerModules" "";
BA_DEF_DEF_  "ECU" "";
BA_DEF_DEF_  "CANoeJitterMax" 0;
BA_DEF_DEF_  "CANoeJitterMin" 0;
BA_DEF_DEF_  "CANoeDrift" 0;
BA_DEF_DEF_  "CANoeStartDelay" 0;
BA_DEF_DEF_  "VFrameFormat" "StandardCAN";
BA_ "BusType" "CAN FD";
BA_ "DBName" "canfd";
BA_ "VFrameFormat" BO_ 256 14;


@Whitehouse112
Copy link
Contributor

Whitehouse112 commented Nov 12, 2024

Hi,
the error is caused by the ";" inside the comment and at the same time at the end of the line ("0x0: new object in the cycle;"). It's a bug and we will fix it, thanks for spot the issue.

In the meanwhile I would suggest to remove this char and everything will work fine.
Thanks

@Uight
Copy link
Contributor

Uight commented Dec 19, 2024

@Whitehouse112
im have a branch to improve multiline support based on the current regex parser, but i currently have one problem with comments parsing for multiline because of the following single line problem.

BO_ 1160 DAS_steeringControl: 4 NEO
 SG_ DAS_steeringControlType : 23|2@0+ (1,0) [0|0] ""  EPAS
 SG_ DAS_steeringAngleRequest : 0|16@0+ (1,0) [0|0] ""  EPAS

CM_ BO_ 1160 "This is a very fine comment; Right?";

This does work in vector candb++ (free version) aswell as in the dbc parser but:

BO_ 1160 DAS_steeringControl: 4 NEO
 SG_ DAS_steeringControlType : 23|2@0+ (1,0) [0|0] ""  EPAS
 SG_ DAS_steeringAngleRequest : 0|16@0+ (1,0) [0|0] ""  EPAS

CM_ BO_ 1160 "This is a very fine comment; Right?"

Would be an error in parsing for vector candb++ but not for us. We would parse this comment as:
"This is a very fine comment"

For the current parser that tries to handle comments in the comment parser thats omehow ok. But in my version im using the nextLineProvider to parse multiline and i would force the ; to be at the end of a comment with the $ in the regex.

Other options?

@Whitehouse112
Copy link
Contributor

Hi @Uight
I would suggest to fix comment parsing regex to force line to have exactly one " at the beginning and one " at the end of the comment string. The regex is currently incorrectly constructed like (e.g. message comment):

  • CM_ BO_ \s+ [\d] "* [^"]* "* \s*;

Don't know/remember why this regex is currently allowing comment line to have no quotation marks, there are also two test (MalformedLineIsAcceptedWithoutInteraction and AnotherMalformedLineIsAcceptedWithoutInteraction) testing this behavior that in my opinion is incorrect since " are mandatory. These tests can be removed and the comment regex updated.

Indeed, having this syntax, zero or multiple " are allowed (* matches the previous token between zero and unlimited times) and this is what we need to fix to correctly read both your and pelva's comment (basically just remove the *). In this way you are forcing ; to be implicitly at the end of the line and at the same time after a ", rather than checking only the end of the line.

Obviously this solution works only in your PR since the CommentLineParser comes already with all the lines available, it doesn't work in the current dbcParser version since multiple lines, if any, are retrieved inside the parser itself looking for the ; a the end of the line. I tried to modify regex in your PR and it works fine.

Any concerns about this solution?

@Uight
Copy link
Contributor

Uight commented Jan 7, 2025

Hi @Whitehouse112 thanks for the answer.

i would be fine with that but im not to sure about the pullrequest. Basically i was just needing a correct regex to test something.
I started a little test project with code like this:

using System;
using System.Collections.Generic;
using System.IO;
using System.Text.RegularExpressions;

class Program
{
    static void Main(string[] args)
    {
        string filePath = "textfile.txt";

        string fileContent = File.ReadAllText(filePath);

        // Split the content into lines for tracking line numbers
        string[] lines = File.ReadAllLines(filePath);

        // Define your regular expressions (anchored to the beginning of the text with optional spaces)
        List<Regex> regexList = new List<Regex>
        {
            new Regex(@"^\s*[A-Z]{2}\s*[a-z]+", RegexOptions.Singleline), // Match a multi-line pattern at the start
            new Regex(@"^\s*\d{4}", RegexOptions.Singleline),                  // Match 4-digit numbers at the start
            new Regex(@"^\s*[A-Z]{3}", RegexOptions.Singleline)               // Match 3 uppercase letters at the start
        };

        Console.WriteLine("Parsing the file:");

        int currentIndex = 0;
        while (!string.IsNullOrEmpty(fileContent))
        {
            bool matchFound = false;

            foreach (Regex regex in regexList)
            {
                Match match = regex.Match(fileContent);

                if (match.Success)
                {
                    matchFound = true;
                    currentIndex += match.Length - match.Index;

                    // Determine the line number of the match
                    int lineNumber = GetLineNumber(currentIndex, lines);

                    Console.WriteLine($"Match for regex '{regex}' at line {lineNumber + 1}: {match.Value.TrimStart()}");

                    // Remove matched text
                    fileContent = fileContent.Remove(match.Index, match.Length); // Trim leading spaces
                    break; // Restart checking from the remaining text
                }
            }

            if (!matchFound)
            {
                // No matches found: Identify the line where it stopped
                int lineNumber = GetLineNumber(currentIndex, lines);
                Console.WriteLine($"No matches found starting at line {lineNumber + 2}:");
                Console.WriteLine(lines[lineNumber + 1]);
                break;
            }
        }
    }

    static int GetLineNumber(int position, string[] lines)
    {
        int charCount = 0;
        for (int i = 0; i < lines.Length; i++)
        {
            charCount += lines[i].Length + Environment.NewLine.Length;
            if (position < charCount)
                return i;
        }
        return lines.Length - 1; // Fallback: last line
    }
}

In this case parsing a file like this:

1234
ABC DEF
AB
f GHI
NoMatchHere
5678
XYZ ghi

This would just take the whole file and read it trying to parse the regexes. This has the potential to be a massive performance killer (slow and memory intensiv) for big files. Escpecially since the GetLineNumber function is super ineffcient but actually only needed if parsing is not successfull. To improve the regex performance i used the ^ at the start so it matches only the start and aborts early if the start is different.
I wanted to try this further but didnt find the time yet but for it i needed a good regex for the comments to check viability

(This is ment as a compromise to the 1.8 milestone branch that uses a fluent parser (where we would need to redefine all parsers in the new syntax. This is ment to do the same but with the current regexes)

I stopped since i head problem with the stuff currently in the ignore line parser but i recently checked some conditions with candb++ and it seems that it could be parsed by matching all strings at until you find BS: (BS: is required i guess exactly for that reason. While Version and NS are optional BS must be present and Version and NS mus appear before it)

Basically: I dont feel like the PR is a really good solution but it might be slightly better then the current solution so if you wanna bring your changes to the PR feel free to do so as it will definitly improve the PR and i get to use the Regexes from there in my current tests.

@Uight
Copy link
Contributor

Uight commented Jan 9, 2025

Finished a example parsed with the method from the last comment.
RegexParsing.zip
It has not all parsers but can parse a small little dbc file with some special stuff in it.
For errors its pretty much behaves as candb++ with the line of error sometime being one of the line vector reports but everything else should be same. (The line logging is sometimes since the parsing of a new definition always starts rigth after the last.

Some strange stuff that i didnt find a good solution too within a regex parser are:

  1. Ignoreing the new symbols in a good way. I found that vector needs the Bit timing definition after the new symbols and so i parse until then with a look ahead.
    (2. a full Bit timing wouldnt be parsed. (dont know what that would even look like))
  2. Stopping the parsing of nodes (There is no clear finisher. So i decided on the lookahead approach again and stop if the next thing is a definition
  3. Same as 3 but this times for signals. The problem is the same tough. Since signals can have multiple receivers theres no clear stop to parsing.

One Problem with lookahead is that it would currently search for "BO_" among some others but BO_Test would also be a valid node name...

@Whitehouse112 you could check that out and maybe have some idea for improved regex...

@Adhara3 do you know how you would write a parsed in the fluent tokenizer branch for a line like this:
BU_: HELLO WORLD BO_ 1160 DAS_steeringControl: 4 NEO

This is parsed correctly by candb++ but im not sure how to do this in a nice way either in my current try or the fluent tokenizer. Because for all i know the BO_ would also be a valid node name if it wasnt a keyword.

Whitehouse112 added a commit to Whitehouse112/DbcParser that referenced this issue Jan 10, 2025
…nt string (single and multiline comment). Added some tests. Removed wrong tests allowing comment to be written without quotation marks.
Adhara3 added a commit that referenced this issue Jan 10, 2025
#97 Updated comment parser regex
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants