-
Notifications
You must be signed in to change notification settings - Fork 30
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comment multi-line parsing support issues #97
Comments
Hi, In the meanwhile I would suggest to remove this char and everything will work fine. |
@Whitehouse112
This does work in vector candb++ (free version) aswell as in the dbc parser but:
Would be an error in parsing for vector candb++ but not for us. We would parse this comment as: For the current parser that tries to handle comments in the comment parser thats omehow ok. But in my version im using the nextLineProvider to parse multiline and i would force the ; to be at the end of a comment with the $ in the regex. Other options? |
Hi @Uight
Don't know/remember why this regex is currently allowing comment line to have no quotation marks, there are also two test (MalformedLineIsAcceptedWithoutInteraction and AnotherMalformedLineIsAcceptedWithoutInteraction) testing this behavior that in my opinion is incorrect since " are mandatory. These tests can be removed and the comment regex updated. Indeed, having this syntax, zero or multiple " are allowed (* matches the previous token between zero and unlimited times) and this is what we need to fix to correctly read both your and pelva's comment (basically just remove the *). In this way you are forcing ; to be implicitly at the end of the line and at the same time after a ", rather than checking only the end of the line. Obviously this solution works only in your PR since the CommentLineParser comes already with all the lines available, it doesn't work in the current dbcParser version since multiple lines, if any, are retrieved inside the parser itself looking for the ; a the end of the line. I tried to modify regex in your PR and it works fine. Any concerns about this solution? |
Hi @Whitehouse112 thanks for the answer. i would be fine with that but im not to sure about the pullrequest. Basically i was just needing a correct regex to test something. using System;
using System.Collections.Generic;
using System.IO;
using System.Text.RegularExpressions;
class Program
{
static void Main(string[] args)
{
string filePath = "textfile.txt";
string fileContent = File.ReadAllText(filePath);
// Split the content into lines for tracking line numbers
string[] lines = File.ReadAllLines(filePath);
// Define your regular expressions (anchored to the beginning of the text with optional spaces)
List<Regex> regexList = new List<Regex>
{
new Regex(@"^\s*[A-Z]{2}\s*[a-z]+", RegexOptions.Singleline), // Match a multi-line pattern at the start
new Regex(@"^\s*\d{4}", RegexOptions.Singleline), // Match 4-digit numbers at the start
new Regex(@"^\s*[A-Z]{3}", RegexOptions.Singleline) // Match 3 uppercase letters at the start
};
Console.WriteLine("Parsing the file:");
int currentIndex = 0;
while (!string.IsNullOrEmpty(fileContent))
{
bool matchFound = false;
foreach (Regex regex in regexList)
{
Match match = regex.Match(fileContent);
if (match.Success)
{
matchFound = true;
currentIndex += match.Length - match.Index;
// Determine the line number of the match
int lineNumber = GetLineNumber(currentIndex, lines);
Console.WriteLine($"Match for regex '{regex}' at line {lineNumber + 1}: {match.Value.TrimStart()}");
// Remove matched text
fileContent = fileContent.Remove(match.Index, match.Length); // Trim leading spaces
break; // Restart checking from the remaining text
}
}
if (!matchFound)
{
// No matches found: Identify the line where it stopped
int lineNumber = GetLineNumber(currentIndex, lines);
Console.WriteLine($"No matches found starting at line {lineNumber + 2}:");
Console.WriteLine(lines[lineNumber + 1]);
break;
}
}
}
static int GetLineNumber(int position, string[] lines)
{
int charCount = 0;
for (int i = 0; i < lines.Length; i++)
{
charCount += lines[i].Length + Environment.NewLine.Length;
if (position < charCount)
return i;
}
return lines.Length - 1; // Fallback: last line
}
} In this case parsing a file like this: 1234 This would just take the whole file and read it trying to parse the regexes. This has the potential to be a massive performance killer (slow and memory intensiv) for big files. Escpecially since the GetLineNumber function is super ineffcient but actually only needed if parsing is not successfull. To improve the regex performance i used the ^ at the start so it matches only the start and aborts early if the start is different. (This is ment as a compromise to the 1.8 milestone branch that uses a fluent parser (where we would need to redefine all parsers in the new syntax. This is ment to do the same but with the current regexes) I stopped since i head problem with the stuff currently in the ignore line parser but i recently checked some conditions with candb++ and it seems that it could be parsed by matching all strings at until you find BS: (BS: is required i guess exactly for that reason. While Version and NS are optional BS must be present and Version and NS mus appear before it) Basically: I dont feel like the PR is a really good solution but it might be slightly better then the current solution so if you wanna bring your changes to the PR feel free to do so as it will definitly improve the PR and i get to use the Regexes from there in my current tests. |
Finished a example parsed with the method from the last comment. Some strange stuff that i didnt find a good solution too within a regex parser are:
One Problem with lookahead is that it would currently search for "BO_" among some others but BO_Test would also be a valid node name... @Whitehouse112 you could check that out and maybe have some idea for improved regex... @Adhara3 do you know how you would write a parsed in the fluent tokenizer branch for a line like this: This is parsed correctly by candb++ but im not sure how to do this in a nice way either in my current try or the fluent tokenizer. Because for all i know the BO_ would also be a valid node name if it wasnt a keyword. |
…nt string (single and multiline comment). Added some tests. Removed wrong tests allowing comment to be written without quotation marks.
The following is a signal multi-line comment, which will cause an error during parsing.
Unkonwn syntax at line 48
The text was updated successfully, but these errors were encountered: