-
Notifications
You must be signed in to change notification settings - Fork 118
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
support bash's ANSI-C quoted strings #366
Conversation
Except for numbers and upper/lower case ASCII letters, I actually haven't implemented control code escapes the way bash implements them Examples of how bash implements control codes
If someone finds it in the bash source code and tells me what it's doing, I can implement it correctly. Also, ANSI-C quoted strings are a bash thing, not a |
d61116c
to
c1420f7
Compare
3000ecb
to
94fd708
Compare
There will be a mismatch in parsing because in bash you can go past the last code point $ echo -n $'\U10FFFF' | xxd
00000000: f48f bfbf ....
$ echo -n $'\U110000' | xxd
00000000: f490 8080 ....
$ node
Welcome to Node.js v14.14.0.
Type ".help" for more information.
> String.fromCodePoint(0x10FFFF)
'�'
> String.fromCodePoint(0x110000)
Uncaught RangeError: Invalid code point 1114112
at Function.fromCodePoint (<anonymous>)
>
$ python3
Python 3.9.0+ (default, Oct 19 2020, 09:51:18)
[GCC 10.2.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> "\U0010FFFF"
'\U0010ffff'
>>> "\U00110000"
File "<stdin>", line 1
"\U00110000"
^
SyntaxError: (unicode error) 'unicodeescape' codec can't decode bytes in position 0-9: illegal Unicode character |
Control characters are proving to be a real pain. It's complicated by the fact that my bash uses UTF-8 (and other encodings are supported) but JS uses UTF-16. I'm thinking to only do the control characters that make sense and throwing an error for the rest. I tried reverse engineering what bash does with function toBinaryString (buf) {
let result = []
for (let b of buf) {
result.push(b.toString(2).padStart(8, '0'))
}
return result.join(' ')
}
async function check (ansiString, encoding = 'utf8') {
const result = parser(['--foo', ansiString]).foo + '' // convert integers back to str
const resultBuffer = Buffer.from(result, encoding)
const { stdout } = await exec(
'printf "%s" ' + ansiString,
{ shell: '/bin/bash', encoding: 'buffer' }
)
if (!resultBuffer.equals(stdout)) {
const binaryResult = toBinaryString(resultBuffer);
const binaryStdout = stdout.length ? toBinaryString(stdout) : '00000000 (null)';
if (binaryResult !== binaryStdout) {
console.log(JSON.stringify(ansiString), ansiString[4], result.length, resultBuffer, resultBuffer.length, stdout, stdout.length)
console.log(toBinaryString(Buffer.from(ansiString[4], 'utf8')))
console.log(binaryStdout)
// con sole.log(binaryResult)
console.log()
}
if (stdout.length) {
// throw 'aoeu'
}
} else {
// console.log(JSON.stringify(ansiString), 'works correctly')
}
// resultBuffer.should.deep.equal(stdout)
}
it('parses all control codes in ANSI-C quoted strings like bash', async () => {
for (let i = 1; i <= 0x10FFFF; i++) {
let chr = String.fromCodePoint(i);
if (chr === "'" || chr === "\\") { // TODO: escape
continue
}
await check("$'\\c" + chr + "'")
// await check("$'\\c" + chr + " '")
}
// }).timeout(5 * 60 * 60 * 1000)
}).timeout(60 * 1000) |
@bcoe could we get this merged into upstream? |
@jonluca I still haven't finished control codes, at a minimum it needs to raise an error for the control codes that aren't handled. If you'd like, feel free to copy paste parts of this code to curlconverter. Take a look at the source of where Chrome generates the curl command, you don't need to handle everything (like control codes) https://github.com/ChromeDevTools/devtools-frontend/blob/90e5470c94875d4417e5d31e2ab84f9b0682f33c/front_end/network/NetworkLogView.js#L2297-L2326 |
@verhovsky @jonluca I am supportive of this feature, but since it's breaking and fairly specialized, I'd like for it to be behind the configuration option |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd expect the throughput of the string parsing to be 1000s or 100000s of operations a second, so I don't see why we should need the timeouts added to test cases.
Added a couple additional comments, along with the note I added about putting this behind a feature flag.
test/yargs-parser.cjs
Outdated
} | ||
await checkAll('', i.toString(8), 3, 'ascii') | ||
} | ||
}).timeout(5000) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm surprised by the need to extend the timeout, the parsing of strings should be quite fast.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This test runs a bash command through exec
511 times. On my desktop it completes in slightly under the default 2 second timeout, but I added some headroom.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I removed these tests, they were more to help me understand how bash does the parsing while I was working on this.
lib/yargs-parser.ts
Outdated
@@ -629,6 +632,59 @@ export class YargsParser { | |||
return value | |||
} | |||
|
|||
// ANSI-C quoted strings are a bash-only feature and have the form $'some text' | |||
// https://www.gnu.org/software/bash/manual/html_node/ANSI_002dC-Quoting.html | |||
function parseAnsiCQuote (str: string): string { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's pul lib/string-utils.ts
.
test/yargs-parser.cjs
Outdated
|
||
// TODO: echo -n $'\U110000' produces f4 90 80 80 but String.fromCodePoint(0x110000) | ||
// raises an error in node: Uncaught RangeError: Invalid code point 1114112 | ||
}).timeout(6000 * 1000) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Likewise, let's move these tests into test/string-utils.ts
.
c698d28
to
2cfd1bc
Compare
I named it |
06a1a2c
to
65701b9
Compare
README.md
Outdated
|
||
```console | ||
> example.js $'hello\\nworld' | ||
{ _: [ "$'hello\\nworld'" ] } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wrote this example by hand. how do I generate it properly?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Easiest thing to do might be to clone yargs
and yargs-parser
, then do this:
cd yargs-parser
npm link .
cd yargs
npm link yargs-parser
// write your example in a file that includes the local copy of yargs.
test/string-utils.cjs
Outdated
} | ||
|
||
// Skip exhaustive testing because it takes a long time | ||
xit('parses all octal codes in ANSI-C quoted strings like bash', async () => { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How do you run these tests? Perhaps you could add an additional test-exhaustive
script to package.json
, and we could run them just when we merge to the main branch.
@verhovsky I'm getting back to open source after a break earlier this year. Could I bother you for a rebase 👏 Working on driving yargs-parser down to zero open PRs. |
0b1d9ff
to
ffefd57
Compare
ab6d8b6
to
732964a
Compare
@verhovsky this feature looks quite solid to me, except for a couple a broken test in Thank you for the contribution 👍 |
@verhovsky friendly nudge, need any help getting this over finish line? |
@bcoe sorry for the inactivity. I realized I put this feature in the wrong place. It needs to go in the tokenizer and I think I have to improve the tokenizer to parse $-prefixed strings first. yargs has 2 modes of operation, either someone (or a script) calls a JS script through bash, which means bash takes the user input command string and parses it (this is the step that parses ANSI-C quoted strings), splitting it into words which then get passed to Node and it exposes that as an array of strings which then get passed to yargs. The other mode is when yargs gets passed a string, presumably this string is a raw bash command that needs to be parsed using bash's rules to build the array of arguments first. I added my code to work after the first step (i.e. to operate on the array), which is not the right place for this to go. It needs to happen in
(Edit: I ended up using tree-sitter-bash instead of yargs for parsing Bash and implementing option parsing myself) You already had basically this conversation in #302 |
@verhovsky thanks for keeping me in the loop 👍 One thought, perhaps we could add an approach for providing an extension to the tokenizer, so that you could inject a parser like I'm going to go ahead and close this, since we've stalled. But please feel free to open the conversation back up at any point. |
Closes #346