Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

support bash's ANSI-C quoted strings #366

Closed
wants to merge 2 commits into from
Closed

Conversation

verhovsky
Copy link

@verhovsky verhovsky commented Mar 15, 2021

Closes #346

@verhovsky
Copy link
Author

verhovsky commented Mar 15, 2021

Except for numbers and upper/lower case ASCII letters, I actually haven't implemented control code escapes the way bash implements them

Examples of how bash implements control codes
$ echo -n $'ф' | xxd
00000000: d184                                     ..
$ echo -n $'\cф' | xxd
00000000: 1184                                     ..
$ echo -n $'🟥' | xxd
00000000: f09f 9fa5                                ....
$ echo -n $'\c🟥' | xxd
00000000: 109f 9fa5                                ....
$ echo -n $'^M' | xxd
00000000: 0d                                       .
$ echo -n $'\c^M' | xxd
00000000: 0d                                       .
$ echo -n $'1' | xxd
00000000: 31                                       1
$ echo -n $'\c1' | xxd
00000000: 11                                       .
$ echo -n ' ' | xxd -b
00000000: 00100000                                                
$ echo -n $'\c ' | xxd -b  # null
$ echo -n '!' | xxd -b
00000000: 00100001                                               !
$ echo -n $'\c!' | xxd -b
00000000: 00000001                                               .
$ echo -n '/' | xxd -b
00000000: 00101111                                               /
$ echo -n $'\c/' | xxd -b
00000000: 00001111                                               .
$ echo -n '0' | xxd -b
00000000: 00110000                                               0
$ echo -n $'\c0' | xxd -b
00000000: 00010000                                               .
$ echo -n '?' | xxd -b
00000000: 00111111                                               ?
$ echo -n $'\c?' | xxd -b
00000000: 01111111                                               .
$ echo -n '@' | xxd -b
00000000: 01000000                                               @
$ echo -n $'\c@' | xxd -b  # null value
$ echo -n 'A' | xxd -b
00000000: 01000001                                               A
$ echo -n $'\cA' | xxd -b
00000000: 00000001                                               .
$ echo -n 'O' | xxd -b
00000000: 01001111                                               O
$ echo -n $'\cO' | xxd -b
00000000: 00001111                                               .
$ echo -n 'P' | xxd -b
00000000: 01010000                                               P
$ echo -n $'\cP' | xxd -b
00000000: 00010000                                               .
$ echo -n '_' | xxd -b
00000000: 01011111                                               _
$ echo -n $'\c_' | xxd -b
00000000: 00011111                                               .
$ echo -n '`' | xxd -b
00000000: 01100000                                               `
$ echo -n $'\c`' | xxd -b
$ echo -n 'a' | xxd -b
00000000: 01100001                                               a
$ echo -n $'\ca' | xxd -b
00000000: 00000001                                               .
$ echo -n 'o' | xxd -b
00000000: 01101111                                               o
$ echo -n $'\co' | xxd -b
00000000: 00001111                                               .
$ echo -n 'p' | xxd -b
00000000: 01110000                                               p
$ echo -n $'\cp' | xxd -b
00000000: 00010000                                               .
$ echo -n '~' | xxd -b
00000000: 01111110                                               ~
$ echo -n $'\c~' | xxd -b
00000000: 00011110                                               .
$ echo -n ' ' | xxd -b
00000000: 00100000                                                
$ echo -n $'\c ' | xxd -b  # null char
$ echo -n '!' | xxd -b
00000000: 00100001                                               !
$ echo -n $'\c!' | xxd -b
00000000: 00000001                                               .
$ echo -n 罿 | xxd -b
00000000: 11100111 10111101 10111111                             ...
$ echo -n $'\c罿' | xxd -b
00000000: 00000111 10111101 10111111                             ...

If someone finds it in the bash source code and tells me what it's doing, I can implement it correctly.

Also, ANSI-C quoted strings are a bash thing, not a sh thing, we may want to have a flag to let users toggle this behavior.

@verhovsky verhovsky force-pushed the master branch 5 times, most recently from d61116c to c1420f7 Compare March 15, 2021 13:48
@verhovsky verhovsky changed the title suport bash's ANSI-C quoted strings support bash's ANSI-C quoted strings Mar 15, 2021
@verhovsky verhovsky force-pushed the master branch 7 times, most recently from 3000ecb to 94fd708 Compare March 17, 2021 22:50
@verhovsky
Copy link
Author

verhovsky commented Mar 17, 2021

There will be a mismatch in parsing because in bash you can go past the last code point 0x10FFFF, but you can't do that in node (or Python):

$ echo -n $'\U10FFFF'  | xxd
00000000: f48f bfbf                                ....
$ echo -n $'\U110000'  | xxd
00000000: f490 8080                                ....
$ node
Welcome to Node.js v14.14.0.
Type ".help" for more information.
> String.fromCodePoint(0x10FFFF)
''
> String.fromCodePoint(0x110000)
Uncaught RangeError: Invalid code point 1114112
    at Function.fromCodePoint (<anonymous>)
> 
$ python3
Python 3.9.0+ (default, Oct 19 2020, 09:51:18) 
[GCC 10.2.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> "\U0010FFFF"
'\U0010ffff'
>>> "\U00110000"
  File "<stdin>", line 1
    "\U00110000"
                ^
SyntaxError: (unicode error) 'unicodeescape' codec can't decode bytes in position 0-9: illegal Unicode character

@verhovsky
Copy link
Author

Control characters are proving to be a real pain. It's complicated by the fact that my bash uses UTF-8 (and other encodings are supported) but JS uses UTF-16. I'm thinking to only do the control characters that make sense and throwing an error for the rest.

I tried reverse engineering what bash does with \c codes by adding this code in test/yargs-parser.cjs:

    function toBinaryString (buf) {
      let result = []
      for (let b of buf) {
        result.push(b.toString(2).padStart(8, '0'))
      }
      return result.join(' ')
    }

    async function check (ansiString, encoding = 'utf8') {
      const result = parser(['--foo', ansiString]).foo + '' // convert integers back to str
      const resultBuffer = Buffer.from(result, encoding)

      const { stdout } = await exec(
        'printf "%s" ' + ansiString,
        { shell: '/bin/bash', encoding: 'buffer' }
      )
      if (!resultBuffer.equals(stdout)) {
        const binaryResult = toBinaryString(resultBuffer);
        const binaryStdout = stdout.length ? toBinaryString(stdout) : '00000000 (null)';
        if (binaryResult !== binaryStdout) {
          console.log(JSON.stringify(ansiString), ansiString[4], result.length, resultBuffer, resultBuffer.length, stdout, stdout.length)
          console.log(toBinaryString(Buffer.from(ansiString[4], 'utf8')))
          console.log(binaryStdout)
          // con sole.log(binaryResult)
          console.log()
        }
        if (stdout.length) {
          // throw 'aoeu'
        }
      } else {
        // console.log(JSON.stringify(ansiString), 'works correctly')
      }
      // resultBuffer.should.deep.equal(stdout)
    }

    it('parses all control codes in ANSI-C quoted strings like bash', async () => {
      for (let i = 1; i <= 0x10FFFF; i++) {
        let chr =  String.fromCodePoint(i);
        if (chr === "'" || chr === "\\") {  // TODO: escape
          continue
        }
        await check("$'\\c" + chr + "'")
        // await check("$'\\c" + chr + " '")
      }
    // }).timeout(5 * 60 * 60 * 1000)
    }).timeout(60 * 1000)

@jonluca
Copy link

jonluca commented Mar 30, 2021

@bcoe could we get this merged into upstream?

@verhovsky
Copy link
Author

@jonluca I still haven't finished control codes, at a minimum it needs to raise an error for the control codes that aren't handled.

If you'd like, feel free to copy paste parts of this code to curlconverter. Take a look at the source of where Chrome generates the curl command, you don't need to handle everything (like control codes) https://github.com/ChromeDevTools/devtools-frontend/blob/90e5470c94875d4417e5d31e2ab84f9b0682f33c/front_end/network/NetworkLogView.js#L2297-L2326

@bcoe
Copy link
Member

bcoe commented Mar 31, 2021

@verhovsky @jonluca I am supportive of this feature, but since it's breaking and fairly specialized, I'd like for it to be behind the configuration option --bash-ansi-strings, then folks who want the behavior for their library just need to turn the feature on in their package.json, by adding it to a yargs stanza.

Copy link
Member

@bcoe bcoe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd expect the throughput of the string parsing to be 1000s or 100000s of operations a second, so I don't see why we should need the timeouts added to test cases.

Added a couple additional comments, along with the note I added about putting this behind a feature flag.

}
await checkAll('', i.toString(8), 3, 'ascii')
}
}).timeout(5000)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm surprised by the need to extend the timeout, the parsing of strings should be quite fast.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test runs a bash command through exec 511 times. On my desktop it completes in slightly under the default 2 second timeout, but I added some headroom.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I removed these tests, they were more to help me understand how bash does the parsing while I was working on this.

@@ -629,6 +632,59 @@ export class YargsParser {
return value
}

// ANSI-C quoted strings are a bash-only feature and have the form $'some text'
// https://www.gnu.org/software/bash/manual/html_node/ANSI_002dC-Quoting.html
function parseAnsiCQuote (str: string): string {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's pul lib/string-utils.ts.


// TODO: echo -n $'\U110000' produces f4 90 80 80 but String.fromCodePoint(0x110000)
// raises an error in node: Uncaught RangeError: Invalid code point 1114112
}).timeout(6000 * 1000)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Likewise, let's move these tests into test/string-utils.ts.

@verhovsky verhovsky force-pushed the master branch 2 times, most recently from c698d28 to 2cfd1bc Compare April 7, 2021 10:00
@verhovsky
Copy link
Author

verhovsky commented Apr 7, 2021

I named it --parse-bash-ansi-c-strings instead of --bash-ansi-strings. It's more verbose but it's more correct, there's no such thing as "ANSI strings". We could name it --bash-c-strings or --bash-c-quoted-strings. The point is that they are bash-specific string syntax that uses C's quoting rules.

@verhovsky verhovsky force-pushed the master branch 3 times, most recently from 06a1a2c to 65701b9 Compare April 7, 2021 10:30
README.md Outdated

```console
> example.js $'hello\\nworld'
{ _: [ "$'hello\\nworld'" ] }
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wrote this example by hand. how do I generate it properly?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Easiest thing to do might be to clone yargs and yargs-parser, then do this:

cd yargs-parser
npm link .
cd yargs
npm link yargs-parser
// write your example in a file that includes the local copy of yargs.

}

// Skip exhaustive testing because it takes a long time
xit('parses all octal codes in ANSI-C quoted strings like bash', async () => {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How do you run these tests? Perhaps you could add an additional test-exhaustive script to package.json, and we could run them just when we merge to the main branch.

@bcoe
Copy link
Member

bcoe commented Jun 20, 2021

@verhovsky I'm getting back to open source after a break earlier this year. Could I bother you for a rebase 👏

Working on driving yargs-parser down to zero open PRs.

@verhovsky verhovsky force-pushed the master branch 3 times, most recently from 0b1d9ff to ffefd57 Compare July 4, 2021 18:06
@verhovsky verhovsky requested a review from bcoe July 4, 2021 18:07
@verhovsky verhovsky force-pushed the master branch 2 times, most recently from ab6d8b6 to 732964a Compare July 4, 2021 18:11
@verhovsky verhovsky marked this pull request as draft July 11, 2021 04:38
@bcoe
Copy link
Member

bcoe commented Jul 12, 2021

@verhovsky this feature looks quite solid to me, except for a couple a broken test in optimize, is this ready to move out of draft?

Thank you for the contribution 👍

@bcoe
Copy link
Member

bcoe commented Jul 22, 2021

@verhovsky friendly nudge, need any help getting this over finish line?

@verhovsky
Copy link
Author

verhovsky commented Jul 30, 2021

@bcoe sorry for the inactivity. I realized I put this feature in the wrong place. It needs to go in the tokenizer and I think I have to improve the tokenizer to parse $-prefixed strings first.

yargs has 2 modes of operation, either someone (or a script) calls a JS script through bash, which means bash takes the user input command string and parses it (this is the step that parses ANSI-C quoted strings), splitting it into words which then get passed to Node and it exposes that as an array of strings which then get passed to yargs. The other mode is when yargs gets passed a string, presumably this string is a raw bash command that needs to be parsed using bash's rules to build the array of arguments first.

I added my code to work after the first step (i.e. to operate on the array), which is not the right place for this to go. It needs to happen in lib/tokenize-arg-string.ts but I noticed the bash parsing there is quite rudimentary, I thought it would be good to do real parsing. So I marked the PR as draft and started looking into how to parse bash in JavaScript. I had these ideas:

  1. bash's parsing is implemented with yacc, here https://git.savannah.gnu.org/cgit/bash.git/tree/parse.y . I thought we could run it with a JS yacc implementation (i.e. jison), but if I understand it correctly, the yacc grammar is used to generate a C program, so you'd still need to rewrite that stuff in JavaScript
  2. use someone else's JS bash parser. I only see js-shell-parse and bash-parser, both are kinda unmaintained and neither seem to handle ANSI-C quoted strings. What else do they not handle?
  3. convert the C code that does shell parsing in a simpler shell such as dash https://git.kernel.org/pub/scm/utils/dash/dash.git/tree/src/parser.c
  4. compile parsing logic from either bash or dash to Wasm
  5. build on the shell parsing from curl-to-go, which seems a bit better

(Edit: I ended up using tree-sitter-bash instead of yargs for parsing Bash and implementing option parsing myself)

You already had basically this conversation in #302

@bcoe
Copy link
Member

bcoe commented Oct 23, 2021

@verhovsky thanks for keeping me in the loop 👍

One thought, perhaps we could add an approach for providing an extension to the tokenizer, so that you could inject a parser like tree-sitter-bash?


I'm going to go ahead and close this, since we've stalled. But please feel free to open the conversation back up at any point.

@bcoe bcoe closed this Oct 23, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Convert ANSI-C quoted strings to their value
3 participants