Skip to content

Commit

Permalink
plumb through utf8 paths in parser
Browse files Browse the repository at this point in the history
This change makes UTF-8 (and other encodings) paths properly round-trip through the
parser, including when showing paths in command descriptions.  I expect Unicode in
other places (like literal text in command descriptions) still won't work.  I updated
doc/development.md with a longer discussion of the Unicode situation in general.

Fixes #55
  • Loading branch information
evmar committed Dec 21, 2022
1 parent 693d2ef commit 03de3da
Show file tree
Hide file tree
Showing 4 changed files with 109 additions and 17 deletions.
56 changes: 46 additions & 10 deletions doc/development.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,24 +8,60 @@ $ ln -s ../../git-pre-commit .git/hooks/pre-commit

## Path handling and Unicode safety

Currently we use Rust `String` for all paths and file contents, but
internally interpret them as as bytes (not UTF8) including using "unsafe"
Ninja
[was intentionally "encoding agnostic"](https://ninja-build.org/manual.html#ref_lexer),
which is to say it treated input build files as any byte stream that is ASCII
compatible. In other words, any string of bytes found in a `build.ninja` is
passed verbatim through printing to stdout and to the OS for path operations,
which meant Ninja was compatible with both UTF-8 and other encodings. The intent
is that those other encodings occur on Linuxes, especially in East Asia, and
also it means Ninja doesn't need any specific knowledge of Unicode.

It looks like since my time,
[Ninja on Windows changed its input files to require UTF-8](https://github.com/ninja-build/ninja/pull/1915).
As mentioned there, this was actually a breaking change, and it looks like there
was a decent amount of fallout from the change.

Windows is particularly fiddly in this area because Ninja needs to parse the
`/showIncludes` output from the MSVC compiler, which is localized. See the
`msvc_deps_prefix` variable in the Ninja docs; there have been lots of bug
reports over the years from people with Chinese output that is failing to parse
right due to Windows code page mess.

In any case, n2 doesn't support any of this for now, and instead just follows
Ninja in treating paths as bytes. (n2 doesn't support `/showIncludes` or MSVC at
all yet.)

Further, we currently use Rust `String` for all paths and file contents, but
internally interpret them as as bytes (not UTF-8) including using `unsafe`
sometimes to convert.

Based on my superficial understanding of how safety relates to UTF8 in Rust
strings, it's probably harmless given that we never treat strings as Unicode,
but it's also possible some code outside of our control relies on this.
but it's also possible some code outside of our control relies on this. But it
does mean there's a bunch of kind of needless `unsafe`s in the code, and some of
them are possibly actually doing something bad.

The proper fix is to switch to a bag of bytes type, like
https://crates.io/crates/bstr. I attempted this initially with my own custom
string type but ran into trouble making it compatible with hash tables.
Some better fixes are:

- Maybe switch to using a bag of bytes type, like https://crates.io/crates/bstr
? But it is pretty invasive. We would need to use that not only for paths but
also console output, error messages, etc.
- Another possible fix is to require input files to be UTF-8, though I think I'd
want to better understand the `/showIncludes` situation above. Possibly we
could mandate "input files are UTF-8, and if you need something other than
UTF-8 in your `msvc_deps_prefix` it's on you to escape the exact byte sequence
you desire".

Handling Windows properly is kind of exhausting; see also the bugs about long
files names linked from the above.

## Profiling

### gperftools

I played with a few profilers, but I think the gperftools profiler turned out
to be significantly better than the others. To install:
I played with a few profilers, but I think the gperftools profiler turned out to
be significantly better than the others. To install:

```
$ apt install libgoogle-perftools-dev
Expand All @@ -40,8 +76,8 @@ $ LD_PRELOAD=/usr/lib/x86_64-linux-gnu/libprofiler.so CPUPROFILE=p ./target/rele
$ pprof -http=:8080 ./target/release/n2 p
```

The web server it brings up shows an interactive graph, top functions,
annotated code, disassembly...
The web server it brings up shows an interactive graph, top functions, annotated
code, disassembly...

### Other options

Expand Down
2 changes: 1 addition & 1 deletion doc/missing.md
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
# Missing features from Ninja

- Windows is only partially implemented.
- `deps = msvc` (parsing of `/showincludes` output) isn't implemented at all,
- `deps = msvc` (parsing of `/showIncludes` output) isn't implemented at all,
which means n2 currently gets header dependencies wrong when you use the
MSVC compiler.
- Dynamic dependencies.
Expand Down
41 changes: 35 additions & 6 deletions src/parse.rs
Original file line number Diff line number Diff line change
Expand Up @@ -45,7 +45,7 @@ pub struct Parser<'text> {
pub vars: Vars<'text>,
/// Reading paths is very hot when parsing, so we always read into this buffer
/// and then immediately pass in to Loader::path() to canonicalize it in-place.
path_buf: String,
path_buf: Vec<u8>,
}

fn is_ident_char(c: u8) -> bool {
Expand All @@ -60,8 +60,16 @@ fn is_path_char(c: u8) -> bool {
!matches!(c as char, '\0' | ' ' | '\n' | '\r' | ':' | '|' | '$')
}

/// Loader maps path strings (as found in build.ninja files) into an arbitrary
/// "Path" type. This allows us to canonicalize and convert path strings to
/// more efficient integer identifiers while we parse, rather than needing to
/// buffer up many intermediate strings; in fact, parsing uses a single buffer
/// for all of these.
pub trait Loader {
type Path;
/// Convert a path string to a Self::Path type. Note there are safety
/// related restrictions on what this function may do; see notes at the call
/// site.
fn path(&mut self, path: &mut String) -> Self::Path;
}

Expand All @@ -70,7 +78,7 @@ impl<'text> Parser<'text> {
Parser {
scanner: Scanner::new(buf),
vars: Vars::new(),
path_buf: String::with_capacity(64),
path_buf: Vec::with_capacity(64),
}
}

Expand Down Expand Up @@ -313,7 +321,7 @@ impl<'text> Parser<'text> {
loop {
let c = self.scanner.read();
if is_path_char(c as u8) {
self.path_buf.push(c);
self.path_buf.push(c as u8);
} else {
match c {
'\0' => {
Expand All @@ -323,10 +331,10 @@ impl<'text> Parser<'text> {
'$' => {
let part = self.read_escape()?;
match part {
EvalPart::Literal(l) => self.path_buf.push_str(l),
EvalPart::Literal(l) => self.path_buf.extend_from_slice(l.as_bytes()),
EvalPart::VarRef(v) => {
if let Some(v) = self.vars.get(v) {
self.path_buf.push_str(v);
self.path_buf.extend_from_slice(v.as_bytes());
}
}
}
Expand All @@ -347,7 +355,28 @@ impl<'text> Parser<'text> {
if self.path_buf.is_empty() {
return Ok(None);
}
Ok(Some(loader.path(&mut self.path_buf)))
// Performance: we want to pass self.path_buf directly to loader to
// have it canonicalize the path in-place, without allocating any
// additional buffers. This is some of the hottest code in n2 so
// we cut some corners to achieve this.
// Safety: see discussion of unicode safety in doc/development.md.
// I looked into switching this to BStr but it would require changing
// a lot of other code to BStr too.
// Safety: this assumes loader.path will never attempt to grow the
// passed-in string (causing a reallocation), and instead only will
// monkey with the contents within the passed-in buffer. We also know
// that this buffer will not be used immediately after loader.path() is
// called so it's fine for loader.path to scribble on it.
let mut path_str = unsafe {
String::from_raw_parts(
self.path_buf.as_mut_ptr(),
self.path_buf.len(),
self.path_buf.capacity(),
)
};
let path = loader.path(&mut path_str);
std::mem::forget(path_str); // path_buf owns it.
Ok(Some(path))
}

fn read_escape(&mut self) -> ParseResult<EvalPart<&'text str>> {
Expand Down
27 changes: 27 additions & 0 deletions tests/basic.rs
Original file line number Diff line number Diff line change
Expand Up @@ -230,3 +230,30 @@ fn repeated_out() -> anyhow::Result<()> {

Ok(())
}

/// Regression test for https://github.com/evmar/n2/issues/55
/// UTF-8 filename.
#[cfg(unix)]
#[test]
fn utf8_filename() -> anyhow::Result<()> {
let space = TestSpace::new()?;
space.write(
"build.ninja",
&[
"
rule echo
description = unicode variable: $in
command = echo unicode command line: $in && touch $out
",
"build out: echo reykjavík.md",
"",
]
.join("\n"),
)?;
space.write("reykjavík.md", "")?;
let out = space.run_expect(&mut n2_command(vec!["out"]))?;
assert_output_contains(&out, "unicode variable: reykjavík.md");
assert_output_contains(&out, "unicode command line: reykjavík.md");

Ok(())
}

0 comments on commit 03de3da

Please sign in to comment.