plumb through utf8 paths in parser

This change makes UTF-8 (and other encodings) paths properly round-trip through the parser, including when showing paths in command descriptions. I expect Unicode in other places (like literal text in command descriptions) still won't work. I updated doc/development.md with a longer discussion of the Unicode situation in general. Fixes #55
evmar · Dec 21, 2022 · 03de3da · 03de3da
1 parent 693d2ef
commit 03de3da
Show file tree

Hide file tree

Showing 4 changed files with 109 additions and 17 deletions.
diff --git a/doc/development.md b/doc/development.md
@@ -8,24 +8,60 @@ $ ln -s ../../git-pre-commit .git/hooks/pre-commit
 
 ## Path handling and Unicode safety
 
-Currently we use Rust `String` for all paths and file contents, but
-internally interpret them as as bytes (not UTF8) including using "unsafe"
+Ninja
+[was intentionally "encoding agnostic"](https://ninja-build.org/manual.html#ref_lexer),
+which is to say it treated input build files as any byte stream that is ASCII
+compatible. In other words, any string of bytes found in a `build.ninja` is
+passed verbatim through printing to stdout and to the OS for path operations,
+which meant Ninja was compatible with both UTF-8 and other encodings. The intent
+is that those other encodings occur on Linuxes, especially in East Asia, and
+also it means Ninja doesn't need any specific knowledge of Unicode.
+
+It looks like since my time,
+[Ninja on Windows changed its input files to require UTF-8](https://github.com/ninja-build/ninja/pull/1915).
+As mentioned there, this was actually a breaking change, and it looks like there
+was a decent amount of fallout from the change.
+
+Windows is particularly fiddly in this area because Ninja needs to parse the
+`/showIncludes` output from the MSVC compiler, which is localized. See the
+`msvc_deps_prefix` variable in the Ninja docs; there have been lots of bug
+reports over the years from people with Chinese output that is failing to parse
+right due to Windows code page mess.
+
+In any case, n2 doesn't support any of this for now, and instead just follows
+Ninja in treating paths as bytes. (n2 doesn't support `/showIncludes` or MSVC at
+all yet.)
+
+Further, we currently use Rust `String` for all paths and file contents, but
+internally interpret them as as bytes (not UTF-8) including using `unsafe`
 sometimes to convert.
 
 Based on my superficial understanding of how safety relates to UTF8 in Rust
 strings, it's probably harmless given that we never treat strings as Unicode,
-but it's also possible some code outside of our control relies on this.
+but it's also possible some code outside of our control relies on this. But it
+does mean there's a bunch of kind of needless `unsafe`s in the code, and some of
+them are possibly actually doing something bad.
 
-The proper fix is to switch to a bag of bytes type, like
-https://crates.io/crates/bstr.  I attempted this initially with my own custom
-string type but ran into trouble making it compatible with hash tables.
+Some better fixes are:
+
+- Maybe switch to using a bag of bytes type, like https://crates.io/crates/bstr
+  ? But it is pretty invasive. We would need to use that not only for paths but
+  also console output, error messages, etc.
+- Another possible fix is to require input files to be UTF-8, though I think I'd
+  want to better understand the `/showIncludes` situation above. Possibly we
+  could mandate "input files are UTF-8, and if you need something other than
+  UTF-8 in your `msvc_deps_prefix` it's on you to escape the exact byte sequence
+  you desire".
+
+Handling Windows properly is kind of exhausting; see also the bugs about long
+files names linked from the above.
 
 ## Profiling
 
 ### gperftools
 
-I played with a few profilers, but I think the gperftools profiler turned out
-to be significantly better than the others.  To install:
+I played with a few profilers, but I think the gperftools profiler turned out to
+be significantly better than the others. To install:
 
 ```
 $ apt install libgoogle-perftools-dev
@@ -40,8 +76,8 @@ $ LD_PRELOAD=/usr/lib/x86_64-linux-gnu/libprofiler.so CPUPROFILE=p ./target/rele
 $ pprof -http=:8080 ./target/release/n2 p
 ```
 
-The web server it brings up shows an interactive graph, top functions,
-annotated code, disassembly...
+The web server it brings up shows an interactive graph, top functions, annotated
+code, disassembly...
 
 ### Other options
 

diff --git a/doc/missing.md b/doc/missing.md
@@ -1,7 +1,7 @@
 # Missing features from Ninja
 
 - Windows is only partially implemented.
-  - `deps = msvc` (parsing of `/showincludes` output) isn't implemented at all,
+  - `deps = msvc` (parsing of `/showIncludes` output) isn't implemented at all,
     which means n2 currently gets header dependencies wrong when you use the
     MSVC compiler.
 - Dynamic dependencies.

diff --git a/src/parse.rs b/src/parse.rs
@@ -45,7 +45,7 @@ pub struct Parser<'text> {
     pub vars: Vars<'text>,
     /// Reading paths is very hot when parsing, so we always read into this buffer
     /// and then immediately pass in to Loader::path() to canonicalize it in-place.
-    path_buf: String,
+    path_buf: Vec<u8>,
 }
 
 fn is_ident_char(c: u8) -> bool {
@@ -60,8 +60,16 @@ fn is_path_char(c: u8) -> bool {
     !matches!(c as char, '\0' | ' ' | '\n' | '\r' | ':' | '|' | '$')
 }
 
+/// Loader maps path strings (as found in build.ninja files) into an arbitrary
+/// "Path" type.  This allows us to canonicalize and convert path strings to
+/// more efficient integer identifiers while we parse, rather than needing to
+/// buffer up many intermediate strings; in fact, parsing uses a single buffer
+/// for all of these.
 pub trait Loader {
     type Path;
+    /// Convert a path string to a Self::Path type.  Note there are safety
+    /// related restrictions on what this function may do; see notes at the call
+    /// site.
     fn path(&mut self, path: &mut String) -> Self::Path;
 }
 
@@ -70,7 +78,7 @@ impl<'text> Parser<'text> {
         Parser {
             scanner: Scanner::new(buf),
             vars: Vars::new(),
-            path_buf: String::with_capacity(64),
+            path_buf: Vec::with_capacity(64),
         }
     }
 
@@ -313,7 +321,7 @@ impl<'text> Parser<'text> {
         loop {
             let c = self.scanner.read();
             if is_path_char(c as u8) {
-                self.path_buf.push(c);
+                self.path_buf.push(c as u8);
             } else {
                 match c {
                     '\0' => {
@@ -323,10 +331,10 @@ impl<'text> Parser<'text> {
                     '$' => {
                         let part = self.read_escape()?;
                         match part {
-                            EvalPart::Literal(l) => self.path_buf.push_str(l),
+                            EvalPart::Literal(l) => self.path_buf.extend_from_slice(l.as_bytes()),
                             EvalPart::VarRef(v) => {
                                 if let Some(v) = self.vars.get(v) {
-                                    self.path_buf.push_str(v);
+                                    self.path_buf.extend_from_slice(v.as_bytes());
                                 }
                             }
                         }
@@ -347,7 +355,28 @@ impl<'text> Parser<'text> {
         if self.path_buf.is_empty() {
             return Ok(None);
         }
-        Ok(Some(loader.path(&mut self.path_buf)))
+        // Performance: we want to pass self.path_buf directly to loader to
+        // have it canonicalize the path in-place, without allocating any
+        // additional buffers.  This is some of the hottest code in n2 so
+        // we cut some corners to achieve this.
+        // Safety: see discussion of unicode safety in doc/development.md.
+        // I looked into switching this to BStr but it would require changing
+        // a lot of other code to BStr too.
+        // Safety: this assumes loader.path will never attempt to grow the
+        // passed-in string (causing a reallocation), and instead only will
+        // monkey with the contents within the passed-in buffer.  We also know
+        // that this buffer will not be used immediately after loader.path() is
+        // called so it's fine for loader.path to scribble on it.
+        let mut path_str = unsafe {
+            String::from_raw_parts(
+                self.path_buf.as_mut_ptr(),
+                self.path_buf.len(),
+                self.path_buf.capacity(),
+            )
+        };
+        let path = loader.path(&mut path_str);
+        std::mem::forget(path_str); // path_buf owns it.
+        Ok(Some(path))
     }
 
     fn read_escape(&mut self) -> ParseResult<EvalPart<&'text str>> {

diff --git a/tests/basic.rs b/tests/basic.rs
@@ -230,3 +230,30 @@ fn repeated_out() -> anyhow::Result<()> {
 
     Ok(())
 }
+
+/// Regression test for https://github.com/evmar/n2/issues/55
+/// UTF-8 filename.
+#[cfg(unix)]
+#[test]
+fn utf8_filename() -> anyhow::Result<()> {
+    let space = TestSpace::new()?;
+    space.write(
+        "build.ninja",
+        &[
+            "
+rule echo
+  description = unicode variable: $in
+  command = echo unicode command line: $in && touch $out
+",
+            "build out: echo reykjavík.md",
+            "",
+        ]
+        .join("\n"),
+    )?;
+    space.write("reykjavík.md", "")?;
+    let out = space.run_expect(&mut n2_command(vec!["out"]))?;
+    assert_output_contains(&out, "unicode variable: reykjavík.md");
+    assert_output_contains(&out, "unicode command line: reykjavík.md");
+
+    Ok(())
+}