Skip to content

Latest commit

 

History

History
387 lines (244 loc) · 10.8 KB

cognite-devcon-installing-rust.md

File metadata and controls

387 lines (244 loc) · 10.8 KB

intro

intro2

Slides at https://github.com/rbtcollins/reveal.js

Robert Collins ; He/him

[email protected]

@rbtcollins (Twitter)

@robert.collins (Slack)

Rustup working group member Note: Thank you for coming to listen to my tale today. I hope you will enjoy it. We're going to talk about a 10x performance improvement in rustup on Windows, well, mostly on Windows.

On a road trip, far from my chunky workstation

Note: I was using my Surface, horrid motel Wifi, worse 4G. Rust-lang installed very very slowly. What was up with that?

Bug 1540 Note: Ah good, there's a bug report

Bug 1540 av Note: Correctly blaming anti-virus! Problem Solved! Why is the bug still open?

Bug 1540 notav Note: Oh. Its still slow even if you turn off Defender :(.

..techsplaining mostly elided.. Bug 1540 techsplain

What is common between these things?

    rustup (Windows) 3m43s
    7-zip (Windows) 2m35s
    FastCopy (Windows) 2m53s
    tar (Linux) <1.5s

Note: None of them are able to make NTFS perform well. Or, for that matter, NFS, GlusterFS, CephFS and a whole range of other file systems. Ext4 on an SSD, great.

The bug had concluded the problem was not solvable within Rustup. And then I "helped".

  • "NTFS is what it is and can't be fixed"
  • Lets not ship docs (but users want them on flights)
  • Lets ship them in a zip or .ISO
  • Lets only unpack them when the user asks for it

Bug 904 Note: Fortunately, I ran across this broader bug for all file systems.

Around about here I submitted my first patch - lets call this v1

Date:   Tue Apr 9 12:53:15 2019 +1200

    Less copying during dist installation

    Per #904 copying the contents of dists out of the extracted staging
    directory and then later deleting that same staging directory consumes
    60s out of a total of 200s on Windows.

    Wall clock testing shows this patch reduces
    `rustup toolchain install nightly` from 3m45 to 2m23 for me - including
    download times etc.

    I'm sure there is more that can be done, thus I'm not marking this as
    a closing merge for #904.

Note: 1 minute on the total runtime isn't too bad: ~30%.

Note: The rust project I had been working on is long forgotten at this point. I'm just sitting there installing and uninstalling rust, looking at process monitor (a sysinternals tool, now from Microsoft - think a GUI strace) traces. rustup isn't super easy to do perf work on. The different stages it goes through are not well isolated, and I was doing hacky things to avoid downloading rust each time from the internet (copying files while they were in use and arrgh).

v2: terminfo

v3: parent dir stat caching: 1-2 seconds

v4: larger untar read buffer: 0-12 seconds

v5: larger hash checking read buffer

Note: My v2 patch fixed a bug where terminfo was opened way too many times. Global state needs to be managed as global state. rustup isn't /good rust/ in this regard - not a reflection on rust. My v3 patch improved tar unpacking so that the parent directory of a path was only statted once per directory name, rather than once per child of that directory. This saved 20k stats. Sorry about the wall of text! I've skipped half the commit as-is. Another 12 seconds. Are you keeping track? Rust has a very small built-in buffer size (8K) - which is never going to be optimal dealing with disk or network I/O anywhere. I looked at submitting a patch but because this ends up in in-stack buffers in various sys functions it was not going to go anywhere.

Bug 1540 detail Note: Remember that very first comment

Bug 1540 detail Note:

  1. Stat
  2. Create, write data
  3. set mtime an
  4. stat
  5. set permissions

And then there is another later walk over the same file. And a complete copy of everything.

v6: 1 line, 4-6 seconds

We don't care about mtime. Turning it off saves an open, a read, and write and a close. Note: NTFS stores the mtime in the Master File Table - the NTKERNEL could have a non-file-opening syscall to set file times, but doesn't. Probably makes the security surface cleaner. 4 kernel mode transitions * 20K = 4 to 6 seconds. Docs is 20k; src is 7k? toolchain is many fewer files but larger. Also note that if these lower layers had benchmark suites using e.g. dm-delay, its likely that the death by a thousand cuts effect would have been mitigated. One of the things that experienced folk will tend to tell you is that it really matters what is happening under the hood. And you may be getting sense that that is involved here.

fn chmod(fname: &Path, mode: uint) -> Result<()> {
    let file = open(fname)?
    fchmod(file, mode)?
    close(file)
}

Note: To make it more concrete than me just talking to it, this is what an emulated chmod inside something like cygwin, or WSL, ends up looking like.

Here I ran off for a bit and worked on tar-rs

  • Time of check/Time of use bugs (O_EXCL etc)
  • Do the modeset while the file is open fchmod vs chmod.
  • Set mtime and atime on file handles
  • Optimise out readonly modesetting on Windows

Note: All of these things are better for Linux and ext4 as well. But Linux and ext4 is remarkably tolerant of things being done less than ideally.

v7 - call CloseHandle() from threads - 11s total

    Avoid blocking on CloseHandle

    On Windows closing a file involves CloseHandle, which can be quite
    slow (6+ms) for various reasons, including circumstances outside our
    control such as virus scanners, content indexing, device driver cache
    write-back synchronisation and so forth. Rather than having a super
    long list of all the things users need to do to optimise the performance
    of CloseHandle, use a small threadpool to avoid blocking package
    extraction when closing file handles.

    This does run a risk of resource exhaustion, but as we only have 20k
    files in the largest package today that should not be a problem in
    practice.

    https://www.mercurial-scm.org/pipermail/mercurial-devel/2016-January/078404.html
    provided inspiration for this.

    My benchmark system went from 21/22s to 11s with this change with both
    4 or 8 threads.

Note: specifically want to acknowledge the mercurial developers who had identified that close latency was a performance issue for hg - see the link in my commit message. Talk here about the IO completion model - drivers not owning the pages - the app still owns it. Vs Linux where it is moved/transferred into the write cache pathway, and then control returns.

Note: whew. Time for a pause. We're down to 11s seconds now, is NTFS still in the picture? Is Defender? Not to mention mcafee/norton/kapersky.

How did just eliminating a directory copy save a whole minute?

Is I/O on NTFS that bad? Note: mention burning temps on tablet back

rust-docs has ~20k files

each file was being scanned each time it was written Note: I don't know about you, but I can't read 20k documents in 11 seconds

Twitter-help

Twitter-help

Twitter-help Note: This provoked some jealousy. Scott watched my earlier version of this talk. He fields dozens of these queries daily. Why did I get access? It was clear, well articulated. I had done my prep work and that enabled both the introduction and then the subsequent response I got from the Defender group.

Note: The discussion with Microsoft staff was excellent. They asked for traces via the feedback hub, shared how defender works with me and included me in discussions about the solution.

CloseHandle -> Scan -> Ok/Error Note: This is how Defender defaults. And I have traces with the CloseHandle call taking many 10's of milliseconds. If we dispatch work to defender - to the scan routine - faster than defender can process it, it backs up.

Open -> Scan -> Ok/Error Note: This is what Defender has to do after a reboot. And it is also an opt-in mode for individual processes at runtime. But not process controlled.

Defender was updated within a couple of weeks Note: to recognise rustup and default to scan-on-first-use rather than scan-on-close for the files that rustup writes.

I suggested a userspace API to let all opensource projects opt-in... but that didn't go anywhere.

Think git/hg/bzr, etc.

v8 - make it slower when it fails

    Fix #1870: Race condition with some virus scanners

    Some virus scanners take out handles on files we are going to
    rename, causing access errors and preventing the rename. Typically
    this is short lived. Retry rather than erroring.

Note: McAfee

Bug 1540 notwindows Note: From the big issue about windows doc installs being slow. Perhaps it really isn't just windows. Explain the KB/s install metric. Also reddit

Bug 1867 notlustre Note: So this bug was filed about lustre being slow, and turned out to be NFS was slow. This leads to

v9 - close() in a thread Note: Narrator- this did not solve the problem

NFS

openat(): ~3ms
fchmod(): ~2ms 
write(): ~10ns
close(): ~10ns
chmod(): ~1ms

Note: 20k items, so 1ms == 20 seconds. so 60 seconds in open(), 40 seconds in fchmod(), and 20 seconds() in the final chmods.=120s So, the things we need to do to make it fast for NFS filesystems are:

  • don't block on inode creation
  • don't apply chmod's twice
  • and do the final chmod threaded or in the initial IO

v10 - thread all file IO Note: A version of this logic is included in rust 1.37 so that rust-doc (and cargo doc etc) generate documentation in a threaded fashion, for exactly the same reason. win-api for instance really benefits.

v11 - thread directory operations as well Note: This required cross-thread coordination because directories are effectively global state

v12 - set permissions during extraction Note: because duh. Otherwise we just blow the disk cache.

I also added a tracing system in there that outputs Catapult trace files - thats the Chromium in-browser profiler.

profit Note: 120s to 15s

Syscalls are not free

Anywhere

Don't blame your serial code's serial performance on what you are running on top of (like NTFS)

Collaborate with the owners of components to get the best results

There are often many causes to problems

Don't stop at the first cause

Theres a whole other chapter into WSLv1, WSLv2 and more, but ETIME.

These slides: https://github.com/rbtcollins/reveal.js/tree/master/cognite-devcon-installing-rust.html

vim: sts=2 sw=2 et