From:  Steve Fink <sfink@mozilla.com>
Date:  25 Jan 2017 13:47:07 Hong Kong Time
Newsgroup:  news.mozilla.org/mozilla.dev.tech.js-engine.internals
Subject:  

Report on Google Suites work week in Taipei

NNTP-Posting-Host:  63.245.214.181

Hello #jsapi.

I wanted to report on some of the JS-centric results of the Google 
Suites work week in Taipei, and hopefully get more eyes on these problems.

First, some background: Sean and I met up with about two dozen other 
people involved in the Google Suites performance improvement effort. We 
had the Hasal people from Taipei, a bunch of project manager type folks, 
and a scattering of developers from across Gecko. The GSuites effort is 
considered extremely high priority, for good reasons that I won't go 
into here. In Taipei, we looked at lots of profiles, worked on internal 
tools, argued over what to test and how to measure, and ate stinky tofu.

High level findings:

  - we have a cluster of test cases that run around 20% slower than 
Chrome, with some quite a bit worse than that (think 40-60%, with 
outliers at 2x)

  - most of the existing test cases appear to be JS bound

  - some other areas showed up here and there, eg layout and GC

  - most of our current test cases boil down to document load times

  - we don't have much visibility into what actual users are hitting

  - we still don't have a great way to say exactly what parts of an 
application are slower in Firefox than in Chrome

  - results derived only from the Gecko Profiler / Cleopatra are not 
always trustable or useful

  - the more profilers, the better; we seem to find slightly different 
things in each

Details:

[most tests measure load time]

We have test cases that do things like load up a Google Slides document, 
go to the end, and add a slide. Almost all of the time (4-6sec on 
typical machines) is in loading the document. This is true for the 
majority of the test cases we are looking at. We are not yet splitting 
out the time for each operation (to the extent that that is even 
possible; GC for example is going to leak into later operations, and 
cached results "leak" time into earlier operations because the initial 
operation will be expensive even though the amortized cost may be low.) 
This is problematic because we expect that actual users' complaints are 
probably more about interacting with these apps, not the initial load.

There are exceptions. In bug 1269695, the Hasal test is heavily 
dependent on document load, but when I profile it I do not time the 
initial load. Once it is loaded and the spinner has stopped, I time how 
long it takes to Ctrl-End to the end of the document. Bug 1330539 is 
looking at I-cache misses when scrolling through a document. Bug 1330252 
is from piling up key events when scrolling through slides via the keyboard.

[actual user problems]

Our Hasal test cases indeed show slowdowns relative to Chrome, but we 
don't know whether the slowdowns for those test cases are related to the 
actual issues people are running into and complaining about. We are 
sending out surveys to try to figure this out, and adding additional 
telemetry to infer it from the field. The most promising pointers I've 
come across are multiple users sharing a document, and gradual perf 
degredation over time. Neither scenario is in our automated tests.

[what is slower in Fx vs Chrome]

In general with these pages (apps), we only know that it takes some 
number like 38% longer in Firefox than in Chrome. We still do not have a 
good way of breaking that down into eg "JS runs for 750ms more, layout 
is a little faster, GC is about the same". We would like to get to that 
level; we have a way to categorize time, and so does Chrome, but both 
have issues and are not directly comparable. Going further, we would 
like to be able to say that a specific script took X ms longer in Fx 
than in Chrome, but we don't have tooling for that either. We discussed 
using a proxy to instrument the JS for this purpose, and I may work on 
that. We can also look at other aggregate metrics for the whole load, eg 
cycles per instruction or cache miss rates for the whole test, but 
that'll never be more than a clue as to what specifically to work on.

[Unreliability/features of profiler]

We identified a number of ways in which the Gecko Profiler can mislead 
or omit useful information:
  - symbolication is unreliable and racy
  - symbolication is buggy (meta bug 1307215 for this and the above)
  - we cannot walk the native stacks of many of the builds in users' hands
  - the sampling overhead is dependent on what is being sampled, which 
biases the results (eg bug 1332489, bug 1330532)
  - similarly, some activities during profiling take way too long (bug 
1330576 and dependencies)
  - bailouts seem to be annotated with script names that do not match 
script names in the samples (bug 1329921)
  - the profiler associates some things, like scripts, using non-unique 
keys that conflate multiple things (eg bug 1329924)

Also, native profilers have various features missing from 
Cleopatra/Gecko Profiler that have revealed differing results. (And note 
that I'm not much of an expert here; I don't know enough about the 
various profilers on the various platforms, nor exactly what turned out 
to be useful for the things we looked at.)
  - perf counter data, eg cycles per retired instruction
  - data for the kernel or other processes
  - butterfly view, a la Zoom on linux (or gprof, for that matter -- 
it's where you see all immediate callers and all immediate callees of a 
given function, rather than a tree view based on one or the other)
  - charge-to-caller (sometimes profiles get obscured by having little 
pieces of time whittled away into leaf functions like memset or 
whatever. It is handy to be able to merge this time into the caller. At 
least, I *think* this is what I've heard a few people request.)

The other profilers people used, or that I've heard recent mention of 
people using, include:
  - perf
  - Zoom (perf frontend)
  - Instruments
  - VTune
  - Windows Concurrency Visualizer (I don't think anyone has used this 
seriously yet)

Oh, and for looking into Chrome performance (to compare with Firefox), I 
am a little suspicious of their categorizations of time because you only 
get those with the Timeline tool, which has an overhead of something 
like 5-7x on the test case I was looking at. (So a 6 second document 
load took longer than 30 seconds.) *Maybe* they're able to adjust for 
the tracing overhead.

[JS performance]

The general impression is that we are spending more time executing JS 
code than Chrome is. For the most part, this is not determined directly, 
but rather we look at tests where we are substantially slower than 
Chrome, then examine a profile to see where most of our time is going. 
Sometimes I'll look deeper at the percentage of time broken down by 
various categories (script execution, GC, layout, etc.), to compare the 
percentage of our time spent running scripts vs Chrome's, but see above 
why I'm not willing to fully trust Chrome's percentage breakdowns. I 
have also used our and Chrome's devtools to narrow in on specific 
executions of scripts (just by eyeballing the basic pattern and 
correlating the timelines), and from that I have seen instances where we 
spend much more time executing a script (and everything under it) than 
Chrome does.

We've really only come up with two general high-level explanations for 
why we might be slower.

(1) We are getting bad IPC counts for at least one test. IIRC we're at 
something like 10, and Chrome is around 2. Except the latest info says 
that that's mostly due to interference from graphics code.

(2) Google suites code is the output of Clojure Compiler, which produces 
functions with poor type locality, if you'll forgive me for coining a 
term. (As in, lots of different code with lots of different types flows 
into the leaf functions, so compiled code for leaves ends up 
megamorphic, but wouldn't be if we magically inlined everything.)

I don't know how to prove or disprove either of these hypotheses. And 
even if they do end up explaining a lot, I'm guessing there will be a 
lot of smaller reasons that'll add up as well.

Typical examples:

https://clptr.io/2jqJ0BD is a non-e10s run for the document from bug 
1269695. It has 50% of the total time in stuff related to script. 19% is 
running ion code, 14% running baseline code, which is actually pretty 
good for the things I have looked at. (Contrast with an Octane run 
https://clptr.io/2jqRUPk where 40% of the total time is in ion, 16% in 
baseline, 18% in GC!)

Bug 1326346 is a gslides example that ehsan has analyzed, finding some 
reflow but mostly script.

https://clptr.io/2jqGWJX shows 31% of the time spent in script: 17% 
executing baseline, 5% executing ion, and to my eye, not very many total 
bailouts. (This is from loading 
https://docs.google.com/presentation/d/10QdoQTau98IeEp862ChQ7vG0eR3_mJZABsuduXGK1VE/edit#slide=id.g115bc93893_11_5 
).

What the JS team could provide:

- Understand what various patterns mean. What are tell-tale signs that 
we are doing something dumb/fisable? And which can we detect 
automatically? nbp has some ideas in bug 1184569. Additional 
perspectives would be helpful.

- In general, a better way to triage and diagnose performance problems 
in single page apps, a way that does not require having an analogous 
shell test case.

- Revive vtune integration (sstangl is working on it)

- Push column numbers through to the profiler to be able to identify 
specific scripts in minified code (bug 785922?)

- Annotate more things with markers, eg bug 1329923, so we can get some 
of the advantages of tracing with our sampling. (eg generate a timeline 
of samples for a given script, with all compiles/bailouts/collections 
annotated in the timeline)

- Produce bailout markers in a structured format (bug 1329921)

- A good description of the overall flow of the engine, as it relates to 
performance. Every time we go over this with someone new, the same ideas 
and suggestions arise.
   - compile sooner
   - compile later
   - specialize the warmup thresholds to specific URLs (or a similar key)
   - compile while idle and keep multiple differently-specialized copies
   - cache compiled code
   - cache type information
   - suppress GC
   - GC when idle
   - eliminate all or most bailouts

I can come up with reasons against every one of these, but I also feel 
like there's a kernel of a good idea in each. It would be nice to have 
ammunition in the form of a clear description of how the engine works 
*in practice*, and preferably specific numbers we used to tune various 
things. This isn't for shooting down these ideas, but rather for making 
them more nuanced and increasing the chance of someone coming up with 
something that *would* help. Without explaining the "why nots" over and 
over again.

Some of these things are doable, would help some, but just aren't worth 
the cost. Like say we saved initial type information + steady state type 
information separately, and unioned each with what we observe at 
runtime. So the first compile would take into account the first batch of 
type info, the second would switch to the second batch. Lots of work, 
more memory, more complexity -- could it be worth those costs? Seems 
unlikely to me, but I'm no expert.

I would imagine the engine description looking something like "x% of 
code never even runs, so we don't want to waste time doing anything 
eager with it. We have to at least syntax parse before we can fetch all 
of the code. x% of code only runs once, and the interpreter will execute 
it faster than we could baseline compile & run it. Baseline is typically 
k x faster than the interpreter, but has to generate ICs before it gets 
up to that speed. It also has to see accurate enough type info before 
ion compiling, so that...." or whatever. Including subtleties such as 
startup type info vs steady state type info. Essentially, "these are the 
patterns of code the engine expects to see, and this is how it makes 
them go fast."

I have a meta bug, bug 1299643, where I try to list the Google Suites 
cases that seem to be JS bound. The latest round of Hasal results is at 
; 
note that I haven't gone through them to sort out the latest set of 
significantly slower, JS-bound cases, because they're using the old 
profiler and I don't have any good way of categorizing the time there.