Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[AArch64] Adjust cache location for PC relative branches #7838

Open
wants to merge 3 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
7 changes: 4 additions & 3 deletions hphp/runtime/base/runtime-option.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -1225,18 +1225,19 @@ void RuntimeOption::Load(
using jit::CodeCache;
Config::Bind(CodeCache::AHotSize, ini, config, "Eval.JitAHotSize",
ahotDefault());
Config::Bind(CodeCache::ASize, ini, config, "Eval.JitASize", 60 << 20);
Config::Bind(CodeCache::ASize, ini, config, "Eval.JitASize",
(arch() == Arch::ARM) ? (50 << 20) : (60 << 20));

if (RuntimeOption::EvalJitPGO) {
Config::Bind(CodeCache::AProfSize, ini, config, "Eval.JitAProfSize",
64 << 20);
(arch() == Arch::ARM) ? (54 << 20) : (64 << 20));
} else {
// Avoid "Possible bad confg node" warning for unused keys.
config["Eval.JitAProfSize"].configGetUInt64();
CodeCache::AProfSize = 0;
}
Config::Bind(CodeCache::AColdSize, ini, config, "Eval.JitAColdSize",
24 << 20);
(arch() == Arch::ARM) ? (20 << 20) : (24 << 20));
Config::Bind(CodeCache::AFrozenSize, ini, config, "Eval.JitAFrozenSize",
40 << 20);
Config::Bind(CodeCache::GlobalDataSize, ini, config,
Expand Down
10 changes: 9 additions & 1 deletion hphp/runtime/vm/jit/code-cache.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -80,6 +80,7 @@ CodeCache::CodeCache()
m_totalSize = kAHotSize + kASize + kAColdSize + kAProfSize +
kAFrozenSize + kGDataSize + thread_local_size;
m_codeSize = m_totalSize - kGDataSize;
auto kGCodeExeSize = kAHotSize + kASize + kAProfSize + kAColdSize;

if ((kASize < (10 << 20)) ||
(kAColdSize < (4 << 20)) ||
Expand All @@ -97,7 +98,14 @@ CodeCache::CodeCache()
exit(1);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mxw I tried to copy to functionality from here. Do you recommend printing the warning and not calling exit(1), or just removing this entire block?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That code quits if the code size is too small.

More generally, I don't think we should be altering these defaults on ARM, since your choice of platform isn't going to alter the reality of your codebase's size. We should be looking for ways to optimize in spite of the direct jump size restrictions.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I can accept that.

I've found that instead of shrinking the sizes of the various caches to fit within 128MB, I can change the memory mapping to place ACold before AProf. This allows direct jumps from AHot to reach ACold, and keeps the significant gains for OSS-mediawiki (in conjunction with #7661). Would an ARM specific change to the cache's layout be acceptable?

Copy link
Contributor

@mxw mxw May 24, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Whoa, sorry @jim-saxman; I must have missed the notification (or maybe it got dropped) for this comment.

That seems fine to me, although if the user configures HHVM to have larger code sizes than the defaults (which I suspect we may do internally), then you'll still run into the performance degradation with far jumps. There might be reasons not do to it that I'm not aware of, but if there are, someone will raise them once we see the PR.

cc-ing @dave-estes, @cmuellner, @swalk-cavium also—On the subject of avoiding long jumps: I think the most important thing is for hot and main to be within near jump distance of one another. Jumps to cold and frozen are supposed to be rare, so the cost penalty should be relatively insignificant in practice—though, of course, the far jump is a much bigger instruction sequence, which does still have a passive impact on icache behavior.

One idea we bounced around internally when discussing this issue on ARM was to dynamically allocate blocks of far jump trampolines which were always within a near jump of hot and main. Then, in the code we emit, we can just always emit a near jump, and then proxy through the trampoline if necessary. The static relocator could probably handle this, and it would obviate the dependency on dynamic relocation (though smashable jumps would still require dynamic relocation to shorten them, I'd expect). (Are calls PC-relative and limited in immediate size, also? Even if main itself grew too big, we could partition main into chunks with trampoline areas in between, and the same solution would work if we called across chunks in main.)

This is an area with substantial potential for perf improvement, so I'm interested in hearing other thoughts or ideas on the subject you all might have.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mxw - Hi Max, et.al, In my previous life we called those trampolines branch islands. The linker would create a little table between object files for the case you describe above. Is there an easy way to instrument this an capture some data? It sure seems like most of the jumps are nearby, but that could be an artifact of just looking at small-ish tests.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mxw Calls can be PC-relative with the BL instruction in which case the range is also +/-128MB, or through a Register using the BLR instructions.

We also have a few branch instructions with smaller immediates:

  • CBZ/CBNZ <Xt>, <label> Compare and branch if register Xt is zero/non-zero to label. They have a +/- 1MB range.

  • TBZ/TBNZ <Xt>, #<imm>, <label> Test bit numbered by imm in register Xt with and branch to label. They have a +/- 32KB range.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That code also exit()'s if code size is > 2GiB due to PC-relatives. Perhaps I marked the wrong exit(1)???

Anyway, I've also found that moving aHot around works well, but I'm afraid to do that since I don't know if __hot_start and __hot_end will be enabled, which could gain from having backward(to lower addresses) PC-relative branches from aHot into the __hot_start--__hot_end native functions.

}

auto enhugen = [&](void* base, int numMB) {
if (arch() == Arch::ARM && kGCodeExeSize > (128 << 20)) {
fprintf(stderr,"Combined size of AHotSize, ASize, AProfSize, and "
"AColdSize must be <= 128MB to support PC relative "
"branches \n");
exit(1);
}

auto enhugen = [&](void* base, int numMB) {
if (CodeCache::MapTCHuge) {
assert((uintptr_t(base) & (kRoundUp - 1)) == 0);
hintHugeDeleteData((char*)base, numMB << 20,
Expand Down