Quick Tips for Fast Code on the JVM

barrkel · on Dec 31, 2017

I'd recommend a process rather than a handful of implementation-specific rules of thumb that are, IMO, at risk of being applied too generally. I'd hate someone to read this article and then go ahead and avoid 'new' wherever possible because they think that's how you make code fast, or avoiding boxing and polymorphic call sites, etc.

My process for writing fast code on the JVM:

1) Measure. Set up a benchmark so you know whether you're on the happy path or not, or whether you've fallen off a performance cliff, for whatever reason. Make this part of your testing suite with some kind of notification for regressions.

2) Start small, ideally with a do-nothing loop over the input. This gives you a baseline; you can't go faster than a do-nothing loop (presuming you can't skip part of your input, which is an algorithm problem, not specific to JVM optimization).

3) Incrementally build up your algorithm and pay attention to when it falls off the performance cliff, using your benchmark from (1). If and when you fall off the performance cliff, that's when you start bringing in tricks like avoiding new, ensuring call sites are monomorphic / bimorphic, avoiding boxing, reducing pointer indirections and other cache friendly code, etc.

Another trick to consider is playing around with inlining, but not in the way you might think: try pushing infrequently executed code (conditionals) one level deeper in the call stack (i.e. making the body of an if-block into a method and calling it). The idea here is to encourage inlining of method doing the calling. Inlining is where the JVM gets to specialize your code to the specific situation at hand, but the JVM is reluctant to inline big methods because it has a time budget. So you need to help it.

_8ca6 · on Dec 31, 2017

I'm a Java programmer who has tried and failed to do some optimizations this year following your general guidelines. I'm extremely happy to finally see a piece of writing that expands my mental model of what is actually happening on the JVM. If we just keep repeating the same general truisms about measurement and inlining, then nobody will ever advance beyond my own low level of understanding.

jnordwick · on Dec 31, 2017

I'm not really a fan of your "best practices". Java often has unique problems where this generic advice breaks down (I do a lot of low latency and high performance work), and sometimes you need mechanism sympathy instead of relying on tools so much.

1- The author does tell you to understand your hot path. However in Java it is often difficult to pin down GC related issues and just easier to prevent them in the first place. I've tried to go through code bases others have written top clean up allocation problems and they become so embedded in the code that it is easier to just rewrite. Or some issues like finalizer/weak ref issues are very difficult to find in your measurements.

2 and 3- I've never seen this work. This is a great way to get a bunch of faulty analysis or not even know you're going off track to begin with.

This article and many of the tips are written when you are trying to be "as fast as possible" and where each microsecond counts. I've seen a lot of projects go down in flames where they treated performance as something secondary the can be written into the code in a second pass. When you ate having to deal with issues like the article describes you should take a more first principles approach and keep your eye on the proper idioms inn the first place.

barrkel · on Dec 31, 2017

sometimes you need mechanism sympathy instead of relying on tools so much

If you need mechanical sympathy throughout you still need solid measurements, you still need to know when you fall off performance cliffs, but you should be experienced enough not to be misled by articles like this one. That last bit was my biggest concern. Because very very few Java apps have performance bottlenecks throughout. That's more typically seen in game engines, where the whole app is a hot loop, and they're not often written in Java.

As to whether it works or not: I've seen it work repeatedly for scanning and parsing. I don't know your situation and can't say why it hasn't worked for you. But again, measurement is the most important bit, and you still need to notice the cliff dropping.

jnordwick · on Dec 31, 2017

This after the fact measurement heavy method of optimizatization method often doesnt work in these systems because it is more often death by a thousand paper cuts than a single line of code.

You won't ever notice yourself going off the track even after youre done you likely won't notice.

I do a lot of latency and throughout sensitive work, and far more often the bigger problem is losing a few tenths of a percent on each poor decision in your hot path. It becomes very very difficult to fix.

kevindication · on Dec 31, 2017

Not really disagreeing with your sentiment, but he laid it out nicely in the "Defining the Hot Path" section. That is honestly the most valuable lesson a newer programmer could take away from this gist.

vardump · on Dec 31, 2017

Some bits and pieces that come to mind:

Make use of JVM and CPU performance counters. They're amazing for discovery and understanding of performance matters.

Quick googling found these:

https://zeroproductionincidents.wordpress.com/tag/jvm-perfor...

http://www.peternier.com/projects/overseer/overseer.php

Make sure you also test in actual application context. Microbenchmarking runs into risk of consuming shared CPU resources (such as L1 cache) and exhausting them when running an actual application.

In other words, something that is fast in a microbenchmark can perform badly in an actual application.

Other than that, one thing no one mentioned is to arrange data in cache-friendly format. Sometimes it might mean ugly things like arrays of primitive types instead of arrays of objects...

False sharing is another real performance killer in parallel code.

IMHO, writing high performance Java code is a lot harder than high performance C/C++.

barrkel · on Dec 31, 2017

> one thing no one mentioned is to arrange data in cache-friendly format

I said: reducing pointer indirections and other cache friendly code (just saying!)

I agree that it's harder. It's harder because the platform is unpredictable and capricious, it's liable to change from one release to the next, and if you have C-oriented ideas around mechanical sympathy, they can lead you astray. For example, manual loop unrolling is seldom a win in Java (often, the array access checks kill you). What you want is to make your code simple enough for the JVM to unroll it (it'll be able to prove to itself that you don't exceed the bounds and elide checks, especially if you help out with an explicit comparison up front). You're in a dialogue with a reasonably smart compiler, and it's a bit of a guessing game hunting for the happy path. That's why my step 2 and 3 involve incrementally building up your algorithm: you want to stay on the happy path, because if you stray off it, you need to get back on to it again before you change more stuff.

Usually the way to go is to write straightforward code (common idioms get more optimization effort from the JVM team) and keep it simple enough to stay in the optimizer time budget (splitting it up where necessary).

vardump · on Dec 31, 2017

> For example, manual loop unrolling is seldom a win in Java

Just like practically everywhere else.

Besides, it increases code size, and can decrease performance just for that reason, L1 code cache spills.

Intel CPUs have increasingly improved loop buffers as well. If the loop qualifies, unrolling has no positive effect at all.

Low end microcontrollers are another matter, that's where unrolling can still sometimes yield significant wins.

I think the biggest issue I had with high performance Java is lack of memory alignment control and (at least last time I checked) vector intrinsics. JNI time...

physguy1123 · on Jan 1, 2018

I reliably see gains from unrolling relatively simple loops on very recent cpus. Going to 8+ iterations is usually overkill, but it can help. Sometimes one can split a dependency chain by unrolling as well.

vardump · on Jan 1, 2018

> I reliably see gains from unrolling relatively simple loops on very recent cpus.

Maybe you didn't have multiple of 4 uops in the loop? Or hit some other (?) loop buffer limitation.

Dependency chain splitting is also indeed a good reason to "unroll". Although it's a bit more complex transformation than simple unrolling.

barrkel · on Jan 1, 2018

Loop unrolling can definitely still be a win if it's in the form of vectorising or otherwise processing multiple elements at a time.

username223 · on Jan 1, 2018

> IMHO, writing high performance Java code is a lot harder than high performance C/C++.

I haven't had to write high performance (numerical) code in Java, but my instinct is to doubt that it's even possible. There are too many levels of abstraction and complexity, which are constantly changing. Does some JIT heuristic get triggered to specialize a method? Does array bounds-check elimination work in this particular case? Does the compiler allocate this array so that SIMD instructions will work? Will the next minor version bump change everything? With C/C++/Fortran you can look at the generated assembly and either beat the compiler into submission, or write the asm yourself. With Java, I don't see a viable path to success.

This is Lisp's "sufficiently smart compiler" problem with a 2000s makeover.

peoplewindow · on Jan 1, 2018

C/C++ developers rely on knowledge of compiler optimisations all the time, the most famous case being the NRV optimization in C++. But exploiting knowledge of where data structures will be allocated is another example.

Java can be used for numerical work but it's not ideal for the reasons you cite. However this isn't fundamental to the technology or language. It's more a problem of incentives. Java isn't used for such work, so the applications that people care about and try to optimise at the JVM level don't tend to benefit from them, so the cost/benefit ratio of such optimisations seem low, so focus goes elsewhere.

C2 can do quite sophisticated auto-vectorisations but I suspect those optimisations rarely trigger because work that benefits from vectorisation was moved into C or assembly a long time ago. It's rather forward looking work. Graal doesn't even focus on it. They put all their effort into optimising large 'business java' type apps, along with language interpreters (Graal's party trick is turning interpreters into fast JIT compilers).

_old_dude_ · on Jan 1, 2018

Using Java for numerical performance seems not a good idea too me to, the Java JITs seem to general for my taste.

Anyway, this is how the Vectorization works in Hotspot: http://cr.openjdk.java.net/~vlivanov/talks/2017_Vectorizatio...

kasey_junk · on Dec 31, 2017

Be careful with 2) because of inlining. Can’t count the number of times I wrote benchmarks that got completely inlined away.

sorokod · on Dec 31, 2017

If you are not using JMH (http://openjdk.java.net/projects/code-tools/jmh/) for micro-benchmarking, you are very likely doing it wrong. Inlining is just a tip of the gotchas iceberg.

hyperpape · on Dec 31, 2017

Agreed. The vast majority of the time you should be using JMH for microbenchmarks on the JVM. If you read the samples and understand them and apply those ideas, it takes care of the inlining problem.

vvanders · on Dec 31, 2017

Also for(Foo x : y) will allocate iterators on some JVM implementations which will nail you on the GC.

_old_dude_ · on Dec 31, 2017

It's not on some JVM implementations, to be able to remove the Iterator creation, the VM requires the call to y.iterator() to be either monomorphic or bimorphic.

Hence the advice "Keep Things Monomorphic or Bimorphic" !

jnordwick · on Dec 31, 2017

Even if it inline the iteration creation and calls, then hoist and remove the allocation, you're likely to see suboptimal instructions.

On hot paths, I often find it better to just avoid iterators in the first place than trying to clean up the mess later.

_old_dude_ · on Dec 31, 2017

yes, the Iterator protocol of Java is sub-optimal - hasNext() has to be called twice so you hope the VM will kick the common sub expression elimination which i have found works well only if you do not do any side effect in hasNext. - and most of the Iterator are fail fast so the code has to check that the modCount of the collection has not changed.

but from my own experience, you usually do not have to optimize that far, removing that 'synchronized' block, is far more common.

hyperpape · on Dec 31, 2017

Is this really a consideration outside of cases where someone is tuning their system so that it never does garbage collection?

vvanders · on Dec 31, 2017

Very much so, I see this all the time in production Android code. Someone throws in an iterator like this in an onDraw() and now your super-fast code is hitting a GC pause every hundred ms or so.

_old_dude_ · on Dec 31, 2017

Android do not use a real JIT (no inlining) so it's mostly an AOT, and removing the Iterator allocation require to know the real class of the iterator.

I'm not a pro of Android, so i wonder if the AOT system of Android (ART) has an option to provide a profile like you can do with gcc. It seems a better idea than to have to rewrite all the codes, third party codes included.

reader_1000 · on Jan 1, 2018

Escape analysis should handle this [1], but I don't know if it does all the time. Reliably profiling this is hard, I guess.

http://psy-lob-saw.blogspot.com.tr/2014/12/the-escape-of-arr...

cdman · on Dec 31, 2017

^^^ this. I would also mention the "performance tuning mindmap for Java": https://shipilev.net/talks/devoxx-Nov2012-perfMethodology-mi...

tehlike · on Dec 31, 2017

Most of performance comes down to profiling. You cannot improve performance without knowing the bottlenecks, and profiling is the best way to get it from real data/usage patterns.

ccozan · on Dec 31, 2017

Regarding 1) I went down the road of optimization and the code was so fast that we had to switch to nanosec to see something. I was quite impressed what the JiT could do.

slaymaker1907 · on Dec 31, 2017

You can tune the maximum size of methods as an argument to the JVM to inline as opposed to manually inlining.

Also, I think the first rule of microbenchmarks on the JVM is don't do them. The second rule is to use a proper setup for these benchmarks such as JMH due to the complexities introduced by the JIT and GC.

tenpoundhammer · on Dec 31, 2017

One point I’d like to add which a lot of people disagree with is that we should all focus on writing simple and easy to understand code. Enigineers shouldn’t worry too much about optimizing code unles there is a good reason to do so.

I’d say about 1% of my code has ever required me to return and optimize it.(I usually write apis that are utilized by end user apis). Most of the time my code is more than fast enough and user concerns are more around bugs related to requirements.

In the few cases I have had to optimize its usually been a super obvious problem. For instance there was a Scala API that looped over a list of lists and called the Scala built in find function. That function is linear and incredibly slow I just converted the first list into a hashmap took a call that took 30 seconds to under 3 seconds still slow but order of magnitudes faster. It’s usually been silly stuff like that.

All of this of course depends on the domain you are working in some applications require you to squeeze out every millisecond in which case most of this won’t apply to you.

pvg · on Dec 31, 2017

One point I’d like to add which a lot of people disagree with is that we should all focus on writing simple and easy to understand code. Enigineers shouldn’t worry too much about optimizing code unles there is a good reason to do so.

A lot of people don't disagree with this. It's the basis of one of the most-repeated programming adages ever, with its own 40+ year literature of talmudic interpretations, guided meditations and contrarian-yet-ultimately-supportive takes.

Shorn · on Jan 2, 2018

I want to write a bot that analyzes all of the Stack Overflow posts by Brian Goetz(https://stackoverflow.com/users/3553087/brian-goetz) and count every time he's responded to "Should I...." / "Which is better/faster..." questions with "just write the code that is simplest and easiest to read".

It must be in the hundreds by now.

userbinator · on Dec 31, 2017

we should all focus on writing simple and easy to understand code

That is an optimisation. I've found that the simplest solution is usually the most efficient.

virgilp · on Dec 31, 2017

This is a good-sounding statement, so it "feels true". But, if you stop to really think about it... It's probably false in most circumstances [1]

[1] If by "efficient" you mean time-efficient. If you pick your own efficiency metric, then yes, sure, I guess so...

mnm1 · on Dec 31, 2017

"I've found that the simplest solution is usually the most efficient."

If this was true, we'd all be using bubble sort and searching databases without indexes, row by row. I can think of very few algorithms or even situations where this is true.

jackmott · on Dec 31, 2017

>.(I usually write apis that are utilized by end user apis).

When that is the case you need to think long and hard about performance, because it needs to not just be fast enough for you, on your machine, but on any conceivable end users machine, for any conceivable end users idea of what responsive enough performance is, for any conceivable end uer's work load, which might be vastly different or larger than anything you intended.

further I don't think it is always the case that faster code is less simple. Often the faster option is more simple. Avoiding unnecessary complexity is often a win for performance too, and understanding memory locality and how the garbage collector work allows you to make smarter default choices which often have no net impact on simplicity

kodablah · on Dec 31, 2017

If you're getting to this level of optimization, you could just consider going native. Making a shared lib in Rust that uses JNI and does everything you could do in Java is really really easy.

twic · on Dec 31, 2017

Careful, though. It costs about a millisecond to do a return trip across the JNI boundary, so don't do that anywhere near an inner loop. Accessing Java objects from native code is also a pain, with potential performance gotchas (pinning, mostly).

An approach I've seen work is to have dedicated Java and native threads, communicating via shared memory. We have some funky in-house code for this, but I think Aeron [1] should be a good open source option.

[1] https://github.com/real-logic/aeron

badlogic · on Jan 1, 2018

JNI roundtrips definitely do not cost 1ms per trip.

twic · on Jan 2, 2018

True. I did some benchmarking a while ago, and had that number in my head. Revisiting it now, it seems to be way off - more like 5 nanoseconds with -Xcheck:jni on. Code:

https://github.com/tomwhoiscontrary/jni-bench

kodablah · on Dec 31, 2017

True of course. I just feel like many people had long since taken native off the table because of the burdens of cross platform c/c++ development. But I found Rust+JNI so easy that I think it should be a reasonable tool in every JVM optimization toolbox.

raz32dust · on Jan 1, 2018

"It costs about a millisecond to do a return trip across the JNI boundary" - Do you have a source on this? Seems unlikely since most system calls would be way too slow if this were true?

wging · on Dec 31, 2017

> Accessing Java objects from native code is also a pain, with potential performance gotchas (pinning, mostly).

What do you mean by pinning? If you mean pinning threads to cores, how does that relate to JNI?

bennofs · on Dec 31, 2017

Probably pinning the object in memory, so that you can keep a stable reference to it without the danger of the GC moving it away under your pointer.

lamansion · on Dec 31, 2017

Do some research on GC locker.

matsemann · on Dec 31, 2017

Not necessarily true that native would be faster than jit optimized code, though.

kodablah · on Dec 31, 2017

True, it's just an option with possible downsides. Just seems to go unmentioned often.

jcdavis · on Dec 31, 2017

The example of bad allocations

  def first[A](left: A, right: A): A = left

  first(12, 42)   // => 12

actually allocates nothing, because both numbers are in the Integer box cache range (-128 to 127 inclusive by default, but configurable to be greater). The general argument is good though, so mabye first(128,129) is a better example :)

fnl · on Jan 2, 2018

Actually, those are Scala code examples, so I believe it will be end up using ints for this case, not Integers.

Edit: I'm wrong, the example still needs a @specialized version.

ychen306 · on Dec 31, 2017

Seems odd that the compiler can't inline this pattern.

peoplewindow · on Jan 1, 2018

It probably can. Graal certainly could.

time4tea · on Dec 31, 2017

Hmmmm. Fast code for my particular application, in my area of specialization. But not actually very helpful for 99.9% of applications that are actually deployed, which dont really suffer from the type of problem here, and far higher level optimisation would have greater benefit...

E.g. prefer final fields; don't share between threads unless you need to; there is little point micro-optimising a poor algorithm; don't do the same work repeatedly...

Having said that, in a specialised context, very interesting....

jnordwick · on Dec 31, 2017

Can't have a performance oriented article here (or on any dev site) without a chorus on how irrelevant/useless/harmful the information is. Or half the comments being about profiling.

Also final doesn't help hotspot nearly as much as the pointers the article gives. Final doesn't help hotspot much if at all.

aglionby · on Dec 31, 2017

Usual disclaimer of profile your code to identify areas for improvement and try to quantify changes as far as possible -- it's no good doing things just because you think it'll make the program faster.

Specifically on the Java side of things, while not quite as low-level as what the post is talking about, I've seen situations where proper use of logging libraries has given perf increases. It's as simple as log.info("Foo: {}", "bar") instead of log.info(String.format("Foo: %s", "bar")) -- reducing the number of string operations dependant on your chosen logging level. Simple but I've seen it happen.

Shoutout to Brandon Gregg's blog [0] with some interesting stuff on performance, some of which is Java-specific.

[0] http://www.brendangregg.com/

_old_dude_ · on Dec 31, 2017

yes, String.valueOf() is dog slow.

We recently moved all logging stuff to use a constant lambda as first parameter, log.info(s -> "Foo: " + s, "bar") so if the logger is not enabled, the constants are discarded by the VM and otherwise the string concatenation is faster than any string interpolations.

specialist · on Dec 31, 2017

Huh. I'll have to try that. Thanks.

My solution was to flip the problem, making a separate logger instance for each log level. So something like

  Logger warn = Logger.factory(Level.WARN); 
  warn.log( "message" );

If warn is disabled, factory returns a NullObject (do nothing) logger.

(My actual implementation had one level of indirection, so logs levels could be updated at runtime.)

Then I got nutty optimizing printf, doing JIT class generation per template... But that's another story.

TL;DR: Logging in Java drives me nuts, so I made my own.

_old_dude_ · on Dec 31, 2017

Did you take a look to java.lang.invoke.StringConcatFactory, it does more or less what you want using method handles instead of bytecode generation ?

specialist · on Dec 31, 2017

Thanks for the tip. Will check it out. I'll be happy if it moots my hackery.

sk5t · on Dec 31, 2017

Have you measured this? I would expect the allocation of Callable<String> with closures to be slightly worse than common types of string concatenation. Not to mention the awkwardness of only logging effectively final vars--this would be a red flag on organizational coding practices--just use slf4j or better yet Scala + LazyLogging.

_old_dude_ · on Dec 31, 2017

If you use a closure that capture some local variables and not a constant lambda, then you're right, the allocation will kill you. But, we are only using lambdas that do not capture anything so the lambda is allocated once by the VM.

And we are using the new shiny System.Logger [1] so the VM logs and the application logs are interleaved. We still have some wrappers to redirect log4j logs on the system logger because few third party libs are compatible with the system logger.

[1] https://docs.oracle.com/javase/9/docs/api/java/lang/System.L...

sk5t · on Jan 1, 2018

Ah, makes sense. On the other hand, this would seem to take a StringBuilder-based approach off the table (cf. logback). I wonder how this stacks up to simply passing varargs to the logger. There's also the question of applying any strategy other than toString to the non-string args.

cbsmith · on Jan 1, 2018

Wait, your string literals are still going to get intern'd, regardless of whether they are in a Lambda or not.

reader_1000 · on Jan 1, 2018

Well, since Java evaluates method arguments all the time, if you have a computationally-heavy argument, then it is better to check debug level:

  if(log.isInfoEnabled()) {
    log.info("Foo: {}", someHeavyOperationThatReturnsString());
  }

cbsmith · on Jan 1, 2018

String formatting in general will be the death of all performance... and humanity.

a_c · on Dec 31, 2017

> It occurred to me that, really, I had more or less picked up all of it by word of mouth and experience, and there just aren't any good reference sources on the topic. So… here's my word of mouth.

Can someone provide some reference on how JVM optimisation works, or maybe JVM generally?

hyperpape · on Dec 31, 2017

The Java Performance book by Oaks has some sections on internals. Cliff Click has a talk called “A JVM Does What?” that has some interesting material (very compressed though).

There’s also some old Oracle or Sun white papers on Hotspot, but I’ve never actually read them.

One nitpick to keep in mind: the major JVMs do many things in similar ways but have important differences. In fact, there is a spec that’s much less prescriptive, and JITs are entirely an implementation detail. Often when you read “the JVM does X” it means that the author is talking about Hotspot.

quotemstr · on Jan 1, 2018

At a certain point, it's simpler, more robust, and more predictable to write some JNI and put the hot path in C++ or Rust or something.

nayuki · on Dec 31, 2017

About the monomorphic and bimorphic virtual methods, Aleksey Shipilev has some great articles on the topic, including low-level benchmarks: https://shipilev.net/blog/2015/black-magic-method-dispatch/

cdman · on Dec 31, 2017

The link from the submission seems to 404. This link seems to work: https://gist.github.com/djspiewak/464c11307cabc80171c90397d4...

gravypod · on Dec 31, 2017

I'd actually like to add "Don't touch synchronized functions" to the list. If need be try and use Thread Locals instead.

imhoguy · on Dec 31, 2017

Foremost one should look what java.util.concurrent[0] has to offer before reinventing concurrency patterns from scratch.

[0] https://docs.oracle.com/javase/9/docs/api/java/util/concurre...

aardvark179 · on Dec 31, 2017

I’ve seen statements like this before, and they are often based on misunderstandings of what either monitors (synchronisation) or ThreadLocal give you. Could you maybe expand your comment to describe what problem with synchronised you would suggest fixing using ThreadLocal?

dboreham · on Dec 31, 2017

Please avoid TLS at all costs unless you really know what you are doing and why. It can blow up GC due to the reference graph becoming highly interconnected via the TLS arrays. This can happen without you being aware because dependent libraries are busy being oh so clever with TLS.

lamansion · on Dec 31, 2017

I don't think this is good advice. If you need synchronization use sych. primitives, java.util.concurrent etc., not thread locals.

sorokod · on Dec 31, 2017

Don't think ThreadLocals help when you actually need to share state between threads.