Hacker News new | past | comments | ask | show | jobs | submit login
Quick Tips for Fast Code on the JVM (gist.github.com)
236 points by wskinner on Dec 31, 2017 | hide | past | favorite | 76 comments



I'd recommend a process rather than a handful of implementation-specific rules of thumb that are, IMO, at risk of being applied too generally. I'd hate someone to read this article and then go ahead and avoid 'new' wherever possible because they think that's how you make code fast, or avoiding boxing and polymorphic call sites, etc.

My process for writing fast code on the JVM:

1) Measure. Set up a benchmark so you know whether you're on the happy path or not, or whether you've fallen off a performance cliff, for whatever reason. Make this part of your testing suite with some kind of notification for regressions.

2) Start small, ideally with a do-nothing loop over the input. This gives you a baseline; you can't go faster than a do-nothing loop (presuming you can't skip part of your input, which is an algorithm problem, not specific to JVM optimization).

3) Incrementally build up your algorithm and pay attention to when it falls off the performance cliff, using your benchmark from (1). If and when you fall off the performance cliff, that's when you start bringing in tricks like avoiding new, ensuring call sites are monomorphic / bimorphic, avoiding boxing, reducing pointer indirections and other cache friendly code, etc.

Another trick to consider is playing around with inlining, but not in the way you might think: try pushing infrequently executed code (conditionals) one level deeper in the call stack (i.e. making the body of an if-block into a method and calling it). The idea here is to encourage inlining of method doing the calling. Inlining is where the JVM gets to specialize your code to the specific situation at hand, but the JVM is reluctant to inline big methods because it has a time budget. So you need to help it.


I'm a Java programmer who has tried and failed to do some optimizations this year following your general guidelines. I'm extremely happy to finally see a piece of writing that expands my mental model of what is actually happening on the JVM. If we just keep repeating the same general truisms about measurement and inlining, then nobody will ever advance beyond my own low level of understanding.


I'm not really a fan of your "best practices". Java often has unique problems where this generic advice breaks down (I do a lot of low latency and high performance work), and sometimes you need mechanism sympathy instead of relying on tools so much.

1- The author does tell you to understand your hot path. However in Java it is often difficult to pin down GC related issues and just easier to prevent them in the first place. I've tried to go through code bases others have written top clean up allocation problems and they become so embedded in the code that it is easier to just rewrite. Or some issues like finalizer/weak ref issues are very difficult to find in your measurements.

2 and 3- I've never seen this work. This is a great way to get a bunch of faulty analysis or not even know you're going off track to begin with.

This article and many of the tips are written when you are trying to be "as fast as possible" and where each microsecond counts. I've seen a lot of projects go down in flames where they treated performance as something secondary the can be written into the code in a second pass. When you ate having to deal with issues like the article describes you should take a more first principles approach and keep your eye on the proper idioms inn the first place.


sometimes you need mechanism sympathy instead of relying on tools so much

If you need mechanical sympathy throughout you still need solid measurements, you still need to know when you fall off performance cliffs, but you should be experienced enough not to be misled by articles like this one. That last bit was my biggest concern. Because very very few Java apps have performance bottlenecks throughout. That's more typically seen in game engines, where the whole app is a hot loop, and they're not often written in Java.

As to whether it works or not: I've seen it work repeatedly for scanning and parsing. I don't know your situation and can't say why it hasn't worked for you. But again, measurement is the most important bit, and you still need to notice the cliff dropping.


This after the fact measurement heavy method of optimizatization method often doesnt work in these systems because it is more often death by a thousand paper cuts than a single line of code.

You won't ever notice yourself going off the track even after youre done you likely won't notice.

I do a lot of latency and throughout sensitive work, and far more often the bigger problem is losing a few tenths of a percent on each poor decision in your hot path. It becomes very very difficult to fix.


Not really disagreeing with your sentiment, but he laid it out nicely in the "Defining the Hot Path" section. That is honestly the most valuable lesson a newer programmer could take away from this gist.


Some bits and pieces that come to mind:

Make use of JVM and CPU performance counters. They're amazing for discovery and understanding of performance matters.

Quick googling found these:

https://zeroproductionincidents.wordpress.com/tag/jvm-perfor...

http://www.peternier.com/projects/overseer/overseer.php

Make sure you also test in actual application context. Microbenchmarking runs into risk of consuming shared CPU resources (such as L1 cache) and exhausting them when running an actual application.

In other words, something that is fast in a microbenchmark can perform badly in an actual application.

Other than that, one thing no one mentioned is to arrange data in cache-friendly format. Sometimes it might mean ugly things like arrays of primitive types instead of arrays of objects...

False sharing is another real performance killer in parallel code.

IMHO, writing high performance Java code is a lot harder than high performance C/C++.


> one thing no one mentioned is to arrange data in cache-friendly format

I said: reducing pointer indirections and other cache friendly code (just saying!)

I agree that it's harder. It's harder because the platform is unpredictable and capricious, it's liable to change from one release to the next, and if you have C-oriented ideas around mechanical sympathy, they can lead you astray. For example, manual loop unrolling is seldom a win in Java (often, the array access checks kill you). What you want is to make your code simple enough for the JVM to unroll it (it'll be able to prove to itself that you don't exceed the bounds and elide checks, especially if you help out with an explicit comparison up front). You're in a dialogue with a reasonably smart compiler, and it's a bit of a guessing game hunting for the happy path. That's why my step 2 and 3 involve incrementally building up your algorithm: you want to stay on the happy path, because if you stray off it, you need to get back on to it again before you change more stuff.

Usually the way to go is to write straightforward code (common idioms get more optimization effort from the JVM team) and keep it simple enough to stay in the optimizer time budget (splitting it up where necessary).


> For example, manual loop unrolling is seldom a win in Java

Just like practically everywhere else.

Besides, it increases code size, and can decrease performance just for that reason, L1 code cache spills.

Intel CPUs have increasingly improved loop buffers as well. If the loop qualifies, unrolling has no positive effect at all.

Low end microcontrollers are another matter, that's where unrolling can still sometimes yield significant wins.

I think the biggest issue I had with high performance Java is lack of memory alignment control and (at least last time I checked) vector intrinsics. JNI time...


I reliably see gains from unrolling relatively simple loops on very recent cpus. Going to 8+ iterations is usually overkill, but it can help. Sometimes one can split a dependency chain by unrolling as well.


> I reliably see gains from unrolling relatively simple loops on very recent cpus.

Maybe you didn't have multiple of 4 uops in the loop? Or hit some other (?) loop buffer limitation.

Dependency chain splitting is also indeed a good reason to "unroll". Although it's a bit more complex transformation than simple unrolling.


Loop unrolling can definitely still be a win if it's in the form of vectorising or otherwise processing multiple elements at a time.


> IMHO, writing high performance Java code is a lot harder than high performance C/C++.

I haven't had to write high performance (numerical) code in Java, but my instinct is to doubt that it's even possible. There are too many levels of abstraction and complexity, which are constantly changing. Does some JIT heuristic get triggered to specialize a method? Does array bounds-check elimination work in this particular case? Does the compiler allocate this array so that SIMD instructions will work? Will the next minor version bump change everything? With C/C++/Fortran you can look at the generated assembly and either beat the compiler into submission, or write the asm yourself. With Java, I don't see a viable path to success.

This is Lisp's "sufficiently smart compiler" problem with a 2000s makeover.


C/C++ developers rely on knowledge of compiler optimisations all the time, the most famous case being the NRV optimization in C++. But exploiting knowledge of where data structures will be allocated is another example.

Java can be used for numerical work but it's not ideal for the reasons you cite. However this isn't fundamental to the technology or language. It's more a problem of incentives. Java isn't used for such work, so the applications that people care about and try to optimise at the JVM level don't tend to benefit from them, so the cost/benefit ratio of such optimisations seem low, so focus goes elsewhere.

C2 can do quite sophisticated auto-vectorisations but I suspect those optimisations rarely trigger because work that benefits from vectorisation was moved into C or assembly a long time ago. It's rather forward looking work. Graal doesn't even focus on it. They put all their effort into optimising large 'business java' type apps, along with language interpreters (Graal's party trick is turning interpreters into fast JIT compilers).


Using Java for numerical performance seems not a good idea too me to, the Java JITs seem to general for my taste.

Anyway, this is how the Vectorization works in Hotspot: http://cr.openjdk.java.net/~vlivanov/talks/2017_Vectorizatio...


Be careful with 2) because of inlining. Can’t count the number of times I wrote benchmarks that got completely inlined away.


If you are not using JMH (http://openjdk.java.net/projects/code-tools/jmh/) for micro-benchmarking, you are very likely doing it wrong. Inlining is just a tip of the gotchas iceberg.


Agreed. The vast majority of the time you should be using JMH for microbenchmarks on the JVM. If you read the samples and understand them and apply those ideas, it takes care of the inlining problem.


Also for(Foo x : y) will allocate iterators on some JVM implementations which will nail you on the GC.


It's not on some JVM implementations, to be able to remove the Iterator creation, the VM requires the call to y.iterator() to be either monomorphic or bimorphic.

Hence the advice "Keep Things Monomorphic or Bimorphic" !


Even if it inline the iteration creation and calls, then hoist and remove the allocation, you're likely to see suboptimal instructions.

On hot paths, I often find it better to just avoid iterators in the first place than trying to clean up the mess later.


yes, the Iterator protocol of Java is sub-optimal - hasNext() has to be called twice so you hope the VM will kick the common sub expression elimination which i have found works well only if you do not do any side effect in hasNext. - and most of the Iterator are fail fast so the code has to check that the modCount of the collection has not changed.

but from my own experience, you usually do not have to optimize that far, removing that 'synchronized' block, is far more common.


Is this really a consideration outside of cases where someone is tuning their system so that it never does garbage collection?


Very much so, I see this all the time in production Android code. Someone throws in an iterator like this in an onDraw() and now your super-fast code is hitting a GC pause every hundred ms or so.


Android do not use a real JIT (no inlining) so it's mostly an AOT, and removing the Iterator allocation require to know the real class of the iterator.

I'm not a pro of Android, so i wonder if the AOT system of Android (ART) has an option to provide a profile like you can do with gcc. It seems a better idea than to have to rewrite all the codes, third party codes included.


Escape analysis should handle this [1], but I don't know if it does all the time. Reliably profiling this is hard, I guess.

http://psy-lob-saw.blogspot.com.tr/2014/12/the-escape-of-arr...


^^^ this. I would also mention the "performance tuning mindmap for Java": https://shipilev.net/talks/devoxx-Nov2012-perfMethodology-mi...


Most of performance comes down to profiling. You cannot improve performance without knowing the bottlenecks, and profiling is the best way to get it from real data/usage patterns.


Regarding 1) I went down the road of optimization and the code was so fast that we had to switch to nanosec to see something. I was quite impressed what the JiT could do.


You can tune the maximum size of methods as an argument to the JVM to inline as opposed to manually inlining.

Also, I think the first rule of microbenchmarks on the JVM is don't do them. The second rule is to use a proper setup for these benchmarks such as JMH due to the complexities introduced by the JIT and GC.


One point I’d like to add which a lot of people disagree with is that we should all focus on writing simple and easy to understand code. Enigineers shouldn’t worry too much about optimizing code unles there is a good reason to do so.

I’d say about 1% of my code has ever required me to return and optimize it.(I usually write apis that are utilized by end user apis). Most of the time my code is more than fast enough and user concerns are more around bugs related to requirements.

In the few cases I have had to optimize its usually been a super obvious problem. For instance there was a Scala API that looped over a list of lists and called the Scala built in find function. That function is linear and incredibly slow I just converted the first list into a hashmap took a call that took 30 seconds to under 3 seconds still slow but order of magnitudes faster. It’s usually been silly stuff like that.

All of this of course depends on the domain you are working in some applications require you to squeeze out every millisecond in which case most of this won’t apply to you.


One point I’d like to add which a lot of people disagree with is that we should all focus on writing simple and easy to understand code. Enigineers shouldn’t worry too much about optimizing code unles there is a good reason to do so.

A lot of people don't disagree with this. It's the basis of one of the most-repeated programming adages ever, with its own 40+ year literature of talmudic interpretations, guided meditations and contrarian-yet-ultimately-supportive takes.


I want to write a bot that analyzes all of the Stack Overflow posts by Brian Goetz(https://stackoverflow.com/users/3553087/brian-goetz) and count every time he's responded to "Should I...." / "Which is better/faster..." questions with "just write the code that is simplest and easiest to read".

It must be in the hundreds by now.


we should all focus on writing simple and easy to understand code

That is an optimisation. I've found that the simplest solution is usually the most efficient.


This is a good-sounding statement, so it "feels true". But, if you stop to really think about it... It's probably false in most circumstances [1]

[1] If by "efficient" you mean time-efficient. If you pick your own efficiency metric, then yes, sure, I guess so...


"I've found that the simplest solution is usually the most efficient."

If this was true, we'd all be using bubble sort and searching databases without indexes, row by row. I can think of very few algorithms or even situations where this is true.


>.(I usually write apis that are utilized by end user apis).

When that is the case you need to think long and hard about performance, because it needs to not just be fast enough for you, on your machine, but on any conceivable end users machine, for any conceivable end users idea of what responsive enough performance is, for any conceivable end uer's work load, which might be vastly different or larger than anything you intended.

further I don't think it is always the case that faster code is less simple. Often the faster option is more simple. Avoiding unnecessary complexity is often a win for performance too, and understanding memory locality and how the garbage collector work allows you to make smarter default choices which often have no net impact on simplicity


If you're getting to this level of optimization, you could just consider going native. Making a shared lib in Rust that uses JNI and does everything you could do in Java is really really easy.


Careful, though. It costs about a millisecond to do a return trip across the JNI boundary, so don't do that anywhere near an inner loop. Accessing Java objects from native code is also a pain, with potential performance gotchas (pinning, mostly).

An approach I've seen work is to have dedicated Java and native threads, communicating via shared memory. We have some funky in-house code for this, but I think Aeron [1] should be a good open source option.

[1] https://github.com/real-logic/aeron


JNI roundtrips definitely do not cost 1ms per trip.


True. I did some benchmarking a while ago, and had that number in my head. Revisiting it now, it seems to be way off - more like 5 nanoseconds with -Xcheck:jni on. Code:

https://github.com/tomwhoiscontrary/jni-bench


True of course. I just feel like many people had long since taken native off the table because of the burdens of cross platform c/c++ development. But I found Rust+JNI so easy that I think it should be a reasonable tool in every JVM optimization toolbox.


"It costs about a millisecond to do a return trip across the JNI boundary" - Do you have a source on this? Seems unlikely since most system calls would be way too slow if this were true?


> Accessing Java objects from native code is also a pain, with potential performance gotchas (pinning, mostly).

What do you mean by pinning? If you mean pinning threads to cores, how does that relate to JNI?


Probably pinning the object in memory, so that you can keep a stable reference to it without the danger of the GC moving it away under your pointer.


Do some research on GC locker.


Not necessarily true that native would be faster than jit optimized code, though.


True, it's just an option with possible downsides. Just seems to go unmentioned often.


The example of bad allocations

  def first[A](left: A, right: A): A = left

  first(12, 42)   // => 12
actually allocates nothing, because both numbers are in the Integer box cache range (-128 to 127 inclusive by default, but configurable to be greater). The general argument is good though, so mabye first(128,129) is a better example :)


Actually, those are Scala code examples, so I believe it will be end up using ints for this case, not Integers.

Edit: I'm wrong, the example still needs a @specialized version.


Seems odd that the compiler can't inline this pattern.


It probably can. Graal certainly could.


Hmmmm. Fast code for my particular application, in my area of specialization. But not actually very helpful for 99.9% of applications that are actually deployed, which dont really suffer from the type of problem here, and far higher level optimisation would have greater benefit...

E.g. prefer final fields; don't share between threads unless you need to; there is little point micro-optimising a poor algorithm; don't do the same work repeatedly...

Having said that, in a specialised context, very interesting....


Can't have a performance oriented article here (or on any dev site) without a chorus on how irrelevant/useless/harmful the information is. Or half the comments being about profiling.

Also final doesn't help hotspot nearly as much as the pointers the article gives. Final doesn't help hotspot much if at all.


Usual disclaimer of profile your code to identify areas for improvement and try to quantify changes as far as possible -- it's no good doing things just because you think it'll make the program faster.

Specifically on the Java side of things, while not quite as low-level as what the post is talking about, I've seen situations where proper use of logging libraries has given perf increases. It's as simple as log.info("Foo: {}", "bar") instead of log.info(String.format("Foo: %s", "bar")) -- reducing the number of string operations dependant on your chosen logging level. Simple but I've seen it happen.

Shoutout to Brandon Gregg's blog [0] with some interesting stuff on performance, some of which is Java-specific.

[0] http://www.brendangregg.com/


yes, String.valueOf() is dog slow.

We recently moved all logging stuff to use a constant lambda as first parameter, log.info(s -> "Foo: " + s, "bar") so if the logger is not enabled, the constants are discarded by the VM and otherwise the string concatenation is faster than any string interpolations.


Huh. I'll have to try that. Thanks.

My solution was to flip the problem, making a separate logger instance for each log level. So something like

  Logger warn = Logger.factory(Level.WARN); 
  warn.log( "message" );
If warn is disabled, factory returns a NullObject (do nothing) logger.

(My actual implementation had one level of indirection, so logs levels could be updated at runtime.)

Then I got nutty optimizing printf, doing JIT class generation per template... But that's another story.

TL;DR: Logging in Java drives me nuts, so I made my own.


Did you take a look to java.lang.invoke.StringConcatFactory, it does more or less what you want using method handles instead of bytecode generation ?


Thanks for the tip. Will check it out. I'll be happy if it moots my hackery.


Have you measured this? I would expect the allocation of Callable<String> with closures to be slightly worse than common types of string concatenation. Not to mention the awkwardness of only logging effectively final vars--this would be a red flag on organizational coding practices--just use slf4j or better yet Scala + LazyLogging.


If you use a closure that capture some local variables and not a constant lambda, then you're right, the allocation will kill you. But, we are only using lambdas that do not capture anything so the lambda is allocated once by the VM.

And we are using the new shiny System.Logger [1] so the VM logs and the application logs are interleaved. We still have some wrappers to redirect log4j logs on the system logger because few third party libs are compatible with the system logger.

[1] https://docs.oracle.com/javase/9/docs/api/java/lang/System.L...


Ah, makes sense. On the other hand, this would seem to take a StringBuilder-based approach off the table (cf. logback). I wonder how this stacks up to simply passing varargs to the logger. There's also the question of applying any strategy other than toString to the non-string args.


Wait, your string literals are still going to get intern'd, regardless of whether they are in a Lambda or not.


Well, since Java evaluates method arguments all the time, if you have a computationally-heavy argument, then it is better to check debug level:

  if(log.isInfoEnabled()) {
    log.info("Foo: {}", someHeavyOperationThatReturnsString());
  }


String formatting in general will be the death of all performance... and humanity.


> It occurred to me that, really, I had more or less picked up all of it by word of mouth and experience, and there just aren't any good reference sources on the topic. So… here's my word of mouth.

Can someone provide some reference on how JVM optimisation works, or maybe JVM generally?


The Java Performance book by Oaks has some sections on internals. Cliff Click has a talk called “A JVM Does What?” that has some interesting material (very compressed though).

There’s also some old Oracle or Sun white papers on Hotspot, but I’ve never actually read them.

One nitpick to keep in mind: the major JVMs do many things in similar ways but have important differences. In fact, there is a spec that’s much less prescriptive, and JITs are entirely an implementation detail. Often when you read “the JVM does X” it means that the author is talking about Hotspot.


At a certain point, it's simpler, more robust, and more predictable to write some JNI and put the hot path in C++ or Rust or something.


About the monomorphic and bimorphic virtual methods, Aleksey Shipilev has some great articles on the topic, including low-level benchmarks: https://shipilev.net/blog/2015/black-magic-method-dispatch/


The link from the submission seems to 404. This link seems to work: https://gist.github.com/djspiewak/464c11307cabc80171c90397d4...


I'd actually like to add "Don't touch synchronized functions" to the list. If need be try and use Thread Locals instead.


Foremost one should look what java.util.concurrent[0] has to offer before reinventing concurrency patterns from scratch.

[0] https://docs.oracle.com/javase/9/docs/api/java/util/concurre...


I’ve seen statements like this before, and they are often based on misunderstandings of what either monitors (synchronisation) or ThreadLocal give you. Could you maybe expand your comment to describe what problem with synchronised you would suggest fixing using ThreadLocal?


Please avoid TLS at all costs unless you really know what you are doing and why. It can blow up GC due to the reference graph becoming highly interconnected via the TLS arrays. This can happen without you being aware because dependent libraries are busy being oh so clever with TLS.


I don't think this is good advice. If you need synchronization use sych. primitives, java.util.concurrent etc., not thread locals.


Don't think ThreadLocals help when you actually need to share state between threads.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: