Sunday, 11 September 2016

Scaling Enterprise Java on 64-bit Multi-Core X86-Based Servers

Multi-core and 64-bit CPUs are the hottest commodities in the enterprise server market these days. In recent years, as the cost and power requirement of faster CPU clock speeds has increased, the growth in raw clock speed (usually measured in megahertz) of single CPUs has slowed down. Hardware manufacturers continue to improve X86-based server performance by increasing both the multitasking capability and internal data bandwidth. Both Intel and Advanced Micro Devices are shipping 64-bit processors with two internal CPU cores, and quad core processors are soon to follow. Ninth-generation servers from Dell exploit this new generation of chips. The PowerEdge 1955 blade server, for example, supports up to two 64-bit dual core processors in a blade configuration, with up to ten such blades in a 7-rack unit (12.25") chassis.

However, those new generations of servers also pose new challenges for the software. For instance, to take advantage of the multi-core CPUs, the software application must be able to execute tasks in parallel across the CPUs; to take advantage of the 64-bit memory bandwidth, the application must also be able to manage a large amount of memory efficiently. As a key software platform on enterprise servers, Java Enterprise Edition (Java EE) is on the forefront of this multi-core, 64-bit revolution. Java EE developers must adapt to those challenges to make the most out of hardware investment.

When Java first came out in 1997, the state-of-the-art PC had a single CPU with a less than 300MHz clock speed and less than 64MB of RAM. The first Java applications were mostly on the client side. High-performance multitasking and large memory handling were clearly not the priority for Java designers at that time. But as Java became widely adopted for server-side applications, things started to change. Web applications are inherently multithreaded since each web request can be handled in a separate thread, parallel to other requests. The latest Java platform has greatly improved performance on modern server hardware.

In this article, we look at the current state of enterprise Java and analyze the challenges it faces with the new generation of servers. Based on our experience working on Java EE applications running on the JBoss Application Server in the Dell Scalable Enterprise Technology Center, we provide solutions and tips to scale your Java EE applications to the latest server hardware.

Tune the JVM


The core of the Java platform is the Java Virtual Machine (JVM). The entire Java application server runs inside a JVM. The JVM takes many startup parameters as command line flags, and some of them have great implications on the application performance. So, let's examine some of the important JVM parameters for server applications.

First, you should allocate as much memory as possible to the JVM using the -Xms<size> (minimum memory) and -Xmx<size> (maximum memory) flags. For instance, the -Xms1g -Xmx1g tag allocates 1GB of RAM to the JVM. If you don't specify a memory size in the JVM startup flags, the JVM would limit the heap memory to 64MB (512MB on Linux), no matter how much physical memory you have on the server! More memory allows the application to handle more concurrent web sessions, and to cache more data to improve the slow I/O and database operations. We typically specify the same amount of memory for both flags to force the server to use all the allocated memory from startup. This way, the JVM wouldn't need to dynamically change the heap size at runtime, which is a leading cause of JVM instability. For 64-bit servers, make sure that you run a 64-bit JVM on top of a 64-bit operating system to take advantage of all RAM on the server. Otherwise, the JVM would only be able to utilize 2GB or less of memory space. 64-bit JVMs are typically only available for JDK 5.0.

With a large heap memory, the garbage collection (GC) operation could become a major performance bottleneck. It could take more than ten seconds for the GC to sweep through a multiple gigabyte heap. In JDK 1.3 and earlier, GC is a single threaded operation, which stops all other tasks in the JVM. That not only causes long and unpredictable pauses in the application, but it also results in very poor performance on multi-CPU computers since all other CPUs must wait in idle while one CPU is running at 100% to free up the heap memory space. It is crucial that we select a JDK 1.4+ JVM that supports parallel and concurrent GC operations. Actually, the concurrent GC implementation in the JDK 1.4 series of JVMs is not very stable. So, we strongly recommend you upgrade to JDK 5.0. Using the command line flags, you can choose from the following two GC algorithms. Both of them are optimized for multi-CPU computers.
  • If your priority is to increase the total throughput of the application and you can tolerate occasional GC pauses, you should use the -XX:UseParallelGC and -XX:UseParallelOldGC (the latter is only available in JDK 5.0) flags to turn on parallel GC. The parallel GC uses all available CPUs to perform the GC operation, and hence it is much faster than the default single thread GC. It still pauses all other activities in the JVM during GC, however.
  • If you need to minimize the GC pause, you can use the -XX:+UseConcMarkSweepGC flag to turn on the concurrent GC. The concurrent GC still pauses the JVM and uses parallel GC to clean up short-lived objects. However, it cleans up long-lived objects from the heap using a background thread running in parallel with other JVM threads. The concurrent GC drastically reduces the GC pause, but managing the background thread does add to the overhead of the system and reduces the total throughput.
Furthermore, there are a few more JVM parameters you can tune to optimize the GC operations.
  • On 64-bit systems, the call stack for each thread is allocated 1MB of memory space. Most threads do not use that much space. Using the -XX:ThreadStackSize=256k flag, you can decrease the stack size to 256k to allow more threads.
  • Use the -XX:+DisableExplicitGC flag to ignore explicit application calls to System.gc(). If the application calls this method frequently, then we could be doing a lot of unnecessary GCs.
  • The -Xmn<size> flag lets you manually set the size of the "young generation" memory space for short-lived objects. If your application generates lots of new objects, you might improve GCs dramatically by increasing this value. The "young generation" size should almost never be more than 50% of heap.
Since the GC has a big impact on performance, the JVM provides several flags to help you fine-tune the GC algorithm for your specific server and application. It's beyond the scope of this article to discuss GC algorithms and tuning tips in detail, but we'd like to point out that the JDK 5.0 JVM comes with an adaptive GC-tuning feature called ergonomics. It can automatically optimize GC algorithm parameters based on the underlying hardware, the application itself, and desired goals specified by the user (e.g., the max pause time and desired throughput). That saves you time trying different GC parameter combinations yourself. Ergonomics is yet another compelling reason to upgrade to JDK 5.0. Interested readers can refer to Tuning Garbage Collection with the 5.0 Java Virtual Machine. If the GC algorithm is misconfigured, it is relatively easy to spot the problems during the testing phase of your application. In a later section, we will discuss several ways to diagnose GC problems in the JVM.

Finally, make sure that you start the JVM with the -server flag. It optimizes the Just-In-Time (JIT) compiler to trade slower startup time for faster runtime performance. There are more JVM flags we have not discussed; for details on these, please check out the JVM options documentation page.

Use New Platform APIs


Besides the JVM, the Java platform libraries have also gone through extensive changes to accommodate the newer server hardware. We strongly recommend you upgrade your application to JDK 5.0+ in order to take advantage of all the performance enhancements built into the platform. Three new library APIs introduced in the last two major versions of the JDK are of particular importance for multi-CPU computers.
  • The concurrency utility library in JDK 5.0 is very important for multithread applications. It simplifies the Java thread API and provides a thread-safe set of Collection implementations. For instance, the new ConcurrentHashMap is a thread-safe HashMap and you can read/write it without a synchronized block. We will get to this in more detail later in this article.
  • The NIO (New I/O) library was introduced in JDK 1.4. It allows multiple threads to share one physical connection (e.g., a socket) to the hard disk or network. A thread no longer needs to block the I/O socket to read or write data. Using NIO, we can greatly reduce the thread waiting time caused by a limited number of blocked sockets. The NIO is especially useful in multi-CPU computers where CPUs often wait on the I/O, and where there are many threads.
  • The logging library introduced in JDK 1.4 provides a convenient API to log information from the application to the console, logfiles, or network destinations. The important performance feature of the logging library is that you can configure the logging output by changing the logging level at runtime via configuration files. This helps us to reduce logging--which involves slow I/O operation and is a major cause for CPU waiting--at runtime, without recompiling the application code.
You should write new applications and upgrade older applications to use the concurrency, NIO, and logging APIs whenever possible. If you cannot upgrade, you should use alternative open source libraries that provides similar features. For instance, Doug Lea's util.concurrent library has many of the same features as the JDK 5.0 concurrency API; furthermore, the Apache Log4j library is comparable to the JDK 1.4+ logging library.

Optimize Your Code


In the previous sections, we discussed the general guidelines to build and run high-performance Java EE applications for multiple CPU and large memory servers. However, each application is unique with its own performance requirements and bottlenecks. The only way to make sure that your application is optimized for your hardware is through extensive performance testing. In this section, we cover some of the basic techniques to diagnose performance problems in your application.

It's beyond the scope of this article to cover performance-testing tools and frameworks. In our tests, we used Grinder, an open source performance-testing framework in Java. It can simulate hundreds of thousands of concurrent users across multiple testing computers and gather statistics on a central console. It provides a utility for you to record your test scripts by going through your web application in a browser. The generated script is in Jython, and you can easily modify it to suit your own needs.

As we discussed before, tuning GC operations in the JVM is crucial for performance. The easiest way to see the effects of various GC algorithm parameters is to monitor the time the application spends on GC throughout the load testing. There are two simple ways to do it.
  • You can add a -verbose:gc startup flag to the JVM. The JVM would print out every GC operation and its duration in the console. If the server pauses due to a long full GC operation, you can optimize the GC parameters accordingly. If the system runs lengthy full GC very frequently, you probably have a memory leak somewhere in your application.
  • If you use the JDK 5.0 JVM, you can also use the JConsole utility to monitor the server usage of resources. The JConsole GUI shows how various regions of the memory are utilized and how much time is spent on GC (see Figure 1). To use the JConsole, you need to start the JVM with the -Dcom.sun.management.jmxremote flag and then run the jconsole command. JConsole can connect to a JVM running on the local computer, or to any JVM running on the network via RMI. You can use one JConsole instance to monitor multiple servers.

Figure 1. JConsole in JDK 5.0 (click for full-size image)

To pinpoint the exact location of a memory leak, you can use an application profiler. The JBoss Profiler is an open source profiler for applications inside the JBoss Application Server.

When the application is fully loaded, the CPU should run between 80% and 100% of its capacity. If the CPU usage is substantially lower, you should look for other bottlenecks, such as whether the network or disk I/O is saturated. However, an underutilized CPU could also indicate contention points inside the application. For instance, as we mentioned before, if there is a synchronized block on the critical path of multiple threads (e.g., a code block frequently accessed by most requests), the multiple CPUs would not be fully utilized. To find those contention points, you can do a thread dump when the server is fully loaded:
  • On a windows machine, type Ctrl-Break in the DOS terminal window where the server is started (i.e., the server console) to create a thread dump.
  • On a Linux/Unix system, run the kill -QUIT process_id command, where process_id is the ID of the server JVM process, to create a thread dump.
The thread dump prints out detailed information (stack trace with source code line numbers) about all current threads in the server. If all the request-handling threads are waiting at the same point, it would indicate a contention point, and you can go back to the code to fix it.

Sometimes, the contention point is not in the application but in the application server itself. Most Java EE application servers have not completely evolved their code base to take advantage of JDK 5.0 APIs, especially the concurrent utility libraries. In this case, it is crucial to choose an open source application server, such as the JBoss Application Server, where you can make changes to the server code.

Collapse the Tiers


Traditionally, Java EE had been designed for the multitiered architecture. This architecture envisions that the web server, servlet container, EJB server, and database server each runs on its own physical computer, and those computers are tied together through remote call protocols on the local network.

But with the new generation of more powerful server hardware, a single computer is powerful enough to run all those components for a medium-sized website. Running everything on the same physical machine is much more efficient than the distributed architecture described above. All communications are now interthread communications that can be handled efficiently by the same operating system or even inside the same JVM in many cases. It eliminates the expensive object serialization requirements and high network latency associated with remote calls. Furthermore, since different components tend to use different kind of server resources (e.g., the database is heavy on disk usage while Java EE is CPU-intensive), the integrated stack helps us to balance the server usage and reduce overall contention points.

Figure 2. Choose between call-by-reference and call-by-value in the JBoss AS installer (click for full-size image)

The JBoss Application Server has built-in optimizations to support the single JVM deployment. For instance, by default, JBoss AS makes call-by-reference method calls from the servlet to the EJB objects. Call-by-reference can be up to ten times faster than the standard Java EE call-by-value approach, because call-by-value requires object serialization and is primarily for remote calls across JVMs. Figure 2 shows that you can choose from the two call-isolation methods.

The JBoss Web Server project goes one step further and builds native Apache web server functionalities directly into the Java EE servlet container. It allows much tighter integration between components when deployed on the same server, and hence could deliver much better performance than older Java EE servers.

With the entire middleware stack running on the same physical server, we also drastically simplify deployment and management. When you need to scale the application up, you simply add a load balancer, move the shared database server to a different computer, and then add any number of server nodes with the integrated middleware stack (see Figure 3).

Figure 3. The load-balanced architecture

All web requests are made against the load balancer, which then forwards the requests to application servers in a manner that ensures all nodes have similar numbers of request per unit time. The load balancer should be configured to forward all requests from the same user session to the same node (i.e., use sticky session). Most Java EE application servers also support automatic state replication between the nodes to avoid application state loss when a node fails. For instance, the JBoss AS supports several state-replication strategies including buddy replication, in which each node can choose a "buddy" as its failover. Such load balancing and state replication would be very difficult to deploy and manage if we had three or four tiers of servers.

Virtualize the Hardware


The new multi-core 64-bit servers are capable of running heavy-load web applications. But for small web applications, they can be overkill. To fully utilize server capabilities, we sometimes run multiple small websites on the same physical server. Of course, you can deploy multiple applications on the same Java EE container. But to achieve the optimal performance, stability, and manageability, we often wish to run each application in its own Java EE container. How do we do that?

Technically, the primary challenge to run multiple Java EE server instances on the same physical server is to avoid the port conflicts. Like any other server application, the Java EE application server listens on TCP/IP ports to provide services. For instance, the HTTP service listens for web requests on the 80 or 8080 port; the RMI service listens for RMI invocation requests on the 4444 port; the naming service listens on the 1099 port; etc. For the server to run properly, it must obtain exclusive control over the port it listens to. So, when you start multiple server instances on the same computer, you are likely to get port conflict errors. There are three ways to avoid port conflicts.
  • Using virtualization software, such as VMware and Xen, you can run multiple operating systems on the same physical server. You can run a Java EE server instance in each of those guest OSes. The benefit of this approach is its flexibility in architecture and management. Each OS virtual machine can be independently provisioned, managed, and backed up. You can also choose to run the database server, the load balancer, or other server components in their own virtual OSes. However, the drawback of this approach is that there is relatively heavy overhead to virtualize the entire OS just to run the JVM.
  • Many Java EE servers allow you to start a server instance bound to a specific IP address. For instance, in the case of JBoss AS, you can start the server with the run.bat -b 10.10.20.111 command to start a server instance bound to IP address 10.10.20.111. You can assign multiple IP addresses to your server and then start a Java EE server instance on each of those IP address. Each server instance listens for the same port numbers on different IP addresses, and hence there is no conflict. This approach provides a good balance between server manageability and performance overhead. We recommend it if your application supports IP address binding.
  • In the unlikely case when you cannot assign multiple IP addresses to a physical server, an alternative is to reconfigure the port number used by each server instance so that they do not conflict. That would typically require you to go through reams of tedious configuration files, and you must be intimately familiar with those files to look for port numbers. In the JBoss AS, there is a simpler way: JBoss AS has an MBean service called jboss.system:service=ServiceBindingManager that can automatically shift port numbers used in the current server instance (e.g., increase all port numbers by 100 from their default values). In general, we do not recommend messing with port numbers since the application server is not regularly tested with nonstandard port numbers, and a number of complications and side effects could arise from such use.
To achieve optimal results, we should run no more server instances than the number of physical CPUs in the server; otherwise, the server instances would wait on one another to use the CPUs, creating more contention points. We should also keep the memory allocation for each server instance at around 1GB for optimal GC results.