Ultimate weapon to kill java memory leaks

You must be logged in to download

Login here or click here to register.

Stock Market

Introduction

A lot of programmers think that memory management is an old task from the C programming times. They believe that no memory control is required after Garbage Collector arrived to ours lives. That is almost true; Java is a good language that takes cares of this uncomfortable detail. But… always? NO.

It is very common to watch big installations at big customers that need to be rebooted every night, only to avoid this problem. But this is only a grandma solution. Memory leaks are usually known after java.lang.OutOfMemoryError, but this is not usually the problem: the main problem is the performance impact that this delivers to the affected platform during long time before the OutOfMemoryError is got.

A typical solution is the postmortem analysis. Although it sounds as a reasonable approach, it has a problem; analysis is some one who dead before: you need to wait until your system crashes in production time, grab a big file of 3 or 4 GB and put this information in an even bigger sized server, and take your time … no cafeine drinks are recommended.

Lucierna distributes a nice and smooth tiny solution: The Antorcha Memory Plumber. It is such an elegant solution because Memory Plumber only looks where the Java Memory Leaks usually are. Antorcha Memory Plumber monitors Collections and Maps, looking which one is growing up with no sense, and its full stack path, and reports this to you.

The installation process takes only two steps:

-Add “-javaagent:aspectjweaver.jar -cp AntorchaMemoryPlumber.jar”  parameter in your jvm or app server startup script.

-Run your app under load conditions for 30 minutes and read the results from the stdout.

 

 

Installation examples

StandAlone Demo App

The usage of Antorcha Memory Plumber for a standalone Java application is easy: just download the zip file, uncompress it; there you will find everything required to run it properly, including third party libraries as well as licenses and some examples.

In order to provide you a quick start kit, a leaking class, called “abr” is included as well. In the next snapshot you can review the standard output that you get when running the “abr” leaking class:


 

WebApp running under App Server

Installation for any Java AppServer is easy, and basically requires the same: add the -javaagent with the AspectJ weaver, and put the Antorcha Memory Plumber in the class path, you can either use the ‘-cp’ option, or add to the CLASSPATH env variable. Depending on your application server the steps can be slightly different.

For Tomcat APP server here you are the details:

1) Copy the aspectjweaver.jar  and antorchaMemoryPlumber.jar to any folder Tomcat  user can access, for instance /opt/Tomcat/antorcha

2) Modify the startup script /opt/Tomcat/bin/startup.sh, including:

PRGDIR=`dirname "$PRG"`
export JAVA_OPTS="-javaagent:/opt/Tomcat/antorcha/aspectjweaver.jar"
EXECUTABLE=catalina.sh

Then be sure to add to the class path the file /opt/Tomcat/antorcha/antorchaMemoryPlumber.jar, you can do it modifying the setclasspath.sh script or just copy it to TOMCAT_HOME/lib folder (as a quick approach).
 

How does Antorcha Memory Plumber work?

Antorcha Memory Plumber works using Java byte code instrumentation, based on AspectJ, to modify all the methods calling to collections and mapping operations. Tracking any addition, and then sampling those collections or maps that are constantly increasing. After some stages it verifies the object leaks, and provides you not only the leaking connection or the last method where the collection/map was increased, but also the full stack trace, so you can even easily identify leaks in Third Party Libraries, HttpSessionLeaks and so on.

As you can imagine, in order to allow Antorcha Memory Plumber to work, you will need to stress test your tool during the time that it is working (aprox. 27 minutes). The Full version of Antorcha Memory Plumber includes auto tune capabilities to setup proper heuristic thresholds like hunting time, etc; but with this “Light” version you should be able to find any leaking class always when you implement a Collection/Map interface (the most usual cause of memory leaks) and you achieve at least a total amount of 500 elements in that collection/map while Antorcha Memory Plumber is up.

 

Advanced configuration

Configuration parameters:

* amp.period.startup.delay: Minutes that the memory plumber will wait prior starting to search for leaks. If not set the default value is 2.
* amp.collectionelements.minimum: Minumum number of elements for that collection that will be needed to start to be considered as a leak. If not set the default value is 500
* amp.period.suspect: Time in minutes on where a collection will be treated as suspect, after that time, the collection is removed from the suspects if that collection decreases. If not set the default value is 10
* amp.iterations.numberof: Number of iterations on that the memory plumber will seek for any potential leak. If not set the default value is 3
* amp.iterations.period: Wait time in seconds per iteration. If not set the default value is 5

Usage example:

Setting a delay of 4 minutes before starting to look for leaks and a suspect period of 20 minutes, leaving the rest options to their default values.

java ….. -Damp.period.startup.delay=4 -Damp.period.suspect=20 …..

Download

Download is easy: if you are not a registered user you will be asked to register which is a very straightforward process: just include your e-mail, name, surname, country and company: that’s it. Once you are registered you have to log-in and then you will be able to download Antorcha Memory Plumber.

In case you find it useful, interesting, if you have doubts or whatever, please post a comment, your questions and feedback are always welcomed.

You must be logged in to download

Login here or click here to register.

CEP and APM

Stock MarketComplex Event Processing (CEP) should be a key technology for any monitoring solution that wants to provide advanced diagnostic or near predictive behavior. As usually technology is a really important point but not the only one.

Other really key aspect is how you model the system/platform you are trying to monitor, and how you map and adapt the underlying technology to match the modeled universe. For instance two monitoring solutions could both use CEP in order to correlate events and monitor 3-tiers application , one modeling primary events as those related to IM (Infrastructure Management) such us CPU, Memory, and devices availability, and the other modeling events as single-transaction related metrics, for instance, providing number of managed exceptions within a particular application server, or including the average response per transaction and node.

With the first approach you will be able to have a fuzzy view of your platform and you will get stuck to classical IM information full of unmixed silos. With the second one you will be able to have a detailed view of your business transactions flow across your infrastructure and the impact and relevance of all the involved elements. From the technologies used point of view both solutions are identical, from the functional advantages, business benefits and triaging capabilities both are totally different.

Synergies between CEP and APM modeling the main events based on transaction metrics when crossing physical or logical borders is an approach that drives a lot of benefits to monitoring scene. Let’s go with a simple sample: imagine every single click you make from the web point of view were automatically monitored and tracked across all your platform. Every time any system involved in your transaction responds to a calling system, an event is generated including the called system name, the response time for that particular piece of call, its status, and some other information… Then imagine a central place where all these events are aggregated and processed in real time, computing aggregate functions such us average response time, standard deviation and all within a temporal sliding window this is where CEP plays a key role correlating those events and computing key performance indicators. As you can imagine you can easily obtain metrics such us average response time from any particular node in your network and not as silo but as a piece within high level concepts such us Applications or Services.

Availability to correlate those events within complex streams boosts the system diagnose capabilities and makes the monitoring solutions working as this as pretty real time diagnose tool, bringing near predictive capabilities if used concurrently with a dynamic tight thresholding strategy such us SUS.

TeX – Telemeasuring customer experience (almost nothing new)

Radar by Santiago AtienzaThere are several approaches out there to measure end user experience. The trendy one is passive monitoring, people seems to get crazy about deploying a new gadget on their network to process huge amounts of traffic. Then like magic spells they claim: what the sonde monitors is the end user experience, because of TCP stream continuity…

Other approach consists about using a robot, emulating users, and extrapolating robots experience as the end user experience.

None of these solutions are able to monitor true end user experience, the first one does not monitor network elements out of your CPD (imagine there are other devices out there such us: content filtering solution, transparent caches and proxies) and of course they are not aware about user laptop usage and performance, or even the browser rendering times.

The seconds one does not measure even any real user experience, it creates extra load in your environment, and only when there is a fire inside your environment the robot realizes about the smoke …

I wonder where simple, elegant, effective and costless solutions have been abandoned: web applications and web environments allows you to monitor user experience from the user point of view, no need to extrapolate, no need to deploy whatever. Javascript was designed for that porpoise.

Google Analytics and a lot of tracking solutions demonstrates this approach. So, what was going wrong about using JavaScript to monitor End User Experience? Why not tools such as Yahoo Boomerang has been deployed massively instead of other complications?

IMHO the answer is because of deployability: until now deploying a javascript such us Google Analytics for tracking or as Yahoo Boomerang was not affordable if you had to modify one by one all of your JSPs and Servlets. That may be the reason why this kind of solution has not proliferated.

What if your APM solution could automatically embed by means of bytecode instrumentation this kind of javascripts, such us Boomerang, or Analytics transparently for you, with no noticeable extra overhead, without no HTTPFilter deployment, etc? What if this kind of solution could even provide you with current bandwidth end to end from customer PC up to your application server?

TEX follows this approach in order to monitor “True Enduser eXperience, and can be used also to inject Analytics script with no configuration at all.!

SUS – Straight lines are not enough

Are you annoyed with your APM solution when it decides it is important to warn you at 03:00 am because the backup is slowing down your computer? When you are watching the latest film at the cinema on Friday night and you receive an SMS notifying you that the platform is showing too many errors; do you turn it off?

Life is not a straight line, and information systems cannot be modeled with straight lines. If your monitoring solution only allows you to set one or two thresholds, one afor the rush hour and another one for the nights; you won’t be able to notice when something is really going wrong, your tool will alert you only when some customers are being impacted. Or even more important it will bother you frequently, but most of the time it won’t mean anything to the business, so you will close your eyes and wait for Monday morning

SUS thanks to different usage profiles, advanced statistics and genetic algorithms self-learns every single second how your system behaves providing high resolution and auto calculated thresholds when your system behaves very accurately and using more tolerant auto-calculated thresholds when it’s not. By doing this you will dramatically reduce the number of false positives, but moste importantly you will be warned whenever the system is starting to deviate and maybe the users have not yet experienced any issue.

Keep connected; I will soon provide some more detail on how SUS works.

Simplest is possible

Simpler is possibleYes it’s true, we iterate at least twice when designing any function, any key piece, and any engine. It is our main mantra, we do not believe into complex tools showing teratones of weird metrics. We do not change what we believe it’s right only because of commercial issues or requirements.

Ok, it is true, there is still a lot of people believing the more metrics you have, the more complex the interface is, the more advance the tool is and the more value it will bring, however this is completely opposite to what -in our opinion- we believe is best: just simply and clearly identifying when you have a problem, where and way it happened.

SAD – Stop the finger point game

Finger Point GameSited around a round table and listening: “it’s not my fault, it must be some elses … The database is working” “No, not the application server. There is no evidence, the thread pool, and the connection pools are performing well, it should be an infrastructure issue” “Oh definitely not, no infrastructure issue, there is CPU available as well as enough memory, no idea what is going on, but servers are working well”

Why not having a topological graphical target that relates servers, applications, backends, services, load, performance and issues all at a glance? What if you were able to triage where your platform problems appear, simply by following the “red path”? This is exactly how we believe any Application Performance Management solution should work: allowing you to quickly triage what systems, services, applications and backend servers are being impacted by a specific problem.

There is no need to define anything, absolutely nothing, just let Antorchaauto discover and depict all of these relations for you. Have a look at it:

SAD target

Antorcha automatically discovers all your backends, thanks to SOS, as everything, always is being monitored it is easy to identify any backend and to shows specific metrics for that backend, but not as a vertical silo anymore but as a key piece within the context of a specific application belonging to a specific service, hosted at a specific application server that is using the conflictive backend.

We have not been able to discover an easier and precise way to quickly triage problems at your IT ecosystem. However if you have a better approach, let’s share use the comments box…

SOS – No more alone in the dark

Alone in the DarkHave you ever felt alone in the dark while trying to identify a performance issue? Do you need the exception stack trace that has not been written to the log? Is your current APM solution not instrumenting the required code? Ok, this means that you don’t know what SOS (System Overhead Supressor) ) is, and how it can simplify your life as developer, support engineer, or operations manager.

Being able to drill down up to the single line of source code without noticeable overhead has always been a dream. There are several fantastic APM tools out there being able to provide nice information when trying to isolate a problem. However all of them work in the same way: they just monitor what you tell them to monitor. The reason: if they do monitor everything, all the time, they would increase their overhead up to a profiler level converting monitored platforms into an electronic tortoise.

SOS brings you exactly the opposite: instead of telling the monitoring tool what you want to be monitored: Antorcha monitors everything by default (NOTE for competitors: Yeah! we actually monitor 100% of the code, 100% of the time, with the same granularity, no “Start advanced monitoring button”).

By doing this, following and agile development methodology and of course not wasting any cpu time calculating useless analytical metrics in the agent is how Antorcha is able to monitor all without no configuration and with an end-to-end overhead below 2 % (average RTT increase)

Still do not believe it? Have a look at this picture:

Antorcha Transaction Trace