Support Play: A methodology for troubleshooting performance
This Support Play presents a comprehensive performance analysis methodology and troubleshooting process for Versions 4.2 and 5.1. The goal is to provide a framework that you can use to identify application performance problems and undertake successful tuning projects.
The analysis begins by investigating the performance of the application. If the source of the performance issue is not in the application, the investigation advances to examine to the system environment.
Other PDN documents and articles provide details on specific tools and methodologies for analyzing and tuning performance. Refer to those documents for details on a particular process or tool.
Quick Links
Understanding Performance
Before performing any performance troubleshooting on a Process Commander system, it is necessary to understand how performance is defined and measured relative to different parts of the system.
Performance Context
The first step in any performance problem analysis or tuning project is to establish a context in which the performance of the system may be judged. In this Play, this concept will be called the performance context.
A performance context is the means by which an observer or user of the system may judge whether performance is “good” or “bad,” by using some pre-established measure. Because words that a user will employ to describe performance (“good,” “bad,” “slow,” etc.) are relative terms, it is necessary to create a performance context to provide an objective and quantifiable means to measure the relative performance of the system.
A performance context is data-based, and relies on at least one of the following measurements:
- SLA (Service Level Agreement) – performance target values for an application
- Baseline application performance data
- Scale or load test performance data
SLA
A Service Level Agreement is a performance measurement which is defined by the business requirements of this application. Example SLAs might be:
- “The system must perform an account lookup in 5 seconds or less.”
- “The system must be able to process a claim in 10 seconds.”
All applications should have SLAs defined – and they should be defined in the conception phase for the application, not after the application is already built (and “seems slow”). Developers and users must determine how they want the application to operate.
Pegasystems cannot define these such SLAs; the SLAs will vary from application to application, based on the type of work the application supports.
For example, a developer has created a form that retrieves and computes loan data. This new application replace a manual approach that required an elapsed week to process each loan request. The new application performs a great deal of computation, with data drawn from several different systems. When a user submits the loan form, it takes the new application three minutes to complete. However, three minutes is a great improvement on one week, and if the organization only processes two of these forms per day, that time is acceptable.
Another organization with a call center uses Process Commander to enter and route calls. This application has 10,000 calls entered every day. In this case, a 3-minute processing time is clearly unacceptable; this form should be processed in no more than 5 seconds, or users will fall hopelessly behind. Thus, the business must determine a reasonable SLA for the work they are trying to do.
These SLAs can then be used, during the development process and during production, to assess whether the application performs as the organization requires.
NOTE: Don't confuse these businessSLAs with the SLAs (service level rules) in the application. SLAs defined inside the application are part of the processing flow. (For example, “If this new call isn’t closed within two days, send an email to the worker’s manager to follow up.”) There is a distinct difference between how the work that the office is doing is processed and how quickly the application responds to the user; the SLA needed for performance problems is the response time to the user.
Baseline Application Performance Data
As a best practice, you must have business SLAs (as stated in the previous section). However, the occasional application will slip by without these having been defined. In this case, baseline application data may serve.
As the application is being developed, run PAL at multiple points in the development process to determine the elapsed time and CPU time required to complete a form. If you have a baseline reading of 3 seconds to process a form in development (for example), but when the application is moved into production, that same form requires 18 seconds, there might be a performance problem that requires further research.
When measuring performance using baseline data, the most accurate measurement will be using data from the same system (the same server and application in development that then went into production). If this parallel data is not available, use the closest possible approximation. For example, there may not be baseline data available from the development system; in this case, use performance data from a UAT system, if available.
Important: Always test performance during the development phase. Do not wait until the system is in production (or the development is almost finished) to find out that the way the application is built makes the system too slow to be usable.
Scale or Load-Test Performance Data
Just as with baseline data, scale or load-testing performance should be measured during the development of the application. An application might perform well when five users are working on it, but once a full group of 2,000 users begins work, the system slows down to the point where it is unusable.
To prevent this problem from happening when the system goes into production, load-testing should be done in the development cycle. The amount of elapsed or CPU time required for one user should be measured in the system, and then compared to the amount of time required for large groups of users.
Performance Problem Types
As stated above, when anyone complains that “performance is slow,” more details are required to pinpoint the problem to begin troubleshooting.
There are several types of performance problems:
- Function-specific
- General system-wide
- Time-specific
- User-specific (one user or a group of users)
- Agent-related
- Service-related
- Environment
Function-Specific
A function-specific problem occurs only when the user performs one specific function. For example, users may experience slow performance when opening a form for a customer call; but if they try some other process, such as creating a new account record, that process runs fine, with no perceived slowness.
General System-Wide
If the issue is not with one particular flow or user, but the system seems overall to be slow, there may be a system-wide performance problem.
Time-Specific
There may be a performance issue, but it only occurs at certain times of the day. Users may be able to work efficiently in the morning, but the system reportedly “slows down" in the afternoon.
User-Specific
The performance problem may not affect all users. If only one user is experiences the slowness, that indicates a different issue than if all users are affected.
Similarly, one group of users may have problems. “The Toronto Office” may be reporting much more slowness than “the main office in Chicago,” or “the HR division” may have performance issues, while the rest of the company (“Engineering” “Finance” “Legal”) may have no problems.
Agent-related
In Process Commander, processing is performed not only by users , but also by agents. In some cases, the work expected to be completed by a certain time may not have been completed in a timely fashion by an agent.
Service-related
A Process Commander application can be accessed directly by users (the “BPM model”), or it may be used as a processing engine by a separate application (the “BRE model”), connecting through service rules. There may be performance problems relating to the expectations of these service calls (SOAP, MQ, etc.). If your application has a BRE setup (is service-based), you may have a situation where users feel the response time into the Process Commander system is slow – because one or more service is not performing up to the service-level agreements for the system.
Environment
In some systems, the PAL statistics and log information will look fine, even though performance is slow. In this case, the problem may not be with the application itself; there may be an environment issue that is causing the system slowdown, such as issues with the network, the user workstation (client), or the PegaRULES database.
Investigation Tools
These important tools that are available in the system to track different aspects of performance:
- PAL
- DB Trace
- Garbage Collection
- Log-Usage
PAL
PAL stands for Performance AnaLyzer, and is a collection of counters and timer readings that an application developer uses to analyze some performance issues in a system. Process Commander captures the information necessary to identify processing inefficiencies or excessive use of resources in your application and stores this data in “PAL counters” or “PAL readings” for one requestor or node.
PAL is a tool which should be used to gain insight into where the system is spending resources; poor performance, such as delays in processing, refreshing screens, submitting work objects, or other application functions, will be highlighted. Use PAL to determine whether there are resource issues impacting performance, or whether issues may arise when more load is added to the system.
NOTE: These PAL readings are not meant to give developers a definitive answer about performance problems. PAL readings highlight processes which fall outside of the norm. Depending upon how the application is constructed, there may be good reasons why a particular application has readings at a certain level; something which in general might be considered too high a reading might be correct for your application. PAL gives the developer the capability to analyze and explain these readings, as well as investigate problem readings.
For full details on how to take and analyze PAL readings, see:
- Using Performance Tools in Process Commander Version 4.2
- Using Performance Tools in Process Commander Version 5.1
DB Trace
DB Trace is a tracing facility used to assist in tuning system performance. If users are perceiving that work items in the system take a long time to process, and the PAL readings point to database interactions, the DBTrace facility might help determine where the time was being spent. DBTrace displays a lot of low-level detail about system-level interactions with the database.
This function records the sequence of database SQL operations that Process Commander performs during processing, such as reads, commits, etc. Unlike the Trace facility, DBTrace cannot support real-time display of operations data. Instead, DBTrace places data into a text output file which records all the low-level database operations that Process Commander performs. Then you can use an Excel template included in DBTrace materials to format this data for viewing.
DBTrace should be used when the PAL counts show:
- the Database Access counts are high
- Elapsed Time is high (especially if CPU time is low)
- the Database Threshold is exceeded
- Elapsed Time for Connects is high (this could be some other connect, such as Rule-Connect-SOAP, but it also tracks Rule-Connect-SQL, which is used for many SQL requests to the database)
If a complex SQL query to the PegaRULES database is taking a long time to return, DBTrace can give further information on details of the query. For full details on using DBTrace, reference these System Tools documents:
- Using Performance Tools in Process Commander Version 4.2
- Using Performance Tools in Process Commander Version 5.1
Garbage Collection
Process Commander runs within the Java Virtual Machine (JVM), which allocates space for applications to run in virtual memory. This space is known as the “heap.” On initialization, the JVM allocates the whole heap in a single contiguous area of virtual storage; within that limit, the heap may expand or contract, depending upon the space required for various objects. Objects are allocated space in the heap, used, and then discarded.
An object continues to be “live” while a reference (or pointer) to it exists somewhere in the system; when an object ceases to be referenced from the active state, it becomes garbage. Rather than keeping all the discarded objects in the heap (which would eventually take up all the space in the system), the Garbage Collector runs and removes the objects from allocation, freeing up the memory they occupied for reuse.
Garbage collection is a key function to check when troubleshooting performance. The time the system spends collecting garbage is essentially all CPU processing time; thus, the system can suspend all user processing while running the garbage collection process. This means that garbage collection can have a massive effect on performance.
For full details on garbage collection in the JVM, reference the appropriate support play::
- Support Play: Tuning the IBM JVM 1.4.2 for performance
- Support Play: Tuning the Sun JVM 1.4.2 and 5.0 for performance
- Tuning Your IBM JVM 1.5 (future)
- Tuning Your Sun JVM 1.5 (future)
For details on tracking garbage collection in Version 4.2, search the Help system for "System Console".
For details on tracking garbage collection in Version 5.1, reference the System Management Reference Guide.
Log Usage
Unlike the PAL tool, which shows data for one node only, Log-Usage reports shows overall system activity. Based on the number of interactions, the Log Usage will show various Performance Analyzer counts of system usage, so the system administrator can see what activities are going on from a system-level perspective.
For example, if a user complains that the system is slow at 4 p.m., the system administrator can choose Start and Stop parameters to highlight that time and see whether there was a sudden spike in the number of users, or in activities run by existing users, or some other occurrence that might cause a system slowdown.
- In Version 4.2, Log-Usage statistics are available from the System Console. For details, reference the Help system..
- In Version 5.1, Log-Usage statistics are available in the System Management Application. For details, reference the System Management Reference Guide.
Runtime Alerts
Beginning in Version 4.2 SP5, runtime alerts are available. You can set thresholds for different interactions; when these thresholds are exceeded, then warnings or errors are written to PegaRULES log files.
Version 4.2
Three alerts are available in V4.2.
Alert | Measures |
---|---|
Database Activity Threshold | the volume of data being retrieved from the BLOB in the database. This threshold is designed to recognize when queries are inefficiently designed, resulting in a return of too much data, or when a return technique has gone awry, and data is being loaded indiscriminately. |
Interaction Time Threshold | the amount of time the Process Commander server takes to respond to an HTTP request from a client machine. This threshold helps developers recognize when the request time is too long. |
Operation Time Threshold | the amount of time that one database operation should consume |
There are two levels set for these thresholds: warning and error.
- If an interaction exceeds the warning threshold set in the
pegarules.xml
file for these alerts, then a stack trace is written to the PegaRULES ALERT log file. (The location of this file varies by application server; for example, in a Tomcat installation, this file is in the bin directory under the main Tomcat subdirectory).
- If an interaction exceeds the error threshold set in the
pegarules.xml
file for these alerts, then a stack trace is written to the PegaRULES log file. (The location of this file varies by application server; for example, in a Tomcat installation, this file is in the bin directory under the main Tomcat subdirectory).
For more details on these settings, reference the pegarules.xml File Settings Reference .
Other system performance tools
There are a number of tools that report on the system performance, such as the Alert log and Performance Analysis (PAL). However, unless the developer specifically seeks out the information, problems with the application can go unnoticed until the system is in production.
It is recommended that you use these tools to help you identify, investigate, diagnose, and remediate issues that may arise in development and production environments:
- The PegaRULES Log Analyzer (PLA) tool is a Web application that consolidates and summarizes three types of logs from individual JVM server nodes in your application system. The log data provide key information about operational and system health. For more information, see Understanding the PegaRULES Log Analyzer.
- The Autonomic Event Services (AES) Enterprise Edition analyzes performance and health indicators from multiple SmartBPM systems across the enterprise. AES is an intelligent agent that can predict and notify administrators when system performance or business logic problems occur. AES provides suggestions and administration tools to correct them. For more information, see About Autonomic Event Services (AES) Enterprise Edition.
Troubleshooting Strategy
There are a number of steps to take to troubleshoot performance. Investigation should begin by questioning the user. The complaint “Performance is slow” is vague ; ask questions to focus the analysis in the correct area to solve the problem. The answers to these questions will point to different areas of the system, and suggest the use of tools appropriate for the investigation.
Users can provide a general estimate of how fast the system or process is running by timing the “return time” (the time the system takes to process a request and return control to the user) by their watches. This gives a basic idea of how slow the system is (“it takes five minutes by my watch to submit a form” is a problem!); more precise timings may be determined later, if necessary.
Test Against Performance Context
Before doing any other testing, it is important to determine whether the performance is actually slow, or whether it is just the user’s perception. The application’s SLA and the baseline readings should provide guidelines for this determination. For example, if the application SLA states that “The system must process an account form in 5 seconds or less, ” the form is processed in 4.5 seconds, but the user thinks it’s slow because Word documents save in .3 seconds, that isa user perception problem. The application is achieving its goal.
However, if it in fact the form actually required 19 seconds to save, that would be a legitimate performance issue, and additional tests to determine the type of performance problem should be run.
Determine Type of Problem
After determining that there is a performance problem (instead of a user perception), ask the users questions to begin identifying the type of performance issue they are encountering.
User Questions
What were you doing when you noticed this problem?
This question helps determine whether this is a function-specific problem. If users report that they were opening a form or saving a form and the system was slow, then ask them to perform another process in the application – perhaps a different form, or creating a different kind of record (create a new customer account, instead of open a customer call). See if the system still seems slow, or if it was only that one function.
- If the problem is with one specific function, go to the Function-Specific Problem section.
- If the problem seems to occur with several processes, continue with the questions.
Do your coworkers notice this problem?
Have the user check whether another user on the same system has this problem. This is a two-fold problem, involving the user’s ID and their workstation. To test both of these conditions, have the user test:
- their ID on another workstation
- another ID on their workstation
- another ID on another workstation
If the user’s ID is slow on their workstation but fast on another workstation, then the problem might lie with the user’s workstation – go to the Client Performance Statistics section.
If the user’s ID is slow on both, but another ID on their workstation is fine, then the problem might lie with the user’s access rights or setup – go to the User-Specific Problem section.
If all trials are slow, then analysis continues.
If the users are in an office which is separate from the main office, they might also try having someone in the home office log in. If the home office does not experience the problem, then this might be an issue confined to the user’s office or group – go to the User-Specific Problem section.
Does this issue occur at a specific time of the day?
Do the users see this problem only in the afternoon? If they only work on the system at a particular time, have them try logging in at another time of day and see if there is still a problem. If there is not, then this may be a time-of-day problem – go to the Time-Specific Problem section.
Problem Analysis
Function-Specific Problem
For a function-specific problem, there is typically one flow or one part of the process (such as opening a form to record a customer call) where the performance is slower than in the rest of the application. The best tool to use for researching this type of issue is the Performance Analyzer (PAL). Run PAL on the process in question, and then analyze the PAL counters where the most time was spent.
There are many PAL counters, each of which tracks different facets of Process Commander. However, some areas are more prone to have performance issues than others:
- Database I/O
- Connectors (includes Rule-Connect-SQL, which is how SQL queries connect to the PegaRULES database)
- Rules Assembly (also known as “first-use assembly”)
List-based Database I/O
One of the places where the most time can be spent in the system is in retrieving information from the database. Process Commander has many built-in efficiencies to cache different kinds of data, in order to avoid unnecessary database reads.
When the system retrieves multiple items from the database, either for a report or for a display list on a form (a drop-down choice for “state” or “country” in a customer address), that is known as reading a list. The system makes distinctions for the type of data being read for the list, and for the mechanism by which the list is being read.
Data types
There are two types of list data tracked:
- rule (instances of Rule- classes)
- non-rule (Work- or Data- class information)
These types can be seen in the Elapsed Time Detail and the CPU Time Detail sections of the Detail PAL display:
Most times, the Rule database lists will be read by the system, to accomplish the requested processing; many rules will be looked up when running an activity, calculating a declarative expression, etc.
The developer should pay more attention to a non-Rule – work or data – lookup. These numbers are indicative of the application having to list a significant quantity of work or data items for the requested process, or of retrieving data from the BLOB (see next section).
The Rule and non-Rule access times are linked to the counts of each type in the Database Access Counts section.
As it runs, the system can retrieve needed rules either from the cache, or from the PegaRULES database, when the system doesn't find them in the cache. Retrieving rules from the cache is efficient; retrieving directly from the database takes much more time. If many rules are continually being retrieved from the database, investigate cache settings.
Data Retrieval
In addition to the type of data being requested, there are two methods which could be used to retrieve the data from the database:
- Obj-List
- RDB-List
Most activities will use the Obj-List method to retrieve data from the database, which is the preferred method. Obj-List is a tool provided to the user which will automatically generate the SQL needed to access information from a PegaRULES database. Beginning in Version 5.1), Obj-List will also access data from an external database connected to a Process Commander application. The tool provides consistent SQL queries which the user does not have to write himself.
Custom queries created through Rule-Connect SQL, or reports created with List View rules or Summary View rules, on the other hand, use the RDB-List method*. This creates custom SQL statements which are sent to the database, which may or may not require data from the BLOB column.
*List view rules and summary views use the RDB-List method to retrieve Rule- data from the database, and will use the Obj-List method to retrieve non-Rule data.
The Database Access Counts section of the PAL Detail display shows how many Obj-List and RDB-List requests were made to the database, and whether data was required from the BLOB (“Storage Stream”).
When analyzing performance, any calls to the database that involve data from the BLOB should be closely scrutinized. For further details on what custom SQL statements were used that required data from the BLOB, use the DBTrace tool (see Example Analysis below).
The BLOB
Process Commander data is stored in tables in a relational database. Each of these tables have some exposed columns, which generally correspond to certain scalar properties. When requested data is stored in these exposed columns, it is efficient to report on, as it can be returned directly from these columns.
All of the data for each table are also stored in the Storage Stream column, which is also known as the BLOB – the Binary Large OBject column. (This column is currently part of all PegaRULES database tables.) The BLOB must be used to store data that is structured in forms that do not readily fit into the pre-defined database columns, like properties that contain multi-dimensional data such as pages, groups or lists.
BLOB data handling is slow, and it requires a lot of memory . To reduce the volume of data stored in this column, Process Commander compresses the data when storing it in the column, using one of several compression algorithms (“V4”, ”V5,” or “V6”). When data is requested from the BLOB, it must be processed:
- read from the column
- decompressed
- translated from binary into an XML string
- stored on a temporary clipboard page
Depending upon what data was stored in the BLOB, each BLOB entry can take up to 1MB of space on the clipboard. Then, since the BLOB includes all the properties in a particular table entry, the data must be filtered, so that only the requested properties are returned; the temporary clipboard page is then thrown away, creating more garbage that the system must handle.
If this entire process is run to only extract one property from the BLOB, that is a highly inefficient and costly setup. It may be that certain properties which are frequently reported on should be exposed as columns in the database table, if they are continually read from the BLOB. This then allows them to be read directly, rather than going through the above process.
Connects
Connect rules are used to connect the Process Commander system to external systems. There are different types of connect rules (Rule-Connect-EJB, Rule-Connect-SOAP, Rule-Connect-MQ, etc.) to connect using different integration methods.
In addition to external systems, one connect rule – Rule-Connect-SQL - is used to connect to the Process Commander database for custom SQL queries (such as those created by the RDB-List method). Connect time for all Connect rules is measured in the Elapsed and CPU sections of the PAL Detail display, and may be linked to data retrieved through RDB-List methods. (See the Example Analysis section.)
Rules Assembly
When a user logs and runs a process for the first time, that process will be slow, as all of the rules involved in the process must be retrieved from the database and then generated into Java code and compiled by the system, in order to run them. After this first use, however, the generated and compiled code is cached, so that subsequent uses of any of these rules will be much faster. The cached code is keyed by the user’s RuleSet list.
Applications should be designed to take advantage of the Rules Assembly functionality. Since the generated code is keyed by the user’s RuleSet list, all the users who have the same RuleSet list can use the same code. To make the system run as efficiently as possible, all users who do the same processing in the application (use the same forms, run the same tasks) should have exactly the same RuleSet List, so the code for the rules they use only needs to be generated once.
If users are experiencing slow response time, one of the things to check is whether they are set up to take advantage of the Rules Assembly caching, or whether they are doing more Rules Assembly than necessary. If a user logs into the system in the middle of the day (when other users have been using the system all day, and all the rules should have already been assembled), and still sees high Rules Assembly times in the PAL Detail readings, more investigation needs to be done.
Generally, when Rules Assembly occurs for these users, it is due to having a different RuleSet List than other users. This situation can occur in one of two ways:
- a different Access Group with different RuleSet list
- incorrectly having the ability to check out rules
Different Access Groups
If each user has a different access group with just slightly different RuleSet Lists, then each user must do their own Rules Assembly, which will make performance slow. Users who all perform the same functions in the application should all have the same Access Group specified in their Operator ID records.
Check-out Rule Ability
In addition, if users who do not change rules have the Allow rule check out box selected (on the Security tab of the Operator ID instance), they will have a personal RuleSet, which also creates a RuleSet List difference.
Check the user’s Operator ID record to see whether either of these conditions is present.
Rules Assembly Cache
As described in the previous section, developers should verify that users are part of the correct access groups, so that they may take advantage of Rules Assembly caching. However, if a user is running a process for the first time, and they are the first user into the system, or they require a different RuleSet list than other users for their processing (which would require re-generating all code to this different RuleSet List), some time spent in Rules Assembly is to be expected.
However, if the user runs the same process several times in a row, those rules should already be assembled and cached for their RuleSet list, and the system should be using the cached code. If Rules Assembly is still taking place, and the Operator ID has been checked and is correct, that is a signal that something odd is occurring, and should be investigated. Are the Rule Caches too small?
Example Analysis
The following example takes an incorrectly-defined report in Version 4.2 and follows it through a performance analysis using PAL and DBTrace, to demonstrate the overall process of investigation for a function-specific problem.
Begin by running all the way through the problem process at least once, so the system goes through Rules Assembly. In this example, the report called AcmeReport is slow. When the user timed it with his watch, this report took 80 seconds to open. (Since the response time for the system should be less than a second, that’s too long.)
The developer then clicks on the Performance link. In V4.2, this link is in the Tools section at the bottom of the Explorer gadget:
In Version 5.x, this link is off the Run menu at the top of the screen:
These links open up to the same PAL display .
Click Reset Data to clear the prior data from the display, so the reading only reflects the coming actions:
Run the process which is being investigated, and then click Add Reading in the PAL screen. When the Delta line appears, click that to display the PAL Detail display.
The PAL Detail display appears:
There are a number of interesting measurements in this reading.
Elapsed time for executing Connect Rules shows over 60 seconds. That’s a lot of time to connect to something.
Looking into the Database Access Counts section, to see if what we’re connecting to is an RDB rule, there are several values:
RDB-List requests to the database that:
- did not require a Storage Stream: 0
- required the Storage Stream: 1
Rows returned from an RDB-List request that:
- did not require a Storage Stream: 0
- required the Storage Stream: 251
And finally, Bytes read from database Storage Stream = 137 MB.
In summary, these readings state:
- there was one RDB-List request to the database
- it required reading the BLOB
- 251 rows were returned from the BLOB due to this one reading
- the number of bytes in those 251 rows totals over 137MB
This is an extremely large amount of information, and it took a very long time. To find out what was being looked at that required all that data, run DBTrace.
Once again, clear the PAL Data. Then, before re-running the report, click Start DB Trace.
The link will change to Stop DB Trace, and the trace will be started. Re-run the report, and then click Stop DB Trace.
The Download pop-up window will appear:
Follow the directions to save the trace file and start Excel.
Analyzing data with DBTrace
There are several tabs available in the DBTrace Excel workbook; click the DB Trace Data tab to look directly at the data for these interactions. In most cases, it is necessary to sort the data by the Time column in descending order, in order to get the most significant lines of data to the top of the worksheet. The 20 highest-value times will be highlighted in yellow. Then, once the highest times are identified, resort the data back into process order, to get the context for what is happening in the system at that time.
The raw data can be a bit daunting to look at. When closely inspected, however, the DBTrace data gives all sorts of useful information:
- for the last half of the screen, all the lines say “readBlob” in the Operation column, meaning the BLOB is being read for this operation
- the SQL in the Note column tells the developer that the report that was read was the Rule-Obj-ListView report defined on Rule-Obj-Flow, called “AcmeReport,”
- there were various activities that were run to process this report
Since the report was run directly for this example, there is not a lot of other surrounding information. If this report were run as part of a large batch process, however, the DBTrace data could be used to pinpoint which report was giving problems, and what SQL queries were being run as a result.
In this case, the developer investigated the specified report, and noticed a warning at the bottom of the form:
The property .pyChildFlows is not exposed, but only stored in the BLOB, so the read of this information is forcing the system to read the BLOB to retrieve only one property.
This investigation pinpoints where the huge read from the database is, and allows the developer to consider whether there is another way to report this information.
In general DBTrace data can highlight two groups of database problems:
- custom SQL queries (without the BLOB)
- queries which return the BLOB
Custom SQL Queries
These would be queries which are coded in a Connect SQL rule, which might have a performance problem. If the DBTrace reports that a particular query is taking a long time, look at the way the SQL is constructed.
Start by copying the query out of the DBTrace data, along with the substitution values (which are reported in a different column). Use the native tools for your particular database (Oracle, SQL, etc.) to analyze the query, and look for ways to optimize; coordinate this analysis with your DBA.
If the query contains insert or update statements, the DBA could investigate storage optimizations or look for extraneous indexes. Having a lot of indexes may actually slow down the insert/update statements, making performance worse, so it is important that they are added judiciously.
For additional details on writing appropriate custom SQL queries, reference the Writing SQL Technology Paper.
If a query joins two tables, it might be more efficient to create a Rule-Declare-Index table to report this data. Declarative Indexes should also be added judiciously, however, as they can also negatively affect performance. For full details on creating Declarative Indexes, reference the Declarative Indexing Technology Paper.
Queries returning the BLOB
These are also custom queries, of course, but queries involving the BLOB are a special case.
As stated above, extracting data from the BLOB can be slow and can create a lot of garbage. Therefore, queries that return data from the BLOB should be carefully scrutinized, to make sure they are as efficient as possible, and aren’t calling the BLOB unnecessarily.
In the DBTrace data, the developer can search for the following phrases:
- “performing list with blob – blob is necessary due to following X properties”
- “select pzPVStream”
pzPVStream is the name of the BLOB column, so any SQL statements that “select pzPVStream” are calling the BLOB. For many cases, this is the most efficient way to get the data. The developer should be aware of the BLOB calls, however, and if any of them are taking excessive time (as reported in DBTrace), should do further investigation.
The “performing list” statement will show which properties in the BLOB are being requested. In some cases, one or more properties will be named in this statement, pointing to the data that was actually requested. If there is only one property listed, the developer should consider whether it might be more efficient to expose that one property as a column, so this request wouldn’t need to extract any information from the BLOB. (If there are a number of properties being requested, then perhaps the BLOB statement is as efficient as possible.)
Sometimes, the DBTrace information will state “ . . . due to the following 0 properties”. This does not mean that no data was returned; it means that the requested properties are on an embedded page in the BLOB, so the top-level trace has no record of them. Since these are embedded properties, it is not possible to put them in an exposed column; instead, the developer should consider whether a Rule-Declare-Index would be more efficient.
In some cases, getting data from the BLOB column is valid, but the reading process needs to be optimized. Check the size of the reads being done – many are up to 16K in size for one entry. If the time to do the reads seems excessive, but the queries themselves are legitimate, discuss with the DBA the default storage size in the BLOB. If the queries are 16K, and the default storage size is 4K, then that means that four reads are required for each line returned. The DBA may want to change the default storage size, so the reads themselves are more efficient.
For details on sizing the database BLOB appropriately, reference the Database LOB Sizing Technology Paper.
Finally, when returning data from the BLOB, it is important to limit the number of records returned (the results set). As stated in the BLOB information above (in the List-based Database I/O section), each BLOB entry can take up to 1MB of space in the clipboard. If too many BLOB entries are returned, out-of-memory errors can occur. Setting .pyMaxRecords allows developers to be certain that the number of entries returned will not overflow the JVM memory.
General System-Wide Problem
A performance problem with one function or process can be seen across the entire system, and in a sense can be considered a “system-wide” problem. However, the issue is isolated to the one process; a developer can track and analyze it with the PAL tool.
Other performance problems may not be isolated to one process; the performance of the entire application is slow (as measured against the application SLAs). In this case, as all users see the problem, some analysis could also be done using PAL. Note, however, that the PAL tool will only look at the performance of one node (one server) – the node that this user is connected to. To get a full “system-wide” view of performance, use the Log-Usage tool.
Below is a screen shot of the V4.2 Log-Usage statistics, available from the System Console.
For a system-wide problem, the PAL statistics at the top of the Log-Usage section should be used.
Note that these statistics are averages, measured across all nodes and during all the hours the system has been running for the time period specified in the tool; they also include data for agents and services. This means that the numbers can’t be looked at directly for answers to where the
performance could be improved, but must be compared against other related PAL measurements.
Examples:
- If there is a high reading in Other Browse Elapsed, check for Other Browse Rows Returned. See how many rows were returned; divide the Elapsed time by the number of rows returned, to see how long it took (on average) to return one row, and see if that time is reasonable.
- Look for the Interaction Count, which shows the number of requests to and from the server. Divide the Total Request Time by the Interaction Count in order to get the average time per interaction.
- Divide both the Java Assemble Count and the Java Compile Count by the number of interactions. The answer might be something like 10 assemblies and compiles per interaction. In development, where things are constantly changing and need to be reassembled frequently, that might be an acceptable number. In a production system, where the code should not be changing, this would be an indication of a serious problem.
- Compare the Connect Count with the Connect Elapsed time – how much time is being taken per connect?
There are many PAL counters available in the V4.2 Log-Usage display, and more were added in Version 5.1. Rather than going through all the possible calculations in this Support Play, reference these supplemental Support Plays for full calculation details:
- Support Play: Troubleshooting Performance - PAL Reading Details in Version 4.2 (future publication)
- Support Play: Troubleshooting Performance - PAL Reading Details in Version 5.1 (future publication)
In addition to Log-Usage, VerboseGC should also be enabled for tracking system-wide performance issues. VerboseGC is a JVM startup argument that instructs the JVM to log statistics on garbage collection cycles. As garbage collection can definitely affect system performance, it is important to see how much garbage is being created in the system.
VerboseGC logging overhead is small, so enabling this switch in any environment is strongly recommended.
For full details on VerboseGC, garbage collection, and tuning your JVM to maximum efficiency, see:
- Support Play: Tuning the IBM JVM 1.4.2 for performance
- Support Play: Tuning the Sun JVM 1.4.2 and 5.0 for performance
Time-Specific Problem
In some systems, the performance problems only appear during specific times of the day. In this case, use the Log-Usage tool, but check the hour-by-hour printout for details. Check the time when users complain about system slowness.
The below example shows a Log-Usage report that was downloaded to a CSV file and printed out. It is clear that the usage is heaviest between 12:00 and 16:00 (4 p.m.), peaking around 14:00 (2 p.m.).
Again, for full details on the Log-Usage calculations, refer to the support play PAL Reading Details for the appropriate version.
User-Specific Problem
If the user questions indicate an issue with just one user (or one group of users), the developer could run a set of PAL statistics for one of the problem users and for another user whose performance is good, and compare the readings, doing further investigation in the areas which have slower times or higher usage counts.
In addition, the developer should investigate the following issues:
- Does this user (or group) have a great many personal rules in their RuleSet that might be causing slowdowns?
- How are this user’s (or group's) access rights different than the other ID’s rights?
Agent Performance Problem
Sometimes the performance issue is not seen by a user, but is observed when an agent runs. The process of analyzing an agent performance problem is much different than analyzing an interactive-user’s performance issue, and is too long to be included in this support play. Therefore, once it is determined that there is a performance problem with an agent, refer to Support Play: Troubleshooting Performance in Agent or Service Activities
Service Performance Problem
As stated above, sometimes the performance issue is not seen by a user, but is experienced when a Service Rule runs. The process of analyzing a Service rule is much different than analyzing an interactive-user’s performance issue.
Version 4.2
Version 4.2 didn’t have many of the performance-tracking tools that were added in later versions. There are some basic tools described in the following documents:
The Troubleshooting section in this doc includes information on:
- Testing Service Activities Manually
- SOAP Message Monitoring
- Process Commander Log File
The Troubleshooting section in this doc includes information on:
- SOAP Message Monitoring
- Process Commander Log File
- Common Error Messages
In addition, it is possible to trace some performance issues within the service activity itself. For details on this procedure, see Support Play: Troubleshooting Performance in Agent or Service Activities
Version 5.1
Additional PAL counters and additional tracing functionality was added in Version 5.1. For full details on tracing service issues in Version 5.1, see Testing Services and Connectors in Version 5.1
Environment Problem
In some systems, the PAL statistics and log information will look fine, even though performance is slow. In this case, the problem may not be with the application itself; there may be some environment issue that is causing the system slowdown, such as issues with the network, the client, or the database.
Since there are many tools which are already available for diagnosing issues with different commercial products, Pegasystems does not provide these. The system administrator should use the tools he or she is most familiar with. The following environment areas should be regularly monitored.
Server and Network Performance Data
In addition to the performance of the application, the resource consumption and load on the servers that make up the operating environment is also critical to identifying performance problems and resource bottlenecks. There is a vast array of tools available to capture and analyze performance metrics for different operating system and hardware platforms; however, the tools bundled with the operating system being used, or some free open-source tools, should be adequate to provide the information necessary to identify performance problems or resource bottlenecks on a server. Below is a partial list of tools that could be used to evaluate server performance on various platforms, and their corresponding descriptions.
Unix and Linux:
- VMSTAT – Reports virtual memory statistics. VMSTAT output differs from platform to platform, but usually provides data on both memory and CPU usage.
- MPSTAT – Reports per-processor or per-processor-set statistics. MPSTAT provides processor statistics for a single CPU or group of CPUs, which may be valuable in large multiple CPU machines that are broken into operating domains.
- IOSTAT – Reports I/O statistics. IOSTAT provides disk subsystem statistics for a server. (Note: This data may be more important for a database server than for an application server.)
- NETSTAT – Reports network usage statistics. NETSTAT displays the status of network controllers and provides important usage statistics.
- SAR – Comprehensive System Activity Reporter (SAR). This tool has more advanced options than the above tools; it is typically more useful in a performance tuning exercise than in a performance troubleshooting exercise, given its increased complexity to set up and monitor.
Windows:
- Task Manager – Windows Task Manager is a comprehensive real-time performance analysis tool. Task Manager provides a view of the current system utilization level; however, it is not possible to save performance data from this tool.
- Windows Performance Logs and Alerts – Performance Logs and Alerts are built into the Windows NT operating system, and can be accessed from the Control Panel. Use this tool to monitor a wide array of performance statistics and save data for comparison and analysis.
Additional information on the Unix and Linux tools can be gained by consulting the man (manual) pages available on the server. Additional information on the Windows tools can be found in Windows help.
Database Performance Statistics
Because so many performance problems are linked to database access, performance statistics should be captured for the database software as well. The gathering of relational database performance statistics should be done using the database software native tools, such as:
- STATSPACK for Oracle
- SQL Profiler trace for Microsoft SQL Server
- DB2 Monitor for IBM DB2
Monitoring database performance should be done by the DBA responsible for the application database, or someone who is proficient in the tools.
Client Performance Statistics
One area of the operating environment that is often ignored is the client – the user’s workstation. Although the client component typically has a small impact on the overall performance equation and is rarely a problem, the client should be checked if the above areas are not yielding results, or if a complete tuning project is underway.
The following areas should be investigated:
Area | Details |
---|---|
Hardware | Does this user have enough memory in their PC? Do they have a fast enough processor to run Process Commander efficiently? |
Hard Drive | Does this user have enough space on their hard drive to keep the files they are trying to store? Is the hard drive so fragmented that the PC can’t quickly find the files it needs? |
Browser | Does this user have the correct version of Internet Explorer? Have they added appropriate patches? Are they trying to use an unsupported browser type (such as Firefox)? |
PC | Does the user have some kind of virus on their PC? |
Besides checking the primary performance statistics for a client machine as described above, it is important to check the configuration of the Internet Explorer browser to make sure the settings are appropriate. Check the following items when evaluating the Internet Explorer configuration:
- Temporary Internet File Cache Size – Check that the amount of space allocated to the browser’s temporary internet file cache is adequate. An “adequate” value can vary from system to system; generally, try to allocate as much space as can be spared, based on the size of the user’s disk. Typical configurations allocate 500 MB to this setting.
- Connections – Verify that the connection to the application server is considered a local intranet connection, and not an internet connection. Also verify that any proxy servers that are passed through between the client and the application server are required.
Additional Resources
- Using Performance Tools in Process Commander Version 4.2
- Using Performance Tools in Process Commander Version 5.1
- Support Play: Tuning Your IBM JVM 1.4.2 for performance
- Support Play: Tuning the Sun JVM 1.4.2 and 5.0 for performance
- System Management Reference Guide
- Support Play: Troubleshooting Performance in Agent or Service Activities
- Testing Services and Connectors (Version 5.1)
- pegarules.xml File Settings Reference
- Declarative Indexes
- Database LOB Sizing
Need Further Help?
If you have followed this Support Play, but still require additional help, you can contact Global Customer Support by logging a Support Request.
Previous topic RuleSet lists containing twenty RuleSets do not directly affect performance Next topic Ten best practices for successful performance load testing