Monday, April 6, 2009

Automatic Workload Repository


Learn to use the new feature that collects database performance statistics and metrics for analysis and tuning, shows the exact time spent in the database, and even saves session information

When you have a database performance problem, what is the first thing you do to address it? One common approach is to see if a pattern exists: Answering questions such as "Is the same problem recurrent?", "Does it occur during a specific time period?", and "Is there a link between two problems?" will almost always lead to a better diagnosis.

As a DBA you probably have invested in a third-party or homegrown tool to collect elaborate statistics during database operation and derive performance metrics from them. In a crisis, you access those metrics for comparisons to the present. Replaying these past events can shed light on current problems, so continuously capturing relevant statistics becomes important for performance analysis.

For some time, Oracle's solution in this area has been its built-in tool, Statspack. While it can prove invaluable in certain cases, it often lacks the robustness required by performance troubleshooting exercises. Oracle Database 10g offers a significant improvement: the Automatic Workload Repository (AWR). The AWR installs along with the database and captures not only statistics, but the derived metrics as well.

A Quick Test Drive

AWR capability is best explained quickly by the report it produces from collected statistics and metrics, by running the script awrrpt.sql in the $ORACLE_HOME/rdbms/admin directory. This script, in its look and feel, resembles Statspack; it shows all the AWR snapshots available and asks for two specific ones as interval boundaries. It produces two types of output: text format, similar to that of the Statspack report but from the AWR repository, and the default HTML format, complete with hyperlinks to sections and subsections, providing quite a user-friendly report. Run the script and take a look at the report now to get an idea about capabilities of the AWR.

Implementation

Now let's explore how AWR is designed and structured. Basically, AWR is an Oracle built-in tool that collects performance related statistics and derives performance metrics from them to track a potential problem. Unlike Statspack, snapshots are collected automatically every hour by a new background process called MMON and its slave processes. To save space, the collected data is automatically purged after 7 days. Both the snapshot frequency and retention time can be modified by the user. To see the present settings, you could use:

select snap_interval, retention
from dba_hist_wr_control;

SNAP_INTERVAL RETENTION
------------------- -------------------
+00000 01:00:00.0 +00007 00:00:00.0

This SQL shows that the snapshots are taken every hour and the collections are retained 7 seven days. To change the settings--say, for snapshot intervals of 20 minutes and a retention period of two days--you would issue the following. The parameters are specified in minutes.

begin
dbms_workload_repository.modify_snapshot_settings (
interval => 20,
retention => 2*24*60
);
end;

AWR uses several tables to store the collected statistics, all stored under the SYS schema in the new special tablespace named SYSAUX, and named in the format WRM$_* and WRH$_*. The former type stores metadata information such as the database being examined and the snapshots taken, and the latter type holds the actual collected statistics. (As you might have guessed, H stands for "historical" and M stands for "metadata.") There are several views with the prefix DBA_HIST_ built upon these tables, which can be used to write your own performance diagnosis tool. The names of the views directly relate to the table; for example, the view DBA_HIST_SYSMETRIC_SUMMARY is built upon the table WRH$_SYSMETRIC_SUMMARY.

The AWR history tables capture a lot more information than Statspack, including tablespace usage, filesystem usage, even operating system statistics. A complete list of these tables can be seen from the data dictionary through:

select view_name from user_views where view_name like 'DBA\_HIST\_%' escape '\';

The view DBA_HIST_METRIC_NAME defines the important metrics the AWR collects, the groups to which they belong, and the unit in which they are collected. For example, here is one record (in vertical format):

DBID : 4133493568
GROUP_ID : 2
GROUP_NAME : System Metrics Long Duration
METRIC_ID : 2075
METRIC_NAME : CPU Usage Per Sec
METRIC_UNIT : CentiSeconds Per Second

It shows that a metric "CPU Usage Per Sec" is measured in units of "CentiSeconds Per Second" and belongs to a metric group "System Metrics Long Duration." This record can be joined with other tables such as DBA_HIST_SYSMETRIC_SUMMARY to get the activity, as in:

select begin_time, intsize, num_interval, minval, maxval, average, standard_deviation sd
from dba_hist_sysmetric_summary where metric_id = 2075;

BEGIN INTSIZE NUM_INTERVAL MINVAL MAXVAL AVERAGE SD
----- ---------- ------------ ------- ------- -------- ----------
11:39 179916 30 0 33 3 9.81553548
11:09 180023 30 21 35 28 5.91543912

... and so on ...

Here we see how the CPU time was consumed in centi-seconds. The standard deviation adds to our analysis by helping ascertain whether the average figure reflects the actual workload. In the first records, the average is 3 centi-seconds in CPU per second elapsed, but the standard deviation is 9.81, meaning the average of 3 is not reflective of the workload. In the second example, the value 28, with a standard deviation of 5.9, is more representative. This type of information trends help understanding the effects of several environmental parameters on performance metrics.

Using the Statistics

So far we have seen what AWR collects; now let's see what it does with the data.

Most performance problems do not exist in isolation, but rather leave tell-tale signs that will lead to the eventual root cause of the problem. Let's use a typical tuning exercise: You notice that the system is slow and decide to look into the waits. Your examination reveals that the "buffer busy wait" is very high. What could be the problem? There are several possibilities: there could be a monotonically increasing index, a table so packed that a single block is asked to be loaded to memory very quickly, or some other factors. In any case, first you want identify the segment in question. If it's an index segment, you could decide to rebuild it; change it to a reverse key index; or convert it to a hash-partitioned index introduced in Oracle Database 10g. If it's a table, you could consider changing storage parameters to make it less dense or move it over to a tablespace with automatic segment space management.

Your plan of attack is generally methodical and usually based your knowledge of various events and your experience in dealing with them. Now imagine if the same thing were done by an engine - an engine that captures metrics and deduces possible plans based on pre-determined logic. Wouldn't your job be easier?

That engine, now available in Oracle Database 10g, is known as Automatic Database Diagnostic Monitor (ADDM). To arrive at a decision, ADDM uses the data collected by AWR. In the above discussion, ADDM can see that the buffer busy waits are occurring, pull the appropriate data to see the segments on which it occurs, evaluate its nature and composition, and finally offer solutions to the DBA. After each snapshot collection by AWR, the ADDM is invoked to examine the metrics and generate recommendations. So, in effect you have a full-time robotic DBA analyzing the data and generating recommendations proactively, freeing you to attend to more strategic issues.

To see the ADDM recommendations and the AWR repository data, use the new Enterprise Manager 10g console on the page named DB Home. To see the AWR reports, you can navigate to them from Administration, then Workload Repository, and then Snapshots. We'll examine ADDM in greater detail in a future installment.

You can also specify alerts to be generated based on certain conditions. These alerts, known as Server Generated Alerts, are pushed to an Advanced Queue, from where they can be consumed by any client listening to it. One such client is Enterprise Manager 10g, where the alerts are displayed prominently.

Time Model

When you have a performance problem, what comes to mind first to reduce the response time? Obviously, you want to eliminate (or reduce) the root cause of the factor that adds to the time. How do you know where the time was spent--not waiting, but actually doing the work?

Oracle Database 10g introduces time models for identifying the time spent in various places. The overall system time spent is recorded in the view V$SYS_TIME_MODEL. Here is the query and its output.

STAT_NAME VALUE
------------------------------------- --------------
DB time 58211645
DB CPU 54500000
background cpu time 254490000
sequence load elapsed time 0
parse time elapsed 1867816
hard parse elapsed time 1758922
sql execute elapsed time 57632352
connection management call elapsed time 288819
failed parse elapsed time 50794
hard parse (sharing criteria) elapsed time 220345
hard parse (bind mismatch) elapsed time 5040
PL/SQL execution elapsed time 197792
inbound PL/SQL rpc elapsed time 0
PL/SQL compilation elapsed time 593992
Java execution elapsed time 0
bind/define call elapsed time 0

Note the statistic named DB Time, which represents the time spent in the database since the instance startup. Run the sample workload and select the statistic value from the view again. The difference should represent the time spent in the database for that workload. After another round of tuning, perform the same analysis and that difference will show the change in DB Time after the tuning, which can be compared to first change to examine the effect of the tuning exercise on the database time.

In addition to the database time, the V$SYS_TIME_MODEL view shows a whole lot of other statistics, such as time spent in different types of parsing and even PL/SQL compilation.

This view shows the overall system times as well; however, you may be interested in a more granular view: the session level times. The timing stats are captured at the session level as well, as shown in the view V$SESS_TIME_MODEL, where all the stats of the current connected sessions, both active and inactive, are visible. The additional column SID specifies the SID of the sessions for which the stats are shown.

In previous releases, this type of analysis was impossible to get and the user was forced to guess or derive from a variety of sources. In Oracle Database 10g, getting this information is a snap.

Active Session History

The view V$SESSION in Oracle Database 10g has been improved; the most valuable improvement of them all is the inclusion of wait events and their duration, eliminating the need to see the view V$SESSION_WAIT. However, since this view merely reflects the values in real time, some of the important information is lost when it is viewed later. For instance, if you select from this view to check if any session is waiting for any non-idle event, and if so, the event in question, you may not find anything because the wait must have been over by the time you select it.

Enter the new feature Active Session History (ASH), which, like AWR, stores the session performance statistics in a buffer for analysis later. However, unlike AWR, the storage is not persistent in a table but in memory, and is shown in the view V$ACTIVE_SESSION_HISTORY. The data is polled every second and only the active sessions are polled. As time progresses, the old entries are removed to accommodate new ones in a circular buffer and shown in the view. To find out how many sessions waited for some event, you would use

select session_id||','||session_serial# SID, n.name, wait_time, time_waited
from v$active_session_history a, v$event_name n
where n.event# = a.event#

This command tells you the name of the event and how much time was spent in waiting. If you want to drill down to a specific wait event, additional columns of ASH help you with that as well. For instance, if one of the events the sessions waited on is buffer busy wait, proper diagnosis must identify the segments on which the wait event occurred. You get that from the ASH view column CURRENT_OBJ#, which can then be joined with DBA_OBJECTS to get the segments in question.

ASH also records parallel query server sessions, useful to diagnose the parallel query wait events. If the record is for a parallel query slave process, the SID of the coordinator server session is identified by QC_SESSION_ID column. The column SQL_ID records the ID of the SQL statement that produced the wait event, which can be joined with the V$SQL view to get the offending SQL statement. To facilitate the identification of the clients in a shared user environment like a web application, the CLIENT_ID column is also shown, which can be set by DBMS_SESSION.SET_IDENTIFIER.

Since ASH information is so valuable, wouldn't it be nice if it were stored in a persistent manner similar to AWR? Fortunately, it is; the information is flushed to the disk by the MMON slave to the AWR table, visible through the view DBA_HIST_ACTIVE_SESS_HISTORY.

Manual Collection

Snapshots are collected automatically by default, but you can also collect them on demand. All AWR functionality has been implemented in the package DBMS_WORKLOAD_REPOSITORY. To take a snapshot, simply issue:

execute dbms_workload_repository.create_snapshot

It immediately takes a snapshot, recorded in the table WRM$_SNAPSHOT. The metrics collected are for the TYPICAL level. If you want to collect more detailed statistics, you can set the parameter FLUSH_LEVEL to ALL in the above procedure. The stats are deleted automatically but can also be deleted manually by calling the procedure drop_snapshot_range().

Baseline

A typical performance tuning exercise starts with a capturing a baseline set of metrics, making changes, and then taking another baseline set. These two sets can be compared to examine the effect of the changes made. In AWR, the same kind of analogy can be implemented for existing snapshots taken. Suppose a particularly resource intensive process named apply_interest ran between 1:00 and 3:00PM, corresponding to snapshot IDs 56 through 59. We could define a baseline named apply_interest_1 for these snapshots:

exec dbms_workload_repository.create_baseline (56,59,'apply_interest_1')

This action marks the snapshots 56 through 59 as part of a baseline named above. Checking for existing baselines:

select * from dba_hist_baseline;

DBID BASELINE_ID BASELINE_NAME START_SNAP_ID END_SNAP_ID
---------- ----------- -------------------- ------------- -----------
4133493568 1 apply_interest_1 56 59

After a few tuning steps, we can create another baseline--called, say apply_interest_2--and compare the metrics for only those snapshots related to these two baselines. Isolating snapshots to only a few sets like this helps in studying the effects of tuning on performance metrics. You can drop the baselines after the analysis using the procedure drop_baseline(); the snapshots will be preserved. Also, when the purge routine kicks in to delete the old snapshots, the ones related to baselines are not purged, allowing for further analysis.

Conclusion

This installment was intended to be merely an introduction to the very rudimentary aspects of the AWR. For a more complete coverage, see Oracle Database 10g documentation. Furthermore, an excellent treatise on AWR and ADDM can be found in the technical whitepaper The Self-Managing Database: Automatic Performance Diagnosis. In Week 18, you will learn more about ADDM and using it to solve real-life problems.

Happy Learning !

1 comment:

Thanks for you valuable comments !