Managing Processes and Memory Consumption on Oracle Exadata database machines.

This post is equally applicable to Rac databases deployed, both on oracle database machines as well as traditional server/storage architectures, running RHEL5 or OEL5 (OEL6 has a new feature called transparent hugepages).

The Best practices for database consolidation on exadata database machine, White Paper outlines a lot of good guidelines. However the best practices seems to be getting occasionally overlooked, in some cases resulting in Node evictions/restarts in the cluster.

High amounts of memory and cpu consumption can result in some real bad things happening (Like for eg: Node evictions). So it is important to configure the operating system and the databases you deploy on the machine optimally, to use the available CPU and memory resources.

Let us first review what those available memory and cpu resources are on an Oracle Exadata Database Machine, on each database node.

[table “1” not found /]

Anytime a workload is deployed on a database server, that exceeds the, operating systems ability to efficiently use the above mentioned, available cpu and memory resources, bad things can happen. (I know i am over simplifying in the prior comment, but you get the idea).

Managing Memory Resources.

Setting up HugePages

First we need to make sure that HugePages on each node, is setup correctly, for the databases you have deployed on the node.
Tanel Poder has explained the reasons for setting up HugePages on Exadata systems.
When the database machines are installed, the HugePages is configured to accomodate the one database that is pre-installed. So if you change the SGA setting for that database or deploy more databases you should now adjust the HugePages setting accordingly. It goes without saying that if you have already configured the hugepages to accomodate all the databases you have, and then you removed some databases from the node, or resized the SGA’s to be smaller, you should again resize your hugepages to free up the excess hugepages.

Once all the database instances are configured and started up you can run the script, hugepages_settings.sh from Mos Note 401749.1, which calculates the number of hugepages you need to setup. I usually recommend adding 10% to the value that hugepages_settings.sh suggests.

With 11.2.0.2 and 11.2.0.3 you should also be setting USE_LARGE_PAGES, preferably to ONLY, following the instructions in Mos Note 1392497.1.

You can then follow the instructions in MOS note 361323.1, under the section “Kernel version 2.6”, to set the value. You need to restart the databases so the SGA gets allocated with the hugepages.

You should also follow instructions from MOS note 1546861.1, and set the value of vm.min_free_kbytes = 524288, in /etc/sysctl.conf, to avoid page allocation failure messages when kernel memory is depleted.

Managing Memory Usage

Ideally for critical implementations, your SGA+PGA+Individual server processes, memory allocations should not exceed 75% of the physical memory on the database node.

The Exadata consolidation white paper above suggests that

For critical Hardware Pools, we recommend an even more conservative approach by not exceeding 75% physical memory per database node.
OLTP applications:
SUM of databases (SGA_TARGET +PGA_AGGREGATE_TARGET) + 4 MB * (Maximum PROCESSES) < Physical Memory per Database Node
DW/BI applications:
SUM of databases (SGA_TARGET + 3 * PGA_AGGREGATE_TARGET) < Physical Memory per Database Node

You can monitor the pga usage a few different ways.

1) SELECT s.inst_id, s.sid, s.serial#, p.spid, s.username, s.program, p.pga_used_mem FROM gv$session s JOIN gv$process p ON p.addr = s.paddr AND p.inst_id = s.inst_id WHERE s.type != ‘BACKGROUND’ and s.program not like ‘%(P%’ and p.pga_used_mem > <APPLICATION_MEMORY_THRESHOLD>
order by s.inst_id, s.sid, s.serial#;

The value you use for APPLICATION_MEMORY_THRESHOLD is dependent on your application and howmuch pga it needs to efficiently run your queries (Performing the sorts in memory using the space for PGA is usually faster than using the TEMP tablespace). However the white paper suggests using 1Gb for OLTP applications and 10Gb for DSS applications.

2) You can monitor the sysmetric “Total PGA Allocated” from dba_hist_sysmetric_summary, to see how much PGA is being allocated.

3) You can use the “PGA Memory Advisory” and “SGA Target Advisory” sections of the awr report, to guage, the impacts of increasing or decreasing the SGA and PGA Allocations.

4) You can monitor the memory usage at the o/s level using the top (Mem: free),free or vmstat (bo (page-out),bi (page-in)) commands.

Managing O/S Processes (And as a consequence CPU usage).

It is important to control the number of processes that are actively executing on each node. Each of these processes consume cpu and memory resources.

The following guidelines are important to follow.
– Use Application connection pooling to appropriately limit the number of sessions connected to the oracle database.

This is important for two reasons.
– Right-Size the resource consumption on each database node.
– Minimize the impact of connection storms.

– Configure the parallel query servers appropriately.
– For all the database’s deployed on a node the total setting for PARALLEL_MAX_SERVERS should be less than or equal to the following

X2-2 or X3-2, <= 240
X2-8 or X3-8, <= 1280

I often see implimentations where parallel_max_servers = 240 (or a high value) in each instance on the machine, which results in not so good consequences. I have occassionaly seen systems where there is sustained 100% cpu usage with load averages of > 300, resulting in node reboots.

– Use Instance Caging and Database Resource Manager to manage cpu allocation on the server.

It would be a good idea to always review the Best practices for database consolidation on exadata database machine, White Paper and follow the guidelines, to have a well performing and  stable database environment.

It would be a good idea to review the performance management screens from Enterprise Manager 12c every day to have a good understanding of the database and server workloads, and/or analyze the oswatcher data to understand the server workload profile.

Graph CPU usage on exadata using oswatcher files

On the oracle database machine, oswatcher is installed during setup time, both on the database nodes and the exadata cells. This utility collects linux operating system level statistics, which comes in very handy when troubleshooting operating system level issues. The data is collected in text files. There is a Java based utility (OSWG) provided by oracle support to graph the contents of these files, however that utility does not work on the oswatcher files generated on exadata.

Here is a python script that can graph the cpu used from the mpstat information that oswatcher captures. It has been tested on new oswatcher files on an x3-2. You need to first install a python environment that has the “numpy” and “matplotlib” modules installed.

Install a Python Virtualenv.

If you create multiple applications using Python and end up using different versions, it is easier to maintain different virtualenv’s. You can create a python virtualenv as shown below (On ubuntu linux).

curl -O https://pypi.python.org/packages/source/v/virtualenv/virtualenv-1.9.1.tar.gz
tar -xzvf virtualenv-1.9.1.tar.gz
cd virtualenv-1.9.1
python virtualenv.py ../p273env2
. p273env2/bin/activate
pip install numpy
sudo apt-get install libfreetype6-dev
pip install matplotlib

Now that you have a python environment, with your required libraries, you can go ahead and execute the script as shown below.

The oswatcher files in /opt/oracle/oswatcher are .bz2 files and there will be one file per hour per day. Copy the mpstat .bz2 files into a directory and use bunzip2 to unzip them. In this example let us say that the directory name is /u01/oswatcher/mpstat/tmp

You can now run the script as shown below

python parseoswmp.py  /u01/oswatcher/mpstat/tmp
or
python parseoswmp.py  /u01/oswatcher/mpstat/tmp '06/14/2013 05:00:00 AM' '06/14/2013 07:00:00 AM'

The first command will graph the cpu usage for the entire time range in all those files and the second command graphs the cpu information for the date and time range you have specified.

It creates a file in the current directory, named oswmpstat.png, which has the graph.

You can find the full script here.

You can find a sample output graph here.

Exadata smart flash log

The exadata development team has now released the exadata cell software version 11.2.2.4.0, which includes a new feature called the “Smart Flash Log”.

In a nutshell, this new feature speeds up redo log writes. Exadata smart flash log uses Exadata smart flash cache as a temporary storage to provide low latency redo log writes. With this new feature enabled, oracle writes both to physical disk and the flash cache simultaneously.

So if for some reason, flash cache writes are slow, the writes to the physical disk will provide the good response times. Similarly if the physical disk writes are slow, the flash cache writes will complete faster, providing the good response times.

You can use the “Create flash log”, cell command to turn this feature on. You can use the “Drop flash log” cell command to turn this feature off.

The exadata storage server software, users guide, has been updated with this information.

You have to have Bundle patch 11 (Actually it works from BP9 onwards, but BP11 is recommended) and exadata cell software 11.2.2.4.0 applied to get this functionality.

Please read the section “Exadata Smart Flash Logging : Flash for database logging”, in the oracle technical white paper, Exadata smart flash cache features and the Oracle Exadata Database Machine, for details.

Here is a video tutorial produced by Oracle Education on this topic, smart flash log.

Monitoring Exadata database machine with Oracle Enterprise Manager 11g

Oracle Enterprise manager Grid control, is hands down the best monitoring and management tool, for the oracle exadata database machine. It comes with plugins to monitor all the hardware components of the database machine, and sensible, preset thresholds for proactive monitoring.

Update (Nov 2011) : Enterprise manager 12c is now available, and Certified to be used with exadata. The master MOS note 1110675.1 covers the installation and configuration details.

Some key points
  • You should use 11gR1 enterprise manager grid control for monitoring.
  • You should use 11gR1 enterprise manager agents, to monitor the targets on the database machine.
  • If you use enterprise wide monitoring tools like tivoli, openview or netcool, use snmp traps from oracle enterprise manager, to notify these monitoring tools (ie dont try to directly use snmp to monitor the exadata components. You could do this but it will be too time consuming).
  • You could potentially use 10.2.0.5 Oem, with 11g agents to monitor the dbmachine, but this is not recommended as a stable/long term solution.
  • The following components (And more) can be monitored using Enterprise Manager
    • Databases hosts
    • Exadata Cells
    • Cisco switch
    • KVM (Keyboard, Video, Mouse)
    • ILOM Monitoring
    • Infiniband switch
    • Power distribution unit (PDU)
You have 3 possible options to configure enterprise manager
  • If you have an existing 11gR1 enterprise manager grid control envrionment, you can patch it with the recommended patches and use that for monitoring the dbmachine targets.
  • You can setup and configure a brand new 11gR1 enterprise manager grid control environment (On a separate server) and configure it to monitor the dbmachine targets.
    • Download the required software
      • Weblogic server 10.3.2 (MOS Note 1106105.1, 1063112.1)
      • Jdk 64 bit (Mos Note 1063587.1)
      • 11gR1 Oms from download.oracle.com
    • Install Java and Web Logic Server (Wls)
      • MOS Note 1063762.1
    • Patch Web Logic Server
      • MOS Note 1072763.1
    • Install 11gR1 Enterprise manager Oracle Management Server (OMS)
      • Install/Create a 11gR2 database to serve as the Enterprise Manager Repository
      • Database pre-reqs for 11.1.0.1 repository (Mos Note 1064441.1)
      • Install/Configure Oms (Mos Notes 1130958.1, 1059516.1)
    • Patch OMS with the required patches to enable database machine monitoring
      • Mos Note 1323298.1
  • You can use an easy install option to setup and configure an enterprise manager environment and configure the plugins.
    • The easy install is delivered as a  patch 11852882 (EMGC setup automation kit)
    • The configuration worksheet has to be filled out properly (Before the installation) and the em.param file has to be generated.
    • Follow the instructions in the readme to do a quick install of a fully configured 11gR1 Enterprise manager installation.
    • This method helps you install/patch  and configure the full 11gR1 oms in just an few steps and is a huge time saver.
Download the required plugins to monitor the following components
Download the plugins from the enterprise manager extensions exchange
http://www.oracle.com/technetwork/database/exadata/index.html#plug-in (Exadata cell plugin)
http://www.oracle.com/technetwork/oem/grid-control/exadata-plug-in-bundle-188771.html (All the rest of the plugins)
Install and Configure the Agent and the Plugins
       Additional tutorials with screenshots on configuring the plugins can be found below
Sending SNMP traps to 3rd party monitoring tools.
  • Get the Mib (Management Information Base) file from your enterprise manager management server and send it to the 3rd party tool administrator (eg: openview or netcool). Follow MOS note 389585.1, to get this MIB file.
  • Then configure your notification methods and rules to send the required snmp traps to the 3rd party tool.

Exadata storage expansion cabinet.

Updated on Jan 10 2012 to include info about High Performance disks availability.

Oracle has announced the introduction of the “Exadata Storage Expansion Rack”. This Expansion Rack can be used to add storage capacity to a X2-2 or X2-8 oracle database machine environment.

The press release can be found here.

The datasheet can be found here.

Some Highlights.

  • The expansion rack comes with 4 to 18 storage servers
  • The expansion rack will be delivered with the same 600Gb High Performance (HP) hard disks or the 3Tb High Capacity (HC) hard disks.
  • Quarter Rack, HC disks, has 4 Exadata Storage servers, with 48, 3TB, SAS High Capacity disks, 1.5Tb flash cache.
    • 64Tb of uncompressed usable capacity
  • Half Rack, HC disks,  has 9 Exadata Storage Servers, with 108 , 3TB, SAS High Capacity disks, 3.4Tb flash cache.
    • 144 Tb of uncompressed usable capacity
  • Full Rack, HC disks, has 18 Exadata Storage Servers, with 216, 3TB, SAS High Capacity disks, 6.75Tb flash cache.
    • 288 Tb of uncompressed, Usable capcity
  • Quarter Rack, HP disks, has 4 Exadata Storage servers, with 48, 600Gb SAS High Performance disks, 1.5Tb flash cache.
    • 13Tb of uncompressed usable capacity
  • Half Rack, HP disks,  has 9 Exadata Storage Servers, with 108 , 600Gb, SAS High Performance disks, 3.4Tb flash cache.
    • 29 Tb of uncompressed usable capacity
  • Full Rack, HP disks, has 18 Exadata Storage Servers, with 216, 600Gb, SAS High Performance disks, 6.75Tb flash cache.
    • 58 Tb of uncompressed, Usable capacity
  • Uncompressed  Flash Cache, I/O bandwidth of upto 88Gb/Second on the full rack.

 

Exadata cloning using sun ZFS storage appliance snapshots

Oracle Customers often prefer, the ease of use of using system snapshots (Think Emc clarion, Snapview, Snapshots), for cloning databases. Oracle Sun Zfs storage Appliance has infiniband connectivity and storage snapshot capabilities that can be leveraged to efficiently create database clones.

Oracle has a new white paper that outlines the steps  and best practices to use Dataguard in conjunction with the Zfs storage appliance to create database clones using snapshotting technology. This method can also be used with the oracle exadata database machine, keeping in mind a couple of ceavats that are outlined in the white paper.

The white paper can be found at http://www.oracle.com/technetwork/database/features/availability/maa-db-clone-szfssa-172997.pdf

What makes the oracle database machine V2 incredibly awesome ?

Lots of hardware power for the database servers

– 8 Real Application Clusters, Database server nodes (aka compute nodes) (Sunfire X4170’s)
– 2 quad core, Intel Xeon, E5540 processors (2.53ghz) in each server (Total 64 cores across 8 nodes)
– 72gb of RAM on each node (Total 576gb of Ram on the database servers)
– 3, 36 port QDR Infiniband switches (40gbit infiniband)

Lots of hardware power for the storage servers

– 14 Exadata cells, (Sunfire X4275’s).
– Each cell has
– 2 quad core Intel Xeon, E5540 processors on each cell (Total 112 cpu cores on all 14 cells together)
– 24Gb RAM on each cell (Total 336gb of Ram on all 14 cells together)
– 384Gb of Flash Cache (PCI-E flash card) on each cell (5Tb on all 14 cells together)
– 12x600gb (SAS) disks (7.2Tb) or 12x1Tb (SATA) disks (12Tb) (Total 100Tb with SAS disks, or 168Tb with SATA disks on all 14 cells together)
The above hardware gives the database machine the ability to read data at the rate of 21GigaBytes a second or 1,000,000 Io’s per second.

Balanced Configuration

Each disk drive (SAS) in the Sunfire X4275 server are, 3.5 inch, 15k Rpm, SAS 2.0, 600Gb drives. Each drive has an average read capacity of atleast 125MegaBytes per second for sequential scans. 168 disk drives, can together scan at the rate off (And return) 21,000 MegaBytes per second.
The Infiniband connections between the storage cells and the compute nodes have enough network bandwidth to transport data at the rate of 21GigaBytes per second.
The 64 cpu cores can issue i/o requests at the approx rate of 300megabytes per core, hence requesting about 21GigaBytes of data per second.
So the system is architect ed to optimally read (enough cpu’s to request the i/o, network bandwidth to ship the i/o and enough disk capacity to service that i/o requests). This is why the oracle database machine is a well balanced system.
At the read rate of 21GigaBytes a second, 1Terabyte of data can be read by the database servers in less than 1 minute.

Infiniband

Each database Node and Exadata Cell, has dual port, Quad data rate (QDR) Infiniband connectivity.
Infiniband is used for the database node to exadata cell connectivity and also for the Rac high speed interconnect (Cache fusion network)
Infiniband has the flexibility of a LAN network with the Speed of a SAN.
Oracle’s interconnect protocol uses DMA, to move data from the wire directly to the Memory without any additional copies made.

Exadata Cells

The exadata cells provide a highly redundant, high performance hardware with very intelligent software to efficiently process database i/o requests.
The hardware capability was discussed in the section “Lots of hardware power for the storage servers”.
The Intelligence in the Exadata cells include “Smart flash cache”, “Smart Scan’s”, “Storage Indexes” and “Exadata column compression”.
Oracle Automatic Storage Management (ASM) ensures that all oracle database files are evenly spread across all the 168 disks available in the database machine.
Oracle Database uses  the protocol iDB (Built by oracle, aptly called the Intelligent Database protocol) to communicate with the exadata cells. iDB is built on the Zero Data Loss, Zero Copy implementation (ZDP) of the industry standard protocol RDSv3 (Reliable datagram socket)

Flash cache

Random read operations are cached on the 5Tb of flash cache available in the database machine, significantly improving oltp performance.
The exadata smart flash cache, working with the database server, keeps track of data access patterns and intelligently manages the caching of blocks from the oracle datafiles.

Smart Scans

The oracle database server uses the iDB protocol to push query predicates (That limit data retrieved by the query using filters and join conditions) to the exadata cell.
This enables the cell to do three things
– Identify rows that are needed by the query and ship only those rows back to the database server (Not entire blocks)
– Identify the columns needed by the query and ship only the required columns in the rows back to the database server
– Use Bloom filters and process join conditions and ship only matching rows in queries with joins back to the database server
This drastically reduces the amount of data send back to the database server (Reducing the network usage)
Transferring file blocks from disks inside a cell to the physical memory of the cell can happen relatively fast. If a lot of the data that is not needed by the database server, can be eliminated at the cell, then the amount of data that needs to go over a network to the database server is significantly reduced. So the network i/o usage between the database servers and exadata cells is reduced by smart scans.

Storage Indexes

Smart scans however do not reduce the disk i/o within the cell (ie transfer from disk to cell physical memory).
Oracle creates an in memory array of structures, that keep track off min and max values of columns (columns used in the where clause that benefit by storage indexes), that let oracle identify if specific 1mb regions are needed based on the filtering conditions applied to the tables.
So storage indexes is a filter oracle applies to prune away 1mb chunks that do not have to be read.
This reduces the i/o within the exadata cell.

Exadata Hybrid Columnar Compression

Traditional relational databases store data in database blocks in a “Row” format. This type of storage limits the amount of compression that can be achieved
Column store databases, organizes and stores data by column. Storing column data together, with the same datatype and similar charachteristics , results in significant compression that can be achieved. However if the query refers to more than a couple of columns in the table, or does more than modest updates and inserts , those queries and dml tend to have slower performance.
Exadata hybrid columnar compression takes a blended approach. Oracle takes rows that fit into multiple blocks (Called a compression unit), converts the rows into columns and stores the data in a columnar format within a compression unit.
Oracle uses 3 different compression format’s and different transformations, depending on the compression level you have chosen
There are 4 levels of compression (Each level is a tradeoff between compression ratio and compression speeds)
– Query Low
– Query High (Default when you say compress for query)
– Archive Low
– Archive High
With Exadata Hybrid columnar compression
1) The amount of storage required to store massive amounts of data could be potentially decreased by a factor of 10.
2) The amount of i/o to be issued (For queries that scan very very large tables) is significantly reduced.
The above features together make’s the Oracle Database Machine, rock database performance.