Rac Starter Kit

Oracle has for a while had a rac assurance team (A team within oracle support, in the HA/Rac support team) that engages proactively with new rac customers. The Rac Assurance team used to provide the new customers with a “starter kit” of documents that  include

  1. A Rac best practices document
  2. A Step by Step installation guide
  3. Recommended patches
  4. A test plan.

If the customer follows these best practices, it sets them up with a solid foundation to be successful with their new  Rac implementation.

Now these test kits are public, you can access them by accessing metlink note 810394.1

You can also log a tar in metalink and ask support for the “Rac Starter Kit”, and they will give you the platform specific starter kit that includes the list of recommended patches.

Rac how to determine interconnect speed

During a Recent  Oracle 11g Rac installation on Solaris, i ran into  the following issue. After installing and configuring oracle clusterware, when we were trying to create the ASM instance, the ASM instance would only stay alive on one node of the cluster. The customer had configured the private interconnect to be a 100 base T connection (As opposed to GiGE). Once the customer re-configured the interconnect to be a GiGE, the ASM instance came up properly. Oracle recommends that you have a GiGE connection for your private interconnect.

Before starting your installation you can check if the interface you are using for the private interconnect, is configured to be a GIGE connection.

On Redhat or Oracle Enterprise Linux

Install the rpm ethtool

ethtool <interfacename> | grep Speed ,will give you the speed of the interface

On Solaris



kstat <interfacename> | grep link_speed  ,will give you the speed of the interface

ORA-00845 Memory_Target Not supported on this system

I was working on testing some 11g streams configurations today. I needed to startup 3 databases instances on the same server. I was using AMM (Automatic memory management). When i was trying to startup the 3rd database, i kept getting the error message “ORA-00845: MEMORY_TARGET not supported on this system”. I also had error messages in the alert log.

This is because, the space allocated for /dev/shm is not sufficient to allocate the SGA+PGA’s for all the 3 database instances (When using the initialization parameter memory_target). The space allocated needs to be >= the total SGA+PGA size of all the 3 instances together.

You can increase the space allocated using the command  “mount -t tmpfs shmfs -o size=2000m /dev/shm” ( I had 3 instances 600mb each SGA+PGA). You can persist this allocation across reboots by adding it to the /etc/fstab.

Determining the network bandwidth required for a dataguard physical standby implementation

References

Network bandwidth Implications of Oracle Dataguard

Dataguard Redo Transport & network configuration

Determine the redo generation rate

– You can query dba_hist_sysstat to find out your redo generation rate in bytes.
– You can choose to use either your average redo generation per hour or peak redo generation per hour (I would recommend that you size for the peak redo generation).
– Let us say that you determined that the redo generation rate, in Bytes, PER DAY, happened to be RedoBytes.
– You have to add a 30% overhead for tcp. So RedoBytesPlusOverhead=RedoBytes*1.3

Convert redo generation rate to Mbps (Megabitspersecond)

– (RedoBytesPlusOverhead*8)/((24*60*60)*(1024*1024))
– This is the theoretical minimum bandwidth that you are going to require.

Other important considerations

– Network latency is a huge factor in the amount of redo you will be able to transport from your primary site to the standby site. This value is unique to your network, so if you have a high latency network you might not be able to sustain the required rate of redo shipping.
– Usually the wide area network between the primary site and standby site is used by more than just dataguard (eg: e-mail etc). So those bandwidth requirements have to be factored in.
– The above two points is why customers should not rely too much on theoretical calculations and to actually deploy dataguard and test the actual redo generation and redo transport performance statistics.
– If you do not deploy a network that can ship redo at a rate of 45 mbps, all that means is that, at times your redo shipping will fall behind (ie your standby site will be behind the primary site) but dataguard still works. In a lot of cases this is acceptable (Based on the customers recovery point objective and recovery time objective).
– There are network tuning best practices outlined in “Dataguard Redo Transport & network configuration” , that you are optimizing the redo transport mechanism and the network. These have to be followed to achieve the best possible network performance.
– There are other techniques like network compression (Hardware compression using wan compression devices, or actual software compression in dataguard 11g) which enable you to reduce the network bandwidth requirements.

Scripts

– You can run the following script to extract redo generation information from dba_hist_sysstat.

set pages 0
set head off
set lines 132
set colsep ~
col curval format 9999999999999999999999
col prevval format 9999999999999999999999
Select
sn.snap_id
,cur_stat.snap_id
,prev_stat.snap_id
,to_char(sn.begin_interval_time,’DD-MON-YY HH24′)
,to_char(sn.end_interval_time,’DD-MON-YY HH24′)
,cur_stat.value curval
,prev_stat.value prevval
,(cur_stat.value-prev_stat.value) RedoGen
from dba_hist_snapshot sn,
(select snap_id,value from dba_hist_sysstat
where stat_name = ‘redo size’) cur_stat
,(select snap_id,value from dba_hist_sysstat where
stat_name = ‘redo size’) prev_stat
Where sn.snap_id = cur_stat.snap_id
and cur_stat.snap_id = prev_stat.snap_id + 1 order by 1;

– Spool the contents into a file RedoInfo.dat

– Create a table in the oracle database named RedoInfo

create table redoinfo (
inid    number,
bdate    date,
edate    date,
totredo    number
);

– Use sql*loader to load the contents of the spool file into redoinfo (At this point some would ask, “why dont i just do a create table as in the same database”, my assumption is that you probably dont want to be creating these temp tables in a production env.).

load data
infile ‘RedoInfo.dat’
append into table RedoInfo
fields terminated by “~” optionally enclosed by ‘”‘
(
field1 filler,
field2 filler,
inid,
bdate  Date “DD-MON-YY HH24”,
edate  Date “DD-MON-YY HH24”,
field3 filler,
field4 filler,
totredo
)

– Then you can run all kinds of queries on this to learn the different charachteristics of your redo generation
– The query below gives you the total redo generation per day and  the Mbps

select to_char(edate,’dd-mon-yy’) Day,sum(totredo)/(1024*1024) TotRedoMb
,(sum(totredo)*1.3)/(1024*1024) RedoPlusOvrHd,((sum(totredo)*1.25)*8)/(1024*1024) Mbits
,round(((sum(totredo)*1.25)*8)/(24*60*60*1024*1024)) Mbps FROM RedoInfo
group by to_char(edate,’dd-mon-yy’)
order by 1;

11g Snapshot Standby

Oracle 11g provides customers the “Snapshot Standby” database feature, so that customers can leverage their investments in standby database systems. When leveraging the “Snapshot Standby” feature, customers can temporarily open up the standby database and run their tests on this system. For example, customer might want to test an application upgrade that alters tables or insert/update/deletes data.

When you convert a physical standby database to a snapshot standby in 11g, the database automatically creates a guaranteed restore point (Obviously “Flashback Database” has to be enabled in order to do this). Once you are done with your testing, the database gets automatically flashed back to this guaranteed restore point and then the log apply starts again.

Point to note with the snapshot standby database is that, in 11g the log transport continues to work. So all the logs are being received on the standby database and gap resolution continues to work (So in case of a primary failure, the logs just need to get applied on the standby). In 10g you could open up the standby database and test on it, but the log transport used to stop.

Let us look at some of the steps involved in setting up and using a “Snapshot standby” database in 11g. For similar functionality with 10g refer to the following documents.

Using 10g Physical standby database for Read/Write Testing and Reporting

10g Snapshot Standby – Dell Case Study

First follow all the normal steps that you would in setting up a physical standby database.
Make sure that you enable “Flashback Database” on both the primary and the standby.
Make sure that log shipping and apply are working properly.
In our example RK01 is the primary database and RK01DG is the standby database.

Check the salary for empno 7934

Update the emp table on the primary database
Switch the logfile to ensure that the change gets propagated and applied on the standby database.

SQL> Update scott.emp set sal=1310 where empno=7934;
SQL> Commit;
SQL> connect / as sysdba
SQL> alter system switch logfile;

Login using sqlplus to the standby database RK01DG

Check the role of this standby database

Use the command “alter database convert to snapshot standby” to try and switch the database to a “Snapshot Standby”

Since the database is in managed recovery mode you get an error.
Cancel the managed recovery
Then convert the database using the “alter database convert to snapshot standby”
This command leaves the standby database in nomount mode.
Shutdown the database and start it back up (In my screen shot i am starting up in mount mode, you can start it all the way up)

Check the role of the standby database, it should say “Snapshot Standby”
Query v$restore_point to see the details of the guaranteed restore point that it created

List the columns in the employee table on the primary database (RK01). This is just for reference, because as part of our testing we are going to add columns to this table.

Login as the scott user to the standby database and run some transactions

Now that testing is over, we can convert this back to the physical standby database.
Issue the “alter database to convert to physical standby” command to convert this from a “Snapshot Standby” to a “Physical standby” database.

Since the database is in open mode, it complains.
Restart the database in mount mode and then issue the same convert command again

Once the command succeeds the database has to be restarted.
Startup the database in mount mode and put the database in managed recovery mode.
Check the role of the standby database. It should now say “physical standby”.

Run transactions on primary to make sure that log transport and apply are working

Put the standby database in read only mode and check if the transactions got applied on the standby

Also make sure that the changes made during testing have been rolled back.

Those are the basic steps to setup and test a “Snapshot Standby” database.

11g Active Dataguard

Oracle introduced the Active Dataguard option in 11g, to allow customers to have read-only access to the physical standby databases. This option gives customers the flexibility to offload resource intensive queries to the physical standby database, while the log shipping and apply continues its work.

In oracle 10g in order to run queries on database tables, in a physical standby database, one had to stop the log apply and open the database in read-only mode. This means that the log files are being queued up during that period and will have to be applied once the database is reverted to the managed recovery mode.

The steps to follow in order to setup Active dataguard are exactly the same as setting up a physical standby database (Creating a Physical Standby). Once the physical standby database is configured and log transport and log apply are working correctly, you can just do an “alter database open” to enable active dataguard.

Let us look at an example.

I have a primary database RK01 and a regualar physical standby database RK01DG.

If you try to query the table emp while the database is in managed recovery mode you will get an error

Make sure that you can update the table emp from the primary database.

In order to enable Active Dataguard, ie enable the ability to query the database while log apply is in progress. Issue the following commands

The main concept to note here is that you are doing a “alter database open” (As opposed to an “alter database open read only”).

Update the table emp again to check if the values you are updating to on the primary are query able on the standby (ie log apply is continuing)

Check the updated value on the standby

Try updating the emp table on the standby database (It should only allow you to query , not update)

The procedure’s to switchover and failover remain exactly the same as before (Irrespective of whether you are in active dataguard mode or not).

11g Data Recovery Advisor

Oracle 11g introduces a new feature called the Data Recovery Advisor. The goal of this advisor is to provide a mechanism for guided error detection and recovery for the oracle database. DRA automatically detects persistent data failures, presents repair options to the user and executes repair options upon user request.

When a user executes a database operation that results in an error, a database integrity check is triggered, which then checks for failures related to that error in the database. If failures are diagnosed they are recorded in the Automatic Diagnostics Repository (ADR).

There following major commands in rman that would be used to detect and correct errors.

  • list failure
  • advise failure
  • repair failure

Let us look at an example where the data recovery advisor is used to recover from database block corruptions.

Create a 11gR1 Database RK01 using DBCA. Set the database to be in archivelog mode.

Use the script $ORACLE_HOME/demo/schema/human_resources/hr_main.sql to create the human resources schema in the database. Use the tablespace “USERS” for the hr schema tables.

Take a full rman backup of the database

export ORACLE_SID=RK01
rman target /

run {
allocate channel d1 type disk;
setlimit channel d1 kbytes 2097150;
backup incremental level 0 format ‘/u01/app/oraback/df_%U’ database
include current controlfile;
sql “alter system archive log current”;
backup
archivelog all format ‘/u01/app/oraback/al_%U’;
}

Introduce corruption in a datafile by overwriting 3 oracle database blocks.

dd of=/u01/app/oradata/RK01/users01.dbf bs=8192 conv=notrunc seek=12 << EOF
CORRUPT corrupt CORRUPT corrupt CORRUPT corrupt CORRUPT corrupt CORRUPT corrupt
EOF

dd of=/u01/app/oradata/RK01/users01.dbf bs=8192 conv=notrunc seek=84 << EOF
CORRUPT corrupt CORRUPT corrupt CORRUPT corrupt CORRUPT corrupt CORRUPT corrupt
EOF

dd of=/u01/app/oradata/RK01/users01.dbf bs=8192 conv=notrunc seek=36 << EOF
CORRUPT corrupt CORRUPT corrupt CORRUPT corrupt CORRUPT corrupt CORRUPT corrupt
EOF

Trigger Error Detection by running a validate database

rman target /

validate database;
File Status Marked Corrupt Empty Blocks Blocks Examined High SCN
—- —— ————– ———— ————— ———-
4 FAILED 0 473 640 555120
File Name: /u01/app/oradata/RK01/users01.dbf
Block Type Blocks Failing Blocks Processed
———- ————– —————-
Data 0 43
Index 0 20
Other 3 104

validate found one or more corrupt blocks
See trace file /u01/app/oracle/diag/rdbms/rk01/RK01/trace/RK01_ora_7452.trc for details
channel ORA_DISK_1: starting validation of datafile
channel ORA_DISK_1: specifying datafile(s) for validation
including current control file for validation
channel ORA_DISK_1: validation complete, elapsed time: 00:00:01
List of Control File and SPFILE
===============================
File Type Status Blocks Failing Blocks Examined
———— —— ————– —————
Control File OK 0 594
Finished validate at 27-MAY-08

List all the failures

RMAN> list failure;

List of Database Failures
=========================

Failure ID Priority Status Time Detected Summary
———- ——– ——— ————- ——-
146 HIGH OPEN 27-MAY-08 Datafile 4: ‘/u01/app/oradata/RK01/users01.dbf’ contains one or more corrupt blocks

List the details for the failure

RMAN> list failure detail;

List of Database Failures
=========================

Failure ID Priority Status Time Detected Summary
———- ——– ——— ————- ——-
146 HIGH OPEN 27-MAY-08 Datafile 4: ‘/u01/app/oradata/RK01/users01.dbf’ contains one or more corrupt blocks
Impact: Some objects in tablespace USERS might be unavailable
List of child failures for parent failure ID 146
Failure ID Priority Status Time Detected Summary
———- ——– ——— ————- ——-
161 HIGH OPEN 27-MAY-08 Block 84 in datafile 4: ‘/u01/app/oradata/RK01/users01.dbf’ is media corrupt
Impact: Object LOCATIONS owned by HR might be unavailable
155 HIGH OPEN 27-MAY-08 Block 36 in datafile 4: ‘/u01/app/oradata/RK01/users01.dbf’ is media corrupt
Impact: Object PK_EMP owned by SCOTT might be unavailable
149 HIGH OPEN 27-MAY-08 Block 12 in datafile 4: ‘/u01/app/oradata/RK01/users01.dbf’ is media corrupt
Impact: Object DEPT owned by SCOTT might be unavailable

Ask Rman to provide recovery advice

RMAN> advise failure;
….

….

Automated Repair Options
========================
Option Repair Description
—— ——————
1 Recover multiple corrupt blocks in datafile 4
Strategy: The repair includes complete media recovery with no data loss
Repair script: /u01/app/oracle/diag/rdbms/rk01/RK01/hm/reco_2336332336.hm

RMAN>

Ask Rman to do the necessary to repair the failure

RMAN> repair failure;

Strategy: The repair includes complete media recovery with no data loss
Repair script: /u01/app/oracle/diag/rdbms/rk01/RK01/hm/reco_2336332336.hm

contents of repair script:
# block media recovery for multiple blocks
recover datafile 4 block 12, 36, 84;

Do you really want to execute the above repair (enter YES or NO)? YES
executing repair script

Starting recover at 27-MAY-08
using channel ORA_DISK_1

channel ORA_DISK_1: restoring block(s)
channel ORA_DISK_1: specifying block(s) to restore from backup set
restoring blocks of datafile 00004
channel ORA_DISK_1: reading from backup piece /u01/app/oraback/df_01jhdo2k_1_1
channel ORA_DISK_1: piece handle=/u01/app/oraback/df_01jhdo2k_1_1 tag=TAG20080527T093723
channel ORA_DISK_1: restored block(s) from backup piece 1
channel ORA_DISK_1: block restore complete, elapsed time: 00:00:01

starting media recovery
media recovery complete, elapsed time: 00:00:03

Finished recover at 27-MAY-08
repair failure complete

Check that the database is Fine after recovery

RMAN> list failure;

no failures found that match specification