Subscribing to PostgreSql logical replication using python and psycopg2

When postgresql is used as a transactional database, there are use cases where the data changes from the transactional database are captured and send to other databases like your datamart or datawarehouse. You could use cloud services like Aws Dms or replication software like Debazium to do this. In this blog post I will show you how to use python to read changes (cdc, change data capture) from a postgresql database using the wal2json output plugin and psycopg2.

When you are compiling postgresql from source code, you can enable the output plugins test_decoder or wal2json as shown below.

            cd /home/postgres/tmp/postgresql-14.4/contrib/test_decoding
            make PG_CONFIG=/u01/pg/14/bin/pg_config
            make PG_CONFIG=/u01/pg/14/bin/pg_config install
            cd /home/postgres/tmp/postgresql-14.4/
            wget https://github.com/eulerto/wal2json/archive/refs/tags/wal2json_2_4.tar.gz
            tar -xzvf wal2json_2_4.tar.gz
            cd wal2json-wal2json_2_4/ 
            make PG_CONFIG=/u01/pg/14/bin/pg_config
            make PG_CONFIG=/u01/pg/14/bin/pg_config install

where /home/postgres/tmp/postgresql-14.4 is the directory where you untarred your postgres source code into, before compiling and installing postgres to /u01/pg/14

In order to proceed you need to have installed the library psycopg2 with python3 (Eg: pip install psycopg2)

You also need to make sure that the parameter wal_level is set to ‘logical’ in the postgresql.conf file of your postgres database.

create a table named books in your postgresql database.

create table books (bookid bigint primary key,bookname varchar(100));

Insert a few rows into the table.

insert into books values (1,'First Book');
insert into books values (2,'Second Book');
insert into books values (3,'Third Book');

The python library psycopg2 has a module named extras which provides helpers to read from postgres logical replication publishers. We will be using the functions from this module, namely create_replication_slot , start_replication and consume_stream to create the publisher and subscriber for logical replication.

Here is the code sample for pglogical.py

from __future__ import print_function
import sys
import psycopg2
import psycopg2.extras

conn = psycopg2.connect(
    'host=localhost user=postgres port=5432',
    connection_factory=psycopg2.extras.LogicalReplicationConnection)
cur = conn.cursor()
replication_options = {
'include-xids':'1',
'include-timestamp':'1',
'pretty-print':'1'
}
try:
    cur.start_replication(
        slot_name='pytest', decode=True,
        options=replication_options)
except psycopg2.ProgrammingError:
    cur.create_replication_slot('pytest', output_plugin='wal2json')
    cur.start_replication(
        slot_name='pytest', decode=True,
        options=replication_options)


class DemoConsumer(object):
    def __call__(self, msg):
        print(msg.payload)
        msg.cursor.send_feedback(flush_lsn=msg.data_start)

democonsumer = DemoConsumer()

print("Starting streaming, press Control-C to end...", file=sys.stderr)
try:
   cur.consume_stream(democonsumer)
except KeyboardInterrupt:
   cur.close()
   conn.close()
   print("The slot 'pytest' still exists. Drop it with "
         "SELECT pg_drop_replication_slot('pytest'); if no longer needed.",
         file=sys.stderr)
   print("WARNING: Transaction logs will accumulate in pg_xlog "
         "until the slot is dropped.", file=sys.stderr)

The code above is a modified version of the code, published by Marco Nenciarini here.

The code above uses the wal2json output plugin, you can change it to use the test_decoding output plugin, but you will also have to change the replication_options variable to those supported by test_decoding.

You can run the following command to create the publication and the subscriber.

python3 pglogical.py

You will see a prompt saying “Starting streaming, press Control-C to end…”

Let us now make a few changes to the books table.

insert into books values (5,'Fifth Book');

do $$
<<first_block>>
begin
    insert into books values (6,'Sixth Book');
    delete from books where bookid = 3;
end first_block $$;

If you now go back to the screen where you ran your python program, you can see the following messages on screen.

{
        "xid": 741,
        "timestamp": "2022-07-11 21:29:37.301299+00",
        "change": [
                {
                        "kind": "insert",
                        "schema": "public",
                        "table": "books",
                        "columnnames": ["bookid", "bookname"],
                        "columntypes": ["bigint", "character varying(100)"],
                        "columnvalues": [5, "Fifth Book"]
                }
        ]
}
{
        "xid": 742,
        "timestamp": "2022-07-11 21:33:07.913942+00",
        "change": [
                {
                        "kind": "insert",
                        "schema": "public",
                        "table": "books",
                        "columnnames": ["bookid", "bookname"],
                        "columntypes": ["bigint", "character varying(100)"],
                        "columnvalues": [6, "Sixth Book"]
                }
                ,{
                        "kind": "delete",
                        "schema": "public",
                        "table": "books",
                        "oldkeys": {
                                "keynames": ["bookid"],
                                "keytypes": ["bigint"],
                                "keyvalues": [3]
                        }
                }
        ]
}

These are the changes the python subscriber program is reading from postgres the logical replication publisher.

You can then write these changes either to a csv file  or to another database as you choose.

Substitution variables in psql scripts

As postgresql users and administrators, we tend to create lot of scripts, and run them routinely, for common queries we need to execute against the database. It is likely that in some of those scripts, you would want to parameterize values used in filter conditions. Below is an example of how this can be done.

psql is the terminal based front end tool to interact with postgresql. You can run scripts stored in files in the file system, using the \i directive in plsql. You can use the \set command to set variables in psql. If you want to prompt for the value to be entered by the user, you can use the \prompt directive.

\prompt 'Enter Table Name : ' tabname


select last_vacuum,last_autovacuum,last_analyze ,last_autoanalyze,n_live_tup,n_dead_tup
from pg_stat_user_tables
where relname = :'tabname';

Now if you run this from psql you will be first prompted for the table name and then it will display the results for the table name you entered.

postgres=# \i pgstats.sql
Enter Table Name : nflstats

 last_vacuum | last_autovacuum |         last_analyze         |       last_autoanalyze        | n_live_tup | n_dead_tup 
-------------+-----------------+------------------------------+-------------------------------+------------+------------
             |                 | 2020-06-08 18:07:36.29538+00 | 2020-06-08 15:37:48.738104+00 |     270418 |          0

Listing Errors from Cloudwatch logs using Aws Cli

The following commands can be used to list the Error messages from cloudwatch logs, produced from DMS (Database Migration service).

First list the log groups

aws logs describe-log-groups

Next list the log streams in your log group

aws logs describe-log-streams --log-group-name <YourLogGroupNameHere>

Next list the error messages. Within the DMS log, the Errors are indicated with a pattern “E:” within the error string, so that is the pattern we search for.

aws logs filter-log-events --log-group-name <YourLogGroupNameHere> --log-stream-names <YourLogStreamNameHere> --filter-pattern "[message = \"*E:*\"]" --query 'events[*].message'

If you are searching in cloudwatch logs produced from other sevices, you should replace E: with the pattern that flags Error messages for that service.

Script to compare tkprof output files

I often use the oracle 10046 event tracing mechanism to capture sql’s from a session to identify why certain transactions are running slower in different env’s or at different points in time. Oracle does have a mechanism where you can save the trace information in database tables. One can use the INSERT parameter in tkprof to store the trace information into a database table. Once in the table you can write sql’s that compare multiple runs or multiple statements.

I wrote a python program that compares two different tkprof output files. The files are compared, and the following aspects of each of the sqlid’s in the tkprof output file,s are printed side by side. The output is sorted by the Difference in Elapsed Time, in Descending order.

  • Sql text
  • Plan Hash Value
  • Total Elapsed time
  • Total Logical Reads
  • Total Rows processed
  • Difference in Elapsed Time
  • Difference in Number of Rows processed
  • Difference in Logical reads.

Other columns can be added to this, if you desire.
I use this script output as a quick way to see which sql’s are running slower and are probably candidates for further analysis/tuning.

The sqlid’s from the file provided as the first argument to the script (referred to as the left file) are compared to the same sqlid’s in the file provided as the second argument to the script (referred to as the right file). The following columns are displayed.

sqlid                           sqlid being compared
text                             First 20 chars of the sql text
lplan                           Plan hash value from the left file
rplan                          Plan hash value from the right file
lela                             Total Elapsed time from the left file
rela                            Total Elapsed time from the right file
llreads                       Total Logical reads (query+current) from the left file
rlreads                      Total Logical reads (query+current) from the right file
lrows                         Total rows processed from the left file
rrows                        Total rows processed from the right file
eladiff                        Lela – Rela
rowsdiff                    Lrows – Rrows
lreadsdiff                  Llreads – rlreads

Here is a sample syntax for running the script. (You need the python pandas package to be installed for this to execute successfully)

python ./difftk.py /u01/tkprofout/Newplans.prf /u01/tkprofout/Stage.prf

Here is a sample output

difftk

Click on the image to view a larger version.

The full script is below

#Python script to list differences between sql executions in two tkprof output files
#useful if comparing tkprof from prod and dev for example
#Author : rajeev.ramdas

import sys
import os
import pandas as pd
from pandas import DataFrame

# Define a class to hold the counters for each sqlid
class sqliddet:
     def init(self):
            sqlid=''
            text=''
            plan_hash=''
            tcount=0
            tcpu=0
            tela=0
            tdisk=0
            tquery=0
            tcurr=0
            trows=0

# Define 2 empty dictionaries to store info about each input file
leftsqliddict={}
rightsqliddict={}

# Process each file and add one row per sqlid to the dictionary
# We want to add the row to the dictionary only after the SQLID row and the total row has been read
# So the firstsqlid flag is used to make sure that we do not insert before total is read for the first row.

def processfile(p_file,p_sqliddict):

    myfile=open(p_file,"r")
    line=myfile.readline()
    firstsqlid=True
    linespastsqlid=99
    while line:
        linespastsqlid+=1
        line=myfile.readline()
        if line.startswith('SQL ID'):
            linespastsqlid=0
            if firstsqlid==True:
                firstsqlid=False
            else:
                p_sqliddict[currsqlid.sqlid]=[currsqlid.plan_hash,currsqlid.tcount,currsqlid.tcpu,currsqlid.tela,currsqlid.tdisk,currsqlid.tquery
                            ,currsqlid.tcurr,currsqlid.trows,currsqlid.text]
            currsqlid=sqliddet()
            currsqlid.sqlid=line.split()[2]
            currsqlid.plan_hash=line.split()[5]
        if linespastsqlid==2:
            currsqlid.text=line[0:20]
        if line.startswith('total'):
            a,currsqlid.tcount,currsqlid.tcpu,currsqlid.tela,currsqlid.tdisk,currsqlid.tquery,currsqlid.tcurr,currsqlid.trows=line.split()
        if line.startswith('OVERALL'):
            p_sqliddict[currsqlid.sqlid]=[currsqlid.plan_hash,currsqlid.tcount,currsqlid.tcpu,currsqlid.tela,currsqlid.tdisk,currsqlid.tquery
                       ,currsqlid.tcurr,currsqlid.trows,currsqlid.text]
        continue
    myfile.close()

# Main portion of script
if len(sys.argv) != 3:
   print('Syntax : python ./difftk.py tkprof1.out tkprof2.out')
   sys.exit()

if not os.path.isfile(sys.argv[1]) or not os.path.isfile(sys.argv[2]):
   print ("File Does not Exist")
   sys.exit()

processfile(sys.argv[1],leftsqliddict)
processfile(sys.argv[2],rightsqliddict)

t_difftk_lst=[]

# Match the sqlid's from the file on the left to the file on the right
# Gather up the statistics form both sides, insert into a list
# Transfer the list to a pandas dataframe, add some computed columns

for sqlid,stats in leftsqliddict.items():
    l_totlogical=int(stats[5])+int(stats[6])
    if sqlid in rightsqliddict:
       t_difftk_lst.append([sqlid,stats[8].rstrip(),stats[0],rightsqliddict[sqlid][0],float(stats[3])
                            ,float(rightsqliddict[sqlid][3]),float(l_totlogical),float(rightsqliddict[sqlid][5])+float(rightsqliddict[sqlid][6])
                            ,float(stats[7]),float(rightsqliddict[sqlid][7])
                           ])
    else:
       t_difftk_lst.append([sqlid,stats[8].rstrip(),stats[0],0,float(stats[3]),0
                            ,float(l_totlogical),0,float(stats[7]),0
                           ])

difftk_df=DataFrame(t_difftk_lst,columns=['sqlid','sqltext','lplan','rplan','lela','rela','llreads','rlreads','lrows','rrows'])
difftk_df['eladiff']=difftk_df['lela']-difftk_df['rela']
difftk_df['rowsdiff']=difftk_df['lrows']-difftk_df['rrows']
difftk_df['lreadsdiff']=difftk_df['llreads']-difftk_df['rlreads']

pd.set_option('display.width',1000)
print (difftk_df.sort(columns='eladiff',ascending=False))

Using Pandas for CSV Analysis

As Oracle Dba’s we often come across situations where we are handed CSV (Comma separated values) files, by our managers, or customers, as Raw data, based on which we need to do some work. The first task would be to analyze the file and come up with some summary satistics, so we can quantify the amount of work involved.

When faced with such circumstances, my favorite method is to use sqlloader to upload the file into a database table, and then run sql statements on the data to produce summary info. Most people would probably use excel, formulas, macros and pivot tables to achieve similar results.

In this blog post, i present an alternate method that i’ve been using recently, for csv file summarization.

Pandas is a library written for the Python language for data manipulation and analysis. In order to proceed, first install Python and then install the Python package named ‘pandas’. Pandas is a real good alternative to the R programming language.See my previous post on how to install python and pandas.

For the examples in this post, i am using a Csv file, that has NFL game, play by play statistics for 2014.

Start by invoking the python interactive interpreter.

 

     $ python3
Python 3.4.2 (default, Dec 18 2014, 14:18:16) 
[GCC 4.8.2] on linux
Type "help", "copyright", "credits" or "license" for more information.

First import the following libraries that we are going to use.

     
import pandas as pd
import numpy as np

Read the csv file into a Pandas DataFrame

 

df=pd.DataFrame(pd.read_csv('pbp-2014.csv',header=0))

Check how many rows the dataframe has

 
df.size
2056275

List the columns in the DataFrame

 

list(df)
['GameId', 'GameDate', 'Quarter', 'Minute', 'Second', 'OffenseTeam', 'DefenseTeam', 'Down', 'ToGo', 'YardLine', 'Unnamed: 10', 'SeriesFirstDown', 'Unnamed: 12', 'NextScore', 'Description', 'TeamWin', 'Unnamed: 16', 'Unnamed: 17', 'SeasonYear', 'Yards', 'Formation', 'PlayType', 'IsRush', 'IsPass', 'IsIncomplete', 'IsTouchdown', 'PassType', 'IsSack', 'IsChallenge', 'IsChallengeReversed', 'Challenger', 'IsMeasurement', 'IsInterception', 'IsFumble', 'IsPenalty', 'IsTwoPointConversion', 'IsTwoPointConversionSuccessful', 'RushDirection', 'YardLineFixed', 'YardLineDirection', 'IsPenaltyAccepted', 'PenaltyTeam', 'IsNoPlay', 'PenaltyType', 'PenaltyYards']

Check how many games the dataset covers. The GameId column is a unique identifier that identifies each game. The nunique method returns the number of unique elements in that object.

 

df.GameId.nunique()
256

List details of all games played by the New England Patriots. This command shows how you can provide a filter condition. The filter specifies that all the rows, where the OffensiveTeam is NE or DefensiveTeam is NE be listed.

 

df[(df['OffenseTeam'] == 'NE') | (df['DefenseTeam'] == 'NE')]

Subset a specific set of columns

 

df[['GameId','PlayType','Yards']]

           GameId     PlayType  Yards
0      2014090400     KICK OFF      0
1      2014090400         RUSH      6
2      2014090400         RUSH      3
3      2014090400         RUSH     15
4      2014090400         RUSH      2
5      2014090400         PASS     -2
...
...

Display all Pass and Rush plays Executed by New England. Here we are using a and filter condition to limit the rows to those of New England and the PlayType is either a PASS or a RUSH.

 

df[['GameId','PlayType','Yards']][(df['OffenseTeam'] == 'NE') & (df['PlayType'].isin(['PASS','RUSH']))]   
           GameId PlayType  Yards
1092   2014090705     RUSH      2
1093   2014090705     PASS      4
1094   2014090705     PASS      0
1102   2014090705     PASS      8
1103   2014090705     PASS      8
1104   2014090705     RUSH      4
     

Display the Number of Plays, Total Yards Gained, and Average Yards gained per PASS and RUSH play, per game.

 

df[['GameId','PlayType','Yards']][(df['OffenseTeam'] == 'NE') & (df['PlayType'].isin(['PASS','RUSH']))].groupby(['GameId','PlayType']).agg({'Yards': [np.sum,np.mean],'GameId':[np.size]}) 
                    Yards           GameId
                      sum      mean   size
GameId     PlayType                       
2014090705 PASS       277  4.540984     61
           RUSH       109  5.190476     21
2014091404 PASS       209  8.038462     26
           RUSH       158  4.157895     38
2014092105 PASS       259  6.641026     39
           RUSH        91  2.935484     31
2014092900 PASS       307  9.903226     31
           RUSH        75  4.687500     16
2014100512 PASS       301  7.921053     38
           RUSH       223  5.309524     42
2014101201 PASS       407  9.465116     43
           RUSH        60  2.307692     26
2014101600 PASS       267  6.675000     40
           RUSH        65  4.062500     16
2014102605 PASS       376  9.400000     40
           RUSH       121  3.781250     32
2014110208 PASS       354  6.210526     57
           RUSH        66  2.640000     25
2014111611 PASS       267  8.343750     32
           RUSH       248  6.048780     41
2014112306 PASS       393  6.894737     57
           RUSH        90  4.285714     21
2014113010 PASS       245  6.621622     37
           RUSH        85  5.000000     17
2014120713 PASS       317  6.604167     48
           RUSH        80  3.333333     24
2014121408 PASS       287  7.972222     36
           RUSH        92  3.407407     27
2014122105 PASS       189  5.250000     36
           RUSH        78  4.333333     18
2014122807 PASS       188  5.222222     36
           RUSH        92  4.181818     22


From the above example’s you can see how easy it is to read a csv file, apply filters and summarize the data set using pandas.

Setting up a Python 2.7.6 Virtual Env for Python development

Python is an excellent language to learn, for DBA’s who want to automate all the repetitive tasks, they need to perform. Once you start using python it is likely that you want to setup multiple environments with different versions of Python and libraries, based on the project you are working on. Virtualenv is a tool to create isolated python environments.

Below are the steps that i followed, to install a brand new working Python 2.7.6 environment with the following packages.

 
           SQLAlchemy    - Object Relational Mapper and SQL Toolkit for Python
           numpy         - Fundamental package for scientific computing
           matplotlib    - Python 2D plotting library
           ipython       - Interactive Python Shell
           pandas        - Python Data Analysis Library 
           Flask         - An easy to use Python Lightweight Micro Framework

This installation is performed on Ubuntu Linux, and i have already installed the libsqlite3-dev package and the oracle instant client.

Install Python 2.7.6

Download Python-2.7.6.tgz from http://www.python.org/getit

Install python 2.7.6 to your directory of choice.

		tar -xvf Python-2.7.6.tgz
		cd Python-2.7.6/

		./configure --prefix=/u01/Rk/Apps/Python/Python276
		make
		make install

Now you have python 2.7.6 installed into the /u01/Rk/Apps/Python/Python276 directory. (You will have a python binary in /u01/Rk/Apps/Python/Python276/bin)

Download virtualenv

curl -O https://pypi.python.org/packages/source/v/virtualenv/virtualenv-1.11.1.tar.gz

Install virtualenv

tar -xzvf virtualenv-1.11.1.tar.gz
cd virtualenv-1.11.1/
/u01/Rk/Apps/Python/Python276/bin/python virtualenv.py /u01/Rk/Apps/Python/p276env1

Activate the virtualenv

           . /u01/Rk/Apps/Python/p276env1/bin/activate

Install the additional Python Modules you need

	   pip install SQLAlchemy
	   pip install numpy
	   pip install matplotlib
	   pip install ipython
	   pip install pyzmq
	   pip install tornado
	   pip install jinja2
	   pip install pandas
	   pip install Flask
	   pip install Flask-SQLAlchemy

Install cx_Oracle

Ensure that the oracle instant client is installed, and the environment variables ORACLE_HOME and LD_LIBRARY_PATH are setup correctly.

Download cx_Oracle (a python extension module that allows access to oracle.) source from from http://cx-oracle.sourceforge.net/

tar -xzvf cx_Oracle-5.1.2.tar.gz
cd  cx_Oracle-5.1.2/
python setup.py install

Setup an alias (In your .bash_profile) to simplify invoking the virtualenv every time you want to use it.

alias p276env1='. /u01/Rk/Apps/Python/p276env1/bin/activate'

Now, anytime you want to execute a python program in this environment, you can invoke the Linux command line and

p276env1
python

Graph CPU usage on exadata using oswatcher files

On the oracle database machine, oswatcher is installed during setup time, both on the database nodes and the exadata cells. This utility collects linux operating system level statistics, which comes in very handy when troubleshooting operating system level issues. The data is collected in text files. There is a Java based utility (OSWG) provided by oracle support to graph the contents of these files, however that utility does not work on the oswatcher files generated on exadata.

Here is a python script that can graph the cpu used from the mpstat information that oswatcher captures. It has been tested on new oswatcher files on an x3-2. You need to first install a python environment that has the “numpy” and “matplotlib” modules installed.

Install a Python Virtualenv.

If you create multiple applications using Python and end up using different versions, it is easier to maintain different virtualenv’s. You can create a python virtualenv as shown below (On ubuntu linux).

curl -O https://pypi.python.org/packages/source/v/virtualenv/virtualenv-1.9.1.tar.gz
tar -xzvf virtualenv-1.9.1.tar.gz
cd virtualenv-1.9.1
python virtualenv.py ../p273env2
. p273env2/bin/activate
pip install numpy
sudo apt-get install libfreetype6-dev
pip install matplotlib

Now that you have a python environment, with your required libraries, you can go ahead and execute the script as shown below.

The oswatcher files in /opt/oracle/oswatcher are .bz2 files and there will be one file per hour per day. Copy the mpstat .bz2 files into a directory and use bunzip2 to unzip them. In this example let us say that the directory name is /u01/oswatcher/mpstat/tmp

You can now run the script as shown below

python parseoswmp.py  /u01/oswatcher/mpstat/tmp
or
python parseoswmp.py  /u01/oswatcher/mpstat/tmp '06/14/2013 05:00:00 AM' '06/14/2013 07:00:00 AM'

The first command will graph the cpu usage for the entire time range in all those files and the second command graphs the cpu information for the date and time range you have specified.

It creates a file in the current directory, named oswmpstat.png, which has the graph.

You can find the full script here.

You can find a sample output graph here.

Plotting AWR database metrics using R

In a previous post i showed how you can connect from R to the oracle database using the R driver. In this post i will explain how we can run queries against the AWR history tables and gather data that can be plotted using ggplot.

When you install R on linux, like i outlined in the above post, you get an executable named Rscript. Rscript is a NonInteractive variant of the R command, so you can run a R batch file from the linux shell (Like running a bash shell script). I am using Rscript as the interpreter in my script (First line).

ggplot2 is a R library that can be used for plotting in R programs. There is native plotting capability in R and there is another library named lattice. ggplot2 is much more robust and is based on the grammar of graphics. You have to install ggplot2 (install.packages(“ggplot2”)) in R before you can use this.

#!/usr/bin/Rscript
library(ROracle)
library(ggplot2)

 

Process command line arguments. This script expects 3 commandline arguments. Copy each argument to a R variable.

args <- commandArgs(TRUE)
l_dbid <- as.double(args[1])
l_bsnap <- as.double(args[2])
l_esnap <- as.double(args[3])

Connect to Oracle

drv <- dbDriver(“Oracle”)
con <- dbConnect(drv,username=”system”,password=”manager”,dbname=”burl5vb1:1521/rk01″)

Popluate a data frame with the values you will need for bind variables in the query you will be submitting.

my.data = data.frame(dbid = l_dbid, bsnap =l_bsnap,esnap=l_esnap)

Prepare and Execute the query

res <- dbSendQuery(con,”select dhss.instance_number,dhss.snap_id,dhs.end_interval_time et,
round(sum(decode(dhss.metric_name,’User Transaction Per Sec’,dhss.average,0))) utps,
round(sum(decode(dhss.metric_name,’Average Active Sessions’,dhss.average,0))) aas,
round(sum(decode(dhss.metric_name,’Host CPU Utilization (%)’,dhss.average,0))) hcpu,
round(sum(decode(dhss.metric_name,’Buffer Cache Hit Ratio’,dhss.average,0))) bchr,
round(sum(decode(dhss.metric_name,’Logical Reads Per Sec’,dhss.average,0))) lr,
round(sum(decode(dhss.metric_name,’I/O Megabytes per Second’,dhss.average,0))) iombps,
round(sum(decode(dhss.metric_name,’I/O Requests per Second’,dhss.average,0))) iops,
round(sum(decode(dhss.metric_name,’Redo Generated Per Sec’,dhss.average,0))) rg,
round(sum(decode(dhss.metric_name,’Temp Space Used’,dhss.average,0))) ts,
round(sum(decode(dhss.metric_name,’Physical Write Total IO Requests Per Sec’,dhss.average,0))) pw,
round(sum(decode(dhss.metric_name,’Physical Read Total IO Requests Per Sec’,dhss.average,0))) pr
from dba_hist_sysmetric_summary dhss,dba_hist_snapshot dhs
where
dhss.dbid = :1
and dhss.snap_id between :2 and :3
and dhss.metric_name in (
‘User Transaction Per Sec’,
‘Average Active Sessions’,
‘Host CPU Utilization (%)’,
‘Buffer Cache Hit Ratio’,
‘Logical Reads Per Sec’,
‘I/O Megabytes per Second’,
‘I/O Requests per Second’,
‘Redo Generated Per Sec’,
‘Temp Space Used’,
‘Physical Write Total IO Requests Per Sec’,
‘Physical Read Total IO Requests Per Sec’)
and dhss.dbid = dhs.dbid
and dhs.instance_number=1
and dhss.snap_id = dhs.snap_id
group by dhss.instance_number,dhss.snap_id,dhs.end_interval_time
order by 1,2″,data=my.data
)

Fetch the rows, and disconnect from the db.

data <- fetch(res)
dbDisconnect(con)

Open a pdf file to save the graphs to.
Generate the graphs using ggplot.
print the graphs to the pdf file
Close the pdf file.

In the ggplot function call, ET and INSTANCE_NUMBER represent the End Snap Time and Instance Number columns output from the query, and AAS, UTPS, HCPU, PW and PR represent the AverageActiveSessions, UserTransactionPerSecond, HostCpu, PhysicalWrites and PhysicalReads columns from the query.

pdf(“plotstat.pdf”, onefile = TRUE)
p1<-ggplot(data,aes(strptime(ET,format=”%Y-%m-%d %H:%M:%S”),AAS,group=INSTANCE_NUMBER,color=INSTANCE_NUMBER))+geom_point()+geom_line()+ggtitle(“Average Active S
essions”)+labs(x=”Time of Day”,y=”Average Active Sessions”)
p2<-ggplot(data,aes(strptime(ET,format=”%Y-%m-%d %H:%M:%S”),UTPS,group=INSTANCE_NUMBER,color=INSTANCE_NUMBER))+geom_point()+geom_line()+ggtitle(“Transactions Pe
r Second”)+labs(x=”Time of Day”,y=”Transactions Per Second”)
p3<-ggplot(data,aes(strptime(ET,format=”%Y-%m-%d %H:%M:%S”),HCPU,group=INSTANCE_NUMBER,color=INSTANCE_NUMBER))+geom_point()+geom_line()+ggtitle(“CPU Usage”)+lab
s(x=”Time of Day”,y=”Cpu Usage”)
p4<-ggplot(data,aes(strptime(ET,format=”%Y-%m-%d %H:%M:%S”),PW,group=INSTANCE_NUMBER,color=INSTANCE_NUMBER))+geom_point()+geom_line()+ggtitle(“Physical Writes”)
+labs(x=”Time of Day”,y=”Phywical Writes”)
p5<-ggplot(data,aes(strptime(ET,format=”%Y-%m-%d %H:%M:%S”),PR,group=INSTANCE_NUMBER,color=INSTANCE_NUMBER))+geom_point()+geom_line()+ggtitle(“Physical Reads”)+
labs(x=”Time of Day”,y=”Physical Reads”)
print(p1)
print(p2)
print(p3)
print(p4)
print(p5)
dev.off()

You can run this script as follows from the Linux Command Line. The first argument is the dbid, the second argument is the begin snap id and the last argument is the end snap id.

./plotstat.R 220594996 5205 5217

You will then see a pdf document named plotstat.pdf in your directory that has 5 separate graphs in it.
Click on the link below to see a sample file. This is plotting awr data from a 4 node Oracle Rac Database.

plotstat

Click Here to download the whole script, plotstat.R

ggplot2 : Elegant Graphics for Data Analysis is a great book to learn about ggplot2.

Shell script to create a tar archive of oracle trace files.

Whenever you have an oracle database problem and Oracle support asks you to upload the related trace files, the best option is to use the oracle Incident Packaging service to create an archive file that has all the necessary info to be uploaded to oracle.

If you just want to upload all the .trc files generated in the diagnostics trace directory (including but not limited to pmon traces), you can use the following script to generate such an archive file.

The following script accepts

  • The directory name (The location of your trace files)
  • The backup destination directory (The directory where you want the archive to be created. Ensure you have enough space here)
  • The date of the trace files (DD-MON-YYYY)
  • The begin time (HH24MI)
  • The end time (HH24MI)

Then it finds all .trc files that falls in between those begin and end times for the date you specified, from the directory you specified and creates a tar.gz archive file in the destination directory you specified. It creates a directory named trcbakMonDD in your destination directory and places the file in that dir. You can download this file and upload it to oracle.

Usage Example :. /backtraces.sh /u01/11gr2/diag/rdbms/rk01/rk01/trace /tmp ’11-Sep-2012′ 1315 1340

The abov ecommand will backup all .trc files, from the directory  /u01/11gr2/diag/rdbms/rk01/rk01/trace, that have a timestamp between 13:15 and 13:40 on 11th Sep 2012 to a tar Archive in the directory /tmp

I have only tested it on Oracle Enterprise Linux 5. (It is likely that the syntax for the Tar and date commands might be different on different platforms)

Find the script code below

 

#!/bin/bash
#This script can be used to create a tar archive of trace files created in 
#The database diagnostics trace directory between a given time period
#Author : Rajeev Ramdas

if [ $# != 5 ]
then
   echo ./backtraces.sh tracefiledir backupdir DD-Mon-YYYY HH24MI HH24MI
   echo ./backtraces.sh /u01/Rk/Docs/11g/Scripts2 /tmp '09-Nov-2012' 0900 1332
   exit
fi

l_backup_base=$2
l_backdir=trcbak`date --date=${3} +%b%d`
l_backdest=${l_backup_base}/${l_backdir}
l_startdate=`date --date=${3} +%Y%m%d`
l_enddate=`date --date=${3} +%Y%m%d`
l_starttime="${l_startdate}${4}"
l_endtime="${l_enddate}${5}"
l_backfile="${l_backdest}/tracebak-${l_starttime}-${l_endtime}.tar.gz"

if [ ! -d ${1} ]
then
   echo Wrong Backup Dir
   exit 1
fi

if [ ! -d ${2} ]
then
   echo Wrong Backup Dest
   exit 1
fi

if [ -d ${l_backdest} ]
then
   echo Directory Exists
else
   mkdir ${l_backdest}
fi

if [ -f ${l_backfile} ]
then
   rm ${l_backfile}
fi

touch -t "$l_starttime" /tmp/tmpoldfile
touch -t "$l_endtime" /tmp/tmpnewfile

find $1 -type f -newer /tmp/tmpoldfile ! -newer /tmp/tmpnewfile -name '*.trc' |  xargs tar -czvf - | cat > ${l_backfile}

echo Your backup file is ${l_backfile}

Using the R Language with an Oracle Database.

R Programming Language Connectivity to Oracle.

R is an open source programming language and software environment for statistical computing and graphics. It is a fully functional programming language, widely used by statisticians to perform data analysis. It can also be a neat tool for Oracle DBA’s to graph and analyse database performance metrics. If you intend to embark on developing a sizable R+Oracle project, i’d encourage you to use Oracle Enterprise R and/or the Oracle Advanced Analytics.

Below are the steps on how to install and configure the R language on Ubuntu Linux with connectivity to Oracle.

These steps assume that you have an already installed and running Oracle 11gR2 database.

The high level steps are as follows

1) Install the R programming language environment
2) Download and install the oracle instant client
3) Download and install the following R packages
– DBI
– ROracle
4) Start using R with Oracle.

Install the R programming language environment

Refer to the installation instructions at www.r-project.org for your platform.
If you are installing this on Ubuntu Linux (As I have on Ubuntu 12.10), open the “Ubuntu Software Center” and install the following packages.
– R-base
– R-base-dev

Download and install the oracle instant Client

As your regular o/s user, download and install (Installation is nothing other than unzipping the downloaded file) the oracle instant client.
Download The instant client for your o/s platform from http://www.oracle.com/technetwork/database/features/instant-client/index-097480.html.
You need to download
– Instant Client package – Basic
– Instant Client package – SDK

For the purpose of this installation, we are going to assume that the instant client has been installed into /u01/Rk/Apps/oracle/instantclient_11_2.

 Download and install the R packages

DBI

– Download DBI from http://cran.r-project.org/web/packages/DBI/index.html. (Download the package source)
– sudo su –
– cd <To the directory where DBI_0.2-5.tar.gz>

root# R CMD INSTALL DBI_0.2-5.tar.gz
* installing to library ‘/usr/local/lib/R/site-library’
* installing *source* package ‘DBI’ ...
** R
** inst
** preparing package for lazy loading
Creating a generic function for ‘summary’ from package ‘base’ in package ‘DBI’
** help
*** installing help indices
** building package indices
** installing vignettes
‘DBI.Rnw’
** testing if installed package can be loaded

* DONE (DBI)

ROracle

– Download the ROracle source from http://cran.r-project.org/web/packages/ROracle/index.html
– sudo su –
– cd

– Set the following environment variables

root# export OCI_LIB=/u01/Rk/Apps/oracle/instantclient_11_2
root# export LD_LIBRARY_PATH=/u01/Rk/Apps/oracle/instantclient_11_2:$LD_LIBRARY_PATH
root# R CMD INSTALL ROracle_1.1-5.tar.gz
* installing to library ‘/usr/local/lib/R/site-library’
* installing *source* package ‘ROracle’ ...
** package ‘ROracle’ successfully unpacked and MD5 sums checked
configure: creating ./config.status
config.status: creating src/Makevars
** libs
gcc -std=gnu99 -I/usr/share/R/include -DNDEBUG -I/u01/Rk/Apps/oracle/instantclient_11_2/sdk/include -fpic -O2 -pipe -g -c rodbi.c -o rodbi.o
gcc -std=gnu99 -I/usr/share/R/include -DNDEBUG -I/u01/Rk/Apps/oracle/instantclient_11_2/sdk/include -fpic -O2 -pipe -g -c rooci.c -o rooci.o
gcc -std=gnu99 -shared -o ROracle.so rodbi.o rooci.o -L/u01/Rk/Apps/oracle/instantclient_11_2 -lclntsh -L/usr/lib/R/lib -lR
installing to /usr/local/lib/R/site-library/ROracle/libs
** R
** inst
** preparing package for lazy loading
** help
*** installing help indices
** building package indices
** installing vignettes
** testing if installed package can be loaded

* DONE (ROracle)

Using The R Language with Oracle

Now you are ready to Run your first R program, Run a query against the database, and plot the output on a graph.

Invoke the R language command line by typing in the following

$ R

From the R command line use the following commands. (The formatting is a bit messed up, click on “view code” to see the actual commands)

> library(ROracle)
> drv <- dbDriver("Oracle")
> con <- dbConnect(drv,username="sh",password="sh",dbname="burl5vb1:1521/rk01")
> res <- dbSendQuery(con,"select time_id,sum(quantity_sold) from sales
+ where time_id > to_date('20-DEC-2001','DD-MON-RR')
+ group by time_id")
> data <- fetch(res)
> data
               TIME_ID SUM(QUANTITY_SOLD)
1  2001-12-20 23:00:00                473
2  2001-12-21 23:00:00                374
3  2001-12-22 23:00:00               1034
4  2001-12-23 23:00:00               1662
5  2001-12-24 23:00:00                470
6  2001-12-25 23:00:00                289
7  2001-12-26 23:00:00               1076
8  2001-12-27 23:00:00               1196
9  2001-12-28 23:00:00                232
10 2001-12-29 23:00:00                758
11 2001-12-30 23:00:00                786

> plot(data)

You will see a plot like the one below

Happy R scripting.

If you want to learn the R Language, i would recommend the book  The Art of R programming.