Adding documents and querying Elasticsearch

Elasticsearch is an opensource text search engine. It is accessible using RESTFul API’s and uses JSON documents to store data. It allows users to search very large amounts of data at a very high speed. Since it is written using the Java programming language it can run on many platforms. See this blog post on how to install, configure and start Elasticsearch 8.5.0.

In this blog post I will show you how you can create an index, insert data into the index and query the data in Elasticsearch. In the rest of this post, I assume that you are executing the curl commands from the host that is running the Elasticsearch server. Hence all the curl queries are directed to localhost. If you are running the curl commands from a client machine, replace localhost with your actual hostname (or ip address) for the server running Elasticsearch.

Inserting a single document into an Index

curl -k -POST https://localhost:9200/messages/_doc -H "Content-Type: application/json" -d'
{
  "msg_id" : 1,
  "msg_case_id" : 55,
  "msg_message" : "Hi This is xyz from def media company"
}' --user "elastic:Password1"

In the command shown above, “messages” is the Index that we are inserting the document into. We are also providing the Elasticsearch username and password to set the POST requirement to Elasticsearch. The -k is used to tell curl to accept the selft signed certificated from Elasticsearch.

Bulk inserting documents into an Index

First create a file named msg.json with the following lines

{"index":{"_id":"1"}}
{ "msg_id" : 1, "msg_case_id" : 55, "msg_message" : "Hi This is xyz from def media company" }
{"index":{"_id":"2"}} 
{ "msg_id" : 2, "msg_case_id" : 55, "msg_message" : "We provide targeted advertising to different platforms" }
{"index":{"_id":"3"}}
{ "msg_id" : 3, "msg_case_id" : 55, "msg_message" : "Includes TV, Radio, Online, Social Media etc" }
{"index":{"_id":"4"}}
{ "msg_id" : 4, "msg_case_id" : 55, "msg_message" : "Our conversion ratios are very high" }
{"index":{"_id":"5"}}
{ "msg_id" : 5, "msg_case_id" : 55, "msg_message" : "provides search engine optimization" }

You can batch insert all the documents in the file above to the Elasticsearch index named “messages” using the command below

curl -k -X POST https://localhost:9200/messages/_bulk?pretty -H "Content-Type: application/x-ndjson" --user "elastic:Password1" --data-binary @msg.json

Note that we are using the content-type x-ndjson here

You can delete all the documents in the “messages” index, using the command below

curl -k -XPOST 'https://localhost:9200/messages/_delete_by_query?conflicts=proceed&pretty' -H 'Content-Type: application/json' --user "elastic:Password1" -d'
{
    "query": {
        "match_all": {}
    }
}'

Querying documents

You can query all the documents from the “messages” index using the command below

curl -k -X GET 'https://localhost:9200/messages/_search' --user "elastic:Password1"

You can query specific documents that match specific criteria using the command below

curl -k -X POST "https://localhost:9200/messages/_search?pretty" -H 'Content-Type: application/json' --user "elastic:Password1" -d'
{
  "query": {
    "bool": {
      "filter": [
        {
        "term": {
          "msg_message": "ratio"
        }
        },
        {
        "term": {
          "msg_id": "4"
        }
        }
      ]
    }
  }
}
'

The above query will display the documents where the word ratio occurs in the msg_message property and when the msg_id property is 4.

In this blog post I have shown you how to insert documents into Elasticsearch indexes and query them.

Installing Elasticsearch and Kibana 8.5.0

Elasticsearch is an opensource search engine based on the Lucene library and Kibana is an opensource data visualization dashboard for Elasticsearch. In this post I will show you how to install Elasticsearch and Kibana on a virtual machine (In this case running the Amazon Linux 2 Operating System).

Install the required packages

sudo yum update -y
sudo yum group install "Development Tools" -y
yum install wget -y
yum install readline-devel -y
yum install openssl-devel -y
 

Install ElasticSearch

curl -O "https://artifacts.elastic.co/downloads/elasticsearch/elasticsearch-8.5.0-linux-x86_64.tar.gz"
tar -xvf elasticsearch-8.5.0-linux-x86_64.tar.gz
chown -R ec2-user:ec2-user /home/ec2-user/elastic
cd /home/ec2-user/elastic/elasticsearch-8.5.0/config

Add the following lines to elasticsearch.yml

transport.host: localhost
transport.port: 9300
http.port: 9200
network.host: 0.0.0.0

Install Kibana

cd /home/ec2-user/
curl -O "https://artifacts.elastic.co/downloads/kibana/kibana-8.5.0-linux-x86_64.tar.gz"
tar -xzf kibana-8.5.0-linux-x86_64.tar.gz
chown -R ec2-user:ec2-user /home/ec2-user/kibana-8.5.0
cd /home/ec2-user/kibana-8.5.0/config

Add the following lines to kibana.yml

server.host=<your-host-ip-address>

In the line shown above, make sure to replace <your-host-ip-address> with your hosts actual ip address.

Start Elasticsearch and Kibana

cd /home/ec2-user/elastic/elasticsearch-8.5.0/bin
nohup ./elasticsearch -Epath.data=data -Epath.logs=log" &
cd /home/ec2-user/kibana-8.5.0/bin
nohup ./kibana &

Setup a new password for the Elasticsearch superuser

cd /home/ec2-user/elastic/elasticsearch-8.5.0
bin/elasticsearch-reset-password -u elastic --interactive

follow the prompts to setup a new password for the user elastic.

At this time elasticsearch has been installed and started and you can start creating your indexes and running your queries.

Subscribing to PostgreSql logical replication using python and psycopg2

When postgresql is used as a transactional database, there are use cases where the data changes from the transactional database are captured and send to other databases like your datamart or datawarehouse. You could use cloud services like Aws Dms or replication software like Debazium to do this. In this blog post I will show you how to use python to read changes (cdc, change data capture) from a postgresql database using the wal2json output plugin and psycopg2.

When you are compiling postgresql from source code, you can enable the output plugins test_decoder or wal2json as shown below.

            cd /home/postgres/tmp/postgresql-14.4/contrib/test_decoding
            make PG_CONFIG=/u01/pg/14/bin/pg_config
            make PG_CONFIG=/u01/pg/14/bin/pg_config install
            cd /home/postgres/tmp/postgresql-14.4/
            wget https://github.com/eulerto/wal2json/archive/refs/tags/wal2json_2_4.tar.gz
            tar -xzvf wal2json_2_4.tar.gz
            cd wal2json-wal2json_2_4/ 
            make PG_CONFIG=/u01/pg/14/bin/pg_config
            make PG_CONFIG=/u01/pg/14/bin/pg_config install

where /home/postgres/tmp/postgresql-14.4 is the directory where you untarred your postgres source code into, before compiling and installing postgres to /u01/pg/14

In order to proceed you need to have installed the library psycopg2 with python3 (Eg: pip install psycopg2)

You also need to make sure that the parameter wal_level is set to ‘logical’ in the postgresql.conf file of your postgres database.

create a table named books in your postgresql database.

create table books (bookid bigint primary key,bookname varchar(100));

Insert a few rows into the table.

insert into books values (1,'First Book');
insert into books values (2,'Second Book');
insert into books values (3,'Third Book');

The python library psycopg2 has a module named extras which provides helpers to read from postgres logical replication publishers. We will be using the functions from this module, namely create_replication_slot , start_replication and consume_stream to create the publisher and subscriber for logical replication.

Here is the code sample for pglogical.py

from __future__ import print_function
import sys
import psycopg2
import psycopg2.extras

conn = psycopg2.connect(
    'host=localhost user=postgres port=5432',
    connection_factory=psycopg2.extras.LogicalReplicationConnection)
cur = conn.cursor()
replication_options = {
'include-xids':'1',
'include-timestamp':'1',
'pretty-print':'1'
}
try:
    cur.start_replication(
        slot_name='pytest', decode=True,
        options=replication_options)
except psycopg2.ProgrammingError:
    cur.create_replication_slot('pytest', output_plugin='wal2json')
    cur.start_replication(
        slot_name='pytest', decode=True,
        options=replication_options)


class DemoConsumer(object):
    def __call__(self, msg):
        print(msg.payload)
        msg.cursor.send_feedback(flush_lsn=msg.data_start)

democonsumer = DemoConsumer()

print("Starting streaming, press Control-C to end...", file=sys.stderr)
try:
   cur.consume_stream(democonsumer)
except KeyboardInterrupt:
   cur.close()
   conn.close()
   print("The slot 'pytest' still exists. Drop it with "
         "SELECT pg_drop_replication_slot('pytest'); if no longer needed.",
         file=sys.stderr)
   print("WARNING: Transaction logs will accumulate in pg_xlog "
         "until the slot is dropped.", file=sys.stderr)

The code above is a modified version of the code, published by Marco Nenciarini here.

The code above uses the wal2json output plugin, you can change it to use the test_decoding output plugin, but you will also have to change the replication_options variable to those supported by test_decoding.

You can run the following command to create the publication and the subscriber.

python3 pglogical.py

You will see a prompt saying “Starting streaming, press Control-C to end…”

Let us now make a few changes to the books table.

insert into books values (5,'Fifth Book');

do $$
<<first_block>>
begin
    insert into books values (6,'Sixth Book');
    delete from books where bookid = 3;
end first_block $$;

If you now go back to the screen where you ran your python program, you can see the following messages on screen.

{
        "xid": 741,
        "timestamp": "2022-07-11 21:29:37.301299+00",
        "change": [
                {
                        "kind": "insert",
                        "schema": "public",
                        "table": "books",
                        "columnnames": ["bookid", "bookname"],
                        "columntypes": ["bigint", "character varying(100)"],
                        "columnvalues": [5, "Fifth Book"]
                }
        ]
}
{
        "xid": 742,
        "timestamp": "2022-07-11 21:33:07.913942+00",
        "change": [
                {
                        "kind": "insert",
                        "schema": "public",
                        "table": "books",
                        "columnnames": ["bookid", "bookname"],
                        "columntypes": ["bigint", "character varying(100)"],
                        "columnvalues": [6, "Sixth Book"]
                }
                ,{
                        "kind": "delete",
                        "schema": "public",
                        "table": "books",
                        "oldkeys": {
                                "keynames": ["bookid"],
                                "keytypes": ["bigint"],
                                "keyvalues": [3]
                        }
                }
        ]
}

These are the changes the python subscriber program is reading from postgres the logical replication publisher.

You can then write these changes either to a csv file  or to another database as you choose.

Installing the pg_partman extension

pg_partman is an extension that simplifies the process of partition management in postgres.

Below are the steps that I followed to install the pg_partman extension with postgres 12.8.

Change your current working directory to the directory where you unzipped the postgres 12.8 source code.

cd <SomeStaticPath>/postgresql-12.8/contrib

git clone https://github.com/pgpartman/pg_partman.git

cd pg_partman

make PG_CONFIG=<pghome>/bin/pg_config NO_BGW=1

make install

Then Edit your postgresql.conf file and add pg_partman_bgw to the parameter shared_preload_libraries

Now restart your postgres instance

At this time you are ready to create the extension from postgres and use it.

Use psql to login to your database

CREATE SCHEMA partman;

CREATE EXTENSION pg_partman SCHEMA partman;

\dx (To list the extensions and the version installed in your database)

Loading IMDB data into postgresql

IMDB (Internet Movie Database) makes the movie dataset available for free download at https://datasets.imdbws.com/. The documentation for this dataset can be found at https://www.imdb.com/interfaces/.

In this blog post, I show you how to load this data into a PostgreSql database. The steps are executed from an Ubuntu Linux workstation. I assume that you already have a postgresql database with about 50Gb of free space to upload this data into and you know the connection information.

We’ll use the s32imdbpy.py script that can be downloaded from github .

From the ubuntu workstation, install the following packages

sudo apt install python3-pip
sudo apt-get install libpq-dev

Now install the following python modules

pip3 install Psycopg2
pip3 install imdbpy

Create a new directory, to store the downloaded datafiles and download the datafiles from https://datasets.imdbws.com into this directory

mkdir dat
cd dat
wget https://datasets.imdbws.com/name.basics.tsv.gz
wget https://datasets.imdbws.com/title.akas.tsv.gz
wget https://datasets.imdbws.com/title.basics.tsv.gz
wget https://datasets.imdbws.com/title.crew.tsv.gz
wget https://datasets.imdbws.com/title.episode.tsv.gz
wget https://datasets.imdbws.com/title.principals.tsv.gz
wget https://datasets.imdbws.com/title.ratings.tsv.gz

Login to your postgresql database and create a new schema to hold the imdb tables (This step is optional. If you do not create this schema, then the tables and the corresponding data gets loaded into the public schema).

create schema imdb;

Before you run the script, lets edit the script and make one change, which will enable the script to load the data into the newly created imdb schema. This change will be made in line 183 in the file.

Change
engine = sqlalchemy.create_engine(db_uri, encoding='utf-8', echo=False)
To
engine = sqlalchemy.create_engine(db_uri, encoding='utf-8', echo=False,connect_args={'options': '-csearch_path={}'.format('imdb')})

Then Execute the script as shown below

python3 s32imdbpy.py /home/ubuntu/dat postgresql://username:password@dbhostname/dbname

Where /home/ubuntu/dat is the directory where the imdb files are downloaded into.

This will take some time to load (Close to an hour on a reasonably sized ubuntu workstation) and will consume about 15Gb of space in your postgresql database.

Substitution variables in psql scripts

As postgresql users and administrators, we tend to create lot of scripts, and run them routinely, for common queries we need to execute against the database. It is likely that in some of those scripts, you would want to parameterize values used in filter conditions. Below is an example of how this can be done.

psql is the terminal based front end tool to interact with postgresql. You can run scripts stored in files in the file system, using the \i directive in plsql. You can use the \set command to set variables in psql. If you want to prompt for the value to be entered by the user, you can use the \prompt directive.

\prompt 'Enter Table Name : ' tabname


select last_vacuum,last_autovacuum,last_analyze ,last_autoanalyze,n_live_tup,n_dead_tup
from pg_stat_user_tables
where relname = :'tabname';

Now if you run this from psql you will be first prompted for the table name and then it will display the results for the table name you entered.

postgres=# \i pgstats.sql
Enter Table Name : nflstats

 last_vacuum | last_autovacuum |         last_analyze         |       last_autoanalyze        | n_live_tup | n_dead_tup 
-------------+-----------------+------------------------------+-------------------------------+------------+------------
             |                 | 2020-06-08 18:07:36.29538+00 | 2020-06-08 15:37:48.738104+00 |     270418 |          0

Listing Errors from Cloudwatch logs using Aws Cli

The following commands can be used to list the Error messages from cloudwatch logs, produced from DMS (Database Migration service).

First list the log groups

aws logs describe-log-groups

Next list the log streams in your log group

aws logs describe-log-streams --log-group-name <YourLogGroupNameHere>

Next list the error messages. Within the DMS log, the Errors are indicated with a pattern “E:” within the error string, so that is the pattern we search for.

aws logs filter-log-events --log-group-name <YourLogGroupNameHere> --log-stream-names <YourLogStreamNameHere> --filter-pattern "[message = \"*E:*\"]" --query 'events[*].message'

If you are searching in cloudwatch logs produced from other sevices, you should replace E: with the pattern that flags Error messages for that service.

Webex configuration on Ubuntu 15.10 Wily WereWolf

I had to go through some additional package installations on Ubuntu 15.10 to get the webex client working from a FireFox browser.

Even though I was getting prompted to install the plugin and the plugin got installed, and i was getting to the Webex screen, I was unable to view the screens being presented via Webex. This is happening because there are a lot of libraries that the plugin needs, (In order to work properly)  that are missing after the base install of Ubuntu 15.10.

You can find the list of missing libraries by

  • Open a terminal with a command line prompt
  • cd .webex
  • cd 1524  (Or whatever your directory is named)
  • ldd *.so | grep -i ‘not found’

I had to perform the following steps to get all these libraries installed

  • Download java 32 bit for linux  from http://www.java.com/en/download/linux_manual.jsp
  • Downloaded and installed jre 32 bit into /u01/Rk/Apps/Java/jre32 (You can install it wherever you want to, just make sure you set LD_LIBRARY_PATH to the correct directory, in the next step)
  • Added following directories to the LD_LIBRARY_PATH setting in .bash_profile
    • /u01/Rk/Apps/Java/jre32/lib/i386:/u01/Rk/Apps/Java/jre32/lib/i386/server
  • Used apt to install the following packages
    • apt-get install libxmu6:i386
    • apt-get install libpangoxft-1.0-0:i386
    • apt-get install libpangox-1.0-0:i386
    • apt-get install libxtst6:i386
    • apt-get install -y lib32stdc++6
    • apt-get install -y libxi6:i386
    • apt-get install -y libxv1:i386
    • apt-get install -y libasound2:i386
    • apt-get install -y libgtk2.0-0:i386

After the above mentioned packages were installed, ldd did not report any missing libraries, and I was able to view and present using webex from FireFox.

Hope this helps others who have the same problem.

 

Oracle Exadata Statistics in AWR report – Part 2 (Outliers)

This blog post is a continuation of the previous blog post titled , Oracle Exadata Statistics in AWR report – Part 1 (Basics). In this post we continue on to describe the performance details displayed in the section “Exadata Outlier Summary”.

Outlier Summary Cell Level

awex3-1

This section displays cells that have performance outliers. The Awr Views DBA_HIST_CELL_DISK_SUMMARY, and DBA_HIST_CELL_GLOBAL_SUMMARY contains samples for each cell, disk and flash card.
The individual sample values, the number of samples, the average, the square of the value are all stored. Using this data the mean and the standard deviation are calculated and the range is defined as the average + or – standard deviation. Cells that have values that are above the mean + standard deviation are displayed.

This section will help us identify cells that have performance metrics, which are outside of the standard operating norms of that cell.

Outlier Summary – Disk Level

awex3-2

This section displays Disks that have performance outliers. The Awr Views DBA_HIST_CELL_DISK_SUMMARY contains this info.The individual sample values, the number of samples, the average, the square of the value are all stored. Using this data the mean and the standard deviation are calculated and the range is defined as the average + or – standard deviation. Disks that have values that are above the mean + standard deviation are displayed.

This section will help us identify Disks (Flash or Hard disk) that have performance metrics, which are outside of the standard operating norms of that Disk.

Exadata OS IO Statistics – Outlier Cells

awex4-1

This section displays cells that have IO statistics that are outliers. Per Cells averages, Per Disk Mean, Standard Deviation, Range’s of the IOPS and IO MBPS information is displayed. Averages exceeding the maximum stated capacity of the disk or cell are shown in Dark red.

This section helps identify whether there are cells or disks that exceed their stated capacities.

Exadata OS IO Statistics – Outlier Disks

awex4-2

This section displays disks (Flash and Hard disk) that have IO statistics that are outliers. Per Disk Mean, Standard Deviation, Range’s of the IOPS, IO MBPS and Disk utilization percentage information is displayed. Averages exceeding the Normal Ranges are shown in Dark red.

This section helps identify whether there are disks that are outside of the standard operating norms of that disk.

Exadata OS IO Latency – Outlier Cells

awex4-3
This section displays cells (Flash and Hard disk) that have IO latencies that are outliers.
Aggregated Across all cells, the Mean, Standard Deviation, Range’s of Average Serice times and Average Wait Times are displayed.
If there are cells Averages that exceed the Normal Range, they are displayed as outliers.

This section helps us identify whether there are cells that have I/O latencies that are outside of the standard operating norms for cells in this system.

Exadata OS IO Latency – Outlier Disks

awex5-1
This section displays disks (Flash and Hard disk) that have IO latencies that are outliers.
Aggregated Across all cells, the Mean, Standard Deviation, Range’s of Average Serice times and Average Wait Times are displayed.
If there are disks whose Averages that exceed the Normal Range of the cells, they are displayed as outliers.

This section helps us identify whether there are disks that have I/O latencies that are outside of the standard operating norms for disks in this system.
Exadata OS CPU Statistics – Outlier Cells

awex5-2
This section displays cells that have Cpu utilization that are outliers.
Aggregated Across all cells, the Mean, Standard Deviation, Range’s of Cpu utilization is displayed.
If there are cells whose Average Cpu utilization that exceed the Normal Cpu utilization Range of the cells, they are displayed as outliers.

Oracle Exadata Statistics in AWR report – Part 1 (Basic Info)

Starting with Exadata storage server 12.1.2.1.0 , used in combination with Oracle Database release 12.1.0.2, there are new sections which have been added to the Oracle AWR (Automatic Worload repository) report, that displays statistics at the Exadata storage level.

This is a really valuable enhancement, which helps with drilling down from database level statistics to cell level statistics, to identify and analyze the workload profile.
You can click on the URL’s in the section “Exadata Configuration and Statistics” to access this part of the report.

There are a few AWR history tables that store this information.

DBA_HIST_CELL_CONFIG
DBA_HIST_CELL_CONFIG_DETAIL
DBA_HIST_CELL_DB
DBA_HIST_CELL_DISKTYPE
DBA_HIST_CELL_DISK_NAME
DBA_HIST_CELL_DISK_SUMMARY
DBA_HIST_CELL_GLOBAL
DBA_HIST_CELL_GLOBAL_SUMMARY
DBA_HIST_CELL_IOREASON
DBA_HIST_CELL_IOREASON_NAME
DBA_HIST_CELL_METRIC_DESC
DBA_HIST_CELL_NAME
DBA_HIST_CELL_OPEN_ALERTS

The description of these views can be found in the Exadata Storage Server Users Guide.

The section starts off by showing the cell configuration information. Then it displays the Kernel  and the Cell Image version’s.

awex1-1

This information comes from the awr view DBA_HIST_CELL_DISK_SUMMARY.

The next section titled “Exadata Storage Information” storage information shows the number of disks and flash cards in each cell and the entire rack.

awex1-2
The first row of the output shows the amount of flash cache in each cell, The size of the smart flash log, Number of hard disks in a cell, Number of flash cards in each cell, and the number of Grid Disks in each cell.
The second row shows the above columns aggregated for all cells in the rack.

The next section titled “Exadata Griddisks” shows the grid disk names, Number of Grid disks in each cell, the Grid Disk size and The type of Drive

awex1-3
The next section titled “Exadata Cell Disks” shows the Disk type, Size of the cell disk, Number of disks .

awex2-1
The next section “ASM disksgroups” shows the diskgroups used by this database.

awex2-2
It shows the diskgroup name,Total size of the diskgroup,Used space, Number of disks in the diskgroup and the redundancy type.

This is followed by a section “Exadata Server Health Report”, which has 3 sub sections Exadata Alerts Summary,Exadata Alerts Detail,Exadata Non-Online Disks which displays information regarding alerts on the cells and any offline disks.
The remaining sections of Exadata performance statistics in the AWR report, display a great deal of Exadata cell performance numbers.

Before we venture much into those sections, it is important to understand some cell level concepts and how they are captured in Awr.

At the cell level if you list the following attributes (On a x5-2 cell with HD drives)

list cell attributes maxpdiops,maxpdmbps,maxfdiops,maxfdmbps you get the following values

167 111 8929 343

These values are collected and stored in the confval column in DBA_HIST_CELL_CONFIG_DETAIL in an XML format.

These base values are used to calculate the maximum capacities of the cells and disks in the sections that follow.