Using Pandas for CSV Analysis

As Oracle Dba’s we often come across situations where we are handed CSV (Comma separated values) files, by our managers, or customers, as Raw data, based on which we need to do some work. The first task would be to analyze the file and come up with some summary satistics, so we can quantify the amount of work involved.

When faced with such circumstances, my favorite method is to use sqlloader to upload the file into a database table, and then run sql statements on the data to produce summary info. Most people would probably use excel, formulas, macros and pivot tables to achieve similar results.

In this blog post, i present an alternate method that i’ve been using recently, for csv file summarization.

Pandas is a library written for the Python language for data manipulation and analysis. In order to proceed, first install Python and then install the Python package named ‘pandas’. Pandas is a real good alternative to the R programming language.See my previous post on how to install python and pandas.

For the examples in this post, i am using a Csv file, that has NFL game, play by play statistics for 2014.

Start by invoking the python interactive interpreter.

 

     $ python3
Python 3.4.2 (default, Dec 18 2014, 14:18:16) 
[GCC 4.8.2] on linux
Type "help", "copyright", "credits" or "license" for more information.

First import the following libraries that we are going to use.

     
import pandas as pd
import numpy as np

Read the csv file into a Pandas DataFrame

 

df=pd.DataFrame(pd.read_csv('pbp-2014.csv',header=0))

Check how many rows the dataframe has

 
df.size
2056275

List the columns in the DataFrame

 

list(df)
['GameId', 'GameDate', 'Quarter', 'Minute', 'Second', 'OffenseTeam', 'DefenseTeam', 'Down', 'ToGo', 'YardLine', 'Unnamed: 10', 'SeriesFirstDown', 'Unnamed: 12', 'NextScore', 'Description', 'TeamWin', 'Unnamed: 16', 'Unnamed: 17', 'SeasonYear', 'Yards', 'Formation', 'PlayType', 'IsRush', 'IsPass', 'IsIncomplete', 'IsTouchdown', 'PassType', 'IsSack', 'IsChallenge', 'IsChallengeReversed', 'Challenger', 'IsMeasurement', 'IsInterception', 'IsFumble', 'IsPenalty', 'IsTwoPointConversion', 'IsTwoPointConversionSuccessful', 'RushDirection', 'YardLineFixed', 'YardLineDirection', 'IsPenaltyAccepted', 'PenaltyTeam', 'IsNoPlay', 'PenaltyType', 'PenaltyYards']

Check how many games the dataset covers. The GameId column is a unique identifier that identifies each game. The nunique method returns the number of unique elements in that object.

 

df.GameId.nunique()
256

List details of all games played by the New England Patriots. This command shows how you can provide a filter condition. The filter specifies that all the rows, where the OffensiveTeam is NE or DefensiveTeam is NE be listed.

 

df[(df['OffenseTeam'] == 'NE') | (df['DefenseTeam'] == 'NE')]

Subset a specific set of columns

 

df[['GameId','PlayType','Yards']]

           GameId     PlayType  Yards
0      2014090400     KICK OFF      0
1      2014090400         RUSH      6
2      2014090400         RUSH      3
3      2014090400         RUSH     15
4      2014090400         RUSH      2
5      2014090400         PASS     -2
...
...

Display all Pass and Rush plays Executed by New England. Here we are using a and filter condition to limit the rows to those of New England and the PlayType is either a PASS or a RUSH.

 

df[['GameId','PlayType','Yards']][(df['OffenseTeam'] == 'NE') & (df['PlayType'].isin(['PASS','RUSH']))]   
           GameId PlayType  Yards
1092   2014090705     RUSH      2
1093   2014090705     PASS      4
1094   2014090705     PASS      0
1102   2014090705     PASS      8
1103   2014090705     PASS      8
1104   2014090705     RUSH      4
     

Display the Number of Plays, Total Yards Gained, and Average Yards gained per PASS and RUSH play, per game.

 

df[['GameId','PlayType','Yards']][(df['OffenseTeam'] == 'NE') & (df['PlayType'].isin(['PASS','RUSH']))].groupby(['GameId','PlayType']).agg({'Yards': [np.sum,np.mean],'GameId':[np.size]}) 
                    Yards           GameId
                      sum      mean   size
GameId     PlayType                       
2014090705 PASS       277  4.540984     61
           RUSH       109  5.190476     21
2014091404 PASS       209  8.038462     26
           RUSH       158  4.157895     38
2014092105 PASS       259  6.641026     39
           RUSH        91  2.935484     31
2014092900 PASS       307  9.903226     31
           RUSH        75  4.687500     16
2014100512 PASS       301  7.921053     38
           RUSH       223  5.309524     42
2014101201 PASS       407  9.465116     43
           RUSH        60  2.307692     26
2014101600 PASS       267  6.675000     40
           RUSH        65  4.062500     16
2014102605 PASS       376  9.400000     40
           RUSH       121  3.781250     32
2014110208 PASS       354  6.210526     57
           RUSH        66  2.640000     25
2014111611 PASS       267  8.343750     32
           RUSH       248  6.048780     41
2014112306 PASS       393  6.894737     57
           RUSH        90  4.285714     21
2014113010 PASS       245  6.621622     37
           RUSH        85  5.000000     17
2014120713 PASS       317  6.604167     48
           RUSH        80  3.333333     24
2014121408 PASS       287  7.972222     36
           RUSH        92  3.407407     27
2014122105 PASS       189  5.250000     36
           RUSH        78  4.333333     18
2014122807 PASS       188  5.222222     36
           RUSH        92  4.181818     22


From the above example’s you can see how easy it is to read a csv file, apply filters and summarize the data set using pandas.