As Oracle Dba’s we often come across situations where we are handed CSV (Comma separated values) files, by our managers, or customers, as Raw data, based on which we need to do some work. The first task would be to analyze the file and come up with some summary satistics, so we can quantify the amount of work involved.
When faced with such circumstances, my favorite method is to use sqlloader to upload the file into a database table, and then run sql statements on the data to produce summary info. Most people would probably use excel, formulas, macros and pivot tables to achieve similar results.
In this blog post, i present an alternate method that i’ve been using recently, for csv file summarization.
Pandas is a library written for the Python language for data manipulation and analysis. In order to proceed, first install Python and then install the Python package named ‘pandas’. Pandas is a real good alternative to the R programming language.See my previous post on how to install python and pandas.
For the examples in this post, i am using a Csv file, that has NFL game, play by play statistics for 2014.
Start by invoking the python interactive interpreter.
$ python3
Python 3.4.2 (default, Dec 18 2014, 14:18:16)
[GCC 4.8.2] on linux
Type "help", "copyright", "credits" or "license" for more information.
First import the following libraries that we are going to use.
import pandas as pd
import numpy as np
Read the csv file into a Pandas DataFrame
df=pd.DataFrame(pd.read_csv('pbp-2014.csv',header=0))
Check how many rows the dataframe has
df.size
2056275
List the columns in the DataFrame
list(df)
['GameId', 'GameDate', 'Quarter', 'Minute', 'Second', 'OffenseTeam', 'DefenseTeam', 'Down', 'ToGo', 'YardLine', 'Unnamed: 10', 'SeriesFirstDown', 'Unnamed: 12', 'NextScore', 'Description', 'TeamWin', 'Unnamed: 16', 'Unnamed: 17', 'SeasonYear', 'Yards', 'Formation', 'PlayType', 'IsRush', 'IsPass', 'IsIncomplete', 'IsTouchdown', 'PassType', 'IsSack', 'IsChallenge', 'IsChallengeReversed', 'Challenger', 'IsMeasurement', 'IsInterception', 'IsFumble', 'IsPenalty', 'IsTwoPointConversion', 'IsTwoPointConversionSuccessful', 'RushDirection', 'YardLineFixed', 'YardLineDirection', 'IsPenaltyAccepted', 'PenaltyTeam', 'IsNoPlay', 'PenaltyType', 'PenaltyYards']
Check how many games the dataset covers. The GameId column is a unique identifier that identifies each game. The nunique method returns the number of unique elements in that object.
df.GameId.nunique()
256
List details of all games played by the New England Patriots. This command shows how you can provide a filter condition. The filter specifies that all the rows, where the OffensiveTeam is NE or DefensiveTeam is NE be listed.
df[(df['OffenseTeam'] == 'NE') | (df['DefenseTeam'] == 'NE')]
Subset a specific set of columns
df[['GameId','PlayType','Yards']]
GameId PlayType Yards
0 2014090400 KICK OFF 0
1 2014090400 RUSH 6
2 2014090400 RUSH 3
3 2014090400 RUSH 15
4 2014090400 RUSH 2
5 2014090400 PASS -2
...
...
Display all Pass and Rush plays Executed by New England. Here we are using a and filter condition to limit the rows to those of New England and the PlayType is either a PASS or a RUSH.
df[['GameId','PlayType','Yards']][(df['OffenseTeam'] == 'NE') & (df['PlayType'].isin(['PASS','RUSH']))]
GameId PlayType Yards
1092 2014090705 RUSH 2
1093 2014090705 PASS 4
1094 2014090705 PASS 0
1102 2014090705 PASS 8
1103 2014090705 PASS 8
1104 2014090705 RUSH 4
Display the Number of Plays, Total Yards Gained, and Average Yards gained per PASS and RUSH play, per game.
df[['GameId','PlayType','Yards']][(df['OffenseTeam'] == 'NE') & (df['PlayType'].isin(['PASS','RUSH']))].groupby(['GameId','PlayType']).agg({'Yards': [np.sum,np.mean],'GameId':[np.size]})
Yards GameId
sum mean size
GameId PlayType
2014090705 PASS 277 4.540984 61
RUSH 109 5.190476 21
2014091404 PASS 209 8.038462 26
RUSH 158 4.157895 38
2014092105 PASS 259 6.641026 39
RUSH 91 2.935484 31
2014092900 PASS 307 9.903226 31
RUSH 75 4.687500 16
2014100512 PASS 301 7.921053 38
RUSH 223 5.309524 42
2014101201 PASS 407 9.465116 43
RUSH 60 2.307692 26
2014101600 PASS 267 6.675000 40
RUSH 65 4.062500 16
2014102605 PASS 376 9.400000 40
RUSH 121 3.781250 32
2014110208 PASS 354 6.210526 57
RUSH 66 2.640000 25
2014111611 PASS 267 8.343750 32
RUSH 248 6.048780 41
2014112306 PASS 393 6.894737 57
RUSH 90 4.285714 21
2014113010 PASS 245 6.621622 37
RUSH 85 5.000000 17
2014120713 PASS 317 6.604167 48
RUSH 80 3.333333 24
2014121408 PASS 287 7.972222 36
RUSH 92 3.407407 27
2014122105 PASS 189 5.250000 36
RUSH 78 4.333333 18
2014122807 PASS 188 5.222222 36
RUSH 92 4.181818 22
From the above example’s you can see how easy it is to read a csv file, apply filters and summarize the data set using pandas.