image

Exploratory data analysis is an important and essential part of any data science project. It helps to prepare, process and visualize the data for better preliminary judgement. A high quality and comprehensive EDA eases the modeling and leads to a more reliable and conclusive outcomes.

This project is therefore focusing solely on EDA and utilizes various techniques to present the data and its related analysis. A set of SAT scores by state was used to practise various EDA techniques and attempt to use different plots to assist the understanding of the dataset and visualize them to gain insights and draw conclusions.

Package imports

import numpy as np
import pandas as pd
import csv
import seaborn as sns
import matplotlib.pyplot as plt
import scipy.stats as stats

plt.rcParams["patch.force_edgecolor"] = True
sns.set(style='darkgrid')
%matplotlib inline
%config InlineBackend.figure_format = 'retina'

Part 1 : Data Loading Using csv Module vs Using Pandas


1.1 Load the file with the csv module and put it in a Python dictionary

sat = './sat_scores.csv'
with open(sat,'rU') as f:
    
    content = [line for line in f]
    keys = content[0].replace('\n','').split(',')

    temp_values = content[1:]
    k1_values = []
    k2_values = []
    k3_values = []
    k4_values = []
    for v_temp in temp_values:
        v_temp = v_temp.replace('\n','').split(',')
        k1_values.append(v_temp[0])
        k2_values.append(v_temp[1])
        k3_values.append(v_temp[2])
        k4_values.append(v_temp[3])
sat_dict = dict(zip(keys, [k1_values,k2_values,k3_values,k4_values]))
print(sat_dict)
{'State': ['CT', 'NJ', 'MA', 'NY', 'NH', 'RI', 'PA', 'VT', 'ME', 'VA', 'DE', 'MD', 'NC', 'GA', 'IN', 'SC', 'DC', 'OR', 'FL', 'WA', 'TX', 'HI', 'AK', 'CA', 'AZ', 'NV', 'CO', 'OH', 'MT', 'WV', 'ID', 'TN', 'NM', 'IL', 'KY', 'WY', 'MI', 'MN', 'KS', 'AL', 'NE', 'OK', 'MO', 'LA', 'WI', 'AR', 'UT', 'IA', 'SD', 'ND', 'MS', 'All'], 'Rate': ['82', '81', '79', '77', '72', '71', '71', '69', '69', '68', '67', '65', '65', '63', '60', '57', '56', '55', '54', '53', '53', '52', '51', '51', '34', '33', '31', '26', '23', '18', '17', '13', '13', '12', '12', '11', '11', '9', '9', '9', '8', '8', '8', '7', '6', '6', '5', '5', '4', '4', '4', '45'], 'Verbal': ['509', '499', '511', '495', '520', '501', '500', '511', '506', '510', '501', '508', '493', '491', '499', '486', '482', '526', '498', '527', '493', '485', '514', '498', '523', '509', '539', '534', '539', '527', '543', '562', '551', '576', '550', '547', '561', '580', '577', '559', '562', '567', '577', '564', '584', '562', '575', '593', '577', '592', '566', '506'], 'Math': ['510', '513', '515', '505', '516', '499', '499', '506', '500', '501', '499', '510', '499', '489', '501', '488', '474', '526', '499', '527', '499', '515', '510', '517', '525', '515', '542', '439', '539', '512', '542', '553', '542', '589', '550', '545', '572', '589', '580', '554', '568', '561', '577', '562', '596', '550', '570', '603', '582', '599', '551', '514']}

1.2 Make a pandas DataFrame object with the SAT dictionary, and another with the pandas .read_csv() function

# set up dataframe with raw data from dict derived earlier

df_sat_dict = pd.DataFrame(sat_dict)
df_sat_dict.head()
State Rate Verbal Math
0 CT 82 509 510
1 NJ 81 499 513
2 MA 79 511 515
3 NY 77 495 505
4 NH 72 520 516
df_sat_dict.tail()
State Rate Verbal Math
47 IA 5 593 603
48 SD 4 577 582
49 ND 4 592 599
50 MS 4 566 551
51 All 45 506 514

Last row is the sum of all rows above, need to remove this from analysis to avoid data being skewed

df_sat_dict.drop(index=51, inplace=True)
df_sat_dict.tail()
State Rate Verbal Math
46 UT 5 575 570
47 IA 5 593 603
48 SD 4 577 582
49 ND 4 592 599
50 MS 4 566 551
df_sat_dict.dtypes 
State     object
Rate      object
Verbal    object
Math      object
dtype: object

Data type for all variables above were found as object types (including those that supposed to be numeric)

# set up dataframe with raw data loaded directly using read_csv()

df_sat = pd.read_csv(sat)
df_sat.head()
State Rate Verbal Math
0 CT 82 509 510
1 NJ 81 499 513
2 MA 79 511 515
3 NY 77 495 505
4 NH 72 520 516
df_sat.tail()
State Rate Verbal Math
47 IA 5 593 603
48 SD 4 577 582
49 ND 4 592 599
50 MS 4 566 551
51 All 45 506 514
df_sat.drop(index=51, axis=0, inplace=True)
df_sat.tail()
State Rate Verbal Math
46 UT 5 575 570
47 IA 5 593 603
48 SD 4 577 582
49 ND 4 592 599
50 MS 4 566 551
df_sat.dtypes 
# numbers were found transformed as integer data type right away when using read_csv()
State     object
Rate       int64
Verbal     int64
Math       int64
dtype: object

Observation on the difference :

  • If we do not convert the string column values to float in dictionary, the columns in the DataFrame are of type object. But when using read_csv(), the data type is framed up correctly without additional adjustment. Using read_csv() will be a better and faster option.

Part 2 : Create a “Data Dictionary”


A data dictionary is an object that describes the data. It helps users to understand the data better and thus ease the EDA process. The data dictionary in this exercise is framed up to include the name of each variable (column), the type of the variable, description of what the variable is, and the shape (rows and columns) of the entire dataset.

data_dict = pd.DataFrame(columns=['Column_name', 'Data_type', 'Description', 'No_of_records'])
data_dict['Column_name'] = df_sat.columns
data_dict['Data_type'] = list(df_sat.dtypes)
data_dict['Description'] = ['US State of SAT scores', 'Rate of students that take the SAT in that state', 'Mean verbal score in that state', 'Mean math score in that state']
df_sat.info() # since all columns are with same no. of records, can use .shape[0] to get this info
<class 'pandas.core.frame.DataFrame'>
Int64Index: 51 entries, 0 to 50
Data columns (total 4 columns):
State     51 non-null object
Rate      51 non-null int64
Verbal    51 non-null int64
Math      51 non-null int64
dtypes: int64(3), object(1)
memory usage: 2.0+ KB
data_dict['No_of_records'] = [df_sat.shape[0] for v in data_dict['Column_name']]
data_dict
Column_name Data_type Description No_of_records
0 State object US State of SAT scores 51
1 Rate int64 Rate of students that take the SAT in that state 51
2 Verbal int64 Mean verbal score in that state 51
3 Math int64 Mean math score in that state 51

Part 3 : Data Analysis and Visualization


3.1 Check the distribution of numerical columns

Data was examined to see if it was normally distributed

fig, ax = plt.subplots(3,1, figsize=(14,12))
sns.distplot(df_sat['Rate'], bins=30, kde=False, ax=ax[0])
sns.distplot(df_sat['Verbal'], bins=30, kde=False, color='g', ax=ax[1])
sns.distplot(df_sat['Math'], bins=30, kde=False, color='r', ax=ax[2])

3.2 Get an overview of the correlation of each variable in the dataset

sns.pairplot(df_sat, hue='State', size=3, aspect=1)


Observation on Pairplot :

  • Rate data is least normalized as compared to Verbal and Math data
  • Positive correlation observed for Math and Verbal
  • Negative correlation observed for Math & Rate, Verbal & Rate
  • SAT performance varied for various states

3.3 : Comparison of Math and Verbal Scores


df_sat[['Math', 'Verbal']].plot.hist(bins=30, alpha=0.5, stacked=True, figsize=(14,5))
plt.title('Histogram of SAT Performance')
plt.xlabel('Score')

fig, ax = plt.subplots(2,3, figsize=(25,6))

sns.boxplot(data=df_sat['Rate'],orient='h', ax=ax[0][0])
sns.distplot(df_sat['Rate'], ax=ax[1][0],bins=30)

sns.boxplot(data=df_sat['Verbal'], orient='h', ax=ax[0][1], color='orange')
sns.distplot(df_sat['Verbal'], bins=30, ax=ax[1][1], color='orange')

sns.boxplot(data=df_sat['Math'], orient='h', ax=ax[0][2], color='purple')
sns.distplot(df_sat['Math'], bins=30, ax=ax[1][2], color='purple')

Observation :

  • Though SAT dataset doesn’t very well distributed, there’s no outlier point observed for all the 3 keys numerical variables
  • Outlier is defined as data points that are beyond 1.5 of interquartile range :
    • 1.5 below the 1st quartile
    • 1.5 above the 3rd quartile
fig = plt.figure(figsize=(14,5))
ax = fig.gca()

ax = sns.boxplot(data=df_sat[['Rate','Verbal','Math']], orient='h', palette='Set2')
ax = plt.xlabel('Score')
ax = plt.title('Boxplot of SAT Performance')
ax = plt.text(1,1,"Shouldn't plot Rate together as it has different score scale", fontsize=14)


Benefits of using boxplot vs scatterplot/histogram:

  • good way to summarize large amounts of data
  • provide information about the range and distribution of data set inclusive details such as :
    • minimum
    • 1st quartile
    • median
    • 3rd quartile
    • maximum
    • outliers
  • give some indication of the data’s symetry and skewness

However, for boxploat above, Rate doesn’t have the same scale as Verbal and Math, plotting the raw data of Rate within one chart with others makes the comparison not so relevant. To have the comparision more relevant, the data needs to be standardize on the same scale. It can be done through the standardization method below by factoring in the mean and standard deviation.

# standardize variables

df_sat_std = (df_sat[['Rate','Verbal','Math']] - df_sat.mean()) / df_sat.std()
fig = plt.figure(figsize=(14,5))
ax = fig.gca()

ax = sns.boxplot(data=df_sat_std, orient='h', palette='Set2')
ax = plt.title('Standardized Boxplot for SAT Performance')

3.4 : Analysis on Subset of Data


3.4.a) States with Verbal scores greater than average Verbal scores across states

df_sat['Verbal'].mean()
532.5294117647059
df_sat_verbal_abv_average = df_sat[df_sat['Verbal']>df_sat['Verbal'].mean()].sort_values(by='Verbal', ascending=False)
df_sat_verbal_abv_average
State Rate Verbal Math
47 IA 5 593 603
49 ND 4 592 599
44 WI 6 584 596
37 MN 9 580 589
48 SD 4 577 582
38 KS 9 577 580
42 MO 8 577 577
33 IL 12 576 589
46 UT 5 575 570
41 OK 8 567 561
50 MS 4 566 551
43 LA 7 564 562
31 TN 13 562 553
45 AR 6 562 550
40 NE 8 562 568
36 MI 11 561 572
39 AL 9 559 554
32 NM 13 551 542
34 KY 12 550 550
35 WY 11 547 545
30 ID 17 543 542
28 MT 23 539 539
26 CO 31 539 542
27 OH 26 534 439
round(len(df_sat_verbal_abv_average) / df_sat['State'].nunique()*100,2)
47.06
  • 47.06% of states in US having Verbal score above average.
  • Less than half of the states in US achieving average score

3.4.b) States with Verbal scores greater than median Verbal scores across states

df_sat['Verbal'].median()
527.0
df_sat_verbal_abv_median = df_sat[df_sat['Verbal']>df_sat['Verbal'].median()].sort_values(by='Verbal', ascending=False)
df_sat_verbal_abv_median
State Rate Verbal Math
47 IA 5 593 603
49 ND 4 592 599
44 WI 6 584 596
37 MN 9 580 589
48 SD 4 577 582
38 KS 9 577 580
42 MO 8 577 577
33 IL 12 576 589
46 UT 5 575 570
41 OK 8 567 561
50 MS 4 566 551
43 LA 7 564 562
31 TN 13 562 553
45 AR 6 562 550
40 NE 8 562 568
36 MI 11 561 572
39 AL 9 559 554
32 NM 13 551 542
34 KY 12 550 550
35 WY 11 547 545
30 ID 17 543 542
28 MT 23 539 539
26 CO 31 539 542
27 OH 26 534 439
round(len(df_sat_verbal_abv_median)/df_sat['Verbal'].nunique()*100,2)
61.54
  • 61.54% of states in US having Verbal score above the median score
  • This percentage is slightly higher ( = more states included ) because the median score is slightly lower than the average score
# visualize the difference of mean, median and distribution of Verval score

plt.figure(figsize=(15,6))
sns.distplot(df_sat['Verbal'], bins=30)
plt.axvline(x=df_sat['Verbal'].mean(), linewidth=2.5, linestyle='dashed', color='r', label='Average score')
plt.axvline(x=df_sat['Verbal'].median(), linewidth=2.5, linestyle='dashed', color='g', label='Median score')
plt.legend()
plt.title('Distribution of Verbal Score')

3.4.c) Difference between the Verbal and Math scores

# set up new column to reflect the difference of Verbal vs Math score

df_sat['Difference_Verbal_Math'] = df_sat['Verbal'] - df_sat['Math']
df_sat.head()
State Rate Verbal Math Difference_Verbal_Math
0 CT 82 509 510 -1
1 NJ 81 499 513 -14
2 MA 79 511 515 -4
3 NY 77 495 505 -10
4 NH 72 520 516 4
# Top 3 states with greatest difference between Verbal and Math

df_sat_verbal_greaterThanMath = df_sat.sort_values(by='Difference_Verbal_Math', ascending=False).head(10)
df_sat_verbal_greaterThanMath.head(3)
State Rate Verbal Math Difference_Verbal_Math
27 OH 26 534 439 95
50 MS 4 566 551 15
29 WV 18 527 512 15
# Top 3 states with greatest difference between Math and Verbal

df_sat_math_greaterThanVerbal = df_sat.sort_values(by='Difference_Verbal_Math', ascending=True).head(10)
df_sat_math_greaterThanVerbal.head(3)
State Rate Verbal Math Difference_Verbal_Math
21 HI 52 485 515 -30
23 CA 51 498 517 -19
1 NJ 81 499 513 -14

3.5. Examine Summary Statistics


Summary stats enables a quick overall on the data, checking for correlation, missing values and anomalies

df_sat.drop(columns=['Difference_Verbal_Math']).describe(include='all')
State Rate Verbal Math
count 51 51.000000 51.000000 51.000000
unique 51 NaN NaN NaN
top UT NaN NaN NaN
freq 1 NaN NaN NaN
mean NaN 37.000000 532.529412 531.843137
std NaN 27.550681 33.360667 36.287393
min NaN 4.000000 482.000000 439.000000
25% NaN 9.000000 501.000000 503.000000
50% NaN 33.000000 527.000000 525.000000
75% NaN 64.000000 562.000000 557.500000
max NaN 82.000000 593.000000 603.000000

Count => no. of records/population size
mean => average of the total population
std => standard deviation of the population
min => lowest value of the record
25% => value by 25% quartile of the population
50% => value by 50% quartile of the population
75% => value by 75% quartile of the population
max => highest value of the record

3.5 a) Correlation with Pearson Coefficient

# check correlation of numerical variables in dataset

df_sat_corr = df_sat.drop(columns=['State','Difference_Verbal_Math']).corr()
df_sat_corr
Rate Verbal Math
Rate 1.000000 -0.888121 -0.773419
Verbal -0.888121 1.000000 0.899909
Math -0.773419 0.899909 1.000000
plt.figure(figsize=(10,6))
sns.heatmap(df_sat_corr)

Observations:

  1. Verbal and Math is positively correlated => If doing good in Verbal, also likely doing good in Math
  2. Rate is negatively correlated to Verbal and Math => State with low SAT taking rate seems to get better score in Verbal / Math

3.5 b) Correlation with Spearman Coefficient

df_sat_spearman = df_sat.drop('Difference_Verbal_Math', axis=1).corr(method='spearman')
df_sat_spearman
Rate Verbal Math
Rate 1.000000 -0.836058 -0.811662
Verbal -0.836058 1.000000 0.909413
Math -0.811662 0.909413 1.000000

Observations

  • spearman correlation coefficient was found higher than the pearson correlation coeffifient

Process calculating the spearman rank correlation

  1. Rank the values of each variable, with the largest value having rank 1.
  2. Calculate the difference in ranks
  3. Complete the rest of the calculation with the formula below: \(r = 1 - \frac {6 \sum d_i^2} {n^3 - n}\)

where $d_i$ is the differences in paired ranks, $n$ is the number of rows

Useful Notes :

  • The Spearman Rank correlation measures the rank of one variable against another.
  • A strong positive spearman rank correlation means that if one does well in math one will most likely also do well in verbal test.
  • A strong negative spearkman rank means that if one does well in one, one would most likely do poorly in the other.
  • The Spearman Rank correlation is not as sensitive to outliers because it does not take into account how far away the outliers are from the average.
  • It is only concerned about the order of those outliers versus everyone else.
  • Outliers in this dataset were those states that did well in one subject but poorly in the other (the two biggest difference). But since these are ranked aka sorted according to how their scores are; doing very well in Math and then doing average in Verbal test will only result is a high rank difference which will then give us a negative spearman rank correlation (not zero linear pearson correlation).

3.5 c) Covariance

# check covariance of the dataset

df_sat_cov = df_sat.drop(columns=['Difference_Verbal_Math']).cov()
df_sat_cov
Rate Verbal Math
Rate 759.04 -816.280000 -773.220000
Verbal -816.28 1112.934118 1089.404706
Math -773.22 1089.404706 1316.774902

Difference between covariance matrix and correlation matrix.

  • Covariance matrix is a measure of how related the variables are to each other, but it is measured on the variance of the variables.
  • While correlation matrix is dimensionless, regardless of what the units are in the variables, it always return the same in between -1 and 1

Process to convert the covariance into the correlation

  • Conversion can be done using the following formula: \(corr(X, Y) = \frac{cov(X, Y)}{std(X)std(Y)}\)

Reason correlation matrix is preferred over covariance matrix for examining relationships in data

  • Covariance is difficult to compared due to different dimension used whilst correlation is a scaled & standardized version of covariance in which the values are assured to be -1 and 1

3.5 d) Percentile Scoring

Examine percentile scoring of data. In other words, the conversion of numeric data to their equivalent percentile scores.

df_sat_percentile = df_sat.copy()
df_sat_percentile['Rate_percentile'] = df_sat_percentile['Rate'].apply(lambda x: stats.percentileofscore(df_sat_percentile['Rate'],x))
df_sat_percentile.head()
State Rate Verbal Math Difference_Verbal_Math Rate_percentile
0 CT 82 509 510 -1 100.000000
1 NJ 81 499 513 -14 98.039216
2 MA 79 511 515 -4 96.078431
3 NY 77 495 505 -10 94.117647
4 NH 72 520 516 4 92.156863
  • We could also possibly rank the values in each variable by percentile when calculating using spearman method
  • Percentile scoring and the spearman rank correlation are related in the sense that they both take a value that is calculated by how they are ranked in relation to one another in some sorted order, and not based on the difference from the mean.
  • Percentile scoring ranks each value against a sorted order to find in which percentile do the value fall in. For example, take New York state. A 94.12 percentile score means that the state’s participantion rate of 77 is in the top 94.12% of the entire dataset. That there were 3 other states that had a higher participation rate than it ( (100% - 94.12%) x 52 = approx. 3 ).

3.5 e) Percentiles vs outliers

# 9.3.2 Distribution as per normal values

fig, ax = plt.subplots(1,2, sharey = True, figsize=(20,6))

sns.distplot(df_sat_percentile['Rate'], bins=30, ax=ax[0]) # Distribution as per normal values
ax[0].set_title('Distribution of Rate')
sns.distplot(df_sat_percentile['Rate_percentile'], bins=30, ax=ax[1], color='g')
ax[1].set_title('Distribution of Rate_in_Percentile')

plt.suptitle('Comparison of Distribution for Rate_Normal vs Rate_In_Percentile', size=14)
plt.show()

fig, ax = plt.subplots(2,1, figsize=(10,8))

sns.boxplot(data=df_sat_percentile[['Rate', 'Rate_percentile']], ax=ax[0])
sns.violinplot(data=df_sat_percentile[['Rate', 'Rate_percentile']], ax=ax[1])

Observations :

  • Percentile is not as sensitive to outliers because it does not take into account how far away the outliers are from the average.
  • Rate was observed more normally distributed using percentile

Key Learnings


There are various ways and techniques that we can use to analyze data. From loading data, data preparation, data analysis and data visualization, there are many options that we can use to achieve the similar purpose. Different techniques have their own strength in particular features and performance. Therefore, it is essential to evaluate which one serves the best in term of clarity and provides the best visualization that enables a better and simplified explanation to a complex observation.

Sometimes, we can combine multiple approaches / different plots to complement the unique feature that demonstrates by one type of chart but not the other. Through this combination, it can help us to explain, to ‘re-imagine’ and understand the dataset better, and assist us to discover patterns in different perspective to make our analysis more comprehensive.