image

This is another exercise to practise EDA using various techniques. Dataset was obtained from Public Source. The challenge for this dataset is its size. It only has 17 rows but with 28 columns. Different thought process is required in order to process this type of data.

Package imports

import numpy as np
import scipy.stats as stats
import statsmodels.api as sm
import csv
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

plt.rcParams["patch.force_edgecolor"] = True
sns.set(style='darkgrid')
%matplotlib inline
%config InlineBackend.figure_format = 'retina'

Part 1 : Data Loading and Cleaning


To check if there are any anomalies on the dataset, and if transformation is needed

pd.set_option('max_columns',50)
drug = pd.read_csv('./drug-use-by-age.csv')
print(drug.shape)
drug.head()
(17, 28)
age n alcohol-use alcohol-frequency marijuana-use marijuana-frequency cocaine-use cocaine-frequency crack-use crack-frequency heroin-use heroin-frequency hallucinogen-use hallucinogen-frequency inhalant-use inhalant-frequency pain-releiver-use pain-releiver-frequency oxycontin-use oxycontin-frequency tranquilizer-use tranquilizer-frequency stimulant-use stimulant-frequency meth-use meth-frequency sedative-use sedative-frequency
0 12 2798 3.9 3.0 1.1 4.0 0.1 5.0 0.0 - 0.1 35.5 0.2 52.0 1.6 19.0 2.0 36.0 0.1 24.5 0.2 52.0 0.2 2.0 0.0 - 0.2 13.0
1 13 2757 8.5 6.0 3.4 15.0 0.1 1.0 0.0 3.0 0.0 - 0.6 6.0 2.5 12.0 2.4 14.0 0.1 41.0 0.3 25.5 0.3 4.0 0.1 5.0 0.1 19.0
2 14 2792 18.1 5.0 8.7 24.0 0.1 5.5 0.0 - 0.1 2.0 1.6 3.0 2.6 5.0 3.9 12.0 0.4 4.5 0.9 5.0 0.8 12.0 0.1 24.0 0.2 16.5
3 15 2956 29.2 6.0 14.5 25.0 0.5 4.0 0.1 9.5 0.2 1.0 2.1 4.0 2.5 5.5 5.5 10.0 0.8 3.0 2.0 4.5 1.5 6.0 0.3 10.5 0.4 30.0
4 16 3058 40.1 10.0 22.5 30.0 1.0 7.0 0.0 1.0 0.1 66.5 3.4 3.0 3.0 3.0 6.2 7.0 1.1 4.0 2.4 11.0 1.8 9.5 0.3 36.0 0.2 3.0
  • This is a ‘short & fat’ dataset consists of only 17 rows but with 28 columns.
  • cleaning is required
  • missing values are observed for some columns
drug.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 17 entries, 0 to 16
Data columns (total 28 columns):
age                        17 non-null object
n                          17 non-null int64
alcohol-use                17 non-null float64
alcohol-frequency          17 non-null float64
marijuana-use              17 non-null float64
marijuana-frequency        17 non-null float64
cocaine-use                17 non-null float64
cocaine-frequency          17 non-null object
crack-use                  17 non-null float64
crack-frequency            17 non-null object
heroin-use                 17 non-null float64
heroin-frequency           17 non-null object
hallucinogen-use           17 non-null float64
hallucinogen-frequency     17 non-null float64
inhalant-use               17 non-null float64
inhalant-frequency         17 non-null object
pain-releiver-use          17 non-null float64
pain-releiver-frequency    17 non-null float64
oxycontin-use              17 non-null float64
oxycontin-frequency        17 non-null object
tranquilizer-use           17 non-null float64
tranquilizer-frequency     17 non-null float64
stimulant-use              17 non-null float64
stimulant-frequency        17 non-null float64
meth-use                   17 non-null float64
meth-frequency             17 non-null object
sedative-use               17 non-null float64
sedative-frequency         17 non-null float64
dtypes: float64(20), int64(1), object(7)
memory usage: 3.8+ KB
  • some numerical values are in wrong data types (object -> float)

Columns with data type=object are examined further in order to understand why it is turned to object data type whilst it actually contains numeric values

print('Age : {}'.format(drug['age'].unique()))
Age : ['12' '13' '14' '15' '16' '17' '18' '19' '20' '21' '22-23' '24-25' '26-29'
 '30-34' '35-49' '50-64' '65+']
  • some values in Age column are in discrete values whilst some in range (but inconsistent range interval), one is found having special character (+)
print('cocaine-frequency : {}'.format(drug['cocaine-frequency'].unique()))
print('crack-frequency : {}'.format(drug['crack-frequency'].unique()))
print('heroin-frequency : {}'.format(drug['heroin-frequency'].unique()))
print('inhalant-frequency : {}'.format(drug['inhalant-frequency'].unique()))
print('oxycontin-frequency : {}'.format(drug['oxycontin-frequency'].unique()))
print('meth-frequency: {}'.format(drug['meth-frequency'].unique()))
cocaine-frequency : ['5.0' '1.0' '5.5' '4.0' '7.0' '8.0' '6.0' '15.0' '36.0' '-']
crack-frequency : ['-' '3.0' '9.5' '1.0' '21.0' '10.0' '2.0' '5.0' '17.0' '6.0' '15.0'
 '48.0' '62.0']
heroin-frequency : ['35.5' '-' '2.0' '1.0' '66.5' '64.0' '46.0' '180.0' '45.0' '30.0' '57.5'
 '88.0' '50.0' '66.0' '280.0' '41.0' '120.0']
inhalant-frequency : ['19.0' '12.0' '5.0' '5.5' '3.0' '4.0' '2.0' '3.5' '10.0' '13.5' '-']
oxycontin-frequency : ['24.5' '41.0' '4.5' '3.0' '4.0' '6.0' '7.0' '7.5' '12.0' '13.5' '17.5'
 '20.0' '46.0' '5.0' '-']
meth-frequency: ['-' '5.0' '24.0' '10.5' '36.0' '48.0' '12.0' '105.0' '2.0' '46.0' '21.0'
 '30.0' '54.0' '104.0']
  • some missing values are observed with ‘-‘

Extraction of records with ‘-‘

drug[drug.values =='-']
age n alcohol-use alcohol-frequency marijuana-use marijuana-frequency cocaine-use cocaine-frequency crack-use crack-frequency heroin-use heroin-frequency hallucinogen-use hallucinogen-frequency inhalant-use inhalant-frequency pain-releiver-use pain-releiver-frequency oxycontin-use oxycontin-frequency tranquilizer-use tranquilizer-frequency stimulant-use stimulant-frequency meth-use meth-frequency sedative-use sedative-frequency
0 12 2798 3.9 3.0 1.1 4.0 0.1 5.0 0.0 - 0.1 35.5 0.2 52.0 1.6 19.0 2.0 36.0 0.1 24.5 0.2 52.0 0.2 2.0 0.0 - 0.2 13.0
0 12 2798 3.9 3.0 1.1 4.0 0.1 5.0 0.0 - 0.1 35.5 0.2 52.0 1.6 19.0 2.0 36.0 0.1 24.5 0.2 52.0 0.2 2.0 0.0 - 0.2 13.0
1 13 2757 8.5 6.0 3.4 15.0 0.1 1.0 0.0 3.0 0.0 - 0.6 6.0 2.5 12.0 2.4 14.0 0.1 41.0 0.3 25.5 0.3 4.0 0.1 5.0 0.1 19.0
2 14 2792 18.1 5.0 8.7 24.0 0.1 5.5 0.0 - 0.1 2.0 1.6 3.0 2.6 5.0 3.9 12.0 0.4 4.5 0.9 5.0 0.8 12.0 0.1 24.0 0.2 16.5
16 65+ 2448 49.3 52.0 1.2 36.0 0.0 - 0.0 - 0.0 120.0 0.1 2.0 0.0 - 0.6 24.0 0.0 - 0.2 5.0 0.0 364.0 0.0 - 0.0 15.0
16 65+ 2448 49.3 52.0 1.2 36.0 0.0 - 0.0 - 0.0 120.0 0.1 2.0 0.0 - 0.6 24.0 0.0 - 0.2 5.0 0.0 364.0 0.0 - 0.0 15.0
16 65+ 2448 49.3 52.0 1.2 36.0 0.0 - 0.0 - 0.0 120.0 0.1 2.0 0.0 - 0.6 24.0 0.0 - 0.2 5.0 0.0 364.0 0.0 - 0.0 15.0
16 65+ 2448 49.3 52.0 1.2 36.0 0.0 - 0.0 - 0.0 120.0 0.1 2.0 0.0 - 0.6 24.0 0.0 - 0.2 5.0 0.0 364.0 0.0 - 0.0 15.0
16 65+ 2448 49.3 52.0 1.2 36.0 0.0 - 0.0 - 0.0 120.0 0.1 2.0 0.0 - 0.6 24.0 0.0 - 0.2 5.0 0.0 364.0 0.0 - 0.0 15.0

Replacement of cells with ‘-‘ to NA values

drug.replace('-', np.nan, inplace=True)
drug.head()
age n alcohol-use alcohol-frequency marijuana-use marijuana-frequency cocaine-use cocaine-frequency crack-use crack-frequency heroin-use heroin-frequency hallucinogen-use hallucinogen-frequency inhalant-use inhalant-frequency pain-releiver-use pain-releiver-frequency oxycontin-use oxycontin-frequency tranquilizer-use tranquilizer-frequency stimulant-use stimulant-frequency meth-use meth-frequency sedative-use sedative-frequency
0 12 2798 3.9 3.0 1.1 4.0 0.1 5.0 0.0 NaN 0.1 35.5 0.2 52.0 1.6 19.0 2.0 36.0 0.1 24.5 0.2 52.0 0.2 2.0 0.0 NaN 0.2 13.0
1 13 2757 8.5 6.0 3.4 15.0 0.1 1.0 0.0 3.0 0.0 NaN 0.6 6.0 2.5 12.0 2.4 14.0 0.1 41.0 0.3 25.5 0.3 4.0 0.1 5.0 0.1 19.0
2 14 2792 18.1 5.0 8.7 24.0 0.1 5.5 0.0 NaN 0.1 2.0 1.6 3.0 2.6 5.0 3.9 12.0 0.4 4.5 0.9 5.0 0.8 12.0 0.1 24.0 0.2 16.5
3 15 2956 29.2 6.0 14.5 25.0 0.5 4.0 0.1 9.5 0.2 1.0 2.1 4.0 2.5 5.5 5.5 10.0 0.8 3.0 2.0 4.5 1.5 6.0 0.3 10.5 0.4 30.0
4 16 3058 40.1 10.0 22.5 30.0 1.0 7.0 0.0 1.0 0.1 66.5 3.4 3.0 3.0 3.0 6.2 7.0 1.1 4.0 2.4 11.0 1.8 9.5 0.3 36.0 0.2 3.0

Examination of missing values and data type for each column

drug.isnull().sum()
age                        0
n                          0
alcohol-use                0
alcohol-frequency          0
marijuana-use              0
marijuana-frequency        0
cocaine-use                0
cocaine-frequency          1
crack-use                  0
crack-frequency            3
heroin-use                 0
heroin-frequency           1
hallucinogen-use           0
hallucinogen-frequency     0
inhalant-use               0
inhalant-frequency         1
pain-releiver-use          0
pain-releiver-frequency    0
oxycontin-use              0
oxycontin-frequency        1
tranquilizer-use           0
tranquilizer-frequency     0
stimulant-use              0
stimulant-frequency        0
meth-use                   0
meth-frequency             2
sedative-use               0
sedative-frequency         0
dtype: int64
drug.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 17 entries, 0 to 16
Data columns (total 28 columns):
age                        17 non-null object
n                          17 non-null int64
alcohol-use                17 non-null float64
alcohol-frequency          17 non-null float64
marijuana-use              17 non-null float64
marijuana-frequency        17 non-null float64
cocaine-use                17 non-null float64
cocaine-frequency          16 non-null object
crack-use                  17 non-null float64
crack-frequency            14 non-null object
heroin-use                 17 non-null float64
heroin-frequency           16 non-null object
hallucinogen-use           17 non-null float64
hallucinogen-frequency     17 non-null float64
inhalant-use               17 non-null float64
inhalant-frequency         16 non-null object
pain-releiver-use          17 non-null float64
pain-releiver-frequency    17 non-null float64
oxycontin-use              17 non-null float64
oxycontin-frequency        16 non-null object
tranquilizer-use           17 non-null float64
tranquilizer-frequency     17 non-null float64
stimulant-use              17 non-null float64
stimulant-frequency        17 non-null float64
meth-use                   17 non-null float64
meth-frequency             15 non-null object
sedative-use               17 non-null float64
sedative-frequency         17 non-null float64
dtypes: float64(20), int64(1), object(7)
memory usage: 3.8+ KB

Adjustment of data type to numeric

drug.iloc[:,2:] = drug.iloc[:,2:].astype(float)
drug.dtypes
age                         object
n                            int64
alcohol-use                float64
alcohol-frequency          float64
marijuana-use              float64
marijuana-frequency        float64
cocaine-use                float64
cocaine-frequency          float64
crack-use                  float64
crack-frequency            float64
heroin-use                 float64
heroin-frequency           float64
hallucinogen-use           float64
hallucinogen-frequency     float64
inhalant-use               float64
inhalant-frequency         float64
pain-releiver-use          float64
pain-releiver-frequency    float64
oxycontin-use              float64
oxycontin-frequency        float64
tranquilizer-use           float64
tranquilizer-frequency     float64
stimulant-use              float64
stimulant-frequency        float64
meth-use                   float64
meth-frequency             float64
sedative-use               float64
sedative-frequency         float64
dtype: object
drug.describe()
n alcohol-use alcohol-frequency marijuana-use marijuana-frequency cocaine-use cocaine-frequency crack-use crack-frequency heroin-use heroin-frequency hallucinogen-use hallucinogen-frequency inhalant-use inhalant-frequency pain-releiver-use pain-releiver-frequency oxycontin-use oxycontin-frequency tranquilizer-use tranquilizer-frequency stimulant-use stimulant-frequency meth-use meth-frequency sedative-use sedative-frequency
count 17.000000 17.000000 17.000000 17.000000 17.000000 17.000000 16.000000 17.000000 14.000000 17.000000 16.000000 17.000000 17.000000 17.000000 16.000000 17.000000 17.000000 17.000000 16.000000 17.000000 17.000000 17.000000 17.000000 17.000000 15.000000 17.000000 17.000000
mean 3251.058824 55.429412 33.352941 18.923529 42.941176 2.176471 7.875000 0.294118 15.035714 0.352941 73.281250 3.394118 8.411765 1.388235 6.156250 6.270588 14.705882 0.935294 14.812500 2.805882 11.735294 1.917647 31.147059 0.382353 35.966667 0.282353 19.382353
std 1297.890426 26.878866 21.318833 11.959752 18.362566 1.816772 8.038449 0.235772 18.111263 0.333762 70.090173 2.792506 15.000245 0.927283 4.860448 3.166379 6.935098 0.608216 12.798275 1.753379 11.485205 1.407673 85.973790 0.262762 31.974581 0.138000 24.833527
min 2223.000000 3.900000 3.000000 1.100000 4.000000 0.000000 1.000000 0.000000 1.000000 0.000000 1.000000 0.100000 2.000000 0.000000 2.000000 0.600000 7.000000 0.000000 3.000000 0.200000 4.500000 0.000000 2.000000 0.000000 2.000000 0.000000 3.000000
25% 2469.000000 40.100000 10.000000 8.700000 30.000000 0.500000 5.000000 0.000000 5.000000 0.100000 39.625000 0.600000 3.000000 0.600000 3.375000 3.900000 12.000000 0.400000 5.750000 1.400000 6.000000 0.600000 7.000000 0.200000 12.000000 0.200000 6.500000
50% 2798.000000 64.600000 48.000000 20.800000 52.000000 2.000000 5.250000 0.400000 7.750000 0.200000 53.750000 3.200000 3.000000 1.400000 4.000000 6.200000 12.000000 1.100000 12.000000 3.500000 10.000000 1.800000 10.000000 0.400000 30.000000 0.300000 10.000000
75% 3058.000000 77.500000 52.000000 28.400000 52.000000 4.000000 7.250000 0.500000 16.500000 0.600000 71.875000 5.200000 4.000000 2.000000 6.625000 9.000000 15.000000 1.400000 18.125000 4.200000 11.000000 3.000000 12.000000 0.600000 47.000000 0.400000 17.500000
max 7391.000000 84.200000 52.000000 34.000000 72.000000 4.900000 36.000000 0.600000 62.000000 1.100000 280.000000 8.600000 52.000000 3.000000 19.000000 10.000000 36.000000 1.700000 46.000000 5.400000 52.000000 4.100000 364.000000 0.900000 105.000000 0.500000 104.000000
  • some columns show high standard deviation values

Part 2 : High Level Overview of Data


2.1 : Check age group vs sample size distribution

drug.plot.bar(x='age', y='n', figsize=(15,6), color='grey')
plt.title('Distribution of Sample Size by Age')
plt.ylabel('sample size')

  • Sample size for age from 12-21 is more stable ranging from 2000 to 3000+
    • for age from 12 - 17 is around +/-3000
    • for age from 18 - 21 is around +/-2500
  • Sample size for other age groups varies a lot. Data showed inconsistency in age group from 22 onwards

2.2 : For better comparison and avoid misleading graph interpretation due to mixture of age vs age groups, convert age group into average

def age_modified(age):
    if '+' in age:
        age = float(age.strip('+'))
    elif '-' in age:
        x = age.split('-')
        age = (float(x[1]) - float(x[0]))/2. + float(x[0])
    else:
        age = float(age)
    return age      
drug['age'] = drug['age'].apply(age_modified)
drug
age n alcohol-use alcohol-frequency marijuana-use marijuana-frequency cocaine-use cocaine-frequency crack-use crack-frequency heroin-use heroin-frequency hallucinogen-use hallucinogen-frequency inhalant-use inhalant-frequency pain-releiver-use pain-releiver-frequency oxycontin-use oxycontin-frequency tranquilizer-use tranquilizer-frequency stimulant-use stimulant-frequency meth-use meth-frequency sedative-use sedative-frequency
0 12.0 2798 3.9 3.0 1.1 4.0 0.1 5.0 0.0 NaN 0.1 35.5 0.2 52.0 1.6 19.0 2.0 36.0 0.1 24.5 0.2 52.0 0.2 2.0 0.0 NaN 0.2 13.0
1 13.0 2757 8.5 6.0 3.4 15.0 0.1 1.0 0.0 3.0 0.0 NaN 0.6 6.0 2.5 12.0 2.4 14.0 0.1 41.0 0.3 25.5 0.3 4.0 0.1 5.0 0.1 19.0
2 14.0 2792 18.1 5.0 8.7 24.0 0.1 5.5 0.0 NaN 0.1 2.0 1.6 3.0 2.6 5.0 3.9 12.0 0.4 4.5 0.9 5.0 0.8 12.0 0.1 24.0 0.2 16.5
3 15.0 2956 29.2 6.0 14.5 25.0 0.5 4.0 0.1 9.5 0.2 1.0 2.1 4.0 2.5 5.5 5.5 10.0 0.8 3.0 2.0 4.5 1.5 6.0 0.3 10.5 0.4 30.0
4 16.0 3058 40.1 10.0 22.5 30.0 1.0 7.0 0.0 1.0 0.1 66.5 3.4 3.0 3.0 3.0 6.2 7.0 1.1 4.0 2.4 11.0 1.8 9.5 0.3 36.0 0.2 3.0
5 17.0 3038 49.3 13.0 28.0 36.0 2.0 5.0 0.1 21.0 0.1 64.0 4.8 3.0 2.0 4.0 8.5 9.0 1.4 6.0 3.5 7.0 2.8 9.0 0.6 48.0 0.5 6.5
6 18.0 2469 58.7 24.0 33.7 52.0 3.2 5.0 0.4 10.0 0.4 46.0 7.0 4.0 1.8 4.0 9.2 12.0 1.7 7.0 4.9 12.0 3.0 8.0 0.5 12.0 0.4 10.0
7 19.0 2223 64.6 36.0 33.4 60.0 4.1 5.5 0.5 2.0 0.5 180.0 8.6 3.0 1.4 3.0 9.4 12.0 1.5 7.5 4.2 4.5 3.3 6.0 0.4 105.0 0.3 6.0
8 20.0 2271 69.7 48.0 34.0 60.0 4.9 8.0 0.6 5.0 0.9 45.0 7.4 2.0 1.5 4.0 10.0 10.0 1.7 12.0 5.4 10.0 4.0 12.0 0.9 12.0 0.5 4.0
9 21.0 2354 83.2 52.0 33.0 52.0 4.8 5.0 0.5 17.0 0.6 30.0 6.3 4.0 1.4 2.0 9.0 15.0 1.3 13.5 3.9 7.0 4.1 10.0 0.6 2.0 0.3 9.0
10 22.5 4707 84.2 52.0 28.4 52.0 4.5 5.0 0.5 5.0 1.1 57.5 5.2 3.0 1.0 4.0 10.0 15.0 1.7 17.5 4.4 12.0 3.6 10.0 0.6 46.0 0.2 52.0
11 24.5 4591 83.1 52.0 24.9 60.0 4.0 6.0 0.5 6.0 0.7 88.0 4.5 2.0 0.8 2.0 9.0 15.0 1.3 20.0 4.3 10.0 2.6 10.0 0.7 21.0 0.2 17.5
12 27.5 2628 80.7 52.0 20.8 52.0 3.2 5.0 0.4 6.0 0.6 50.0 3.2 3.0 0.6 4.0 8.3 13.0 1.2 13.5 4.2 10.0 2.3 7.0 0.6 30.0 0.4 4.0
13 32.0 2864 77.5 52.0 16.4 72.0 2.1 8.0 0.5 15.0 0.4 66.0 1.8 2.0 0.4 3.5 5.9 22.0 0.9 46.0 3.6 8.0 1.4 12.0 0.4 54.0 0.4 10.0
14 42.0 7391 75.0 52.0 10.4 48.0 1.5 15.0 0.5 48.0 0.1 280.0 0.6 3.0 0.3 10.0 4.2 12.0 0.3 12.0 1.9 6.0 0.6 24.0 0.2 104.0 0.3 10.0
15 57.0 3923 67.2 52.0 7.3 52.0 0.9 36.0 0.4 62.0 0.1 41.0 0.3 44.0 0.2 13.5 2.5 12.0 0.4 5.0 1.4 10.0 0.3 24.0 0.2 30.0 0.2 104.0
16 65.0 2448 49.3 52.0 1.2 36.0 0.0 NaN 0.0 NaN 0.0 120.0 0.1 2.0 0.0 NaN 0.6 24.0 0.0 NaN 0.2 5.0 0.0 364.0 0.0 NaN 0.0 15.0

2.3 : Set up 2 dataframes –> one by drugUse ; one by drugFrequency

drug.columns
Index(['age', 'n', 'alcohol-use', 'alcohol-frequency', 'marijuana-use',
       'marijuana-frequency', 'cocaine-use', 'cocaine-frequency', 'crack-use',
       'crack-frequency', 'heroin-use', 'heroin-frequency', 'hallucinogen-use',
       'hallucinogen-frequency', 'inhalant-use', 'inhalant-frequency',
       'pain-releiver-use', 'pain-releiver-frequency', 'oxycontin-use',
       'oxycontin-frequency', 'tranquilizer-use', 'tranquilizer-frequency',
       'stimulant-use', 'stimulant-frequency', 'meth-use', 'meth-frequency',
       'sedative-use', 'sedative-frequency'],
      dtype='object')
use_columns = [col for col in drug.columns if 'use' in col]
frequency_columns = [col for col in drug.columns if 'frequency' in col]

df_drugUse = drug[use_columns]
df_drugFrequency = drug[frequency_columns]

# insert age column to the front of the new df
df_drugUse.insert(0, 'age', drug['age'] )
df_drugFrequency.insert(0, 'age', drug['age'])

Check if df by drugUse is correctly set up

df_drugUse.head()
age alcohol-use marijuana-use cocaine-use crack-use heroin-use hallucinogen-use inhalant-use pain-releiver-use oxycontin-use tranquilizer-use stimulant-use meth-use sedative-use
0 12.0 3.9 1.1 0.1 0.0 0.1 0.2 1.6 2.0 0.1 0.2 0.2 0.0 0.2
1 13.0 8.5 3.4 0.1 0.0 0.0 0.6 2.5 2.4 0.1 0.3 0.3 0.1 0.1
2 14.0 18.1 8.7 0.1 0.0 0.1 1.6 2.6 3.9 0.4 0.9 0.8 0.1 0.2
3 15.0 29.2 14.5 0.5 0.1 0.2 2.1 2.5 5.5 0.8 2.0 1.5 0.3 0.4
4 16.0 40.1 22.5 1.0 0.0 0.1 3.4 3.0 6.2 1.1 2.4 1.8 0.3 0.2

Check if df by drugFrequency is correctly set up

df_drugFrequency.head()
age alcohol-frequency marijuana-frequency cocaine-frequency crack-frequency heroin-frequency hallucinogen-frequency inhalant-frequency pain-releiver-frequency oxycontin-frequency tranquilizer-frequency stimulant-frequency meth-frequency sedative-frequency
0 12.0 3.0 4.0 5.0 NaN 35.5 52.0 19.0 36.0 24.5 52.0 2.0 NaN 13.0
1 13.0 6.0 15.0 1.0 3.0 NaN 6.0 12.0 14.0 41.0 25.5 4.0 5.0 19.0
2 14.0 5.0 24.0 5.5 NaN 2.0 3.0 5.0 12.0 4.5 5.0 12.0 24.0 16.5
3 15.0 6.0 25.0 4.0 9.5 1.0 4.0 5.5 10.0 3.0 4.5 6.0 10.5 30.0
4 16.0 10.0 30.0 7.0 1.0 66.5 3.0 3.0 7.0 4.0 11.0 9.5 36.0 3.0

2.4a : Visualize data by drugUse in stacked-bar chart

df_drugUse.plot(x='age', kind='bar',stacked=True, figsize=(20,8), colormap='tab20',rot=1)
plt.ylabel('% of population taking the drug')
plt.title('Distribution of Population by Age for Various Drug Types')

  • alcohol is the highest intake in all ages/age groups
  • marijuana is the 2nd highest drug intake among various ages/age groups. However, a reduction in marijuana use was observed for age 22 onwards
  • pain reliever is the 3rd popular drug with the % of people in the same age/age groups who used this drug remained stable for age 17-21

2.4b : Another visualization of drugUse using line chart

df_drugUse.plot('age', xticks=np.arange(10,70,5), figsize=(20,8))
plt.ylabel('% of age population')
plt.title('Distribution of Population by Age for Various Drug Types')

2.5a : Visualization of drug data by Frequency in stacked-bar chart

df_drugFrequency.plot(x='age', figsize=(20,8), stacked=True, kind='bar', colormap='tab20')
plt.ylabel('Frequency')
plt.title('Distribution of Drug Intake Freqeuncy by Age')

  • heroin was the drug with highest frequency of intake age 19 and age group of 35-49
  • stimulant was found having a high spike of drug frequency in age group of 65+
  • marijuana frequency was found stable for age 18 till age group of 50-64

2.5b : Another visualization of drugFrequency using line chart

df_drugFrequency.plot('age', figsize=(20,8), xticks=np.arange(10,70,5))
plt.ylabel('Median Frequency of Drug Intake')
plt.title('Distribution of Drug Frequency by Age for Various Drug Types')

  • Visualization through stacked bar gave better comparison view as compared to line plot

2.6 : Check the spread of the data for each category using boxplot

Since data range varies significantly, standardized data before boxplot to enable comparable values on same scale

drugUse_bx = df_drugUse.drop('age', axis=1)
drugFrequency_bx = df_drugFrequency.drop('age', axis=1)

# standardize data on same scale
std_drugUse_bx = (drugUse_bx - drugUse_bx.mean())/drugUse_bx.std()
std_drugFrequency_bx = (drugFrequency_bx - drugFrequency_bx.mean())/drugFrequency_bx.std()

std_drugUse_bx
alcohol-use marijuana-use cocaine-use crack-use heroin-use hallucinogen-use inhalant-use pain-releiver-use oxycontin-use tranquilizer-use stimulant-use meth-use sedative-use
0 -1.917098 -1.490293 -1.142945 -1.247469 -0.757849 -1.143818 0.228371 -1.348729 -1.373352 -1.486206 -1.220203 -1.455128 -0.596759
1 -1.745959 -1.297981 -1.142945 -1.247469 -1.057464 -1.000577 1.198949 -1.222402 -1.373352 -1.429173 -1.149164 -1.074556 -1.321394
2 -1.388802 -0.854828 -1.142945 -1.247469 -0.757849 -0.642476 1.306791 -0.748675 -0.880106 -1.086977 -0.793968 -1.074556 -0.596759
3 -0.975838 -0.369868 -0.922774 -0.823329 -0.458234 -0.463425 1.198949 -0.243366 -0.222444 -0.459617 -0.296693 -0.313412 0.852512
4 -0.570315 0.299042 -0.647561 -1.247469 -0.757849 0.002106 1.738159 -0.022293 0.270802 -0.231486 -0.083576 -0.313412 -0.596759
5 -0.228038 0.758918 -0.097134 -0.823329 -0.757849 0.503448 0.659739 0.704089 0.764048 0.395874 0.626817 0.828303 1.577148
6 0.121679 1.235516 0.563378 0.449089 0.140995 1.291271 0.444055 0.925161 1.257294 1.194333 0.768895 0.447732 0.852512
7 0.341182 1.210432 1.058762 0.873228 0.440610 1.864233 0.012687 0.988325 0.928463 0.795103 0.982013 0.067160 0.127877
8 0.530922 1.260601 1.499103 1.297367 1.639069 1.434512 0.120529 1.177816 1.257294 1.479496 1.479287 1.970019 1.577148
9 1.033176 1.176987 1.444061 0.873228 0.740225 1.040600 0.012687 0.861998 0.599632 0.624005 1.550326 0.828303 0.127877
10 1.070380 0.792363 1.278933 0.873228 2.238298 0.646689 -0.418681 1.177816 1.257294 0.909169 1.195130 0.828303 -0.596759
11 1.029455 0.499715 1.003719 0.873228 1.039839 0.396018 -0.634365 0.861998 0.599632 0.852136 0.484738 1.208875 -0.596759
12 0.940166 0.156899 0.563378 0.449089 0.740225 -0.069514 -0.850049 0.640925 0.435217 0.795103 0.271621 0.828303 0.852512
13 0.821113 -0.211002 -0.042091 0.873228 0.140995 -0.570856 -1.065733 -0.117038 -0.058029 0.452907 -0.367732 0.067160 0.852512
14 0.728103 -0.712684 -0.372347 0.873228 -0.757849 -1.000577 -1.173575 -0.653929 -1.044521 -0.516649 -0.936046 -0.693984 0.127877
15 0.437912 -0.971887 -0.702603 0.449089 -0.757849 -1.108008 -1.281417 -1.190820 -0.880106 -0.801813 -1.149164 -0.693984 -0.596759
16 -0.228038 -1.481931 -1.197988 -1.247469 -1.057464 -1.179628 -1.497101 -1.790875 -1.537767 -1.486206 -1.362281 -1.455128 -2.046029

Comparision of various drugUse spread on same scale

plt.figure(figsize=(20,10))
sns.boxplot(data=std_drugUse_bx, orient='h',)
plt.title('Boxplot on Standardized Scale for DrugUse')

Comparision of various drugFrequency spread on same scale

plt.figure(figsize=(20,10))
sns.boxplot(data=std_drugFrequency_bx, orient='h')
plt.title('Boxplot on Standardized Scale for DrugFrequency')

  • drugFrequency data is more scatter and with more outliers as compared to drugUse data

2.7a : Check the correlation of data in drugUse dataset

drugUse_temp = df_drugUse.drop('age', axis=1)
drugFrequency_temp = df_drugFrequency.drop('age', axis=1)
plt.figure(figsize=(12,8))
sns.heatmap(drugUse_temp.corr(),cmap='RdBu_r',annot=True)

  • inhalant-use has almost no to negative correlation to the rest of the drug use
  • other drug-use are generally having positive correlation to each other in different levels

Visualization of the correlation of top 4 popular drugs (alcohol, marijuana, hallucinogen, and pain-reliever)

sns.pairplot(drugUse_temp[['alcohol-use','marijuana-use','hallucinogen-use','pain-releiver-use']],kind='reg')

  • all 4 drugs showed positive correlation to each other
  • alcohol-use data was observed to behave more scatter-correlated with other drugs
  • whilst the other 3 drugs(marijuana, hallucinogen, and pain-reliever) were found quite well positively correlated

2.7b : Check the correlation of data in drugFrequency dataset

plt.figure(figsize=(12,8))
sns.heatmap(data=drugFrequency_temp.corr(), cmap='RdBu_r', annot=True)

  • mixture of positive and negative correlation among various drugFrequency
  • crack-frequeny and stimulant-frequency has upto 0.9 positive correlation

Visualization the correlation of top 4 popular drugFreq(alcohol, marijuana, hallucinogen, and pain-reliever)

sns.pairplot(drugFrequency_temp[['alcohol-frequency','marijuana-frequency','hallucinogen-frequency','pain-releiver-frequency']],kind='reg')

  • No significant correlation was observed for all the 4 drugs frequency
  • Hallucinogen-frequency was found having 2 extreme scale with majority of data at lower frequency level.
  • The high hallucinogen-frequency data points could be outliers/exceptional intakes

Part 3 : Hypothesis Generation and Testing


In the data exploration process, it is common that we would need to generate some assumptions, testify and validate those assumptions before we can summarize the findings in order to make solid conclusions. For this session, observation in Part 2 was used to practise hypothesis generation and testing.

Question to explore :

  • Correlation matrix showed significant correlation (0.98) of oxycontin-use vs pain-reliever-use.
  • Are the drug users in pain-reliever having the similar age group distribution as the drug users in oxycontine?
    • \[H_0: Use_{pain-reliever} = Use_{oxycontin}\]
    • \[H_1: Use_{pain-reliever} \neq Use_{oxycontin}\]
  • But for their frequencies, correlation was only at 0.56. Are these correlation statistically significant?
  • Among these 2 groups of drug users, are they taking the pain-reliever as frequent as oxycontine?
    • \[H_0: frequency_{pain-reliever} = frequency_{oxycontin}\]
    • \[H_1: frequency_{pain-reliever} \neq frequency_{oxycontin}\]

Deliverables :

  • join-plot
  • stats summary to include p-values

3.a : Comparison on Drug Use

1st examination through graphical view

sns.jointplot(x='pain-releiver-use', y='oxycontin-use', data=drugUse_temp, kind='reg')

2nd examination through stats library on p-value

p_value_drugUse = stats.ttest_ind(drugUse_temp['pain-releiver-use'],drugUse_temp['oxycontin-use'])
p_value_drugUse
Ttest_indResult(statistic=6.82263516475104, pvalue=1.0265878201430413e-07)

3.b : Comparison on Drug Frequency

1st examination through graphical view

  • Repeat the same workflow as in 3.a
sns.jointplot(x='pain-releiver-frequency', y='oxycontin-frequency', data=drugFrequency_temp, kind='reg')

p_value_drugFrequency = stats.ttest_ind(drugFrequency_temp['pain-releiver-frequency'], drugFrequency_temp['oxycontin-frequency'],nan_policy='omit')
p_value_drugFrequency
# need to set nan_policy='omit' in order to ignore nan value in oxycontin-frequency for p-value calculation
Ttest_indResult(statistic=-0.030003630957118617, pvalue=0.9762564938195634)

Conclusion :

for drug use

Pearson correlation coefficient was close to 1, reported high at 0.98. p-value is small at 1.0265878201430413e-07, therefore null hypothesis is rejected drug user age group in pain-reliever is positively correlated to the drug user age group in oxycontine

for drug frequency

Pearson correlation coefficient was only at 0.56 p-value is small at 0.9762564938195634, therefore null hypothesis is accepted No conclusion can be made for drug frequency in pain reliever vs drug frequency in osycontine

Part 4 : Outliers Handling


Outliers handling is common in data analysis. In this session, a subset of the data is extracted and used to outline the flow on how outliers could be examined and corrected.

  • Pain-reliever-frequency is used to study outlier effect
fig, ax = plt.subplots(2,1,figsize=(10,6), sharex=True)

sns.boxplot(data=drugFrequency_temp['pain-releiver-frequency'], orient='h',ax=ax[0])
sns.distplot(drugFrequency_temp['pain-releiver-frequency'], bins=30, ax=ax[1])

4 outlier data points were observed

4.a : Extraction of outlier data points

# Get the IQR
dataExamined = drugFrequency_temp['pain-releiver-frequency']
q25, q75 = np.percentile(dataExamined, [25,75])
IQR = q75 - q25

# Get outlier point below q25 and above q75
outliers_abv = dataExamined[dataExamined>(q75+1.5*IQR)] 
outliers_below = dataExamined[dataExamined<(q25-1.5*IQR)]

# List out all outlier points
outliers_list = list(outliers_below.append(outliers_abv))
outliers_list
[7.0, 36.0, 22.0, 24.0]

4.b : Removal of outlier data points from examined dataset

dataExamined_clean = [v for v in dataExamined if v not in outliers_list]
print(len(dataExamined_clean))
dataExamined_clean
13





[14.0, 12.0, 10.0, 9.0, 12.0, 12.0, 10.0, 15.0, 15.0, 15.0, 13.0, 12.0, 12.0]

4.c : Comparison of mean, median, std dev for dataset with outliers vs no outlier

# set up dataframe for comparison
df_comparison = pd.DataFrame(dataExamined)
df_comparison.columns = ['With outliers']

# replace those identified as outlier point with nan in order to do stats calculation for scenario with no outliers
df_comparison['No outliers'] = df_comparison['With outliers'].map(lambda x: np.nan if x in outliers_list else x)
df_comparison.head()
With outliers No outliers
0 36.0 NaN
1 14.0 14.0
2 12.0 12.0
3 10.0 10.0
4 7.0 NaN

4.d : Tranpose dataframe for ease of comparison and calculation by columns

df_comparison_T = df_comparison.T
df_comparison_T
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
With outliers 36.0 14.0 12.0 10.0 7.0 9.0 12.0 12.0 10.0 15.0 15.0 15.0 13.0 22.0 12.0 12.0 24.0
No outliers NaN 14.0 12.0 10.0 NaN 9.0 12.0 12.0 10.0 15.0 15.0 15.0 13.0 NaN 12.0 12.0 NaN
df_comparison_T['mean'] = df_comparison_T.mean(axis=1)
df_comparison_T['median'] = df_comparison_T.median(axis=1)
df_comparison_T['stdDev'] = df_comparison_T.std(axis=1)
df_comparison_T
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 mean median stdDev
With outliers 36.0 14.0 12.0 10.0 7.0 9.0 12.0 12.0 10.0 15.0 15.0 15.0 13.0 22.0 12.0 12.0 24.0 14.705882 12.5 6.558028
No outliers NaN 14.0 12.0 10.0 NaN 9.0 12.0 12.0 10.0 15.0 15.0 15.0 13.0 NaN 12.0 12.0 NaN 12.384615 12.0 1.836437
df_comparison_T[['mean','median','stdDev']]
mean median stdDev
With outliers 14.705882 12.5 6.558028
No outliers 12.384615 12.0 1.836437
  • all the 3 mean, median and stdDev are smaller when the outlier data points are excluded
sns.boxplot(data=df_comparison,orient='h')

# addition plot to see the distribution of df with no outliers

fig, ax = plt.subplots(2,2,figsize=(15,6), sharex=True)

sns.boxplot(data=df_comparison['With outliers'], orient='h', ax=ax[0][0])
sns.distplot(df_comparison['With outliers'], bins=30, ax=ax[1][0])

sns.boxplot(data=df_comparison['No outliers'], orient='h', ax=ax[0][1], color='g')
sns.distplot(df_comparison.dropna()['No outliers'], bins=30, ax=ax[1][1], color='g')

Key Learnings


There are many techniques that we can use to explore data. No one single best method to outline how to explore the data perfectly. It is very much data dependent and can have different creative thoughts to present and visualize the data. Most importantly, the presented method is able to level up the understanding of the data and therefore expand the usage of such data for higher order of processing or modeling.