Analysis on Drug Usage

This is another exercise to practise EDA using various techniques. Dataset was obtained from Public Source. The challenge for this dataset is its size. It only has 17 rows but with 28 columns. Different thought process is required in order to process this type of data.

Package imports

import numpy as np
import scipy.stats as stats
import statsmodels.api as sm
import csv
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

plt.rcParams["patch.force_edgecolor"] = True
sns.set(style='darkgrid')
%matplotlib inline
%config InlineBackend.figure_format = 'retina'

Part 1 : Data Loading and Cleaning

To check if there are any anomalies on the dataset, and if transformation is needed

pd.set_option('max_columns',50)

drug = pd.read_csv('./drug-use-by-age.csv')
print(drug.shape)
drug.head()

(17, 28)

	age	n	alcohol-use	alcohol-frequency	marijuana-use	marijuana-frequency	cocaine-use	cocaine-frequency	crack-use	crack-frequency	heroin-use	heroin-frequency	hallucinogen-use	hallucinogen-frequency	inhalant-use	inhalant-frequency	pain-releiver-use	pain-releiver-frequency	oxycontin-use	oxycontin-frequency	tranquilizer-use	tranquilizer-frequency	stimulant-use	stimulant-frequency	meth-use	meth-frequency	sedative-use	sedative-frequency
0	12	2798	3.9	3.0	1.1	4.0	0.1	5.0	0.0	-	0.1	35.5	0.2	52.0	1.6	19.0	2.0	36.0	0.1	24.5	0.2	52.0	0.2	2.0	0.0	-	0.2	13.0
1	13	2757	8.5	6.0	3.4	15.0	0.1	1.0	0.0	3.0	0.0	-	0.6	6.0	2.5	12.0	2.4	14.0	0.1	41.0	0.3	25.5	0.3	4.0	0.1	5.0	0.1	19.0
2	14	2792	18.1	5.0	8.7	24.0	0.1	5.5	0.0	-	0.1	2.0	1.6	3.0	2.6	5.0	3.9	12.0	0.4	4.5	0.9	5.0	0.8	12.0	0.1	24.0	0.2	16.5
3	15	2956	29.2	6.0	14.5	25.0	0.5	4.0	0.1	9.5	0.2	1.0	2.1	4.0	2.5	5.5	5.5	10.0	0.8	3.0	2.0	4.5	1.5	6.0	0.3	10.5	0.4	30.0
4	16	3058	40.1	10.0	22.5	30.0	1.0	7.0	0.0	1.0	0.1	66.5	3.4	3.0	3.0	3.0	6.2	7.0	1.1	4.0	2.4	11.0	1.8	9.5	0.3	36.0	0.2	3.0

This is a ‘short & fat’ dataset consists of only 17 rows but with 28 columns.

cleaning is required

missing values are observed for some columns

drug.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 17 entries, 0 to 16
Data columns (total 28 columns):
age                        17 non-null object
n                          17 non-null int64
alcohol-use                17 non-null float64
alcohol-frequency          17 non-null float64
marijuana-use              17 non-null float64
marijuana-frequency        17 non-null float64
cocaine-use                17 non-null float64
cocaine-frequency          17 non-null object
crack-use                  17 non-null float64
crack-frequency            17 non-null object
heroin-use                 17 non-null float64
heroin-frequency           17 non-null object
hallucinogen-use           17 non-null float64
hallucinogen-frequency     17 non-null float64
inhalant-use               17 non-null float64
inhalant-frequency         17 non-null object
pain-releiver-use          17 non-null float64
pain-releiver-frequency    17 non-null float64
oxycontin-use              17 non-null float64
oxycontin-frequency        17 non-null object
tranquilizer-use           17 non-null float64
tranquilizer-frequency     17 non-null float64
stimulant-use              17 non-null float64
stimulant-frequency        17 non-null float64
meth-use                   17 non-null float64
meth-frequency             17 non-null object
sedative-use               17 non-null float64
sedative-frequency         17 non-null float64
dtypes: float64(20), int64(1), object(7)
memory usage: 3.8+ KB

some numerical values are in wrong data types (object -> float)

Columns with data type=object are examined further in order to understand why it is turned to object data type whilst it actually contains numeric values

print('Age : {}'.format(drug['age'].unique()))

Age : ['12' '13' '14' '15' '16' '17' '18' '19' '20' '21' '22-23' '24-25' '26-29'
 '30-34' '35-49' '50-64' '65+']

some values in Age column are in discrete values whilst some in range (but inconsistent range interval), one is found having special character (+)

print('cocaine-frequency : {}'.format(drug['cocaine-frequency'].unique()))
print('crack-frequency : {}'.format(drug['crack-frequency'].unique()))
print('heroin-frequency : {}'.format(drug['heroin-frequency'].unique()))
print('inhalant-frequency : {}'.format(drug['inhalant-frequency'].unique()))
print('oxycontin-frequency : {}'.format(drug['oxycontin-frequency'].unique()))
print('meth-frequency: {}'.format(drug['meth-frequency'].unique()))

cocaine-frequency : ['5.0' '1.0' '5.5' '4.0' '7.0' '8.0' '6.0' '15.0' '36.0' '-']
crack-frequency : ['-' '3.0' '9.5' '1.0' '21.0' '10.0' '2.0' '5.0' '17.0' '6.0' '15.0'
 '48.0' '62.0']
heroin-frequency : ['35.5' '-' '2.0' '1.0' '66.5' '64.0' '46.0' '180.0' '45.0' '30.0' '57.5'
 '88.0' '50.0' '66.0' '280.0' '41.0' '120.0']
inhalant-frequency : ['19.0' '12.0' '5.0' '5.5' '3.0' '4.0' '2.0' '3.5' '10.0' '13.5' '-']
oxycontin-frequency : ['24.5' '41.0' '4.5' '3.0' '4.0' '6.0' '7.0' '7.5' '12.0' '13.5' '17.5'
 '20.0' '46.0' '5.0' '-']
meth-frequency: ['-' '5.0' '24.0' '10.5' '36.0' '48.0' '12.0' '105.0' '2.0' '46.0' '21.0'
 '30.0' '54.0' '104.0']

some missing values are observed with ‘-‘

Extraction of records with ‘-‘

drug[drug.values =='-']

	age	n	alcohol-use	alcohol-frequency	marijuana-use	marijuana-frequency	cocaine-use	cocaine-frequency	crack-frequency	heroin-use	heroin-frequency	hallucinogen-use	hallucinogen-frequency	inhalant-use	inhalant-frequency	pain-releiver-use	pain-releiver-frequency	oxycontin-use	oxycontin-frequency	tranquilizer-use	tranquilizer-frequency	stimulant-use	stimulant-frequency	meth-use	meth-frequency	sedative-use	sedative-frequency
0	12	2798	3.9	3.0	1.1	4.0	0.1	5.0	-	0.1	35.5	0.2	52.0	1.6	19.0	2.0	36.0	0.1	24.5	0.2	52.0	0.2	2.0	0.0	-	0.2	13.0
0	12	2798	3.9	3.0	1.1	4.0	0.1	5.0	-	0.1	35.5	0.2	52.0	1.6	19.0	2.0	36.0	0.1	24.5	0.2	52.0	0.2	2.0	0.0	-	0.2	13.0
1	13	2757	8.5	6.0	3.4	15.0	0.1	1.0	3.0	0.0	-	0.6	6.0	2.5	12.0	2.4	14.0	0.1	41.0	0.3	25.5	0.3	4.0	0.1	5.0	0.1	19.0
2	14	2792	18.1	5.0	8.7	24.0	0.1	5.5	-	0.1	2.0	1.6	3.0	2.6	5.0	3.9	12.0	0.4	4.5	0.9	5.0	0.8	12.0	0.1	24.0	0.2	16.5
16	65+	2448	49.3	52.0	1.2	36.0	0.0	-	-	0.0	120.0	0.1	2.0	0.0	-	0.6	24.0	0.0	-	0.2	5.0	0.0	364.0	0.0	-	0.0	15.0
16	65+	2448	49.3	52.0	1.2	36.0	0.0	-	-	0.0	120.0	0.1	2.0	0.0	-	0.6	24.0	0.0	-	0.2	5.0	0.0	364.0	0.0	-	0.0	15.0
16	65+	2448	49.3	52.0	1.2	36.0	0.0	-	-	0.0	120.0	0.1	2.0	0.0	-	0.6	24.0	0.0	-	0.2	5.0	0.0	364.0	0.0	-	0.0	15.0
16	65+	2448	49.3	52.0	1.2	36.0	0.0	-	-	0.0	120.0	0.1	2.0	0.0	-	0.6	24.0	0.0	-	0.2	5.0	0.0	364.0	0.0	-	0.0	15.0
16	65+	2448	49.3	52.0	1.2	36.0	0.0	-	-	0.0	120.0	0.1	2.0	0.0	-	0.6	24.0	0.0	-	0.2	5.0	0.0	364.0	0.0	-	0.0	15.0

Replacement of cells with ‘-‘ to NA values

drug.replace('-', np.nan, inplace=True)

drug.head()

	age	n	alcohol-use	alcohol-frequency	marijuana-use	marijuana-frequency	cocaine-use	cocaine-frequency	crack-use	crack-frequency	heroin-use	heroin-frequency	hallucinogen-use	hallucinogen-frequency	inhalant-use	inhalant-frequency	pain-releiver-use	pain-releiver-frequency	oxycontin-use	oxycontin-frequency	tranquilizer-use	tranquilizer-frequency	stimulant-use	stimulant-frequency	meth-use	meth-frequency	sedative-use	sedative-frequency
0	12	2798	3.9	3.0	1.1	4.0	0.1	5.0	0.0	NaN	0.1	35.5	0.2	52.0	1.6	19.0	2.0	36.0	0.1	24.5	0.2	52.0	0.2	2.0	0.0	NaN	0.2	13.0
1	13	2757	8.5	6.0	3.4	15.0	0.1	1.0	0.0	3.0	0.0	NaN	0.6	6.0	2.5	12.0	2.4	14.0	0.1	41.0	0.3	25.5	0.3	4.0	0.1	5.0	0.1	19.0
2	14	2792	18.1	5.0	8.7	24.0	0.1	5.5	0.0	NaN	0.1	2.0	1.6	3.0	2.6	5.0	3.9	12.0	0.4	4.5	0.9	5.0	0.8	12.0	0.1	24.0	0.2	16.5
3	15	2956	29.2	6.0	14.5	25.0	0.5	4.0	0.1	9.5	0.2	1.0	2.1	4.0	2.5	5.5	5.5	10.0	0.8	3.0	2.0	4.5	1.5	6.0	0.3	10.5	0.4	30.0
4	16	3058	40.1	10.0	22.5	30.0	1.0	7.0	0.0	1.0	0.1	66.5	3.4	3.0	3.0	3.0	6.2	7.0	1.1	4.0	2.4	11.0	1.8	9.5	0.3	36.0	0.2	3.0

Examination of missing values and data type for each column

drug.isnull().sum()

age                        0
n                          0
alcohol-use                0
alcohol-frequency          0
marijuana-use              0
marijuana-frequency        0
cocaine-use                0
cocaine-frequency          1
crack-use                  0
crack-frequency            3
heroin-use                 0
heroin-frequency           1
hallucinogen-use           0
hallucinogen-frequency     0
inhalant-use               0
inhalant-frequency         1
pain-releiver-use          0
pain-releiver-frequency    0
oxycontin-use              0
oxycontin-frequency        1
tranquilizer-use           0
tranquilizer-frequency     0
stimulant-use              0
stimulant-frequency        0
meth-use                   0
meth-frequency             2
sedative-use               0
sedative-frequency         0
dtype: int64

drug.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 17 entries, 0 to 16
Data columns (total 28 columns):
age                        17 non-null object
n                          17 non-null int64
alcohol-use                17 non-null float64
alcohol-frequency          17 non-null float64
marijuana-use              17 non-null float64
marijuana-frequency        17 non-null float64
cocaine-use                17 non-null float64
cocaine-frequency          16 non-null object
crack-use                  17 non-null float64
crack-frequency            14 non-null object
heroin-use                 17 non-null float64
heroin-frequency           16 non-null object
hallucinogen-use           17 non-null float64
hallucinogen-frequency     17 non-null float64
inhalant-use               17 non-null float64
inhalant-frequency         16 non-null object
pain-releiver-use          17 non-null float64
pain-releiver-frequency    17 non-null float64
oxycontin-use              17 non-null float64
oxycontin-frequency        16 non-null object
tranquilizer-use           17 non-null float64
tranquilizer-frequency     17 non-null float64
stimulant-use              17 non-null float64
stimulant-frequency        17 non-null float64
meth-use                   17 non-null float64
meth-frequency             15 non-null object
sedative-use               17 non-null float64
sedative-frequency         17 non-null float64
dtypes: float64(20), int64(1), object(7)
memory usage: 3.8+ KB

Adjustment of data type to numeric

drug.iloc[:,2:] = drug.iloc[:,2:].astype(float)
drug.dtypes

age                         object
n                            int64
alcohol-use                float64
alcohol-frequency          float64
marijuana-use              float64
marijuana-frequency        float64
cocaine-use                float64
cocaine-frequency          float64
crack-use                  float64
crack-frequency            float64
heroin-use                 float64
heroin-frequency           float64
hallucinogen-use           float64
hallucinogen-frequency     float64
inhalant-use               float64
inhalant-frequency         float64
pain-releiver-use          float64
pain-releiver-frequency    float64
oxycontin-use              float64
oxycontin-frequency        float64
tranquilizer-use           float64
tranquilizer-frequency     float64
stimulant-use              float64
stimulant-frequency        float64
meth-use                   float64
meth-frequency             float64
sedative-use               float64
sedative-frequency         float64
dtype: object

drug.describe()

	n	alcohol-use	alcohol-frequency	marijuana-use	marijuana-frequency	cocaine-use	cocaine-frequency	crack-use	crack-frequency	heroin-use	heroin-frequency	hallucinogen-use	hallucinogen-frequency	inhalant-use	inhalant-frequency	pain-releiver-use	pain-releiver-frequency	oxycontin-use	oxycontin-frequency	tranquilizer-use	tranquilizer-frequency	stimulant-use	stimulant-frequency	meth-use	meth-frequency	sedative-use	sedative-frequency
count	17.000000	17.000000	17.000000	17.000000	17.000000	17.000000	16.000000	17.000000	14.000000	17.000000	16.000000	17.000000	17.000000	17.000000	16.000000	17.000000	17.000000	17.000000	16.000000	17.000000	17.000000	17.000000	17.000000	17.000000	15.000000	17.000000	17.000000
mean	3251.058824	55.429412	33.352941	18.923529	42.941176	2.176471	7.875000	0.294118	15.035714	0.352941	73.281250	3.394118	8.411765	1.388235	6.156250	6.270588	14.705882	0.935294	14.812500	2.805882	11.735294	1.917647	31.147059	0.382353	35.966667	0.282353	19.382353
std	1297.890426	26.878866	21.318833	11.959752	18.362566	1.816772	8.038449	0.235772	18.111263	0.333762	70.090173	2.792506	15.000245	0.927283	4.860448	3.166379	6.935098	0.608216	12.798275	1.753379	11.485205	1.407673	85.973790	0.262762	31.974581	0.138000	24.833527
min	2223.000000	3.900000	3.000000	1.100000	4.000000	0.000000	1.000000	0.000000	1.000000	0.000000	1.000000	0.100000	2.000000	0.000000	2.000000	0.600000	7.000000	0.000000	3.000000	0.200000	4.500000	0.000000	2.000000	0.000000	2.000000	0.000000	3.000000
25%	2469.000000	40.100000	10.000000	8.700000	30.000000	0.500000	5.000000	0.000000	5.000000	0.100000	39.625000	0.600000	3.000000	0.600000	3.375000	3.900000	12.000000	0.400000	5.750000	1.400000	6.000000	0.600000	7.000000	0.200000	12.000000	0.200000	6.500000
50%	2798.000000	64.600000	48.000000	20.800000	52.000000	2.000000	5.250000	0.400000	7.750000	0.200000	53.750000	3.200000	3.000000	1.400000	4.000000	6.200000	12.000000	1.100000	12.000000	3.500000	10.000000	1.800000	10.000000	0.400000	30.000000	0.300000	10.000000
75%	3058.000000	77.500000	52.000000	28.400000	52.000000	4.000000	7.250000	0.500000	16.500000	0.600000	71.875000	5.200000	4.000000	2.000000	6.625000	9.000000	15.000000	1.400000	18.125000	4.200000	11.000000	3.000000	12.000000	0.600000	47.000000	0.400000	17.500000
max	7391.000000	84.200000	52.000000	34.000000	72.000000	4.900000	36.000000	0.600000	62.000000	1.100000	280.000000	8.600000	52.000000	3.000000	19.000000	10.000000	36.000000	1.700000	46.000000	5.400000	52.000000	4.100000	364.000000	0.900000	105.000000	0.500000	104.000000

some columns show high standard deviation values

Part 2 : High Level Overview of Data

2.1 : Check age group vs sample size distribution

drug.plot.bar(x='age', y='n', figsize=(15,6), color='grey')
plt.title('Distribution of Sample Size by Age')
plt.ylabel('sample size')

Sample size for age from 12-21 is more stable ranging from 2000 to 3000+

for age from 12 - 17 is around +/-3000

for age from 18 - 21 is around +/-2500

Sample size for other age groups varies a lot. Data showed inconsistency in age group from 22 onwards

2.2 : For better comparison and avoid misleading graph interpretation due to mixture of age vs age groups, convert age group into average

def age_modified(age):
    if '+' in age:
        age = float(age.strip('+'))
    elif '-' in age:
        x = age.split('-')
        age = (float(x[1]) - float(x[0]))/2. + float(x[0])
    else:
        age = float(age)
    return age      

drug['age'] = drug['age'].apply(age_modified)
drug

	age	n	alcohol-use	alcohol-frequency	marijuana-use	marijuana-frequency	cocaine-use	cocaine-frequency	crack-use	crack-frequency	heroin-use	heroin-frequency	hallucinogen-use	hallucinogen-frequency	inhalant-use	inhalant-frequency	pain-releiver-use	pain-releiver-frequency	oxycontin-use	oxycontin-frequency	tranquilizer-use	tranquilizer-frequency	stimulant-use	stimulant-frequency	meth-use	meth-frequency	sedative-use	sedative-frequency
0	12.0	2798	3.9	3.0	1.1	4.0	0.1	5.0	0.0	NaN	0.1	35.5	0.2	52.0	1.6	19.0	2.0	36.0	0.1	24.5	0.2	52.0	0.2	2.0	0.0	NaN	0.2	13.0
1	13.0	2757	8.5	6.0	3.4	15.0	0.1	1.0	0.0	3.0	0.0	NaN	0.6	6.0	2.5	12.0	2.4	14.0	0.1	41.0	0.3	25.5	0.3	4.0	0.1	5.0	0.1	19.0
2	14.0	2792	18.1	5.0	8.7	24.0	0.1	5.5	0.0	NaN	0.1	2.0	1.6	3.0	2.6	5.0	3.9	12.0	0.4	4.5	0.9	5.0	0.8	12.0	0.1	24.0	0.2	16.5
3	15.0	2956	29.2	6.0	14.5	25.0	0.5	4.0	0.1	9.5	0.2	1.0	2.1	4.0	2.5	5.5	5.5	10.0	0.8	3.0	2.0	4.5	1.5	6.0	0.3	10.5	0.4	30.0
4	16.0	3058	40.1	10.0	22.5	30.0	1.0	7.0	0.0	1.0	0.1	66.5	3.4	3.0	3.0	3.0	6.2	7.0	1.1	4.0	2.4	11.0	1.8	9.5	0.3	36.0	0.2	3.0
5	17.0	3038	49.3	13.0	28.0	36.0	2.0	5.0	0.1	21.0	0.1	64.0	4.8	3.0	2.0	4.0	8.5	9.0	1.4	6.0	3.5	7.0	2.8	9.0	0.6	48.0	0.5	6.5
6	18.0	2469	58.7	24.0	33.7	52.0	3.2	5.0	0.4	10.0	0.4	46.0	7.0	4.0	1.8	4.0	9.2	12.0	1.7	7.0	4.9	12.0	3.0	8.0	0.5	12.0	0.4	10.0
7	19.0	2223	64.6	36.0	33.4	60.0	4.1	5.5	0.5	2.0	0.5	180.0	8.6	3.0	1.4	3.0	9.4	12.0	1.5	7.5	4.2	4.5	3.3	6.0	0.4	105.0	0.3	6.0
8	20.0	2271	69.7	48.0	34.0	60.0	4.9	8.0	0.6	5.0	0.9	45.0	7.4	2.0	1.5	4.0	10.0	10.0	1.7	12.0	5.4	10.0	4.0	12.0	0.9	12.0	0.5	4.0
9	21.0	2354	83.2	52.0	33.0	52.0	4.8	5.0	0.5	17.0	0.6	30.0	6.3	4.0	1.4	2.0	9.0	15.0	1.3	13.5	3.9	7.0	4.1	10.0	0.6	2.0	0.3	9.0
10	22.5	4707	84.2	52.0	28.4	52.0	4.5	5.0	0.5	5.0	1.1	57.5	5.2	3.0	1.0	4.0	10.0	15.0	1.7	17.5	4.4	12.0	3.6	10.0	0.6	46.0	0.2	52.0
11	24.5	4591	83.1	52.0	24.9	60.0	4.0	6.0	0.5	6.0	0.7	88.0	4.5	2.0	0.8	2.0	9.0	15.0	1.3	20.0	4.3	10.0	2.6	10.0	0.7	21.0	0.2	17.5
12	27.5	2628	80.7	52.0	20.8	52.0	3.2	5.0	0.4	6.0	0.6	50.0	3.2	3.0	0.6	4.0	8.3	13.0	1.2	13.5	4.2	10.0	2.3	7.0	0.6	30.0	0.4	4.0
13	32.0	2864	77.5	52.0	16.4	72.0	2.1	8.0	0.5	15.0	0.4	66.0	1.8	2.0	0.4	3.5	5.9	22.0	0.9	46.0	3.6	8.0	1.4	12.0	0.4	54.0	0.4	10.0
14	42.0	7391	75.0	52.0	10.4	48.0	1.5	15.0	0.5	48.0	0.1	280.0	0.6	3.0	0.3	10.0	4.2	12.0	0.3	12.0	1.9	6.0	0.6	24.0	0.2	104.0	0.3	10.0
15	57.0	3923	67.2	52.0	7.3	52.0	0.9	36.0	0.4	62.0	0.1	41.0	0.3	44.0	0.2	13.5	2.5	12.0	0.4	5.0	1.4	10.0	0.3	24.0	0.2	30.0	0.2	104.0
16	65.0	2448	49.3	52.0	1.2	36.0	0.0	NaN	0.0	NaN	0.0	120.0	0.1	2.0	0.0	NaN	0.6	24.0	0.0	NaN	0.2	5.0	0.0	364.0	0.0	NaN	0.0	15.0

2.3 : Set up 2 dataframes –> one by drugUse ; one by drugFrequency

drug.columns

Index(['age', 'n', 'alcohol-use', 'alcohol-frequency', 'marijuana-use',
       'marijuana-frequency', 'cocaine-use', 'cocaine-frequency', 'crack-use',
       'crack-frequency', 'heroin-use', 'heroin-frequency', 'hallucinogen-use',
       'hallucinogen-frequency', 'inhalant-use', 'inhalant-frequency',
       'pain-releiver-use', 'pain-releiver-frequency', 'oxycontin-use',
       'oxycontin-frequency', 'tranquilizer-use', 'tranquilizer-frequency',
       'stimulant-use', 'stimulant-frequency', 'meth-use', 'meth-frequency',
       'sedative-use', 'sedative-frequency'],
      dtype='object')

use_columns = [col for col in drug.columns if 'use' in col]
frequency_columns = [col for col in drug.columns if 'frequency' in col]

df_drugUse = drug[use_columns]
df_drugFrequency = drug[frequency_columns]

# insert age column to the front of the new df
df_drugUse.insert(0, 'age', drug['age'] )
df_drugFrequency.insert(0, 'age', drug['age'])

Check if df by drugUse is correctly set up

df_drugUse.head()

	age	alcohol-use	marijuana-use	cocaine-use	crack-use	heroin-use	hallucinogen-use	inhalant-use	pain-releiver-use	oxycontin-use	tranquilizer-use	stimulant-use	meth-use	sedative-use
0	12.0	3.9	1.1	0.1	0.0	0.1	0.2	1.6	2.0	0.1	0.2	0.2	0.0	0.2
1	13.0	8.5	3.4	0.1	0.0	0.0	0.6	2.5	2.4	0.1	0.3	0.3	0.1	0.1
2	14.0	18.1	8.7	0.1	0.0	0.1	1.6	2.6	3.9	0.4	0.9	0.8	0.1	0.2
3	15.0	29.2	14.5	0.5	0.1	0.2	2.1	2.5	5.5	0.8	2.0	1.5	0.3	0.4
4	16.0	40.1	22.5	1.0	0.0	0.1	3.4	3.0	6.2	1.1	2.4	1.8	0.3	0.2

Check if df by drugFrequency is correctly set up

df_drugFrequency.head()

	age	alcohol-frequency	marijuana-frequency	cocaine-frequency	crack-frequency	heroin-frequency	hallucinogen-frequency	inhalant-frequency	pain-releiver-frequency	oxycontin-frequency	tranquilizer-frequency	stimulant-frequency	meth-frequency	sedative-frequency
0	12.0	3.0	4.0	5.0	NaN	35.5	52.0	19.0	36.0	24.5	52.0	2.0	NaN	13.0
1	13.0	6.0	15.0	1.0	3.0	NaN	6.0	12.0	14.0	41.0	25.5	4.0	5.0	19.0
2	14.0	5.0	24.0	5.5	NaN	2.0	3.0	5.0	12.0	4.5	5.0	12.0	24.0	16.5
3	15.0	6.0	25.0	4.0	9.5	1.0	4.0	5.5	10.0	3.0	4.5	6.0	10.5	30.0
4	16.0	10.0	30.0	7.0	1.0	66.5	3.0	3.0	7.0	4.0	11.0	9.5	36.0	3.0

2.4a : Visualize data by drugUse in stacked-bar chart

df_drugUse.plot(x='age', kind='bar',stacked=True, figsize=(20,8), colormap='tab20',rot=1)
plt.ylabel('% of population taking the drug')
plt.title('Distribution of Population by Age for Various Drug Types')

alcohol is the highest intake in all ages/age groups

marijuana is the 2nd highest drug intake among various ages/age groups. However, a reduction in marijuana use was observed for age 22 onwards

pain reliever is the 3rd popular drug with the % of people in the same age/age groups who used this drug remained stable for age 17-21

2.4b : Another visualization of drugUse using line chart

df_drugUse.plot('age', xticks=np.arange(10,70,5), figsize=(20,8))
plt.ylabel('% of age population')
plt.title('Distribution of Population by Age for Various Drug Types')

2.5a : Visualization of drug data by Frequency in stacked-bar chart

df_drugFrequency.plot(x='age', figsize=(20,8), stacked=True, kind='bar', colormap='tab20')
plt.ylabel('Frequency')
plt.title('Distribution of Drug Intake Freqeuncy by Age')

heroin was the drug with highest frequency of intake age 19 and age group of 35-49

stimulant was found having a high spike of drug frequency in age group of 65+

marijuana frequency was found stable for age 18 till age group of 50-64

2.5b : Another visualization of drugFrequency using line chart

df_drugFrequency.plot('age', figsize=(20,8), xticks=np.arange(10,70,5))
plt.ylabel('Median Frequency of Drug Intake')
plt.title('Distribution of Drug Frequency by Age for Various Drug Types')

Visualization through stacked bar gave better comparison view as compared to line plot

2.6 : Check the spread of the data for each category using boxplot

Since data range varies significantly, standardized data before boxplot to enable comparable values on same scale

drugUse_bx = df_drugUse.drop('age', axis=1)
drugFrequency_bx = df_drugFrequency.drop('age', axis=1)

# standardize data on same scale
std_drugUse_bx = (drugUse_bx - drugUse_bx.mean())/drugUse_bx.std()
std_drugFrequency_bx = (drugFrequency_bx - drugFrequency_bx.mean())/drugFrequency_bx.std()

std_drugUse_bx

	alcohol-use	marijuana-use	cocaine-use	crack-use	heroin-use	hallucinogen-use	inhalant-use	pain-releiver-use	oxycontin-use	tranquilizer-use	stimulant-use	meth-use	sedative-use
0	-1.917098	-1.490293	-1.142945	-1.247469	-0.757849	-1.143818	0.228371	-1.348729	-1.373352	-1.486206	-1.220203	-1.455128	-0.596759
1	-1.745959	-1.297981	-1.142945	-1.247469	-1.057464	-1.000577	1.198949	-1.222402	-1.373352	-1.429173	-1.149164	-1.074556	-1.321394
2	-1.388802	-0.854828	-1.142945	-1.247469	-0.757849	-0.642476	1.306791	-0.748675	-0.880106	-1.086977	-0.793968	-1.074556	-0.596759
3	-0.975838	-0.369868	-0.922774	-0.823329	-0.458234	-0.463425	1.198949	-0.243366	-0.222444	-0.459617	-0.296693	-0.313412	0.852512
4	-0.570315	0.299042	-0.647561	-1.247469	-0.757849	0.002106	1.738159	-0.022293	0.270802	-0.231486	-0.083576	-0.313412	-0.596759
5	-0.228038	0.758918	-0.097134	-0.823329	-0.757849	0.503448	0.659739	0.704089	0.764048	0.395874	0.626817	0.828303	1.577148
6	0.121679	1.235516	0.563378	0.449089	0.140995	1.291271	0.444055	0.925161	1.257294	1.194333	0.768895	0.447732	0.852512
7	0.341182	1.210432	1.058762	0.873228	0.440610	1.864233	0.012687	0.988325	0.928463	0.795103	0.982013	0.067160	0.127877
8	0.530922	1.260601	1.499103	1.297367	1.639069	1.434512	0.120529	1.177816	1.257294	1.479496	1.479287	1.970019	1.577148
9	1.033176	1.176987	1.444061	0.873228	0.740225	1.040600	0.012687	0.861998	0.599632	0.624005	1.550326	0.828303	0.127877
10	1.070380	0.792363	1.278933	0.873228	2.238298	0.646689	-0.418681	1.177816	1.257294	0.909169	1.195130	0.828303	-0.596759
11	1.029455	0.499715	1.003719	0.873228	1.039839	0.396018	-0.634365	0.861998	0.599632	0.852136	0.484738	1.208875	-0.596759
12	0.940166	0.156899	0.563378	0.449089	0.740225	-0.069514	-0.850049	0.640925	0.435217	0.795103	0.271621	0.828303	0.852512
13	0.821113	-0.211002	-0.042091	0.873228	0.140995	-0.570856	-1.065733	-0.117038	-0.058029	0.452907	-0.367732	0.067160	0.852512
14	0.728103	-0.712684	-0.372347	0.873228	-0.757849	-1.000577	-1.173575	-0.653929	-1.044521	-0.516649	-0.936046	-0.693984	0.127877
15	0.437912	-0.971887	-0.702603	0.449089	-0.757849	-1.108008	-1.281417	-1.190820	-0.880106	-0.801813	-1.149164	-0.693984	-0.596759
16	-0.228038	-1.481931	-1.197988	-1.247469	-1.057464	-1.179628	-1.497101	-1.790875	-1.537767	-1.486206	-1.362281	-1.455128	-2.046029

Comparision of various drugUse spread on same scale

plt.figure(figsize=(20,10))
sns.boxplot(data=std_drugUse_bx, orient='h',)
plt.title('Boxplot on Standardized Scale for DrugUse')

Comparision of various drugFrequency spread on same scale

plt.figure(figsize=(20,10))
sns.boxplot(data=std_drugFrequency_bx, orient='h')
plt.title('Boxplot on Standardized Scale for DrugFrequency')

drugFrequency data is more scatter and with more outliers as compared to drugUse data

2.7a : Check the correlation of data in drugUse dataset

drugUse_temp = df_drugUse.drop('age', axis=1)
drugFrequency_temp = df_drugFrequency.drop('age', axis=1)

plt.figure(figsize=(12,8))
sns.heatmap(drugUse_temp.corr(),cmap='RdBu_r',annot=True)

inhalant-use has almost no to negative correlation to the rest of the drug use

other drug-use are generally having positive correlation to each other in different levels

Visualization of the correlation of top 4 popular drugs (alcohol, marijuana, hallucinogen, and pain-reliever)

sns.pairplot(drugUse_temp[['alcohol-use','marijuana-use','hallucinogen-use','pain-releiver-use']],kind='reg')

all 4 drugs showed positive correlation to each other

alcohol-use data was observed to behave more scatter-correlated with other drugs

whilst the other 3 drugs(marijuana, hallucinogen, and pain-reliever) were found quite well positively correlated

2.7b : Check the correlation of data in drugFrequency dataset

plt.figure(figsize=(12,8))
sns.heatmap(data=drugFrequency_temp.corr(), cmap='RdBu_r', annot=True)

mixture of positive and negative correlation among various drugFrequency

crack-frequeny and stimulant-frequency has upto 0.9 positive correlation

Visualization the correlation of top 4 popular drugFreq(alcohol, marijuana, hallucinogen, and pain-reliever)

sns.pairplot(drugFrequency_temp[['alcohol-frequency','marijuana-frequency','hallucinogen-frequency','pain-releiver-frequency']],kind='reg')

No significant correlation was observed for all the 4 drugs frequency

Hallucinogen-frequency was found having 2 extreme scale with majority of data at lower frequency level.

The high hallucinogen-frequency data points could be outliers/exceptional intakes

Part 3 : Hypothesis Generation and Testing

In the data exploration process, it is common that we would need to generate some assumptions, testify and validate those assumptions before we can summarize the findings in order to make solid conclusions. For this session, observation in Part 2 was used to practise hypothesis generation and testing.

Question to explore :

Correlation matrix showed significant correlation (0.98) of oxycontin-use vs pain-reliever-use.

Are the drug users in pain-reliever having the similar age group distribution as the drug users in oxycontine?

\[H_0: Use_{pain-reliever} = Use_{oxycontin}\]

\[H_1: Use_{pain-reliever} \neq Use_{oxycontin}\]

But for their frequencies, correlation was only at 0.56. Are these correlation statistically significant?

Among these 2 groups of drug users, are they taking the pain-reliever as frequent as oxycontine?

\[H_0: frequency_{pain-reliever} = frequency_{oxycontin}\]

\[H_1: frequency_{pain-reliever} \neq frequency_{oxycontin}\]

Deliverables :

join-plot

stats summary to include p-values

3.a : Comparison on Drug Use

1st examination through graphical view

sns.jointplot(x='pain-releiver-use', y='oxycontin-use', data=drugUse_temp, kind='reg')

2nd examination through stats library on p-value

p_value_drugUse = stats.ttest_ind(drugUse_temp['pain-releiver-use'],drugUse_temp['oxycontin-use'])
p_value_drugUse

Ttest_indResult(statistic=6.82263516475104, pvalue=1.0265878201430413e-07)

3.b : Comparison on Drug Frequency

1st examination through graphical view

Repeat the same workflow as in 3.a

sns.jointplot(x='pain-releiver-frequency', y='oxycontin-frequency', data=drugFrequency_temp, kind='reg')

p_value_drugFrequency = stats.ttest_ind(drugFrequency_temp['pain-releiver-frequency'], drugFrequency_temp['oxycontin-frequency'],nan_policy='omit')
p_value_drugFrequency
# need to set nan_policy='omit' in order to ignore nan value in oxycontin-frequency for p-value calculation

Ttest_indResult(statistic=-0.030003630957118617, pvalue=0.9762564938195634)

Conclusion :

for drug use

Pearson correlation coefficient was close to 1, reported high at 0.98. p-value is small at 1.0265878201430413e-07, therefore null hypothesis is rejected drug user age group in pain-reliever is positively correlated to the drug user age group in oxycontine

for drug frequency

Pearson correlation coefficient was only at 0.56 p-value is small at 0.9762564938195634, therefore null hypothesis is accepted No conclusion can be made for drug frequency in pain reliever vs drug frequency in osycontine

Part 4 : Outliers Handling

Outliers handling is common in data analysis. In this session, a subset of the data is extracted and used to outline the flow on how outliers could be examined and corrected.

Pain-reliever-frequency is used to study outlier effect

fig, ax = plt.subplots(2,1,figsize=(10,6), sharex=True)

sns.boxplot(data=drugFrequency_temp['pain-releiver-frequency'], orient='h',ax=ax[0])
sns.distplot(drugFrequency_temp['pain-releiver-frequency'], bins=30, ax=ax[1])

4 outlier data points were observed

4.a : Extraction of outlier data points

# Get the IQR
dataExamined = drugFrequency_temp['pain-releiver-frequency']
q25, q75 = np.percentile(dataExamined, [25,75])
IQR = q75 - q25

# Get outlier point below q25 and above q75
outliers_abv = dataExamined[dataExamined>(q75+1.5*IQR)] 
outliers_below = dataExamined[dataExamined<(q25-1.5*IQR)]

# List out all outlier points
outliers_list = list(outliers_below.append(outliers_abv))
outliers_list

[7.0, 36.0, 22.0, 24.0]

4.b : Removal of outlier data points from examined dataset

dataExamined_clean = [v for v in dataExamined if v not in outliers_list]
print(len(dataExamined_clean))
dataExamined_clean

13

[14.0, 12.0, 10.0, 9.0, 12.0, 12.0, 10.0, 15.0, 15.0, 15.0, 13.0, 12.0, 12.0]

4.c : Comparison of mean, median, std dev for dataset with outliers vs no outlier

# set up dataframe for comparison
df_comparison = pd.DataFrame(dataExamined)
df_comparison.columns = ['With outliers']

# replace those identified as outlier point with nan in order to do stats calculation for scenario with no outliers
df_comparison['No outliers'] = df_comparison['With outliers'].map(lambda x: np.nan if x in outliers_list else x)
df_comparison.head()

	With outliers	No outliers
0	36.0	NaN
1	14.0	14.0
2	12.0	12.0
3	10.0	10.0
4	7.0	NaN

4.d : Tranpose dataframe for ease of comparison and calculation by columns

df_comparison_T = df_comparison.T
df_comparison_T

	0	1	2	3	4	5	6	7	8	9	10	11	12	13	14	15	16
With outliers	36.0	14.0	12.0	10.0	7.0	9.0	12.0	12.0	10.0	15.0	15.0	15.0	13.0	22.0	12.0	12.0	24.0
No outliers	NaN	14.0	12.0	10.0	NaN	9.0	12.0	12.0	10.0	15.0	15.0	15.0	13.0	NaN	12.0	12.0	NaN

df_comparison_T['mean'] = df_comparison_T.mean(axis=1)
df_comparison_T['median'] = df_comparison_T.median(axis=1)
df_comparison_T['stdDev'] = df_comparison_T.std(axis=1)
df_comparison_T

	0	1	2	3	4	5	6	7	8	9	10	11	12	13	14	15	16	mean	median	stdDev
With outliers	36.0	14.0	12.0	10.0	7.0	9.0	12.0	12.0	10.0	15.0	15.0	15.0	13.0	22.0	12.0	12.0	24.0	14.705882	12.5	6.558028
No outliers	NaN	14.0	12.0	10.0	NaN	9.0	12.0	12.0	10.0	15.0	15.0	15.0	13.0	NaN	12.0	12.0	NaN	12.384615	12.0	1.836437

df_comparison_T[['mean','median','stdDev']]

	mean	median	stdDev
With outliers	14.705882	12.5	6.558028
No outliers	12.384615	12.0	1.836437

all the 3 mean, median and stdDev are smaller when the outlier data points are excluded

sns.boxplot(data=df_comparison,orient='h')

# addition plot to see the distribution of df with no outliers

fig, ax = plt.subplots(2,2,figsize=(15,6), sharex=True)

sns.boxplot(data=df_comparison['With outliers'], orient='h', ax=ax[0][0])
sns.distplot(df_comparison['With outliers'], bins=30, ax=ax[1][0])

sns.boxplot(data=df_comparison['No outliers'], orient='h', ax=ax[0][1], color='g')
sns.distplot(df_comparison.dropna()['No outliers'], bins=30, ax=ax[1][1], color='g')

Key Learnings

There are many techniques that we can use to explore data. No one single best method to outline how to explore the data perfectly. It is very much data dependent and can have different creative thoughts to present and visualize the data. Most importantly, the presented method is able to level up the understanding of the data and therefore expand the usage of such data for higher order of processing or modeling.