Analysis on SAT Test

Exploratory data analysis is an important and essential part of any data science project. It helps to prepare, process and visualize the data for better preliminary judgement. A high quality and comprehensive EDA eases the modeling and leads to a more reliable and conclusive outcomes.

This project is therefore focusing solely on EDA and utilizes various techniques to present the data and its related analysis. A set of SAT scores by state was used to practise various EDA techniques and attempt to use different plots to assist the understanding of the dataset and visualize them to gain insights and draw conclusions.

Package imports

import numpy as np
import pandas as pd
import csv
import seaborn as sns
import matplotlib.pyplot as plt
import scipy.stats as stats

plt.rcParams["patch.force_edgecolor"] = True
sns.set(style='darkgrid')
%matplotlib inline
%config InlineBackend.figure_format = 'retina'

Part 1 : Data Loading Using csv Module vs Using Pandas

1.1 Load the file with the `csv` module and put it in a Python dictionary

sat = './sat_scores.csv'

with open(sat,'rU') as f:
    
    content = [line for line in f]
    keys = content[0].replace('\n','').split(',')

    temp_values = content[1:]
    k1_values = []
    k2_values = []
    k3_values = []
    k4_values = []
    for v_temp in temp_values:
        v_temp = v_temp.replace('\n','').split(',')
        k1_values.append(v_temp[0])
        k2_values.append(v_temp[1])
        k3_values.append(v_temp[2])
        k4_values.append(v_temp[3])

sat_dict = dict(zip(keys, [k1_values,k2_values,k3_values,k4_values]))

print(sat_dict)

{'State': ['CT', 'NJ', 'MA', 'NY', 'NH', 'RI', 'PA', 'VT', 'ME', 'VA', 'DE', 'MD', 'NC', 'GA', 'IN', 'SC', 'DC', 'OR', 'FL', 'WA', 'TX', 'HI', 'AK', 'CA', 'AZ', 'NV', 'CO', 'OH', 'MT', 'WV', 'ID', 'TN', 'NM', 'IL', 'KY', 'WY', 'MI', 'MN', 'KS', 'AL', 'NE', 'OK', 'MO', 'LA', 'WI', 'AR', 'UT', 'IA', 'SD', 'ND', 'MS', 'All'], 'Rate': ['82', '81', '79', '77', '72', '71', '71', '69', '69', '68', '67', '65', '65', '63', '60', '57', '56', '55', '54', '53', '53', '52', '51', '51', '34', '33', '31', '26', '23', '18', '17', '13', '13', '12', '12', '11', '11', '9', '9', '9', '8', '8', '8', '7', '6', '6', '5', '5', '4', '4', '4', '45'], 'Verbal': ['509', '499', '511', '495', '520', '501', '500', '511', '506', '510', '501', '508', '493', '491', '499', '486', '482', '526', '498', '527', '493', '485', '514', '498', '523', '509', '539', '534', '539', '527', '543', '562', '551', '576', '550', '547', '561', '580', '577', '559', '562', '567', '577', '564', '584', '562', '575', '593', '577', '592', '566', '506'], 'Math': ['510', '513', '515', '505', '516', '499', '499', '506', '500', '501', '499', '510', '499', '489', '501', '488', '474', '526', '499', '527', '499', '515', '510', '517', '525', '515', '542', '439', '539', '512', '542', '553', '542', '589', '550', '545', '572', '589', '580', '554', '568', '561', '577', '562', '596', '550', '570', '603', '582', '599', '551', '514']}

1.2 Make a pandas DataFrame object with the SAT dictionary, and another with the pandas `.read_csv()` function

# set up dataframe with raw data from dict derived earlier

df_sat_dict = pd.DataFrame(sat_dict)
df_sat_dict.head()

	State	Rate	Verbal	Math
0	CT	82	509	510
1	NJ	81	499	513
2	MA	79	511	515
3	NY	77	495	505
4	NH	72	520	516

df_sat_dict.tail()

	State	Rate	Verbal	Math
47	IA	5	593	603
48	SD	4	577	582
49	ND	4	592	599
50	MS	4	566	551
51	All	45	506	514

Last row is the sum of all rows above, need to remove this from analysis to avoid data being skewed

df_sat_dict.drop(index=51, inplace=True)

df_sat_dict.tail()

	State	Rate	Verbal	Math
46	UT	5	575	570
47	IA	5	593	603
48	SD	4	577	582
49	ND	4	592	599
50	MS	4	566	551

df_sat_dict.dtypes 

State     object
Rate      object
Verbal    object
Math      object
dtype: object

Data type for all variables above were found as object types (including those that supposed to be numeric)

# set up dataframe with raw data loaded directly using read_csv()

df_sat = pd.read_csv(sat)
df_sat.head()

	State	Rate	Verbal	Math
0	CT	82	509	510
1	NJ	81	499	513
2	MA	79	511	515
3	NY	77	495	505
4	NH	72	520	516

df_sat.tail()

	State	Rate	Verbal	Math
47	IA	5	593	603
48	SD	4	577	582
49	ND	4	592	599
50	MS	4	566	551
51	All	45	506	514

df_sat.drop(index=51, axis=0, inplace=True)
df_sat.tail()

	State	Rate	Verbal	Math
46	UT	5	575	570
47	IA	5	593	603
48	SD	4	577	582
49	ND	4	592	599
50	MS	4	566	551

df_sat.dtypes 
# numbers were found transformed as integer data type right away when using read_csv()

State     object
Rate       int64
Verbal     int64
Math       int64
dtype: object

Observation on the difference :

If we do not convert the string column values to float in dictionary, the columns in the DataFrame are of type object. But when using read_csv(), the data type is framed up correctly without additional adjustment. Using read_csv() will be a better and faster option.

Part 2 : Create a “Data Dictionary”

A data dictionary is an object that describes the data. It helps users to understand the data better and thus ease the EDA process. The data dictionary in this exercise is framed up to include the name of each variable (column), the type of the variable, description of what the variable is, and the shape (rows and columns) of the entire dataset.

data_dict = pd.DataFrame(columns=['Column_name', 'Data_type', 'Description', 'No_of_records'])

data_dict['Column_name'] = df_sat.columns

data_dict['Data_type'] = list(df_sat.dtypes)

data_dict['Description'] = ['US State of SAT scores', 'Rate of students that take the SAT in that state', 'Mean verbal score in that state', 'Mean math score in that state']

df_sat.info() # since all columns are with same no. of records, can use .shape[0] to get this info

<class 'pandas.core.frame.DataFrame'>
Int64Index: 51 entries, 0 to 50
Data columns (total 4 columns):
State     51 non-null object
Rate      51 non-null int64
Verbal    51 non-null int64
Math      51 non-null int64
dtypes: int64(3), object(1)
memory usage: 2.0+ KB

data_dict['No_of_records'] = [df_sat.shape[0] for v in data_dict['Column_name']]

data_dict

	Column_name	Data_type	Description	No_of_records
0	State	object	US State of SAT scores	51
1	Rate	int64	Rate of students that take the SAT in that state	51
2	Verbal	int64	Mean verbal score in that state	51
3	Math	int64	Mean math score in that state	51

Part 3 : Data Analysis and Visualization

3.1 Check the distribution of numerical columns

Data was examined to see if it was normally distributed

fig, ax = plt.subplots(3,1, figsize=(14,12))
sns.distplot(df_sat['Rate'], bins=30, kde=False, ax=ax[0])
sns.distplot(df_sat['Verbal'], bins=30, kde=False, color='g', ax=ax[1])
sns.distplot(df_sat['Math'], bins=30, kde=False, color='r', ax=ax[2])

3.2 Get an overview of the correlation of each variable in the dataset

sns.pairplot(df_sat, hue='State', size=3, aspect=1)

Observation on Pairplot :

Rate data is least normalized as compared to Verbal and Math data

Positive correlation observed for Math and Verbal

Negative correlation observed for Math & Rate, Verbal & Rate

SAT performance varied for various states

3.3 : Comparison of Math and Verbal Scores

df_sat[['Math', 'Verbal']].plot.hist(bins=30, alpha=0.5, stacked=True, figsize=(14,5))
plt.title('Histogram of SAT Performance')
plt.xlabel('Score')

fig, ax = plt.subplots(2,3, figsize=(25,6))

sns.boxplot(data=df_sat['Rate'],orient='h', ax=ax[0][0])
sns.distplot(df_sat['Rate'], ax=ax[1][0],bins=30)

sns.boxplot(data=df_sat['Verbal'], orient='h', ax=ax[0][1], color='orange')
sns.distplot(df_sat['Verbal'], bins=30, ax=ax[1][1], color='orange')

sns.boxplot(data=df_sat['Math'], orient='h', ax=ax[0][2], color='purple')
sns.distplot(df_sat['Math'], bins=30, ax=ax[1][2], color='purple')

Observation :

Though SAT dataset doesn’t very well distributed, there’s no outlier point observed for all the 3 keys numerical variables

Outlier is defined as data points that are beyond 1.5 of interquartile range :

1.5 below the 1st quartile

1.5 above the 3rd quartile

fig = plt.figure(figsize=(14,5))
ax = fig.gca()

ax = sns.boxplot(data=df_sat[['Rate','Verbal','Math']], orient='h', palette='Set2')
ax = plt.xlabel('Score')
ax = plt.title('Boxplot of SAT Performance')
ax = plt.text(1,1,"Shouldn't plot Rate together as it has different score scale", fontsize=14)

Benefits of using boxplot vs scatterplot/histogram:

good way to summarize large amounts of data

provide information about the range and distribution of data set inclusive details such as :

minimum

1st quartile

median

3rd quartile

maximum

outliers

give some indication of the data’s symetry and skewness

However, for boxploat above, Rate doesn’t have the same scale as Verbal and Math, plotting the raw data of Rate within one chart with others makes the comparison not so relevant. To have the comparision more relevant, the data needs to be standardize on the same scale. It can be done through the standardization method below by factoring in the mean and standard deviation.

# standardize variables

df_sat_std = (df_sat[['Rate','Verbal','Math']] - df_sat.mean()) / df_sat.std()

fig = plt.figure(figsize=(14,5))
ax = fig.gca()

ax = sns.boxplot(data=df_sat_std, orient='h', palette='Set2')
ax = plt.title('Standardized Boxplot for SAT Performance')

3.4 : Analysis on Subset of Data

3.4.a) States with Verbal scores greater than average Verbal scores across states

df_sat['Verbal'].mean()

532.5294117647059

df_sat_verbal_abv_average = df_sat[df_sat['Verbal']>df_sat['Verbal'].mean()].sort_values(by='Verbal', ascending=False)
df_sat_verbal_abv_average

	State	Rate	Verbal	Math
47	IA	5	593	603
49	ND	4	592	599
44	WI	6	584	596
37	MN	9	580	589
48	SD	4	577	582
38	KS	9	577	580
42	MO	8	577	577
33	IL	12	576	589
46	UT	5	575	570
41	OK	8	567	561
50	MS	4	566	551
43	LA	7	564	562
31	TN	13	562	553
45	AR	6	562	550
40	NE	8	562	568
36	MI	11	561	572
39	AL	9	559	554
32	NM	13	551	542
34	KY	12	550	550
35	WY	11	547	545
30	ID	17	543	542
28	MT	23	539	539
26	CO	31	539	542
27	OH	26	534	439

round(len(df_sat_verbal_abv_average) / df_sat['State'].nunique()*100,2)

47.06

47.06% of states in US having Verbal score above average.

Less than half of the states in US achieving average score

3.4.b) States with Verbal scores greater than median Verbal scores across states

df_sat['Verbal'].median()

527.0

df_sat_verbal_abv_median = df_sat[df_sat['Verbal']>df_sat['Verbal'].median()].sort_values(by='Verbal', ascending=False)
df_sat_verbal_abv_median

	State	Rate	Verbal	Math
47	IA	5	593	603
49	ND	4	592	599
44	WI	6	584	596
37	MN	9	580	589
48	SD	4	577	582
38	KS	9	577	580
42	MO	8	577	577
33	IL	12	576	589
46	UT	5	575	570
41	OK	8	567	561
50	MS	4	566	551
43	LA	7	564	562
31	TN	13	562	553
45	AR	6	562	550
40	NE	8	562	568
36	MI	11	561	572
39	AL	9	559	554
32	NM	13	551	542
34	KY	12	550	550
35	WY	11	547	545
30	ID	17	543	542
28	MT	23	539	539
26	CO	31	539	542
27	OH	26	534	439

round(len(df_sat_verbal_abv_median)/df_sat['Verbal'].nunique()*100,2)

61.54

61.54% of states in US having Verbal score above the median score

This percentage is slightly higher ( = more states included ) because the median score is slightly lower than the average score

# visualize the difference of mean, median and distribution of Verval score

plt.figure(figsize=(15,6))
sns.distplot(df_sat['Verbal'], bins=30)
plt.axvline(x=df_sat['Verbal'].mean(), linewidth=2.5, linestyle='dashed', color='r', label='Average score')
plt.axvline(x=df_sat['Verbal'].median(), linewidth=2.5, linestyle='dashed', color='g', label='Median score')
plt.legend()
plt.title('Distribution of Verbal Score')

3.4.c) Difference between the Verbal and Math scores

# set up new column to reflect the difference of Verbal vs Math score

df_sat['Difference_Verbal_Math'] = df_sat['Verbal'] - df_sat['Math']
df_sat.head()

	State	Rate	Verbal	Math	Difference_Verbal_Math
0	CT	82	509	510	-1
1	NJ	81	499	513	-14
2	MA	79	511	515	-4
3	NY	77	495	505	-10
4	NH	72	520	516	4

# Top 3 states with greatest difference between Verbal and Math

df_sat_verbal_greaterThanMath = df_sat.sort_values(by='Difference_Verbal_Math', ascending=False).head(10)
df_sat_verbal_greaterThanMath.head(3)

	State	Rate	Verbal	Math	Difference_Verbal_Math
27	OH	26	534	439	95
50	MS	4	566	551	15
29	WV	18	527	512	15

# Top 3 states with greatest difference between Math and Verbal

df_sat_math_greaterThanVerbal = df_sat.sort_values(by='Difference_Verbal_Math', ascending=True).head(10)
df_sat_math_greaterThanVerbal.head(3)

	State	Rate	Verbal	Math	Difference_Verbal_Math
21	HI	52	485	515	-30
23	CA	51	498	517	-19
1	NJ	81	499	513	-14

3.5. Examine Summary Statistics

Summary stats enables a quick overall on the data, checking for correlation, missing values and anomalies

df_sat.drop(columns=['Difference_Verbal_Math']).describe(include='all')

	State	Rate	Verbal	Math
count	51	51.000000	51.000000	51.000000
unique	51	NaN	NaN	NaN
top	UT	NaN	NaN	NaN
freq	1	NaN	NaN	NaN
mean	NaN	37.000000	532.529412	531.843137
std	NaN	27.550681	33.360667	36.287393
min	NaN	4.000000	482.000000	439.000000
25%	NaN	9.000000	501.000000	503.000000
50%	NaN	33.000000	527.000000	525.000000
75%	NaN	64.000000	562.000000	557.500000
max	NaN	82.000000	593.000000	603.000000

Count => no. of records/population size
mean => average of the total population
std => standard deviation of the population
min => lowest value of the record
25% => value by 25% quartile of the population
50% => value by 50% quartile of the population
75% => value by 75% quartile of the population
max => highest value of the record

3.5 a) Correlation with Pearson Coefficient

# check correlation of numerical variables in dataset

df_sat_corr = df_sat.drop(columns=['State','Difference_Verbal_Math']).corr()
df_sat_corr

	Rate	Verbal	Math
Rate	1.000000	-0.888121	-0.773419
Verbal	-0.888121	1.000000	0.899909
Math	-0.773419	0.899909	1.000000

plt.figure(figsize=(10,6))
sns.heatmap(df_sat_corr)

Observations:

Verbal and Math is positively correlated => If doing good in Verbal, also likely doing good in Math

Rate is negatively correlated to Verbal and Math => State with low SAT taking rate seems to get better score in Verbal / Math

3.5 b) Correlation with Spearman Coefficient

df_sat_spearman = df_sat.drop('Difference_Verbal_Math', axis=1).corr(method='spearman')
df_sat_spearman

	Rate	Verbal	Math
Rate	1.000000	-0.836058	-0.811662
Verbal	-0.836058	1.000000	0.909413
Math	-0.811662	0.909413	1.000000

Observations

spearman correlation coefficient was found higher than the pearson correlation coeffifient

Process calculating the spearman rank correlation

Rank the values of each variable, with the largest value having rank 1.

Calculate the difference in ranks

Complete the rest of the calculation with the formula below: $r = 1 - \frac {6 \sum d_i^2} {n^3 - n}$

where $d_i$ is the differences in paired ranks, $n$ is the number of rows

Useful Notes :

The Spearman Rank correlation measures the rank of one variable against another.

A strong positive spearman rank correlation means that if one does well in math one will most likely also do well in verbal test.

A strong negative spearkman rank means that if one does well in one, one would most likely do poorly in the other.

The Spearman Rank correlation is not as sensitive to outliers because it does not take into account how far away the outliers are from the average.

It is only concerned about the order of those outliers versus everyone else.

Outliers in this dataset were those states that did well in one subject but poorly in the other (the two biggest difference). But since these are ranked aka sorted according to how their scores are; doing very well in Math and then doing average in Verbal test will only result is a high rank difference which will then give us a negative spearman rank correlation (not zero linear pearson correlation).

3.5 c) Covariance

# check covariance of the dataset

df_sat_cov = df_sat.drop(columns=['Difference_Verbal_Math']).cov()
df_sat_cov

	Rate	Verbal	Math
Rate	759.04	-816.280000	-773.220000
Verbal	-816.28	1112.934118	1089.404706
Math	-773.22	1089.404706	1316.774902

Difference between covariance matrix and correlation matrix.

Covariance matrix is a measure of how related the variables are to each other, but it is measured on the variance of the variables.

While correlation matrix is dimensionless, regardless of what the units are in the variables, it always return the same in between -1 and 1

Process to convert the covariance into the correlation

Conversion can be done using the following formula: $corr(X, Y) = \frac{cov(X, Y)}{std(X)std(Y)}$

Reason correlation matrix is preferred over covariance matrix for examining relationships in data

Covariance is difficult to compared due to different dimension used whilst correlation is a scaled & standardized version of covariance in which the values are assured to be -1 and 1

3.5 d) Percentile Scoring

Examine percentile scoring of data. In other words, the conversion of numeric data to their equivalent percentile scores.

df_sat_percentile = df_sat.copy()
df_sat_percentile['Rate_percentile'] = df_sat_percentile['Rate'].apply(lambda x: stats.percentileofscore(df_sat_percentile['Rate'],x))
df_sat_percentile.head()

	State	Rate	Verbal	Math	Difference_Verbal_Math	Rate_percentile
0	CT	82	509	510	-1	100.000000
1	NJ	81	499	513	-14	98.039216
2	MA	79	511	515	-4	96.078431
3	NY	77	495	505	-10	94.117647
4	NH	72	520	516	4	92.156863

We could also possibly rank the values in each variable by percentile when calculating using spearman method

Percentile scoring and the spearman rank correlation are related in the sense that they both take a value that is calculated by how they are ranked in relation to one another in some sorted order, and not based on the difference from the mean.

Percentile scoring ranks each value against a sorted order to find in which percentile do the value fall in. For example, take New York state. A 94.12 percentile score means that the state’s participantion rate of 77 is in the top 94.12% of the entire dataset. That there were 3 other states that had a higher participation rate than it ( (100% - 94.12%) x 52 = approx. 3 ).

3.5 e) Percentiles vs outliers

# 9.3.2 Distribution as per normal values

fig, ax = plt.subplots(1,2, sharey = True, figsize=(20,6))

sns.distplot(df_sat_percentile['Rate'], bins=30, ax=ax[0]) # Distribution as per normal values
ax[0].set_title('Distribution of Rate')
sns.distplot(df_sat_percentile['Rate_percentile'], bins=30, ax=ax[1], color='g')
ax[1].set_title('Distribution of Rate_in_Percentile')

plt.suptitle('Comparison of Distribution for Rate_Normal vs Rate_In_Percentile', size=14)
plt.show()

fig, ax = plt.subplots(2,1, figsize=(10,8))

sns.boxplot(data=df_sat_percentile[['Rate', 'Rate_percentile']], ax=ax[0])
sns.violinplot(data=df_sat_percentile[['Rate', 'Rate_percentile']], ax=ax[1])

Observations :

Percentile is not as sensitive to outliers because it does not take into account how far away the outliers are from the average.

Rate was observed more normally distributed using percentile

Key Learnings

There are various ways and techniques that we can use to analyze data. From loading data, data preparation, data analysis and data visualization, there are many options that we can use to achieve the similar purpose. Different techniques have their own strength in particular features and performance. Therefore, it is essential to evaluate which one serves the best in term of clarity and provides the best visualization that enables a better and simplified explanation to a complex observation.

Sometimes, we can combine multiple approaches / different plots to complement the unique feature that demonstrates by one type of chart but not the other. Through this combination, it can help us to explain, to ‘re-imagine’ and understand the dataset better, and assist us to discover patterns in different perspective to make our analysis more comprehensive.

	State	Rate	Verbal	Math
47	IA	5	593	603
49	ND	4	592	599
44	WI	6	584	596
37	MN	9	580	589
48	SD	4	577	582
38	KS	9	577	580
42	MO	8	577	577
33	IL	12	576	589
46	UT	5	575	570
41	OK	8	567	561
50	MS	4	566	551
43	LA	7	564	562
31	TN	13	562	553
45	AR	6	562	550
40	NE	8	562	568
36	MI	11	561	572
39	AL	9	559	554
32	NM	13	551	542
34	KY	12	550	550
35	WY	11	547	545
30	ID	17	543	542
28	MT	23	539	539
26	CO	31	539	542
27	OH	26	534	439

	State	Rate	Verbal	Math
47	IA	5	593	603
49	ND	4	592	599
44	WI	6	584	596
37	MN	9	580	589
48	SD	4	577	582
38	KS	9	577	580
42	MO	8	577	577
33	IL	12	576	589
46	UT	5	575	570
41	OK	8	567	561
50	MS	4	566	551
43	LA	7	564	562
31	TN	13	562	553
45	AR	6	562	550
40	NE	8	562	568
36	MI	11	561	572
39	AL	9	559	554
32	NM	13	551	542
34	KY	12	550	550
35	WY	11	547	545
30	ID	17	543	542
28	MT	23	539	539
26	CO	31	539	542
27	OH	26	534	439

	State	Rate	Verbal	Math	Difference_Verbal_Math
0	CT	82	509	510	-1
1	NJ	81	499	513	-14
2	MA	79	511	515	-4
3	NY	77	495	505	-10
4	NH	72	520	516	4

	State	Rate	Verbal	Math
47	IA	5	593	603
49	ND	4	592	599
44	WI	6	584	596
37	MN	9	580	589
48	SD	4	577	582
38	KS	9	577	580
42	MO	8	577	577
33	IL	12	576	589
46	UT	5	575	570
41	OK	8	567	561
50	MS	4	566	551
43	LA	7	564	562
31	TN	13	562	553
45	AR	6	562	550
40	NE	8	562	568
36	MI	11	561	572
39	AL	9	559	554
32	NM	13	551	542
34	KY	12	550	550
35	WY	11	547	545
30	ID	17	543	542
28	MT	23	539	539
26	CO	31	539	542
27	OH	26	534	439

	State	Rate	Verbal	Math
47	IA	5	593	603
49	ND	4	592	599
44	WI	6	584	596
37	MN	9	580	589
48	SD	4	577	582
38	KS	9	577	580
42	MO	8	577	577
33	IL	12	576	589
46	UT	5	575	570
41	OK	8	567	561
50	MS	4	566	551
43	LA	7	564	562
31	TN	13	562	553
45	AR	6	562	550
40	NE	8	562	568
36	MI	11	561	572
39	AL	9	559	554
32	NM	13	551	542
34	KY	12	550	550
35	WY	11	547	545
30	ID	17	543	542
28	MT	23	539	539
26	CO	31	539	542
27	OH	26	534	439

	State	Rate	Verbal	Math	Difference_Verbal_Math
0	CT	82	509	510	-1
1	NJ	81	499	513	-14
2	MA	79	511	515	-4
3	NY	77	495	505	-10
4	NH	72	520	516	4

Package imports

Part 1 : Data Loading Using csv Module vs Using Pandas

1.1 Load the file with the csv module and put it in a Python dictionary

1.2 Make a pandas DataFrame object with the SAT dictionary, and another with the pandas .read_csv() function

Observation on the difference :

Part 2 : Create a “Data Dictionary”

Part 3 : Data Analysis and Visualization

3.1 Check the distribution of numerical columns

3.2 Get an overview of the correlation of each variable in the dataset

Observation on Pairplot :

3.3 : Comparison of Math and Verbal Scores

Observation :

Benefits of using boxplot vs scatterplot/histogram: