Monday, August 25, 2008

Learning Statistics Using R: Role Type Classification: Case III (1 of 5)

Lets start with our next example; understanding relationship between two quantitative variables, ie. both explanatory and response variables are quantitative.

As in previous two cases we compared distribution of response variable with that of explanatory variables. To be specific, in case I we compared distribution of quantitative response with categorical explanatory variable and in case II we compared distribution of categorical response variable with categorical explanatory variable.
Now we have both variables as quantitative, more importantly we have a explanatory variable that is quantitiave. We will start understanding their relationship using "scatterplot".

We start with an example taken from course site.
A Pennsylvania research firm conducted a study in which 30 drivers (of ages 18 to 82 years old) were sampled and for each one the maximum distance at which he/she could read a newly designed sign was determined. The goal of this study was to explore the relationship between driver's age and the maximum distance at which signs were legible, and then use the study's findings to improve safety for older drivers. (Reference: Utts and Heckard, Mind on Statistics (2002). Originally source: Data collected by Last Resource, Inc, Bellfonte, PA.)

Since the purpose of this study is to explore the effect of age on maximum legibility distance,
  • the explanatory variable is Age, and
  • the response variable is Distance.
Here is what the raw data look like and its available here.
Note that the data structure is such that for each individual (in this case driver 1....driver 30) we have a pair of value (in this case representing the driver's age and distance). We can therefore think about this data as 30 pairs of values: (18,510), (32,410), (55,420)........(82,360).

The first step in exploring the relationship between driver age and sign legibility distance is to create an appropriate and informative graphical display. The appropriate graphical display for examining the relationship between two quantitative variables is the scatterplot.

Here is how a scatterplot is constructed for our example:

To create a scatterplot, each pair of values is plotted, so that the value of the explanatory variable (X) is plotted on the horizontal axis, and the value of the response variable (Y) is plotted on the vertical axis. In other words, each individual (driver, in our example) appears on the scatterplot as a single point whose x-coordinate is the value of the explanatory for that individual, and the y-coordinate is the value of the response. Following images taken from course website illustrate the same.

As we have data set, lets start doing the same using R.

# Read data using read.csv function, here separator is tab.
# Bug-Fix: Gabriele Righetti 
> signdist <- read.csv ("signdistance.txt", sep="\t", header=T)
> plot(signdist, type="b", main="Sign and Distance", col="blue", xlab="Driver Age", 
ylab="Sign Legibility Distance (Feet)")
The output of above simple command plot draws our scatter plot. Also note our scatter plot is different, that is we have lines along with points. That is possible with simple parameter "type=b" in plot commands. See the help page of plot command to see all types.
Another using "type=p" argument.

No comments: