tag:blogger.com,1999:blog-5420536656665117812018-03-08T04:39:21.635+05:30~Krishna DagliKrishna Daglinoreply@blogger.comBlogger27125tag:blogger.com,1999:blog-542053665666511781.post-36560835728544309642010-03-15T23:18:00.003+05:302010-03-15T23:26:55.914+05:30Measure of Center: Mean Vs. Median<span style="font-family: verdana;font-size:100%; color:black;">
So now we know that to measure center of a distribution we have two
most common numerical measures; <strong>Mean</strong> and
<strong>Median</strong>.
<br/>
The mean describes the center as an average value, where the actual
values of the data points play an important role since we are summing
values and then dividing by count (number of values). The median, on
the other hand, locates the middle value as the center, and the order
of the data is the key to finding it (here we are sorting and only
counting).
<br/><br/>
Lets understand difference between two with an example.<br/>
Assume we have following two data sets:
<span style="font-family: trebuchet ms; color:#003300;">
<pre>
Data set A -> 64 65 66 68 70 71 73
Data set B -> 64 65 66 68 70 71 730
</pre></span>
Please observe that only last value changes in these two sets, ie. 73
in data set A becomes 730 in data set B.<br/><br/>
For data set A, the mean is 68.1, and the median is 68. Visually
comparing these two data sets we know that the observation 730 is very
large and is certainly an outlier. In this case the median is still
68, but the mean will be influenced by the high outlier, and shifted
up to 162. The message that we should take from this example is:
<strong>The mean is very sensitive to outliers (as it factors in their
magnitude), while the median is resistant to outliers. </strong>
<br/><br/>
<Strong>Types of distributions and mean and median:</strong><br/>
Lets see what happens to mean and median with our 3 basic
distributions.
<UL><LI><Strong>Symmetric distributions with no outliers:</strong>
Mean (X̄) is approximately equal to median (M).
<a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://4.bp.blogspot.com/_4NC7AWExX5g/S550UQvaFGI/AAAAAAAAAtM/qKxg6H4aa8U/s1600-h/mean-skewed-left.gif"><img style="display:block; margin:0px auto 10px; text-align:center;cursor:pointer; cursor:hand;width: 400px; height: 303px;" src="http://4.bp.blogspot.com/_4NC7AWExX5g/S550UQvaFGI/AAAAAAAAAtM/qKxg6H4aa8U/s400/mean-skewed-left.gif" border="0" alt=""id="BLOGGER_PHOTO_ID_5448920490604893282" /></a>
<LI><strong>Skewed right distributions and/or datasets with high outliers:</strong>
Mean (X̄) is always greater than median (M). (X̄ > M)
<a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://2.bp.blogspot.com/_4NC7AWExX5g/S550T0F-GVI/AAAAAAAAAtE/o7ktiMijnmc/s1600-h/mean-skewed-right.gif"><img style="display:block; margin:0px auto 10px; text-align:center;cursor:pointer; cursor:hand;width: 400px; height: 298px;" src="http://2.bp.blogspot.com/_4NC7AWExX5g/S550T0F-GVI/AAAAAAAAAtE/o7ktiMijnmc/s400/mean-skewed-right.gif" border="0" alt=""id="BLOGGER_PHOTO_ID_5448920482914900306" /></a>
<LI><strong>Skewed left distributions and/or datasets with low outliers:</strong>
Mean (X̄) is always less than median (M). (X̄ < M)
<a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://2.bp.blogspot.com/_4NC7AWExX5g/S550TpNTfsI/AAAAAAAAAs8/ETZ8F6SMNoM/s1600-h/mean-sym.gif"><img style="display:block; margin:0px auto 10px; text-align:center;cursor:pointer; cursor:hand;width: 400px; height: 271px;" src="http://2.bp.blogspot.com/_4NC7AWExX5g/S550TpNTfsI/AAAAAAAAAs8/ETZ8F6SMNoM/s400/mean-sym.gif" border="0" alt=""id="BLOGGER_PHOTO_ID_5448920479992872642" /></a>
</UL>
<br/><br/>
<strong>Let's Summarize</strong>
<UL>
<LI> The two main numerical measures for the center of a
distribution are the mean (X̄) and the median (M). The mean is the
average of the values, while the median is the middle value.
<LI> The mean is very sensitive to outliers (as it factors in
their magnitude), while the median is resistant to outliers.
<LI> The mean is an appropriate measure of center only for
symmetric distributions with no outliers. In all other cases,
the median should be used to describe the center of the
distribution.
</UL>
</span>Krishna Daglinoreply@blogger.com3tag:blogger.com,1999:blog-542053665666511781.post-91836638056336404752010-03-03T23:37:00.002+05:302010-03-10T23:06:02.712+05:30Why numerical measures? Mean and Median<span style="font-family: verdana;font-size:100%; color:black;">
As we show in previous example only graphical representation of
quantitative variable is not enough. Using graphical representation we
can only get a rough estimate for the center and spread. A description
of the distribution of a quantitative variable must include, in
addition to the graphical display, a more precise numerical
description of the center and spread of the distribution.
So we learn:
<UL>
<LI> how to quantify the center and spread of a distribution with
various numerical measures.
<LI> Few important properties of numerical measures
<LI> how to choose the appropriate numerical measures of center and
spread to supplement the histogram/graphical representation.
</UL>
<br/>
<strong>Measue of Center (1 of 2):</strong><br/>
Two most important measure of center of a distribution are
<strong>mean </strong> and <strong>median</strong>. These two have
completely different approach and idea of describing center of a
distribution.
<UL>
<LI> <strong>Mean/Arithmetic mean:</strong> is the sum of the
observations (values) divided by the number(count) of observations. If
X1, X2, X3,...Xn are total 'n' number of observations then their mean
X̄ (x bar) is : <br/><br/>
X̄ = (X1 + X2 + X3 + ... + Xn)/ n
<br/><br/>
Lets take one example using random values. We use R's
<strong>runif</strong> function to generate 10 random values and then
compute its mean:
<span style="font-family: trebuchet ms; color:#003300;">
<pre>
# runif will generate 10 random observation/values.
> observation <- runif(10)
> print(observation)
[1] 0.7080567 0.6582278 0.2415265 0.4169798 0.4172357 0.2258143 0.3805531
[8] 0.4568466 0.5952122 0.4650702
> mean(observation)
[1] 0.4565523
> sum(observation)/10.00
[1] 0.4565523
</pre></span><br/>
If we take our Best Actress Oscar winners from 1970 to 2001 example
then following are the different age when actress have won Oscar:
<br/>
34 34 26 37 42 41 35 31 41 33 30 74 33 49 38 61 21 41 26 80 43 29 33 35 45 49 39 34 26 25 35 33
<br/><br/>
Their sum: 34+34+26+37+42......+35+33=1233 <br/>
and their count (total number of observations): 32. Hence the mean age of this dataset is :
<br/>
X̄ = 1233/32 = 38.5
<br/><br/>
<LI> <strong>Median:</strong> M is the center/midpoint of the
distribution. M is such a number that half of the the observations
fall above and half fall below. To find the median:
<UL><LI> Order the data from smallest to largest. (sort).
<LI> Consider whether n, the number of observations, is even or
odd.
<UL><LI> If n is odd, the median M is the center observation in
the ordered list. This observation is the one "sitting" in the
(n+1)/2 spot in the ordered list.
<LI> If n is even, the median M is the mean of the two
center observations in the ordered list. These two
observations are the ones "sitting" in the n/2 and n/2 + 1
spots in the ordered list.
</UL>
</UL>
This is better explained using a visualization provided at course website:
<a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://4.bp.blogspot.com/_4NC7AWExX5g/S46l0DiXh5I/AAAAAAAAAsQ/bWJCE7FwgQQ/s1600-h/median.gif"><img style="display:block; margin:0px auto 10px; text-align:center;cursor:pointer; cursor:hand;width: 392px; height: 280px;" src="http://4.bp.blogspot.com/_4NC7AWExX5g/S46l0DiXh5I/AAAAAAAAAsQ/bWJCE7FwgQQ/s400/median.gif" border="0" alt=""id="BLOGGER_PHOTO_ID_5444471313258743698" /></a>
Now lets again take our Best Actress Oscar winners from 1970 to 2001 example, following are the ages:
<br/>
34 34 26 37 42 41 35 31 41 33 30 74 33 49 38 61 21 41 26 80 43 29 33 35 45 49 39 34 26 25 35 33
<br/><br/>
We paste these values (ages) into simple R vector:
<span style="font-family: trebuchet ms; color:#003300;">
<pre>
>age = c(34, 34, 26, 37, 42, 41, 35, 31, 41, 33, 30, 74, 33, 49, 38, 61, 21, 41, 26, 80, 43, 29, 33, 35, 45, 49, 39, 34, 26, 25, 35, 33)
# first check mean so we know that we have copied all values.
>mean(age)
[1] 38.53125
# Good, in R we can order/sort this dataset with one command.
>sort(age)
[1] 21 25 26 26 26 29 30 31 33 33 33 33 34 34 34 35 35 35 37 38 39 41 41 41 42
[26] 43 45 49 49 61 74 80
>length(age)
[1] 32
# so we know its median would be mean of 16th and 17th observation that (35+35)/2.
> (sort(age)[16] + sort(age)[17])/2
[1] 35
# so the 16th observation is 35 and 17th is also 35, lets cross check using R's function.
> median(age)
[1] 35
> median(sort(age))
[1] 35
# So R's median function correctly returns 35 even when dataset is not sorted. but if we do
> (age[16]+age[17])/2
[1] 41
# then answer is incorrect since age dataset is not sorted.
</pre></span>
</ul>
</span>Krishna Daglinoreply@blogger.com0tag:blogger.com,1999:blog-542053665666511781.post-77063008893286550062009-01-08T00:02:00.003+05:302010-03-10T23:02:27.085+05:30Why numerical measures?So far we used scatterplots to study the relationships between two
quantitative variables, we studied the overall pattern of the
relationship by looking at its direction (positive, negative, or
neither), form (linear, curvilinear etc), and strength (eg. stronger,
weaker).
<br/><br/>
We also noted that assessing the strength of a relationship just by
looking at the scatterplot is quite hard, and therefore we need to
supplement the scatterplot with some kind of numerical measure that
will help us assess the strength, this numerical measure we are going
to learn.
<br/><br/>
Lets have a look at following two images (from the course website) to
see why numerical measures are required along with scatterplot to
assess strength of linear relationship.
<a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://4.bp.blogspot.com/_4NC7AWExX5g/SWT10_nuJJI/AAAAAAAAAeI/rxwg5--AnHU/s1600-h/why-numerical-measure.gif"><img style="display:block; margin:0px auto 10px; text-align:center;cursor:pointer; cursor:hand;width: 307px; height: 400px;" src="http://4.bp.blogspot.com/_4NC7AWExX5g/SWT10_nuJJI/AAAAAAAAAeI/rxwg5--AnHU/s400/why-numerical-measure.gif" border="0" alt=""id="BLOGGER_PHOTO_ID_5288622153219843218" /></a>
<span style="font-family: verdana;font-size:100%; color:black;">
<br/>
We can see that in both cases, the direction of the relationship is
positive and the form of the relationship is linear. What about the
strength? Recall that the strength of a relationship is the extent to
which the data follow its form.
<br/><br/> At first glance it looks like strength of the first graph is
stronger but at course website clarifies, both graphs are for same
data, just drawn using two different scales!
<br/><br/> The purpose of this example was to illustrate how assessing
the strength of the linear relationship from a scatterplot alone is
problematic, since our judgment might be affected by the range of
values that are plotted. This example, therefore, provides a
motivation for the need to supplement the scatterplot with a numerical
summary that will measure the strength of the linear relationship
between two quantitative variables.
</span>Krishna Daglinoreply@blogger.com0tag:blogger.com,1999:blog-542053665666511781.post-37886586988738551872008-10-13T18:03:00.007+05:302009-01-07T22:37:42.907+05:30Learning Statistics Using R: Role Type Classification: Case III (5 of 5)<span style="font-family: verdana;font-size:100%; color:black;">
Now we learn to create a labeled scatterplot using R. In labeled
scatterplot we indicate different subgroups or categories within the
data on the plot; by labeling each subgroup differently.
<br/><br/> Note till now we were using "plot" or "scatter.smooth"
function to create our scatter plot, but I have not found a nice and
easy method of creating labeled scatterplot using these functions. So
instead of these functions we will use another function from <b>"car"
package/library. </b>
<br/><br/>
<a href="http://krishnadagli.blogspot.com/2008/07/learning-statistics-using-r-role-type.html">
Recall the hot dog example from case I</a>, in which 54 major hot dog
brands were examined. In this study both the calorie content and the
sodium level of each brand was recorded, as well as the type of hot
dog: beef, poultry, and meat (mostly pork and beef, but up to 15%
poultry meat). In this example we will explore the relationship
between the sodium level and calorie content of hot dogs, and use the
three different types of hot dogs to create a labeled scatterplot.
<br/><b>Lets do this with R using scatter.smooth function</b><br/>
<span style="font-family: trebuchet ms;color:#003300; font-size:100%; weight:bold;">
<pre>
# hotdog.txt is same as TA01_009.TXT
> hotdogdata <- read.table("hotdog.txt", sep="\t", header=T)
> attach(hotdogdata)
> names(hotdogdata)
[1] "HotDog" "Calories" "Sodium"
# lets try firt with scatter.smooth function.
# As of version R version 2.6.2 (2008-02-08), 'quanlity' argument
# does not result in a warning but newer version does; pointed out by Gabriele Righetti.
> png("/tmp/hotdogtry1.png", quality=100, width=480)
> scatter.smooth(Sodium, Calories)
# We try added label to this plot using text command/function.
# Bug-Fix: Gabriele Righetti
> text(Sodium,Calories, HotDog)
> dev.off()
</pre>
</span>
So here is our image, but this is not what we wanted. Here we have
tried to add text/label to this using "text" function but it has
become too ugly to look at. So now lets try creating a better labeled
scatter plot using "car" library.
<a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://2.bp.blogspot.com/_4NC7AWExX5g/SPNAw1qVVxI/AAAAAAAAAZM/5Tz4AZNTf2g/s1600-h/hotdogtry1.png"><img style="display:block; margin:0px auto 10px; text-align:center;cursor:pointer; cursor:hand;" src="http://2.bp.blogspot.com/_4NC7AWExX5g/SPNAw1qVVxI/AAAAAAAAAZM/5Tz4AZNTf2g/s400/hotdogtry1.png" border="0" alt=""id="BLOGGER_PHOTO_ID_5256616397854037778" /></a>
<br/><b>Lets do this with R using "car" library</b><br/>
<span style="font-family: trebuchet ms;color:#003300; font-size:100%; weight:bold;">
<pre>
# we already have data that is attached. so let first load the
# library.
> library("car")
> png("/tmp/hotdogtry2.png", quality=100, width=480)
> scatterplot(Calories ~ Sodium | HotDog)
> dev.off()
</pre></span>
<a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://2.bp.blogspot.com/_4NC7AWExX5g/SPNBA7thXSI/AAAAAAAAAZU/6FnmqMZ5pFE/s1600-h/hotdogtry2.png"><img style="display:block; margin:0px auto 10px; text-align:center;cursor:pointer; cursor:hand;" src="http://2.bp.blogspot.com/_4NC7AWExX5g/SPNBA7thXSI/AAAAAAAAAZU/6FnmqMZ5pFE/s400/hotdogtry2.png" border="0" alt=""id="BLOGGER_PHOTO_ID_5256616674355928354" /></a>
So this is out second graph and this looks much better than our
earlier. Also there is an entire moive explaining concept of labeled
scatterplot at the <a
href="https://oli.web.cmu.edu/jcourse/workbook/activity/page?context=859d4b1f80020c6901c7ceae194171df">
course website.</a>
<br/><br/>
<b>Lets Summarize:</b>
<UL>
<LI>The relationship between two quantitative variables is visually
displayed using the scatterplot, where each point represents an
individual. We always plot the explanatory variable on the horizontal,
X-axis, and the response variable on the vertical, Y-axis.
<LI>When we explore a relationship using the scatterplot we should
describe the overall pattern of the relationship and any deviations
from that pattern. To describe the overall pattern consider the
direction, form and strength of the relationship. Assessing the
strength could be problematic.
<LI>Adding labels to the scatterplot, indicating different groups or
categories within the data, might help us get more insight about the
relationship we are exploring.
</UL>
</span>Krishna Daglinoreply@blogger.com0tag:blogger.com,1999:blog-542053665666511781.post-36017236946811136042008-10-02T14:55:00.007+05:302008-10-02T17:02:15.715+05:30Learning Statistics Using R: Role Type Classification: Case III (4 of 5)<span style="font-family: verdana;font-size:100%; color:black;">
To remind; we are learning to examine relationship between a response
and a exaplanatory variable. In our case III both variables are
quantitative (numerical).
<br/><br/>
We looked at the method to examine case III relationship between two
quantitative variables in <a href="">previous post.</a> Lets
understand it better with few more examples.
<br/><br/>
<b>Example:</b> The average gestation period, or time of pregnancy, of
an animal is closely related to its longevity (the length of its
lifespan.) Data on the average gestation period and longevity (in
captivity) of 40 different species of animals have been examined, with
the purpose of examining how the gestation period of an animal is
related to (or can be predicted from) its longevity. (Reference:
Rossman and Chance, Workshop Statistics. Discovery with Data and
Minitab (2001). Original source: The 1993 World Almanac and Book of
Facts). <br/><br/> The actual dataset for this example is available <a
href="ftp://www.rossmanchance.com/pub/ws2/mtw/animals.dat">here </a>
and also as <a
href="http://krishnadagli.blogspot.com/2008/07/learning-statistics-using-r-data-sets.html">single
zip file</a> that includes all dataset of this site.
<br/><b>Lets examine this dataset using R:</b><br/>
<span style="font-family: trebuchet ms;color:#003300; font-size:100%; weight:bold;">
<pre>
# Lets read the dataset animals.dat into R.
# dataset is separated by a tab (\t) and header row
# is present but not very intitutive so we will
# change it later.
> animals <- read.table("animals.dat", sep="\t", header=TRUE)
# print names of the rows.
> names(animals)
[1] "animal" "gestati3" "longevi4"
# we want to change "gestati3" and "logevi4"
> names(animals) <- c("animal", "gestation", "longevity")
> names(animals)
[1] "animal" "gestation" "longevity"
# so assigning values to names() changes the names.
# now lets draw the scatterplot and examine the dataset.
# we want to create scatterplot in a png file for upload.
> png("/tmp/animal.png", quanlity=100, width=480)
> scatter.smooth(animals$long, main="Lifespan and Pregnancy",
xlab="Longevity (Years)", ylab="Gestation (Days)")
# following writes the graph to file.
> dev.off()
</pre></span>
<a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://2.bp.blogspot.com/_4NC7AWExX5g/SOShP6nEqDI/AAAAAAAAAYE/b7mQJnmxZAk/s1600-h/animals.png"><img style="display:block; margin:0px auto 10px; text-align:center;cursor:pointer; cursor:hand;" src="http://2.bp.blogspot.com/_4NC7AWExX5g/SOShP6nEqDI/AAAAAAAAAYE/b7mQJnmxZAk/s400/animals.png" border="0" alt=""id="BLOGGER_PHOTO_ID_5252500360224090162" /></a>
<UL>
<LI>Direction: The direction of relationship is essentially
positive, that is longer lifespan tends to have have longer times of
pregnancy.
<LI>Form: Again the form of relationship (between response and
explanatory) variable is linear.
<LI>Outliers: There seems to be one outlier at around 40 years. Lets
use R to find out which observation is this?
<span style="font-family: trebuchet ms;color:#003300; font-size:102%; weight:bold;">
<pre>
# we search for more than 35 year, just to be careful.
> which(animals$longevity > 35)
[1] 15
# 'which' provides us with the observation number that has
# logevity > 35. Lets display that observation.
# combination of which along with dataset give following:
> animals [which(animals$longevity > 35), ]
animal gestation longevity
15 elephant 645 40
</pre></span>
<br/> So our outlier is an observation for elephant. Note that while
this outlier definitely deviates from the rest of the data in term of
its magnitude, it does follow the direction of the data.
</ul>
<b>Comments from course site:</b> Another feature of the
scatterplot that is worthwhile observing is how the variation in
Gestation increases as Longevity increases. Note that gestation period
for animal who live 5 years ranges from about 30 days up to about 120
days. On the other hand the gestation period of animals who live 12
years varies much more, and ranges from about 60 days and up to above
400 days.
<br/><br/><b>Example:</b> As another example, consider the relationship between
the average fuel usage (in liters) for driving a fixed distance in a
car (100 kilometers), and the speed at which the car drives (in
kilometers per hour). (Reference: Moore and McCabe, Introduction to
the Practice of Statistics, 2003. Original source: T.N. Lam
"Estimating fuel consumption for engine size", Journal of
transportation Engineering,111 (1985))
<br/><br/> The actual dataset for this example is available <a
href="http://bcs.whfreeman.com/ips5e/content/cat_060/PCDataSets/PC_Text.zip">here
(See chapter 2)</a> and also as <a
href="http://krishnadagli.blogspot.com/2008/07/learning-statistics-using-r-data-sets.html">single
zip file</a> that includes all dataset of this site.
<br/><b>Lets examine this dataset using R:</b><br/>
<span style="font-family: trebuchet ms;color:#003300;"><pre>
> sf <- read.table("speedfuel.txt", sep="\t", header=TRUE)
> png("/tmp/speedfuel.png", quanlity=100, width=480)
# check column names..
> scatter.smooth(sf$Speed, sf$Fuel, main="", xlab="Speed (km/h)",
ylab="Fuel Used (liters/100km)")
> dev.off()
</pre></span>
<a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://1.bp.blogspot.com/_4NC7AWExX5g/SOSwfF6ugVI/AAAAAAAAAY8/Bx9QZVuXi2w/s1600-h/speedfuel.png"><img style="display:block; margin:0px auto 10px; text-align:center;cursor:pointer; cursor:hand;" src="http://1.bp.blogspot.com/_4NC7AWExX5g/SOSwfF6ugVI/AAAAAAAAAY8/Bx9QZVuXi2w/s400/speedfuel.png" border="0" alt=""id="BLOGGER_PHOTO_ID_5252517113631768914" /></a>
The data describe a relationship that decreases and then increases -
the amount of fuel consumed decreases rapidly to a minimum for a car
driving 60 kilometers per hour, and then increases gradually for
speeds exceeding 60 kilometers per hour. This suggests that the speed
at which a car economizes on fuel the most is about 60 km/h. This
forms a curvilinear relationship which seems to be very strong, as the
observations seem to perfectly fit the curve. Finally, there do not
appear to be any outliers.
<br/><br/>
<b>Example:</b> Another example and scatterplot taken from course
website provides a great opportunity for interpretation of the form of
the relationship in context. The example examines how the percentage
of participants who completed a survey is affected by the monetary
incentive that researchers promised to participants. Here, is the
scatterplot which displays the relationship:
<a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://1.bp.blogspot.com/_4NC7AWExX5g/SOSxAFsq0RI/AAAAAAAAAZE/X2epCc-zrPw/s1600-h/scatterplot26.gif"><img style="display:block; margin:0px auto 10px; text-align:center;cursor:pointer; cursor:hand;" src="http://1.bp.blogspot.com/_4NC7AWExX5g/SOSxAFsq0RI/AAAAAAAAAZE/X2epCc-zrPw/s400/scatterplot26.gif" border="0" alt=""id="BLOGGER_PHOTO_ID_5252517680508490002" /></a>
The positive relationship definitely makes sense in context, but what
is the interpretation of the curvilinear form in the context of the
problem? How can we explain (in context) the fact that the
relationship seems at first to be increasing very rapidly, but then
slows down?
<br/><br/>
Note that when the monetary incentive increases from $0 to $10, the
percentage of returned surveys increases sharply - an increase of 27%
(from 16% to 43%). However, the same increase of $10 from $30 to $40
doesn't show the same dramatic increase on the percentage of returned
surveys - an increase of only 3% (from 54% to 57%). The form displays
the phenomenon of "diminishing returns" - a return rate that after a
certain point fails to increase proportionately to additional outlays
of investment. $10 is worth more to people relative to $0 than to
$30.
</span>Krishna Daglinoreply@blogger.com1tag:blogger.com,1999:blog-542053665666511781.post-65881233413672250652008-09-09T00:44:00.003+05:302008-09-09T00:53:53.531+05:30Learning Statistics Using R: Role Type Classification: Case III (3 of 5)<span style="font-family: verdana;font-size:100%; color:black;">
In next few posts we try to understand direction, form, strength and
outliers of our Sign Legibility Distance Vs. Age scatterplot.
<br/><br/> We try with simple scatterplot and then try to find its
direction using R. So this is how our scatterplot looks when drawn
using R.
<span style="font-family: trebuchet ms;color:#003300; font-size:100%; weight:bold;"> <pre>
# read the signdistance datafile that is included in here
# http://krishnadagli.blogspot.com/2008/07/learning-statistics-using-r-data-sets.html
> sigdist <- read.table("signdistance.txt", header=T, sep='\t')
# we want to create scatterplot in a png file, that is easy to upload.
> png("/tmp/signdist.png", quality=100, width=480)
> plot(sigdist, main="Sign Distance", xlab="Driver Age(years)", ylab="Sign Legibility Distance(feet)")
> dev.off()
</pre></span>
<br/><br/> So this how our scatterplot looks, we try to first find the
direction of this plot.
<a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://1.bp.blogspot.com/_4NC7AWExX5g/SMV7Oj3nNvI/AAAAAAAAAXY/eHU10gNHD2k/s1600-h/signdist.png"><img style="display:block; margin:0px auto 10px; text-align:center;cursor:pointer; cursor:hand;" src="http://1.bp.blogspot.com/_4NC7AWExX5g/SMV7Oj3nNvI/AAAAAAAAAXY/eHU10gNHD2k/s400/signdist.png" border="0" alt=""id="BLOGGER_PHOTO_ID_5243732831219693298" /></a>
<br/>
<b>Find direction: </b> (FIXME: Is this correct, what alternatives?):
For finding direction of scatter plot we use one of R's function
'scatter.smooth'. The help page says that it plot and add a smooth
curve computed by 'loess' to a scatter plot, and Wiki entry <a
href="http://en.wikipedia.org/wiki/Local_regression">here </a>
mentions loess, or locally weighted scatterplot smoothing, as one of
many "modern" modeling methods. Here is the R code to add a curve to
our plot.
<span style="font-family: trebuchet ms;color:#003300; font-size:100%; weight:bold;"> <pre>
# we have already read data in sigdist object, lets use it.
# As we want to create and save the plot, we use png function.
>png("/tmp/sigdistsmooth.png", width=480, quality=100)
> scatter.smooth(sigdist, main="Sign Distance Smooth",
xlab="Drive Age(years)", ylab="Sign Legibility Distance(feet)")
</pre></span>
<br/><br/>
Here is how our scatterplot along with a curve looks.
<a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://1.bp.blogspot.com/_4NC7AWExX5g/SMV7dQe8u-I/AAAAAAAAAXg/AQdQ0FYuYtU/s1600-h/sigdistsmooth.png"><img style="display:block; margin:0px auto 10px; text-align:center;cursor:pointer; cursor:hand;" src="http://1.bp.blogspot.com/_4NC7AWExX5g/SMV7dQe8u-I/AAAAAAAAAXg/AQdQ0FYuYtU/s400/sigdistsmooth.png" border="0" alt=""id="BLOGGER_PHOTO_ID_5243733083714010082" /></a>
<br/>
We can see that the direction of the relationship is negative, which
makes sense in context since as you get older your eyesight weakens,
and in particular older drivers tend to be able to read signs only at
lesser distances. (How do we get an arrow drawn over the scatterplot
just like its shown in the course?)
<br/><br/>
<b>Find form: </b>The form of the relationship seems to be linear. Notice
how the points tend to be scattered about the line. Although, as we
mentioned earlier, it is problematic to assess the strength without a
numerical measure, the relationship appears to be moderately strong,
as the data is fairly tightly scattered about the line. Finally, all
the data points seem to "obey" the pattern- <b>there do not appear to be
any outliers.</b>
</span>Krishna Daglinoreply@blogger.com0tag:blogger.com,1999:blog-542053665666511781.post-30918985270252906762008-09-05T22:09:00.012+05:302008-09-05T23:38:19.987+05:30Learning Statistics Using R: Role Type Classification: Case III (2 of 5)<span style="font-family: verdana;font-size:100%; color:black;">
As we have seen in <a
href="http://krishnadagli.blogspot.com/2008/08/learning-statistics-using-r-role-type_25.html">
our previous scatterplot</a>, it is always the case that
<b>exaplanatory variable is plotted on horizontal, X-axis and response
variable on Y-axis. </b> If, at times we are not able to clearly
identify explanatory and response variables then each of them can be
plotted on either axis.
<br/><br/>
<b>Interperting scatterplot: </b> In our case-I we did comparative box
plot and in case-II we did comparative bar plot/histogram but now how
do we interpret scatterplot? What we dis is to describe the overall
pattern of the distribution (of response variable) and any deviations
(outliers) from that pattern, we take same approach for
scatterplot. That is we describe overall pattern by looking at
distribution's <b>"Direction", "Form", and, "Strength" </b>, and along with
this we find outliers. Following from the course site puts it in a
nice figure. <br/><br/>
<a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://1.bp.blogspot.com/_4NC7AWExX5g/SMFxsfvI2nI/AAAAAAAAAWI/ONs29rrarPc/s1600-h/scatterplot5.gif"><img style="display:block; margin:0px auto 10px; text-align:center;cursor:pointer; cursor:hand;" src="http://1.bp.blogspot.com/_4NC7AWExX5g/SMFxsfvI2nI/AAAAAAAAAWI/ONs29rrarPc/s400/scatterplot5.gif" border="0" alt=""id="BLOGGER_PHOTO_ID_5242596450483362418" /></a>
<br/>
Lets discuss each of these three in details that describes overall
pattern of relationship.
<OL>
<LI>Direction: The direction of the relationship can be positive,
negative, or neither. We identify the direction of relationship by
looking at how scatterplot's points are moving along with x-y plane.
Following figures shows example of positive, negative, and neither
directions.
<UL>
<LI> Positive direction: A positive (or increasing) relationship means
that an increase in one of the variables is associated with an
increase in the other.<br/>
<a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://1.bp.blogspot.com/_4NC7AWExX5g/SMFx5zCe5jI/AAAAAAAAAWQ/ABPBmgSIvmE/s1600-h/scatterplot6.gif"><img style="display:block; margin:0px auto 10px; text-align:center;cursor:pointer; cursor:hand;" src="http://1.bp.blogspot.com/_4NC7AWExX5g/SMFx5zCe5jI/AAAAAAAAAWQ/ABPBmgSIvmE/s400/scatterplot6.gif" border="0" alt=""id="BLOGGER_PHOTO_ID_5242596679003072050" /></a>
<LI> Negative direction: A negative (or decreasing) relationship means
that an increase in one of the variables is associated with a decrease
in the other. <br/>
<a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://3.bp.blogspot.com/_4NC7AWExX5g/SMFyJrBJDzI/AAAAAAAAAWY/nOSO1gj2mto/s1600-h/scatterplot7.gif"><img style="display:block; margin:0px auto 10px; text-align:center;cursor:pointer; cursor:hand;" src="http://3.bp.blogspot.com/_4NC7AWExX5g/SMFyJrBJDzI/AAAAAAAAAWY/nOSO1gj2mto/s400/scatterplot7.gif" border="0" alt=""id="BLOGGER_PHOTO_ID_5242596951727869746" /></a>
<LI> Neither: Not all relationships can be classified as either
positive or negative.
<a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://2.bp.blogspot.com/_4NC7AWExX5g/SMFyX7b_NnI/AAAAAAAAAWg/a9KN25nTnFc/s1600-h/scatterplot8.gif"><img style="display:block; margin:0px auto 10px; text-align:center;cursor:pointer; cursor:hand;" src="http://2.bp.blogspot.com/_4NC7AWExX5g/SMFyX7b_NnI/AAAAAAAAAWg/a9KN25nTnFc/s400/scatterplot8.gif" border="0" alt=""id="BLOGGER_PHOTO_ID_5242597196653606514" /></a>
</UL>
<LI>Form: The form of the relationship is its general shape. When
identifying the form, we try to find the simplest way to describe the
shape of the scatterplot. There are many possible forms. Here are a
couple that are quite common:
<UL>
<LI>Linear Form: Relationships with a linear form are most simply
described as points scattered about a line:
<a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://3.bp.blogspot.com/_4NC7AWExX5g/SMFy8TgC92I/AAAAAAAAAWo/amnvVhJKhbQ/s1600-h/scatterplot9.gif"><img style="display:block; margin:0px auto 10px; text-align:center;cursor:pointer; cursor:hand;" src="http://3.bp.blogspot.com/_4NC7AWExX5g/SMFy8TgC92I/AAAAAAAAAWo/amnvVhJKhbQ/s400/scatterplot9.gif" border="0" alt=""id="BLOGGER_PHOTO_ID_5242597821588371298" /></a>
<LI>Curvilinear Form: Relationships with a curvilinear form are most
simply described as points dispersed around the same curved line:
<a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://2.bp.blogspot.com/_4NC7AWExX5g/SMFzbPluhHI/AAAAAAAAAWw/u-f3WS-YGLA/s1600-h/scatterplot10.gif"><img style="display:block; margin:0px auto 10px; text-align:center;cursor:pointer; cursor:hand;" src="http://2.bp.blogspot.com/_4NC7AWExX5g/SMFzbPluhHI/AAAAAAAAAWw/u-f3WS-YGLA/s400/scatterplot10.gif" border="0" alt=""id="BLOGGER_PHOTO_ID_5242598353114399858" /></a>
<LI> Other Forms: There are many other possible forms for the
relationship between two quantitative variables, but linear and
curvilinear forms are quite common and easy to identify. Another
form-related pattern that we should be aware of is <b>clusters in the
data</b>:
<a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://4.bp.blogspot.com/_4NC7AWExX5g/SMFztL2suYI/AAAAAAAAAW4/bpLukCud4C4/s1600-h/scatterplot11.gif"><img style="display:block; margin:0px auto 10px; text-align:center;cursor:pointer; cursor:hand;" src="http://4.bp.blogspot.com/_4NC7AWExX5g/SMFztL2suYI/AAAAAAAAAW4/bpLukCud4C4/s400/scatterplot11.gif" border="0" alt=""id="BLOGGER_PHOTO_ID_5242598661349489026" /></a>
</UL>
<LI>Strength: The strength of the relationship is determined by how
closely the data follow the form of the relationship. Let's look, for
example, at the following two scatterplots displaying a positive,
linear relationship:
<a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://2.bp.blogspot.com/_4NC7AWExX5g/SMF0DKgS3jI/AAAAAAAAAXA/yl0Mse5OO-I/s1600-h/scatterplot12.gif"><img style="display:block; margin:0px auto 10px; text-align:center;cursor:pointer; cursor:hand;" src="http://2.bp.blogspot.com/_4NC7AWExX5g/SMF0DKgS3jI/AAAAAAAAAXA/yl0Mse5OO-I/s400/scatterplot12.gif" border="0" alt=""id="BLOGGER_PHOTO_ID_5242599038944206386" /></a>
<br/>
<a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://3.bp.blogspot.com/_4NC7AWExX5g/SMF0antBO5I/AAAAAAAAAXI/LGTIN21UBvQ/s1600-h/scatterplot13.gif"><img style="display:block; margin:0px auto 10px; text-align:center;cursor:pointer; cursor:hand;" src="http://3.bp.blogspot.com/_4NC7AWExX5g/SMF0antBO5I/AAAAAAAAAXI/LGTIN21UBvQ/s400/scatterplot13.gif" border="0" alt=""id="BLOGGER_PHOTO_ID_5242599441919196050" /></a>
The strength of the relationship is determined by how closely the data
points follow the form. We can see that in the top scatterplot the the
data points follow the linear patter quite closely. This is an example
of a strong relationship. In the bottom scatterplot the points also
follow the linear pattern but much less closely, and therefore we can
say that the relationship is weaker. In general, though, assessing the
strength of a relationship just by looking at the scatterplot is quite
problematic, and we need a numerical measure to help us with that. We
will discuss this later in this section.
<LI>Outliers: Data points that deviate from the pattern of the
relationship are called outliers. We will see several examples of
outliers during this section. Two outliers are illustrated in the
scatterplot below:
<a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://1.bp.blogspot.com/_4NC7AWExX5g/SMF0wPJVy-I/AAAAAAAAAXQ/rx1Fr5RypLE/s1600-h/scatterplot14.gif"><img style="display:block; margin:0px auto 10px; text-align:center;cursor:pointer; cursor:hand;" src="http://1.bp.blogspot.com/_4NC7AWExX5g/SMF0wPJVy-I/AAAAAAAAAXQ/rx1Fr5RypLE/s400/scatterplot14.gif" border="0" alt=""id="BLOGGER_PHOTO_ID_5242599813284219874" /></a>
</OL>
</span>Krishna Daglinoreply@blogger.com0tag:blogger.com,1999:blog-542053665666511781.post-53108874266720104292008-08-25T01:11:00.006+05:302009-01-07T22:12:43.327+05:30Learning Statistics Using R: Role Type Classification: Case III (1 of 5)<span style="font-family: verdana;font-size:100%; color:black;">
Lets start with our next example; understanding relationship between
two quantitative variables, ie. both explanatory and response
variables are quantitative.
<br/><br/>
As in previous two cases we compared distribution of response variable
with that of explanatory variables. To be specific, in case I we
compared distribution of quantitative response with categorical
explanatory variable and in case II we compared distribution of
categorical response variable with categorical explanatory variable.
<br/>
Now we have both variables as quantitative, more importantly we have a
explanatory variable that is quantitiave. We will start understanding
their relationship using "scatterplot".
<br/><br/>
We start with an example taken from course site. <br/>
A Pennsylvania research firm conducted a study in which 30 drivers (of
ages 18 to 82 years old) were sampled and for each one the maximum
distance at which he/she could read a newly designed sign was
determined. The goal of this study was to explore the relationship
between driver's age and the maximum distance at which signs were
legible, and then use the study's findings to improve safety for older
drivers. (Reference: Utts and Heckard, Mind on Statistics
(2002). Originally source: Data collected by Last Resource, Inc,
Bellfonte, PA.) <br/><br/>
Since the purpose of this study is to explore the effect of age on
maximum legibility distance,<br/>
<UL>
<LI> the explanatory variable is Age, and
<LI> the response variable is Distance.
</UL>
Here is what the raw data look like and its available <a
href="http://math.fullerton.edu/mori/Math120/Data/ascii/signdist.txt">here.</a>
<a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://4.bp.blogspot.com/_4NC7AWExX5g/SLG6biXKSmI/AAAAAAAAAVo/-2T6tZybJKw/s1600-h/scatterplot2.gif"><img style="display:block; margin:0px auto 10px; text-align:center;cursor:pointer; cursor:hand;" src="http://4.bp.blogspot.com/_4NC7AWExX5g/SLG6biXKSmI/AAAAAAAAAVo/-2T6tZybJKw/s320/scatterplot2.gif" border="0" alt=""id="BLOGGER_PHOTO_ID_5238172823851780706" /></a>
<br/>
Note that the data structure is such that for each individual (in this
case driver 1....driver 30) we have a pair of value (in this case
representing the driver's age and distance). We can therefore think
about this data as 30 pairs of values: (18,510), (32,410),
(55,420)........(82,360).
<br/><br/>
The first step in exploring the relationship between driver age and
sign legibility distance is to create an appropriate and informative
graphical display. The appropriate graphical display for examining
the relationship between two quantitative variables is the
<b>scatterplot</b>. <br/><br/>
Here is how a scatterplot is constructed for our example: <br/><br/>
To create a scatterplot, each pair of values is plotted, so that the
value of the explanatory variable (X) is plotted on the horizontal
axis, and the value of the response variable (Y) is plotted on the
vertical axis. In other words, each individual (driver, in our
example) appears on the scatterplot as a single point whose
x-coordinate is the value of the explanatory for that individual, and
the y-coordinate is the value of the response. Following images taken
from course website illustrate the same.
<a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://1.bp.blogspot.com/_4NC7AWExX5g/SLG63WiklNI/AAAAAAAAAVw/4F-wxrx1Bt8/s1600-h/scatterplot3.gif"><img style="display:block; margin:0px auto 10px; text-align:center;cursor:pointer; cursor:hand;" src="http://1.bp.blogspot.com/_4NC7AWExX5g/SLG63WiklNI/AAAAAAAAAVw/4F-wxrx1Bt8/s400/scatterplot3.gif" border="0" alt=""id="BLOGGER_PHOTO_ID_5238173301714752722" /></a>
<br/><br/>
<b>As we have data set, lets start doing the same using R.</b><br/><br/>
<span style="font-family: trebuchet ms;color:#003300; font-size:100%; weight:bold;"> <pre>
# Read data using read.csv function, here separator is tab.
# Bug-Fix: Gabriele Righetti
> signdist <- read.csv ("signdistance.txt", sep="\t", header=T)
> plot(signdist, type="b", main="Sign and Distance", col="blue", xlab="Driver Age",
ylab="Sign Legibility Distance (Feet)")
</pre></span>
The output of above simple command plot draws our scatter plot. Also
note our scatter plot is different, that is we have lines along with
points. That is possible with simple parameter "type=b" in plot
commands. See the help page of plot command to see all types.
<a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://2.bp.blogspot.com/_4NC7AWExX5g/SLG77q-wSPI/AAAAAAAAAV4/G7HpyCIeBEA/s1600-h/myscatterplot.png"><img style="display:block; margin:0px auto 10px; text-align:center;cursor:pointer; cursor:hand;" src="http://2.bp.blogspot.com/_4NC7AWExX5g/SLG77q-wSPI/AAAAAAAAAV4/G7HpyCIeBEA/s400/myscatterplot.png" border="0" alt=""id="BLOGGER_PHOTO_ID_5238174475432773874" /></a>
<br/>
Another using "type=p" argument. <br/>
<a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://4.bp.blogspot.com/_4NC7AWExX5g/SLG9DRzfykI/AAAAAAAAAWA/hk9WiQa2T4g/s1600-h/myscatterplot1.png"><img style="display:block; margin:0px auto 10px; text-align:center;cursor:pointer; cursor:hand;" src="http://4.bp.blogspot.com/_4NC7AWExX5g/SLG9DRzfykI/AAAAAAAAAWA/hk9WiQa2T4g/s400/myscatterplot1.png" border="0" alt=""id="BLOGGER_PHOTO_ID_5238175705625250370" /></a>
</span>Krishna Daglinoreply@blogger.com0tag:blogger.com,1999:blog-542053665666511781.post-59089138002240145062008-08-03T14:47:00.004+05:302008-08-03T15:07:03.005+05:30Learning Statistics Using R: Role Type Classification : Case II (2 of 2)<span style="font-family: verdana;font-size:100%; color:black;">
Following example from the course website provides an opportunity to
analyze relationship between two categorical variables using R.
<br/><br/>
<b>Study:</b> An associated press article captured the attentions of
readers with the headline "Night lights bad for kids?" The article
was based on a 1999 study at the University of Pennsylvania and
Children's Hospital of Philadelphia, in which parents were surveyed
about the lighting conditions under which their children slept between
birth and age 2 (lamp, night-light, or no light) and whether or not
their children developed nearsightedness (myopia.) <b>The purpose of the
study was to explore the effect of a young child's night-time exposure
to light on later nearsightedness.</b>
<br> <br/><i>The actual excel file can be downloaded from <a
href="https://oli.web.cmu.edu/repository/webcontent/859044f080020c690045b50e228faf24/_u2_summarizing_data/webcontent/excel/nightlight.xls">here</a>. Or
it can be downloaded as and csv file along with other data sets from
<a
href="http://krishnadagli.blogspot.com/2008/07/learning-statistics-using-r-data-sets.html">here.</a></i><br/>
<br/>
Lets try to do this analysis using R.
<br/>
<OL>
<LI> Find out explanatory and response variables:
<UL>
<LI> Light:(No light/Night light/Lamp) this categorical variable is
explanatory variable.
<LI> Nearsightedness: (Yes/No) this categorical variable is response
variable since values (yes/no) depends on the value of Light variable.
</UL>
<br/>
<LI> Create a two-way summary table: Following steps shows how we
summarize the data in a two-way table using R. Note there are multiple
ways of creating two-way table in R but I have not successful at them.
<span style="font-family: trebuchet ms;color:#003300; font-size:100%; weight:bold;">
<pre> # Read data usingread.csv function.
> nightlight<-read.csv("nightlight.csv", header=T,sep=",")
# Find row and column totals using addmargins function.
> addmargins(table(nightlight),FUN=sum, quiet=F)
Margins computed over dimensions
in the following order:
1: Light
2: Nearsightedness
Nearsightedness
Light No Yes sum
lamp 34 41 75
night light 153 79 232
no light 155 17 172
sum 342 137 479
</pre>
</span>
As we have seen earlier "table" command is used to tabulate data but
we need row and column totals; that is possible using "addmargins"
function. (I think this can also be achieved using combination of
apply, t, cut and few other functions but I am not able to do so
due to limited knowledge of R. Have a look at <a
href="http://tolstoy.newcastle.edu.au/R/devel/02b/0137.html"> this
discussion thread.</a>)
<br/><br/>
<LI> Find percentage: As noted earlier, having sum does not help; we
need to find percentage for comparing distribution of response
variable.
<br>
Again, as I am learning R I could not do this percentage calculation
using combination of R commands but found a package (rather two) that
does exactly what we want.
<br/>
<UL><LI>Using "CrossTable" function from library <a
href="http://cran.r-project.org/web/packages/gregmisc/index.html">
"gregmisc": </a><br/>
<span style="font-family: trebuchet ms;color:#003300; font-size:100%; weight:bold;">
<pre> # load the library
> library("gregmisc")
Loading required package: gdata
Loading required package: gmodels
Loading required package: gplots
Loading required package: gtools
# Read data usingread.csv function.
> nightlight <- read.csv("nightlight.csv", header=T,sep=",")
> CrossTable(table(nightlight), prop.t=FALSE, prop.c=FALSE, prop.r=TRUE, prop.chisq=F)
Cell Contents
|-------------------------|
| N |
| N / Row Total |
|-------------------------|
Total Observations in Table: 479
| Nearsightedness
Light | No | Yes | Row Total |
-------------|-----------|-----------|-----------|
lamp | 34 | 41 | 75 |
| 0.453 | 0.547 | 0.157 |
-------------|-----------|-----------|-----------|
night light | 153 | 79 | 232 |
| 0.659 | 0.341 | 0.484 |
-------------|-----------|-----------|-----------|
no light | 155 | 17 | 172 |
| 0.901 | 0.099 | 0.359 |
-------------|-----------|-----------|-----------|
Column Total | 342 | 137 | 479 |
-------------|-----------|-----------|-----------|
</pre></span>
<br/> As we can see from the above output, "CrossTable" function
provides us what we want. ( "prop.r=TRUE" provides us with row
proportions.)
<br/><br/>
<LI>Using <a
href="http://cran.r-project.org/web/packages/Rcmdr/index.html">
"Rcmdr" </a>library<br/>
This library provides nice GUI using which we can find row percentage
and other details. Here is the screen shot of the same.
<a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://bp3.blogger.com/_4NC7AWExX5g/SJV5v6YJo3I/AAAAAAAAAVY/d6WPAZ6ynME/s1600-h/rcmdr.png"><img style="display:block; margin:0px auto 10px; text-align:center;cursor:pointer; cursor:hand;" src="http://bp3.blogger.com/_4NC7AWExX5g/SJV5v6YJo3I/AAAAAAAAAVY/d6WPAZ6ynME/s400/rcmdr.png" border="0" alt=""id="BLOGGER_PHOTO_ID_5230220406292063090" /></a>
<span style="font-family: verdana;font-size:100%; color:black;">
</ul>
<br/>
<LI> Interpret these results: Lets try to analyze these results. We
want explore the effect of a young child's night-time exposure to
light on later nearsightedness. From the above results lets note few
things:
<UL>
<LI> The results suggest that propotion of children 0.547 (or 54.7%)
developed nearsightedness when exposed to lamp. This propotion is
higher when we compare this to night light and no light propotion;
that are 0.341 (or 34.1 %) and 0.099 (or 9.9 %) respectively.
<LI> That is there are 5 times higher chances of children developing
nearsightedness when slept with lamp compared to children who slept
without any lights.
<LI> Though 9.9 % of children developed nearsightedness when slept
without any lights.
</UL>
</OL>
</span>Krishna Daglinoreply@blogger.com0tag:blogger.com,1999:blog-542053665666511781.post-54079019989293517902008-07-17T00:17:00.004+05:302008-08-03T14:47:05.994+05:30Learning Statistics Using R: Role Type Classification : Case II (1 of 2)Case II of our role type classification includes study of relationship
between a Categorical Explanatory and a Categorical Response variable.
<br/><br/>
We start with an example from the course web site to explore
relationship between two categorical variables. <br/>
Example: In a survey, 1200 U.S. college students were asked about
their body-image, underweight, overweight, or about right. We have to
find answer to following questions: <br/>
If we had separated our sample of 1200 U.S. college students by gender
and looked at males and females separately, would we have found a
similar distribution across body-image categories? <br/>
More specifically,are men and women just as likely to think their
weight is about right? Among those students who do not think their
weight is about right, is there a difference between the genders in
feelings about body-image? <br/><br/>
So for answering these questions requires us to study the relationship
between two categorical variables. Both response and explanatory
variables are categorical since we want to find how gender
(male/female) affects body image (underweight, overweight, right
weight).
Here in this study we have following:
<UL>
<LI> Gender: (Male/Female) as explanatory variable and it is a
categorical variable.
<LI> Body-image:(underweight, overweight, right weight) as response
variable and it is a categorical variable.
</UL>
<br/> <B>As I could not find raw data for these example; we will
directly use results derived at the course site instead of reading raw
data in R and finding results.</B><br/><br/>
To understand how body image is related to gender, we need an
informative display that summarizes the data. In order to summarize
the relationship between two categorical variables, we create a
display called a two-way table.<br/><br/>
Here is the two-way table for our example:<br/>
<a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://bp3.blogger.com/_4NC7AWExX5g/SIYeVasS73I/AAAAAAAAAUg/7lsomOu1LEY/s1600-h/caseii-1.gif"><img style="display:block; margin:0px auto 10px; text-align:center;cursor:pointer; cursor:hand;" src="http://bp3.blogger.com/_4NC7AWExX5g/SIYeVasS73I/AAAAAAAAAUg/7lsomOu1LEY/s320/caseii-1.gif" border="0" alt=""id="BLOGGER_PHOTO_ID_5225897770901237618" /></a>
<br/>
So our two-way table summarizes data of all 1200 students by gender and their body image as counts.
The "Total" row or column is a summary of one of the two categorical
variables, ignoring the other. In our example:
<UL>
<LI>The Total row gives the summary of the categorical variable
Body-image:
<a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://bp1.blogger.com/_4NC7AWExX5g/SIYfAA0l46I/AAAAAAAAAUo/kplGdufAJRQ/s1600-h/caseii-2.gif"><img style="display:block; margin:0px auto 10px; text-align:center;cursor:pointer; cursor:hand;" src="http://bp1.blogger.com/_4NC7AWExX5g/SIYfAA0l46I/AAAAAAAAAUo/kplGdufAJRQ/s320/caseii-2.gif" border="0" alt=""id="BLOGGER_PHOTO_ID_5225898502691087266" /></a>
<br/>
<LI>The Total column gives the summary of the categorical variable
Gender:
<a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://bp2.blogger.com/_4NC7AWExX5g/SIYfNXSxqPI/AAAAAAAAAUw/2JVbRhmul04/s1600-h/caseii-3.gif"><img style="display:block; margin:0px auto 10px; text-align:center;cursor:pointer; cursor:hand;" src="http://bp2.blogger.com/_4NC7AWExX5g/SIYfNXSxqPI/AAAAAAAAAUw/2JVbRhmul04/s320/caseii-3.gif" border="0" alt=""id="BLOGGER_PHOTO_ID_5225898732061567218" /></a>
<br/>
</UL>
<br/>
Remember, though, that our primary goal is to explore how body image
is related to gender. Exploring the relationship between two
categorical variables (in this case Body-image and Gender) amounts to
comparing the distributions of the response (in this case Body-image)
across the different values of the explanatory (in this case males and
females):
<a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://bp2.blogger.com/_4NC7AWExX5g/SIYfyehFRPI/AAAAAAAAAU4/rOrSxr2coTk/s1600-h/caseii-4.gif"><img style="display:block; margin:0px auto 10px; text-align:center;cursor:pointer; cursor:hand;" src="http://bp2.blogger.com/_4NC7AWExX5g/SIYfyehFRPI/AAAAAAAAAU4/rOrSxr2coTk/s320/caseii-4.gif" border="0" alt=""id="BLOGGER_PHOTO_ID_5225899369655780594" /></a>
<span style="font-family: verdana;font-size:100%; color:black;">
<br/>
Note that it does not make sense to compare raw counts, because there
are more females than males overall. So for example, it is not very
informative to say "there are 560 females who responded 'About Right'
compared to only 295 males," since the 560 females are out of a total
of 760, and the 295 males are only out of a total of 440).
We need to supplement our display, the two-way table, with some
numerical summaries that will allow us to compare the
distributions. These numerical summaries are found by simply
converting the counts to percents within (or restricted to) each value
of the explanatory variable separately!
In our example:
We look at each gender separately, and convert the counts to percents
within that gender. Let's start with females:
<br/>
<a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://bp0.blogger.com/_4NC7AWExX5g/SIYgCUTX9gI/AAAAAAAAAVA/xRP9MNCxK2M/s1600-h/caseii-5.gif"><img style="display:block; margin:0px auto 10px; text-align:center;cursor:pointer; cursor:hand;" src="http://bp0.blogger.com/_4NC7AWExX5g/SIYgCUTX9gI/AAAAAAAAAVA/xRP9MNCxK2M/s320/caseii-5.gif" border="0" alt=""id="BLOGGER_PHOTO_ID_5225899641791837698" /></a>
<br/>
Note that each count is converted to percents by dividing by the total
number of females, 760. These numerical summaries are called
conditional percents, since we find them by conditioning on one of the
genders
<br/>
<a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://bp0.blogger.com/_4NC7AWExX5g/SIYhOuOaunI/AAAAAAAAAVI/Q13uuEnky2w/s1600-h/caseii-5.5.gif"><img style="display:block; margin:0px auto 10px; text-align:center;cursor:pointer; cursor:hand;" src="http://bp0.blogger.com/_4NC7AWExX5g/SIYhOuOaunI/AAAAAAAAAVI/Q13uuEnky2w/s320/caseii-5.5.gif" border="0" alt=""id="BLOGGER_PHOTO_ID_5225900954420427378" /></a>
<br/>
<b>Comments</b>
<OL>
<LI>In our example, we chose to organize the data with the explanatory
variable Gender in rows and the response variable Body-image in
columns, and thus our conditional percents were row percents,
calculated within each row separately. Similarly, if the explanatory
variable happens to sit in columns and the response variable in rows,
our conditional percents will be column percents, calculated within
each column separately.
<LI>Another way to visualize the conditional percents, instead of a
table, is the double bar chart. This display is quite common in
newspapers.
<a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://bp1.blogger.com/_4NC7AWExX5g/SIYhh2dqOWI/AAAAAAAAAVQ/Vw04jh4DC_I/s1600-h/caseii-6.gif"><img style="display:block; margin:0px auto 10px; text-align:center;cursor:pointer; cursor:hand;" src="http://bp1.blogger.com/_4NC7AWExX5g/SIYhh2dqOWI/AAAAAAAAAVQ/Vw04jh4DC_I/s320/caseii-6.gif" border="0" alt=""id="BLOGGER_PHOTO_ID_5225901283049355618" /></a>
</OL>
<br/>
After looking at the numerical summary and graph lets try to put the results in words:
<UL>
<LI>The results suggest that propotion of males who are happy with their body image 'About right' is slightly less than among female student. That is 73.3 % of female students are happy with their body image compared to only 67 % of males.
<LI> Female students who are not happy with their body image often feel they are overweight. That is 73.3 % are happy but remaining 21.4 % feel they are overweight compared to only 4.9 % feeling underweight.
<LI> Male students who are not happy with their body image feel they are overweight about often as they feel they are underweight. That is 16.6 % student feel they are overweight while rougly same 16.2 % student feel they are underweight.
</UL>
</span>Krishna Daglinoreply@blogger.com0tag:blogger.com,1999:blog-542053665666511781.post-53616153369065619712008-07-15T01:25:00.006+05:302009-01-04T19:14:51.658+05:30Learning Statistics Using R: Role Type Classification : Case I (1 of 1)As we show earlier, Case I of our role type classification includes
study of a relationship between a Categorical Explanatory and a
Quantitative Response variable.
<br/><br/>
We start with an example from the course website. <br/>
Example:People who are concerned about their health may prefer hot
dogs that are low in calories. A study was conducted by a concerned
health group in which 54 major hot dog brands were examined, and their
calorie contents recorded. In addition, each brand was classified by
type: beef, poultry, and meat (mostly pork and beef, but up to 15%
poultry meat).
<br/><br/>
<b>
The purpose of the study was to examine whether the number of calories
a hot dog has is related to (or affected by) its type.</b>
(Reference: Moore, David S., and George P. McCabe (1989). Introduction
to the Practice of Statistics. Original source: Consumer Reports, June
1986, pp. 366-367.)
Answering this question requires us to examine the relationship
between the categorical variable Type and the quantitative variable
Calories. Because the question of interest is whether the type of hot
dog affects calorie content,
<UL>
<LI>the explanatory variable is Type, and
<LI>the response variable is Calories.
</UL>
To explore how the number of calories is related to the type of hot
dog, we need an informative visual display of the data that will
compare the three types of hot dogs with respect to their calorie
content.
The visual display that we'll use is side-by-side boxplots. The
side-by-side boxplots will allow us to compare the distribution of
calorie counts within each category of the explanatory variable, hot
dog type. We use R to do side-by-side box plot:
<span style="font-family: trebuchet ms; color:#003300; font-size:100%;">
<pre>
> hotdogdata <- read.csv ("TA01_009.TXT", header=T, sep="\t")
> attach(hotdogdata)
> boxplot(Calories[HotDog == 'Beef'], Calories[HotDog == 'Meat'], Calories[HotDog == 'Poul'],
border=c('blue', 'magenta','red'),
names=c('Beef','Meat', 'Poul'),
ylab='Calories',
main='Side By Side Comparative Boxplot of Calories')
</pre></span>
<br/>
<b>Gabriele Righetti</b> from Italy pointed out that it should be "names" and not "xlab" for
naming boxplots. Thanks Gabriele.<br/>
<br/>
The output of the above command displays a boxplot like following:
<a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://bp0.blogger.com/_4NC7AWExX5g/SHuvzrQJiSI/AAAAAAAAAUQ/35j2cBetXGU/s1600-h/boxplot-case1.gif"><img style="display:block; margin:0px auto 10px; text-align:center;cursor:pointer; cursor:hand;" src="http://bp0.blogger.com/_4NC7AWExX5g/SHuvzrQJiSI/AAAAAAAAAUQ/35j2cBetXGU/s320/boxplot-case1.gif" border="0" alt=""id="BLOGGER_PHOTO_ID_5222961495184869666" /></a>
<span style="font-family: verdana;font-size:100%; color:black;">
<br/> Lets also find five number summary for each type using R. Here
is the output of the same:
<span style="font-family: trebuchet ms; color:#003300; font-size:100%;">
<pre>
> summary(Calories[HotDog=='Beef'])
Min. 1st Qu. Median Mean 3rd Qu. Max.
111.0 140.5 152.5 156.8 177.2 190.0
> summary(Calories[HotDog=='Meat'])
Min. 1st Qu. Median Mean 3rd Qu. Max.
107.0 139.0 153.0 158.7 179.0 195.0
> summary(Calories[HotDog=='Poul'])
Min. 1st Qu. Median Mean 3rd Qu. Max.
86.0 102.0 129.0 122.5 143.0 170.0
</pre></span>
Let's summarize the results we got and interpret them in the context of the question we posed:
By examining the three side-by-side boxplots and the numerical
summaries, we see at once that poultry hotdogs as a group contain
fewer calories than beef or meat. <br/>The median number of calories in
poultry hotdogs (129) is less than the median (and even the first
quartile) of either of the other two distributions (medians 152.5 and
153). <br/>The spread of the three distributions is about the same, if IQR
is considered (all slightly above 40), but the (full) ranges vary
slightly more (beef: 79, meat:88, poultry 84). The general
recommendation to the health conscious consumer is to eat poultry
hotdogs. <br/>It should be noted, though, that since each of the three
types of hotdogs shows quite a large spread among brands, simply
buying a poultry hotdog does not guarantee a low calorie food.
What we learn from this example is that when exploring the
relationship between a categorical explanatory variable and a
quantitative response (Case I), we essentially compare the
distributions of the quantitative response for each category of the
explanatory variable using side-by-side boxplots supplemented by
descriptive statistics.
<br/><br/>
So we can safely say, the relationship between a categorical explanatory and a quantitative response variable is summarized using:
<UL>
<LI> Data display: side-by-side boxplots
<LI> Numerical summaries: descriptive statistics
</UL>
That is we compare the distributions of the quantitative (response) variables for each category (of categorical variable or factors as in R).
</span>Krishna Daglinoreply@blogger.com0tag:blogger.com,1999:blog-542053665666511781.post-31062572236245925172008-07-08T19:09:00.004+05:302008-07-08T21:12:42.366+05:30Learning Statistics using R: Role-type classification (2 of 2)<span style="font-family: verdana;font-size:100%; color:black;">
Lets try few examples from the course web site. In these example we
are presented with a brief description of a study involving two
variables. We are required to determine which of the four cases
represents the data sets of the problem. That is we need to identify
if a variable is Categorical or Quantitative and which variable is
Response and Explanatory variable.
<br/>
<OL>
<LI>A store asked 250 of its customers whether they were satisfied
with the service or not. The purpose of this study was to examine the
relationship between the customer's satisfaction and gender.
<br/>
<br/>
In this example, Gender is explanatory variable and Statifaction based
on gender is response variable. Both these variables are Categorical
and hence this is an example of Case II.
<br/>
<br/>
<LI>A study was conducted in order to explore the relationship between
the number of beers a person drinks, and his/her Blood Alcohol Level
(in %).
<br/>
In this example; Both the explanatory (number of beers) and response
(BAC) variables are quantitative in this case, and therefore this is
an example of case III. Hence this is an example of case I.
<br/>
<br/>
<LI>A study was conducted in order to determine whether longevity (how
long a person lives) is somehow related to the person's handedness
(right-handed/left-handed).
<br/><br/>
In this case the explanatory variable (handedness) is categorical and
the response variables (longevity) is quantitative. This is, therefore
an example of case I.
</OL>
</span>Krishna Daglinoreply@blogger.com0tag:blogger.com,1999:blog-542053665666511781.post-14051789007284330812008-07-08T18:57:00.012+05:302008-10-02T17:17:07.537+05:30Learning Statistics using R: Data sets used in examplesThe actual data is available from the course website but lately my links to actual data set is not working; perhaps those are changed? I am also not sure on how to upload these data sets on blogger. So I have uploaded a zip file containing all data sets at MegaShare. Here is the link to download file : <a href="http://www.MegaShare.com/497357">http://www.MegaShare.com/497357</a>
(Updated : 01-OCT-2008, size: ~5.2 K)Krishna Daglinoreply@blogger.com0tag:blogger.com,1999:blog-542053665666511781.post-90935073044891931972008-06-18T22:58:00.005+05:302008-09-10T23:18:55.717+05:30Learning Statistics using R: Role-type classification (1 of 2)<span style="font-family: verdana;font-size:100%; color:black;">
<strong><em>Most of the material here is taken from the course website!</em></strong><br/>
The second module of the course explains the relationship between two
variables. In earlier sections we learned how to work with a
distribution of a single variable, either quantitative or categorical.
<br/>
This section start with Role-Type classification of two variable:In
most studies involving two variables, each of the variables has a
role. We distinguish between:
<UL>
<LI>Response variable: the outcome of study.
<LI>Explanatory varible: the variable that claims to explain, predict or affect the response.
</UL>
The response variables are also known as "Dependent" variables and the
explanatory variables as "Independent" variables. Dependent variable
depend on the Independent variable and hence the name. A simple
example would be of a function that computes the sum of passed
arguments; in this case arguments (the values whose sum we need to
find) are independent variables while output (sum of these values) is
dependent variable.
<br/> Lets take 8 example from course website to make this clear. We
will be using these examples for further variable type
classification.<br/>
<OL>
<LI> We want to explore whether the outcome of the study - the score
on a test - is affected by the test-taker's gender. Therefore:
<UL><LI>Gender is the explanatory variable
<LI>Test score is the response variable
</UL>
<LI>How does the number of calories a hot dog has related to (or
effected by) the type of hot dog (beef, meat or poultry)? (in other
words, are there differences in the number of calories between the
three type of hot dogs?)
<UL><LI>Number of calories is response variable
<LI>Type of hot dog is explanatory variable
</UL>
<LI>In this study we explore whether nearsightedness of a person can
be explained by the type of light that person slept with as a baby.
Therefore:
<UL><LI>Light Type is the explanatory variable
<LI>Nearsightedness is the response variable
</UL>
<LI>Are the smoking habits of a person (yes/no) related to the
person's gender?
<UL><LI>Gender of person (male/female) is explanatory variable
<LI>Smoking habit is response variable
</UL>
<LI>Here we are examining whether a student's SAT score is a good
predictor for the student's GPA in freshman year. Therefore:
<UL><LI>SAT score is the explanatory variable
<LI>GPA of Freshman Year is the response variable
</UL>
<LI>In an attempt to improve highway safety for older drivers, a
government agency funded a research that explored the relationship
between drivers' age and sign legibility distance (the maximum
distance at which the driver can read a sign).
<UL><LI>Driver's age is the explanatory variable
<LI> Sign legibility distance is response variable
</UL>
<LI>Here we are examining whether a person's outcome on the driving test
(pass/fail) can be explained by the length of time this person has
practiced driving prior to the test. Therefore:
<UL><LI>Time is the explanatory variable
<LI>Driving Test Outcome is the response variable
</UL>
<LI>Can you predict a person's favorite type of music
(Classical/Rock/Jazz) based on his/her IQ level?
<UL><LI>IQ Level is explanatory variable
<LI>Type of music is response variable
</UL>
</OL>
<br/>
<b>
Above examples helps in identifying response and explanatory variable
but is it always clear what is the role classification? In other
words, is it always clear which of the variables is the explanatory
and which is the response? <br/>
<Big>Answer:</Big> NO! There are studies in which the role classification
is not really clear. This mainly happens in cases when both variables
are categorical or both are quantitative. An example could be a study
that explores the relationship between the SAT Math and SAT Verbal
scores. In cases like this, any classification choice would be fine
(as long as it is consistent throughout the analysis).</b> <br/><br/>
We know that a variable is either categorical variable or quantitative
variable. We use this information to further classify response and
explanatory variables. With this role-type classification we get
following 4 possibilities:
<OL>
<LI> Case I: Explanatory is Categorical and Response is Quantitative variable.
<LI> Case II: Explanatory is Categorical and Response is Categorical variable.
<LI> Case III:Explanatory is Quantitative and Response is Quantitative variable.
<LI> Case IV: Explanatory is Quantitative and Response Categorical variable.
</OL>
Following table taken from course web summarizes above 4 cases:
<a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://bp1.blogger.com/_4NC7AWExX5g/SFlGy0Wf5uI/AAAAAAAAALo/shN9Fs4_RSY/s1600-h/relationships_overview1.gif"><img style="display:block; margin:0px auto 10px; text-align:center;cursor:pointer; cursor:hand;" src="http://bp1.blogger.com/_4NC7AWExX5g/SFlGy0Wf5uI/AAAAAAAAALo/shN9Fs4_RSY/s320/relationships_overview1.gif" border="0" alt=""id="BLOGGER_PHOTO_ID_5213275882518472418" /></a>
<br/>
The couse warns us that this role-type classification serves as the
infrastructure for the entire section. In each of the 4 cases,
different statistical tools (displays and numerical measures) should be
used in order to explore the relationship between the two variables.
<br/><br/>
Along with this course also suggest us following important rule:
<br/>
<b>Principle:</b><br/>
When confronted with a research question that involves exploring the
relationship between two variables, the first and most crucial step is
to determine which of the 4 cases represents the data structure of the
problem. In other words, the first step should be classifying the two
relevant variables according to their role and type, and only then can
we determine the appropriate statistical tools. <br/>
Lets go back to our 8 examples and try to classify explanatory and
response variables to categorical or quantitative variable.
<OL>
<LI> We want to explore whether the outcome of the study - the score
on a test - is affected by the test-taker's gender. Therefore:
<UL>
<LI>Gender is the explanatory variable and it is categorical variable.
<LI>Test score is the response variable and it is quantitative variable.
<LI> Therefore this is an example of Case I.
</UL>
<LI>How does the number of calories a hot dog has related to (or
effected by) the type of hot dog (beef, meat or poultry)? (in other
words, are there differences in the number of calories between the
three type of hot dogs?)
<UL>
<LI>Type of hot dog is explanatory variable and it is categorical variable.
<LI>Number of calories is response variable and it is quantitative variable.
<LI> Therefore this is an example of Case I.
</UL>
<LI>In this study we explore whether nearsightedness of a person can
be explained by the type of light that person slept with as a baby.
Therefore:
<UL><LI>Light Type is the explanatory variable and it is categorical variable.
<LI>Nearsightedness is the response variable and it is categorical variable.
<LI> Therefore this is an example of Case II.
</UL>
<LI>Are the smoking habits of a person (yes/no) related to the
person's gender?
<UL>
<LI>Gender of person (male/female) is explanatory variable and it is categorical variable.
<LI>Smoking habit is response variable and it is categorical variable.
<LI> Therefore this is an example of Case II.
</UL>
<LI>Here we are examining whether a student's SAT score is a good
predictor for the student's GPA in freshman year. Therefore:
<UL>
<LI>SAT score is the explanatory variable and it is quantitative variable.
<LI>GPA of Freshman Year is the response variable and it is quantitiative variable.
<LI> Therefore this is an example of Case III.
</UL>
<LI>In an attempt to improve highway safety for older drivers, a
government agency funded a research that explored the relationship
between drivers' age and sign legibility distance (the maximum
distance at which the driver can read a sign).
<UL>
<LI>Driver's age is the explanatory variable and it is quantitiave variable.
<LI>Sign legibility distance is response variable and it is quantitative variable.
<LI> Therefore this is an example of Case III.
</UL>
<LI>Here we are examining whether a person's outcome on the driving test
(pass/fail) can be explained by the length of time this person has
practiced driving prior to the test. Therefore:
<UL>
<LI>Time is the explanatory variable and it is qunatitative variable.
<LI>Driving Test Outcome is the response variable and it is categorical variable.
<LI> Therefore this is an example of Case IV.
</UL>
<LI>Can you predict a person's favorite type of music
(Classical/Rock/Jazz) based on his/her IQ level?
<UL>
<LI>IQ Level is explanatory variable and it is quantitiave variable.
<LI>Type of music is response variable and it is categorical variable.
<LI> Therefore this is an example of Case IV.
</UL>
</OL>
After this we learn more about role-type classification and which tool
to use in which cases.
</span>Krishna Daglinoreply@blogger.com0tag:blogger.com,1999:blog-542053665666511781.post-17726326199330750762008-06-17T01:00:00.004+05:302008-09-10T23:18:31.774+05:30Learning Statistics using R: Rule Of Standard Deviation<strong><em>Most of the material here is taken from the course website!
</em></strong><br/>
Following explains the rule of standard deviation, also known as The
Empirical Rule. The rule is applied only to normal (symmetric)
data distribution.
<UL>
<LI>Approximately 68% of the observations fall within 1 standard
deviation of the mean.
<LI>Approximately 95% of the observations fall within 2 standard
deviations of the mean.
<LI>Approximately 99.7% (or virtually all) of the observations fall
within 3 standard deviations of the mean.
</UL>
This rule provides more insights about standard deviation and
following picture taken from course web site illustrates the same rule:
<a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://bp0.blogger.com/_4NC7AWExX5g/SFf89qwZDoI/AAAAAAAAALQ/oVhKJ1Cwas8/s1600-h/sdgraph2.gif"><img style="display:block; margin:0px auto 10px; text-align:center;cursor:pointer; cursor:hand;" src="http://bp0.blogger.com/_4NC7AWExX5g/SFf89qwZDoI/AAAAAAAAALQ/oVhKJ1Cwas8/s320/sdgraph2.gif" border="0" alt=""id="BLOGGER_PHOTO_ID_5212913230084116098" /></a>
Lets understand this with an example: The following data represents
height data of 50 males. Lets use R to find 5 number summary of these
and to confirm if the distribution is nomal - mound shaped.
<span style="font-family: trebuchet ms; color:#003300;"><pre>
# We use 'c' command to populate male vector; on which we will carry our operations.
male <- c(64, 66, 66, 67, 67, 67, 67, 68, 68, 68, 68, 68, 68, 69, 69, 69, 69, 69, 70, 70, 70, 70, 70, 70, 70, 71, 71, 71, 71, 71, 71, 71, 72, 72, 72, 72, 72, 72, 73, 73, 73, 74, 74, 74, 74, 74, 75, 76, 76, 77)
> hist(male)
</pre></span>
In above code sample 'hist' command draws a histogram that has almost
normal-mound shape. Here is the image that R draws for us.
<a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://bp0.blogger.com/_4NC7AWExX5g/SFf9L9WIbPI/AAAAAAAAALY/0mtnMQLeT5g/s1600-h/mail-height.png"><img style="display:block; margin:0px auto 10px; text-align:center;cursor:pointer; cursor:hand;" src="http://bp0.blogger.com/_4NC7AWExX5g/SFf9L9WIbPI/AAAAAAAAALY/0mtnMQLeT5g/s320/mail-height.png" border="0" alt=""id="BLOGGER_PHOTO_ID_5212913475592416498" /></a>
<span style="font-family: verdana;font-size:100%; color:black;">
Lets find five number summary and confirm if standard deviation rule
applies correctly to this data set.
<span style="font-family: trebuchet ms; color:#003300;"><pre>
# Just a simple summary command gives five point summary
> summary(male)
Min. 1st Qu. Median Mean 3rd Qu. Max.
64.00 68.25 70.50 70.58 72.00 77.00
> sd(male)
[1] 2.857786
# Lets apply first rule - 68% of data points are within (mean - 1 * SD) and (mean + 1 * SD)
> male >= (mean(male) - (1 * sd(male))) & male <= (mean(male) + (1 * sd(male)))
[1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE TRUE TRUE TRUE TRUE
[13] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
[25] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
[37] TRUE TRUE TRUE TRUE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[49] FALSE FALSE
# above command gives us the indices of male vector as TRUE where our condition satisfies.
# Lets count how many such obervations are there.
> length(male[male >= (mean(male) - (1 * sd(male))) & male <= (mean(male) + (1 * sd(male)))])
[1] 34
# So out of 50 observation 34 observation are with in mean +/- 1 SD. ie.
> 34/50 * 100
[1] 68
# So as rule suggests, 68% observations are with in mean +/- 1 SD.
# Lets check second rule - 95% of data points are within (mean - 2 * SD) and (mean + 2 * SD)
> length(male[male >= (mean(male) - (2 * sd(male))) & male <= (mean(male) + (2 * sd(male)))])
[1] 48
> 48/50 * 100
[1] 96
# So indeed 95% of data points are within mean +/- 2 SD.
# Lets check third rule - 99.7% of data points are within (mean - 3 * SD) and (mean + 3 * SD)
> length(male[male >= (mean(male) - (3 * sd(male))) & male <= (mean(male) + (3 * sd(male)))])
[1] 50
> 50/50*100
[1] 100
# this shows that 99.7% of data points are with in mean +/- 3 SD.
</pre></span>
Following table taken from course website makes this more clear:
<a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://bp1.blogger.com/_4NC7AWExX5g/SFf9i15JqXI/AAAAAAAAALg/EG0U3KU3cdg/s1600-h/table-male-height.gif"><img style="display:block; margin:0px auto 20px; text-align:center;cursor:pointer; cursor:hand;" src="http://bp1.blogger.com/_4NC7AWExX5g/SFf9i15JqXI/AAAAAAAAALg/EG0U3KU3cdg/s320/table-male-height.gif" border="0" alt=""id="BLOGGER_PHOTO_ID_5212913868728805746" /></a>
<br/><Big>Summary:</Big><br/>
<UL>
<LI> The standard deviation measures the spread by reporting a typical (average) distance between the data points and their average.
<LI> It is appropriate to use the SD as a measure of spread with the mean as the measure of center.
<LI>Since the mean and standard deviations are highly influenced by extreme observations, they should be used as numerical descriptions of the center and spread only for distributions that are roughly symmetric, and have no outliers.
<LI> For symmetric mound-shaped distributions, the Standard Deviation Rule tells us what percentage of the observations falls within 1, 2, and 3 standard deviations of the mean, and thus provides another way to interpret the standard deviation's value for distributions of this type.
</UL>
</SPAN>Krishna Daglinoreply@blogger.com0tag:blogger.com,1999:blog-542053665666511781.post-90397524696895021432008-06-13T00:55:00.003+05:302008-09-10T23:18:09.161+05:30Learning Statistics using R: Standard Deviation and Histogram<span style="font-family: verdana;font-size:90%; color:black;">
<strong><em>Most of the material here is taken from the course website!
</em></strong><br/>
In following example we will see how histogram can help us to clarify
the concept of Standard Deviation.
<br/>
<strong>Example:</strong><br/>
At the end of a statistics course, the 27 students in the class were
asked to rate the instructor on a number scale of 1 to 9 (1 being
"very poor", and 9 being "best instrctor I've ever had"). The
following table provides three hypothetical rating data:
<br/>
<a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://bp3.blogger.com/_4NC7AWExX5g/SFF4yntKGPI/AAAAAAAAAKw/yTBvI7yGXew/s1600-h/deviationexample.gif"><img style="display:block; margin:0px auto 10px; text-align:center;cursor:pointer; cursor:hand;" src="http://bp3.blogger.com/_4NC7AWExX5g/SFF4yntKGPI/AAAAAAAAAKw/yTBvI7yGXew/s320/deviationexample.gif" border="0" alt=""id="BLOGGER_PHOTO_ID_5211079054891555058" /></a>
Following are the histogram of data of each class:<br/>
<a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://bp3.blogger.com/_4NC7AWExX5g/SFF5AsuuKBI/AAAAAAAAAK4/Hrxi-eOw5ew/s1600-h/class1.gif"><img style="display:block; margin:0px auto 10px; text-align:center;cursor:pointer; cursor:hand;" src="http://bp3.blogger.com/_4NC7AWExX5g/SFF5AsuuKBI/AAAAAAAAAK4/Hrxi-eOw5ew/s320/class1.gif" border="0" alt=""id="BLOGGER_PHOTO_ID_5211079296758458386" /></a>
<a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://bp3.blogger.com/_4NC7AWExX5g/SFF5LW6HJDI/AAAAAAAAALA/mTVtwZOlmCo/s1600-h/class2.gif"><img style="display:block; margin:0px auto 10px; text-align:center;cursor:pointer; cursor:hand;" src="http://bp3.blogger.com/_4NC7AWExX5g/SFF5LW6HJDI/AAAAAAAAALA/mTVtwZOlmCo/s320/class2.gif" border="0" alt=""id="BLOGGER_PHOTO_ID_5211079479879214130" /></a>
<a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://bp3.blogger.com/_4NC7AWExX5g/SFF5WPSIKwI/AAAAAAAAALI/PATkgZnpzms/s1600-h/class3.gif"><img style="display:block; margin:0px auto 10px; text-align:center;cursor:pointer; cursor:hand;" src="http://bp3.blogger.com/_4NC7AWExX5g/SFF5WPSIKwI/AAAAAAAAALI/PATkgZnpzms/s320/class3.gif" border="0" alt=""id="BLOGGER_PHOTO_ID_5211079666811022082" /></a>
What can we say about standard deviation by looking at these histogram and
data set? <br/>
Lets assume that the mean of all three data set is 5 (which is
reasonable clear by looking at histograms) and we know (roughly) that
standard deviation is average distance of all data points from their
mean.
<UL>
<LI>For class I histogram most of the ratings are at 5 which is also
the mean of the data set. So the average distance between mean and
data points would be very small (since most of the data points are at
mean).
<LI>For class II histogram most of the ratings are at far points from
mean - 5. In this case most of the data points are at two extrems at 1
and 9. So the average distance between mean and data points would be
larger.
<LI>For class III histgram data points are evenly distributed around
mean. We can safely say that in this case the average distance between
mean and data points would be greater than that of class I but smaller
than that of class II. ie. in-between class I and class II standard
deviation.
</UL>
<br/>
Lets check our assumption by loading these data set into R and
verifying standard deviation of each. The excel contained data set can be downloaded from <a href="https://oli.web.cmu.edu/repository/webcontent/859044f080020c690045b50e228faf24/_u2_summarizing_data/webcontent/excel/sdintuition.xls">here.</a>
<span style="font-family: trebuchet ms; color:#003300;"><pre>
> class1 <- c(1,1,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,9,9)
> sd(class1)
[1] 1.568929
> summary(class1)
Min. 1st Qu. Median Mean 3rd Qu. Max.
1 5 5 5 5 9
> class2 <- c(1,1,1,1,1,1,1,1,1,1,1,1,1,5,9,9,9,9,9,9,9,9,9,9,9,9,9)
> sd(class2)
[1] 4
> summary(class2)
Min. 1st Qu. Median Mean 3rd Qu. Max.
1 1 5 5 9 9
> class3 <- c(1,2,3,4,5,6,7,8,9,1,2,3,4,5,6,7,8,9,1,2,3,4,5,6,7,8,9)
> sd(class3)
[1] 2.631174
> summary(class3)
Min. 1st Qu. Median Mean 3rd Qu. Max.
1 3 5 5 7 9
</pre>
</span>
So we have following standard deviation for our 3 class ratings
<UL>
<LI>Class I : 1.568929
<LI>Class II : 4.0
<LI>Class III : 2.631174
</UL>
(Note that excel may vary a bit in results of standard deviation if you are using
stdev function.) So calculated standard deviation confirm our
assumption that we made by looking at histograms.
</span>Krishna Daglinoreply@blogger.com0tag:blogger.com,1999:blog-542053665666511781.post-51470108709133215862008-06-04T23:50:00.007+05:302008-09-10T23:17:48.408+05:30Learning Statistics using R: Standard Deviation<span style="font-family: verdana;font-size:90%; color:black;">
<strong><em>Most of the material here is taken from the course website!</em></strong><br/>
Earlier we examined measure of spread using Range(max - min) and
IQR(the range covered by middle 50% data). We also noted that IQR
should be used when median is used as measure of center. Now we move
to another measure of spread called standard deviation.
<br/><br/>
The idea behind standard deviation is to quantify the spread of
distribution by measuring how far the observations are from mean of
distribution. That is how distant observations are located from the
mean of all observations. The standard deviation gives average (or
typical distance) between an observation/data point and the mean,
X-bar.
<br/><br/>
Lets understand standard deviation using an example; we calculate
standard deviation step by step using R commands. (There is a single R
function to do the same!)
<br/>
Assume we have following data set of 8 observations: <br/>
7, 9, 5, 13, 3, 11, 15, 9
<UL>
<LI>Calculate mean: <br/>
We use R's 'c' function to combine them in a vector and then use mean
function to calculate mean of the data set.
<span style="font-family: trebuchet ms; color:#003300; font-size:100%; weight:bold"><pre>
> dataset <- c(7, 9, 5, 13, 3, 11, 15, 9)
> is.vector(dataset)
[1] TRUE
# we have stored our observation in a vector called dataset
> mean(dataset)
[1] 9
# so we have 9 as the mean of our data set; now we need to find
# distance of each observation from this value : 9.
> deviation <- c (dataset - 9)
> deviation
[1] -2 0 -4 4 -6 2 6 0
# above command directly gives us deviation of each observation from
# the mean; that we have stored in another vector called deviation.
</pre></span>
<strong> Thinking about the idea behind the standard deviation being
an average (typical) distance between the data points and their mean,
it will make sense to average the deviations we got. Note, however,
that the sum of the deviations from the mean is always 0. </strong>
<br/><br/>
<LI> Square of deviation:<br/>
So we square each of the deviation and then take its average;
which following R code does:<br/>
<span style="font-family: trebuchet ms; color:#003300;font-size:100%; weight:bold"><pre>
# we can use either of following two methods to calculate square of
# deviations.
> deviation ^ 2
[1] 4 0 16 16 36 4 36 0
> deviation * deviation
[1] 4 0 16 16 36 4 36 0
</pre></span>
<LI>Average the square deviations by adding them up, and dividing by
n-1, (one less than the sample size): Lets do that in R.
<span style="font-family: trebuchet ms; color:#003300; font-size:100%; weight:bold"><pre>
> (sum(deviation ^ 2)) / (length(dataset) - 1)
[1] 16
</pre></span>
<Big>This average of the squared deviations is called the variance of the data.</big>
<br/><br/>
<LI> Find standard deviation:<br/>
The standard deviation of the data is the square root of the
variance. So in our case it would be square root of 16.
<span style="font-family: trebuchet ms; color:#003300; font-size:100%; weight:bold"><pre>
>sqrt(16)
[1] 4
</pre></span>
Why do we take the square root? Note that 16 is an average of the
squared deviations, and therfore has different units of
measurement. In this case 16 is measured in "squared deviation", which
obviously cannot be interpreted. We therefore, take the square root in
order to compensate for the fact that we squared our deviations, and
in order to go back to the original units of measurement.
</UL>
<br/>
<Big><strong>Properties of the Standard Deviation:</strong></Big>
<OL>
<LI>
It should be clear from the discussion thus far that the standard
deviation should be paired as a measure of spread with the mean as a
measure of center.
<LI> Note that the only way, mathematically, in which the standard
deviation = 0, is when all the observations have the same value.
Indeed in this case not only the standard deviation is 0, but also the
range and the IQR are 0.
<LI>Like the mean, the SD is strongly influenced by outliers in the
data. Consider our last example: 3, 5, 7, 9, 9, 11, 13, 15 (data
ordered). If the largest observation was wrongly recorded as 150,
then: the average would jump up to 25.9, and the standard deviation
jumps up to 50.3 Note that in this simple example it is easy to see
that while the standard is strongly influenced by outliers, the IQR is
not! In both cases, the IQR will be the same since, like the median,
the calculation of the quartiles depends only on the order of the data
rather than the actual values.
</OL>
<br/>
<Big><Strong>Choosing Numerical Summaries</strong></big><br/>
<UL>
<LI>Use mean and the standard deviation as measures of center and spread
only for reasonably symmetric distributions with no outliers.
<LI>Use the five-number summary (which gives the median, IQR and
range) for all other cases. </UL>
<Big><strong>R function for Standard Deviation:</strong></Big>
There is a single R function "sd" that calculates Standard Deviation of dataset, just be careful to use "na.rm=TRUE" argument if you have NA values in your dataset. This function would return vector of SD of columns if dataset is dataframe or matrix. Remember its column's SD and not rows by default.
</span>Krishna Daglinoreply@blogger.com0tag:blogger.com,1999:blog-542053665666511781.post-24838458563206143202008-05-13T22:05:00.010+05:302009-01-07T22:08:28.085+05:30Learning Statistics using R: Five Number Summary<span style="font-family: verdana;font-size:90%; color:black;">
<strong><em>Most of the material here is taken from the course website!
</em></strong><br/>
Before we go ahead and learn how graphical representation of Five
Number Summary, let check out few intersting course problems. <br/>
Here they are:<br/>
<OL>
<LI>Example 1: A survey taken of 140 sports fans asked the question:
"What is the most you have ever spent for a ticket to a sporting
event?"<br/>
The five-number summary for the data collected is: <br/>
min = 85 Q1 =130 median = 145 Q3 = 150 max = 250 <br/>
Should the smallest observation be classified as an outlier?
<LI> Example 2: A survey taken in a large statistics class contained
the question: "What's the fastest you have driven a car (in mph)?" <br/>
The five-number summary for the 87 males surveyed is: <br/>
min=55 Q1=95 Median=110 Q3=120 Max=155 <br/>
Should the largest observation in this data set be classified as an
outlier?
</OL>
<br/>
<Big>Important Summary About Spread </Big>
<UL>
<LI> The range covered by the data is the most intuitive measure of
spread and is exactly the distance between the smallest data point -
min and the largest one - Max.
<LI>Another measure of spread is the inter-quartile range (IQR) which is
the range covered by the middle 50% of the data.
<LI>IQR = Q3-Q1, the difference between the third and first
quartiles. The first quartile (Q1) is the value such that one quarter
(25%) of the data points fall below it. The third quartile is the
value such that three quarters (75%) of the data points fall below it.
<LI>The IQR should be used as a measure of spread of a distribution
only when the median is used as a measure of center.
<LI>The IQR can be used to detect outliers using the 1.5 * IQR (1.5 times
IQR) criterion.
</UL>
<br/>
<em><Strong>Five Number Summary:</strong></em>
So far, in our discussion about measures of spread, the key players were:<br/>
<UL>
<LI>The extremes (min and Max) which provide the range covered by
all the data, and
<LI>The quartiles (Q1, M and Q3), which together provide the IQR, the
range covered by the middle 50% of the data.
</UL>
The combination of all five numbers (min, Q1, M, Q3, Max) is
called the five number summary, and provides a quick numerical
description of both the center and spread of a distribution.
<br/>
<strong>Boxplot</strong> can be used for graphically summarizing these five
number summary of the distribution of a quantitative. Lets start doing boxplots in R.
<br/><br>
<Big>Example: Best Actor Oscar Winners:</big><br/>
This time we use best actor Oscar winners instead of actress and draw a boxplot using R.
<span style="font-family: trebuchet ms; color:#003300;">
<pre>
# read the actor.csv file.
>actor<-read.csv("actor.csv", header=T, sep=",")
> attach(actor)
>boxplot(Age, border=c('blue'), xlab='Actor Age')
</pre></span>
Here is how our boxplot of actor data set looks:<br/>
<a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://bp0.blogger.com/_4NC7AWExX5g/SCnPXt9XM_I/AAAAAAAAAJI/M8W149VrOJQ/s1600-h/a.png"><img style="display:block; margin:0px auto 10px; text-align:center;cursor:pointer; cursor:hand;" src="http://bp0.blogger.com/_4NC7AWExX5g/SCnPXt9XM_I/AAAAAAAAAJI/M8W149VrOJQ/s320/a.png" border="0" alt=""id="BLOGGER_PHOTO_ID_5199915251156857842" /></a>
The boxplot graphically represents the distribution of a quantitative variable by visually
displaying the five number summary and any observation that was classified as a suspected
outlier using the 1.5(IQR) criterion.
<br/><br/>
<Big>Example: Best Actress Oscar Winners:</big><br/>
We use our actress data set again to draw another box plot:
<span style="font-family: trebuchet ms; color:#003300;">
<pre>
# read the actress.csv file.
>actress<-read.csv("actress.csv", header=T, sep=",")
> attach(actress)
> boxplot(Age, border=c('magenta'), xlab='Actress Age')
</pre></span>
Here is how our actress box plot looks drawn using R:
<a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://bp2.blogger.com/_4NC7AWExX5g/SCnU0N9XNCI/AAAAAAAAAJg/WI-S_T1eeiU/s1600-h/actressbox.png"><img style="display:block; margin:0px auto 10px; text-align:center;cursor:pointer; cursor:hand;" src="http://bp2.blogger.com/_4NC7AWExX5g/SCnU0N9XNCI/AAAAAAAAAJg/WI-S_T1eeiU/s320/actressbox.png" border="0" alt=""id="BLOGGER_PHOTO_ID_5199921238341268514" /></a>
<Big>Following graph taken from the course website highlights various details of boxplot done for actress data set.</Big>
<a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://bp3.blogger.com/_4NC7AWExX5g/SCnUAd9XNAI/AAAAAAAAAJQ/1jstN42cF44/s1600-h/boxplot6.gif"><img style="display:block; margin:0px auto 10px; text-align:center;cursor:pointer; cursor:hand;" src="http://bp3.blogger.com/_4NC7AWExX5g/SCnUAd9XNAI/AAAAAAAAAJQ/1jstN42cF44/s320/boxplot6.gif" border="0" alt=""id="BLOGGER_PHOTO_ID_5199920349283038210" /></a>
There are <a href=”https://oli.web.cmu.edu/jcourse/workbook/activity/page?context=859d4ac480020c6900ccc10e7813a437”>couple of interactive examples</a> of boxplot at course website, try doing them.
<br/><br/>
<big>Example: Best Actress and Actor Oscar Winners: Side by Side Comparative Boxplots.</big>
So far we have examined the age distributions of Oscar winners for males and females separately.
It will be interesting to compare the age distributions of actors and actresses who won the best acting Oscar. To do that we will look at side-by-side boxplots of the age distributions by gender.
<br/>
Its quite easy to do side by side boxplots in R. Following code shows how to do it:
<span style="font-family: trebuchet ms; color:#003300;">
<pre>
# read the actor.csv file.
>actor <-read.csv("actor.csv", header=T, sep=",")
>actress <-read.csv("actress.csv", header=T, sep=",")
# Following is one single command
# Bug-Fix: Gabriele Righetti
> boxplot(actor$Age, actress$Age, border=c('blue','magenta'), names=c('Actor','Actress'), ylab='Age',main='Side-By-Side (Comparative) Boxplots\nAge of Best Actor/Actress Winners (1970-2001)')
>
</pre></span>
This is how our final output from R look:
<a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://bp3.blogger.com/_4NC7AWExX5g/SCnUUd9XNBI/AAAAAAAAAJY/MWrMq2iatXI/s1600-h/sidebyside.png"><img style="display:block; margin:0px auto 10px; text-align:center;cursor:pointer; cursor:hand;" src="http://bp3.blogger.com/_4NC7AWExX5g/SCnUUd9XNBI/AAAAAAAAAJY/MWrMq2iatXI/s320/sidebyside.png" border="0" alt=""id="BLOGGER_PHOTO_ID_5199920692880421906" /></a>
<br/>
Recall also that we found the five-number summary and means for both distributions:
<UL><LI>Actors: min=31, Q1=37.25, M=42.5, Q3=50.25, Max=76
<LI> Actresses: min=21, Q1=32, M=35, Q3=41.5, Max 80
</UL>
Based on the graph and numerical measures, we can make the following comparison between the two distributions:<br>
<UL><LI>Center: The graph reveals that the age distribution of the males is higher than the females' age distribution. This is supported by the numerical measures. The median age for females (35) is lower than for the males (42.5). Actually, it should be noted that even the third quartile of the females' distribution (41.5) is lower than the median age for males. We therefore conclude that in general, actresses win the Best Actress Oscar at a younger age than the actors do.
<LI>Spread: Judging by the range of the data, there is much more variability in the females' distribution (range=59) than there is in the males' distribution (range=35). On the other hand, if we look at the IQR, which measures the variability only among the middle 50% of the distribution, we see more spread among males (IQR=13) than the females (IQR=9.5). We conclude that among all the winners, the actors' ages are more alike than the actresses' ages. However, the middle 50% of the age distribution of actresses is more homogeneous than the actors' age distribution.
<LI>Outliers: We see that we have outliers in both distributions. There is only one high outlier in the actors' distribution (76, Henry Fonda, On Golden Pond), compared with three high outliers in the actresses' distribution.
</UL>
</span>Krishna Daglinoreply@blogger.com0tag:blogger.com,1999:blog-542053665666511781.post-20384793537531380282008-05-05T19:29:00.005+05:302008-05-05T20:07:43.637+05:30Send Your Name to the Moon<p>NASA invites people of all ages to join the lunar exploration journey with an opportunity to <a href="http://lro.jhuapl.edu/NameToMoon/index.php">send their names to the moon</a> aboard the Lunar Reconnaissance Orbiter, or LRO, spacecraft.
</p>
<center><Strong>Here is my certificate of participation!</strong></center>
<a href="http://bp1.blogger.com/_4NC7AWExX5g/SB8a9ntiaUI/AAAAAAAAAJA/RW2saKgdi58/s1600-h/mymoon.png"><img style="display:block; margin:0px auto 10px; text-align:center;cursor:pointer; cursor:hand;" src="http://bp1.blogger.com/_4NC7AWExX5g/SB8a9ntiaUI/AAAAAAAAAJA/RW2saKgdi58/s400/mymoon.png" border="0" alt=""id="BLOGGER_PHOTO_ID_5196902140943165762" /></a>
<br/>Krishna Daglinoreply@blogger.com1tag:blogger.com,1999:blog-542053665666511781.post-64893315050669690722008-05-04T00:16:00.004+05:302008-09-10T23:17:07.176+05:30Learning Statistics using R:Detecting Outliers with IQR<span style="font-family: verdana;font-size:90%; color:black;">
<strong><em>Most of the material here is taken from the course website!
</em></strong><br/>
An Outlier is an observation/data-point in set of observations (data
set) that is far removed in values from other observations. An
outlier is either very large or very small value and as noted earlier
affects the mean of the data set. More about Outlier is <a
href="http://en.wikipedia.org/wiki/Outlier">here.</a>
<br/><br/>
<strong>The (1.5 * IQR) criteria for finding outliers (1.5 times IQR):</strong><br/>
An observation is suspected outliers if it is:<br/>
<OL>
<LI>Below Q1 - (1.5 * IQR): that is Q1 minus 1.5 times IQR.
<LI>Above Q3 + (1.5 * IQR): that is Q2 minus 1.5 times IQR.
</OL>
<br/>
The following picture illustrates the 1.5 * IQR rule:
<a href="http://bp3.blogger.com/_4NC7AWExX5g/SBy063tiaTI/AAAAAAAAAI4/CRhsyi23TUA/s1600-h/IQR-Outlier.png"><img style="display:block; margin:0px auto 10px; text-align:center;cursor:pointer; cursor:hand;" src="http://bp3.blogger.com/_4NC7AWExX5g/SBy063tiaTI/AAAAAAAAAI4/CRhsyi23TUA/s320/IQR-Outlier.png" border="0" alt=""id="BLOGGER_PHOTO_ID_5196226993559071026" /></a>
<br/>
<big>Example: Best Actress Oscar Winners:</big><br/>
We continue with our Best Actress Oscar winner data set. Here we will
try to locate names of actress whose age is beyond the 1.5 * IQR
range.
<span style="font-family: trebuchet ms; color:#003300; font-size:100%; font-weight:bold;">
<pre>
# we have data in 'actress' data frame; just check the quantile
> quantile(Age, names=T)
0% 25% 50% 75% 100%
21.00 32.50 35.00 41.25 80.00
# Lets check the IQR
> IQR(Age)
[1] 8.75
# Now lets see how to retrieve value for Q1 - First quantile
> quantile(Age,0.25)
25% 32.5
# (Is this correct method?)
# As can be seen above passing a value 0.25 returns value of the first quantile.
# now lets see what is the value for Q1 - (1.5 * IQR)
>quantile(Age, 0.25) - (IQR(Age) * 1.5)
25% 19.375
# Okay so this also works!
# now how do we get names of all actress whose age is less than 19.375?
>which(Age < (quantile(Age, 0.25) - IQR(Age) * 1.5))
integer(0)
# So in our data set there is no actress whose age is less than 19.375 since the smallest is of age 21!
# now lets try to find upper/higher suspected outliers.
# Remember its Q3 + (IQR * 1.5)
>quantile(Age,0.75) + (IQR(Age) * 1.5)
75%
54.375
# Okay so far so good; lets get age that are greater than 54.375
> Age > quantile(Age,0.75) + (IQR(Age) * 1.5 )
[1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE
[13] FALSE FALSE FALSE TRUE FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE
[25] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
# See it returns TRUE in the indices where our condition matches.
> which(Age > quantile(Age,0.75) + (IQR(Age) * 1.5 ))
[1] 12 16 20
# which command returns the same but actual index number; but how do we get names?
> actress[which(Age > quantile(Age,0.75) + (IQR(Age) * 1.5 )),]
Year Name Movie Age
12 1981 Kathryn Hepburn On Golden Pond 74
16 1985 Geraldine Page A Trip to the Bountiful 61
20 1989 Jessica Tandy Driving Miss Daisy 80
# so the comma (,) at the end does the trick. Lets try to do without "which" command.
>actress[ Age > (quantile(Age,.75, names=T) + (IQR(Age) * 1.5)),]
Year Name Movie Age
12 1981 Kathryn Hepburn On Golden Pond 74
16 1985 Geraldine Page A Trip to the Bountiful 61
20 1989 Jessica Tandy Driving Miss Daisy 80
# so finally we got suspected outliers!
</pre>
</span>
<strong><em>Other methods:</em></strong><br/>
We can draw boxplot to visually detect outliers. (But following does not seem that helpful?)
<span style="font-family: trebuchet ms; color:#003300; font-size:100%; font-weight:bold;"><pre>
> boxplot(Age)
# or
> plot(lm(Age ~ 1))
# or using library car
> library("car")
> outlier.test(lm(Age ~ 1))
max|rstudent| = 3.943262, degrees of freedom = 30,
unadjusted p = 0.0004461754, Bonferroni p = 0.01427761
Observation: 20
</pre></span>
</span>Krishna Daglinoreply@blogger.com0tag:blogger.com,1999:blog-542053665666511781.post-88179837028440874042008-05-02T22:34:00.009+05:302008-05-02T23:11:17.704+05:30Learning Statistics using R: IQR<span style="font-family: verdana;font-size:90%; color:black;">
<strong><em>I am including the notes from the course itself. None of material is
mine other than errors! </em></strong><br/>
As seen earlier range gives us over all range of the
distribution while IQR measures the spread of distribution by giving
us the range covered by the middle 50% of the data.
<br/>
The <a
href=https://oli.web.cmu.edu/repository/webcontent/859044f080020c690045b50e228faf24/_u2_summarizing_data/_m1_examining_distributions/webcontent/spread2.gif>
picture </a> taken from course website makes it more clear.
<a href="http://bp1.blogger.com/_4NC7AWExX5g/SBtOontiaNI/AAAAAAAAAII/aCXs4zCliRc/s1600-h/IQR-1.png"><img style="display:block; margin:0px auto 10px; text-align:center;cursor:pointer; cursor:hand;" src="http://bp1.blogger.com/_4NC7AWExX5g/SBtOontiaNI/AAAAAAAAAII/aCXs4zCliRc/s320/IQR-1.png" border="0" alt=""id="BLOGGER_PHOTO_ID_5195833054863714514" /></a>
<br/>
Here is how course suggest on finding IQR:
<br/>
<OL>
<LI>First sort the data so that we can easily find the median. As we
know that median divides the dataset in such a way that 50% of data
points are below the median and 50% of data points are above the
median. That is data set would divide in two equal halves; lower and
top halves - first half containing min to median and another from
median to max.
<LI>Find the median of the lower 50% of the data or the first
half. This is called the first quartile of the distribution and is
denoted by Q1. Q1 or median of the fist half is called the first
quartile since one quarter of the data points fall below it.
<LI>Repeat this again for the top 50% of the data. Find the median of
the top 50% of the data. This is called the third quartile of the
distribution and is denoted by Q3. Q3 is called the third quartile
since three quarters of the data points fall below it.
<LI>The middle 50% of data between Q1 and Q3 is IQR and calculated by
following: IQR = Q3 - Q1.
</OL>
<br/>
Here is another <a href="
https://oli.web.cmu.edu/repository/webcontent/859044f080020c690045b50e228faf24/_u2_summarizing_data/_m1_examining_distributions/webcontent/spread4.gif">picture</a> taken from course website that visually explains how first quartile
and Q1 is found:
<a href="http://bp3.blogger.com/_4NC7AWExX5g/SBtPTHtiaOI/AAAAAAAAAIQ/xAs-ZgP9Vt0/s1600-h/IQR-2.png"><img style="display:block; margin:0px auto 10px; text-align:center;cursor:pointer; cursor:hand;" src="http://bp3.blogger.com/_4NC7AWExX5g/SBtPTHtiaOI/AAAAAAAAAIQ/xAs-ZgP9Vt0/s320/IQR-2.png" border="0" alt=""id="BLOGGER_PHOTO_ID_5195833785008154850" /></a>
<br/><br/>
Few very important observation that course makes as following:
<UL>
<LI>From the first picture we can see that Q1, M, and Q3 divide the
data into four quarters with 25% of the data points in each, where the
median is essentially the second quartile. The use of IQR=Q3-Q1 as a
measure of spread is therefore particularly appropriate when the
median M is used as a measure of center.
<LI>We can define a bit more precisely what is considered the bottom
or top 50% of the data. The bottom (top) 50% of the data is all the
observations whose position in the ordered list is to the LEFT (RIGHT)
of the location of the overall median M. The following picture will
visually illustrate this for the simple cases of n=7 and n=8.
</UL>
<a href="http://bp1.blogger.com/_4NC7AWExX5g/SBtQGntiaPI/AAAAAAAAAIY/YaQCwWg8tUs/s1600-h/IQR-3.png"><img style="display:block; margin:0px auto 10px; text-align:center;cursor:pointer; cursor:hand;" src="http://bp1.blogger.com/_4NC7AWExX5g/SBtQGntiaPI/AAAAAAAAAIY/YaQCwWg8tUs/s320/IQR-3.png" border="0" alt=""id="BLOGGER_PHOTO_ID_5195834669771417842" /></a>
Note that when n is odd (like in n=7 above), the median is not
included in either the bottom or top half of the data; when n is even
(like in n=8 above), the data are naturally divided into two halves.
<br/><br/>
Example: Best Actress Oscar Winners:
Course uses stemplot for finding IQR; we use simple R command to find
IQR. Please note that IQR found using R is different from the course
example.
<br/>
<span style="font-family: trebuchet ms; color:#003300;">
<pre>
# we have data in 'actress' data frame.
> quantile(Age, names=T)
0% 25% 50% 75% 100%
21.00 32.50 35.00 41.25 80.00
> IQR(Age)
[1] 8.75
</pre></span>
<br/>
As can be seen in above code, 'quantile' R command outputs following for quartile:
<UL>
<LI> Q1 32.50 (shown as 25%; course calculated value is 32)
<LI> Q3 41.25 (shown as 75%; course calculated value is 41.5)
</UL>
Simple IQR R function calculates 8.75 as IQR; while course calculated value is 9.75.
</span>Krishna Daglinoreply@blogger.com0tag:blogger.com,1999:blog-542053665666511781.post-12627257398448498262008-04-29T22:10:00.004+05:302008-09-10T23:16:42.204+05:30Learning Statistics using R: Measure of Spread<span style="font-family: verdana;font-size:90%; color:black;">
<strong><em>
Most of the material here is taken from the course website!</em></strong>
<br/>
To describe a distribution along with measure of center we also need
to know spread also known as variability of distribution. As course
describes there are 3 commonly used measures of spread/variability,
each describing spread differently:
<br/>
<UL>
<LI> Range
<LI> Inter-quartile range (IQR)
<LI> Standard deviation
</UL>
<br/>
<UL>
<LI>Range:</LI>
Range is simplest measure of spread and is exactly the distance
(difference) between smallest data point (min) and maximum data point.
We try to find Range of our Best Actress Dataset:
<span style="font-family: trebuchet ms; color:#003300;">
<pre>
actress <- read.csv("actress.csv", sep=",", header=T)
> attach(actress)
> summary(Age) Min. 1st Qu. Median Mean 3rd Qu. Max.
21.00 32.50 35.00 38.53 41.25 80.00
> range(Age)
[1] 21 80
> diff(range(Age))
[1] 59
</pre>
</span>
Yes, summary command gives us all the details but we try to learn few
more R commands. As can be seen in above example "range" function
gives the minimum and maximum value for the "Age" distribution. If we
subtract min from max we get number of years covered as shown by
"diff" command.
<br/>
80 (max) - 21 (min) = 59 (Range)
</UL>
</span>Krishna Daglinoreply@blogger.com0tag:blogger.com,1999:blog-542053665666511781.post-9969479177212176762008-04-28T23:44:00.002+05:302008-09-10T23:13:18.189+05:30Learning Statistics using R: Comparing Mean and Median<span style="font-family: verdana;font-size:90%; color:black;">
Mean and Median are measure of center, each describing center in a different
way. Mean, is average value of all observations and due to this actual
values of observations makes a difference to its value, while Median
is a middle value in an <b>ordered</b> data set.
<br/><br/>
Lets understand this with few simple examples:
<br/>
<UL>
<LI> Assume we have a dataset with these three values: 1, 2, 5. We can
see the median being 2 and mean as (1+2+5) = 8 / 3 = 2.67
<LI> If we just change the last observation value from 5 to 50 then
median is still 2 but mean changes to 17.67.
</UL>
<br/><br/>
As course brings out the main point that is <big>"The mean is very
sensitive to outliers (as it factors in their magnitude), while the
median is resistant to outliers."</big>
<br/><br/>
So as course explains: <UL> <LI>For symmetric distributions with no
outliers: <fo:inline text-decoration="line-through">X</fo:inline> is
approximately equal to M.
<LI> For skewed right distributions and/or datasets with high
outliers: <fo:inline text-decoration="line-over">X</fo:inline> > M.
<LI>For skewed left distributions and/or datasets with low outliers:
<fo:inline text-decoration="line-over">X</fo:inline> < M.
</UL>
Hence mean is used for symmetric distribution with no outliers while
median is used in other case for measure of center.
</span>Krishna Daglinoreply@blogger.com0tag:blogger.com,1999:blog-542053665666511781.post-3238517651193573292008-04-27T00:20:00.002+05:302009-01-07T20:36:33.740+05:30Learning Statistics using R: Numerical Measure<span style="font-family: verdana;font-size:90%; color:black;">
<OL type="1"> Numerical Measures: We proceed to numerical measures of
the distribution of a quantitative variable. <BIG>From this point
onward I am including the notes from the course itself. None of
material is mine other than errors!</BIG> The distribution of a
quantitative variable is described by its shape, center, and
spread. With histogram we can describe the shape of the distribution,
but we can only get a rough estimate for the center and spread. So
along with graphical display we need a more precise numerical
description of the center and spread of the distribution.
<br/><br/>
In this section we will learn:
<UL>
<LI> how to quantify the center and spread of a distribution with various numerical measures.
<LI> some of the properties of these numerical measures, and
<LI> how to choose the appropriate numerical measures of center and spread to supplement the histogram.
</UL>
<br/><br/>
<OL>
<LI type="a">Measure of Center: The two main numerical measures for
the center of a distribution are the mean and the median. Each one of
these measures is based on a completely different idea of describing
the center of a distribution. We will first present each one of the
measures, and then compare their properties.
<br/><br/>
<OL><LI>Measure of Center:
The mean is the sum of the observations divided by the number of
observations. If the n observations are X1, X2, ... Xn, their mean,
which we denote by <span style="text-decoration:line-over;">X</span> (and read X-bar), is
therefore: = <fo:inline text-decoration="line-through">X</fo:inline> =
(X1+X2+..+Xn)/n.
<br/><br/>
<strong>
Example: Best Actress Oscar Winners: We continue with our
Best Actress Oscar Winners <a
href="https://oli.web.cmu.edu/jcourse/workbook/activity/page?context=859d4f0980020c69005402d08707e3c0">
dataset.</a></strong>
<span style="font-family: trebuchet ms; color:#003300;">
<pre>
# read the actress.csv file in an actress data frame. [Bug-Fix: Gabriele Righetti]
>actress <- read.csv ("actress.csv", header=T, sep=",")
# with following command we do not have to keep writing actor$Age to refer to Age column of actor.
>attach(actress)
# a single command summary can give us all details, but just to learn few more R commands.
>mean(Age)[1] 38.53125
</pre>
</span>
As it can be seen from above example, "mean" is an R command that
gives average of distribution (measure of center). <br/><br/>
<LI>Median:
The median M is the midpoint of the distribution. It is the number
such that half of the observations fall above and half fall below. To
find the median:
<UL>
<LI> Order the data from smallest to largest.
<LI>Consider whether n, the number of observations, is even or odd. <br/>
<UL><LI>If n is odd, the median M is the center observation in
the ordered list. This observation is the one "sitting" in the
(n+1)/2 spot in the ordered list.
<LI>If n is even, the median M is the mean of the two center
observations in the ordered list. These two observations are
the ones "sitting" in the n/2 and n/2 + 1 spots in the ordered
list. </UL>
<br/>
<strong>Finding median using Best Actress data set:</strong>
<span style="font-family: trebuchet ms; color:#003300;">
<pre>
# we already have data read in the actress data frame.
> attach(actress)
> median(Age)
[1] 35
</pre>
</span>
As seen in above code we can use "median" command of R to find the
median value of the distribution.
<br/><br/>
<strong>Example: Finding median.</strong>
Here are the numbers of hours that 9 students spend on the computer on a typical day: <br/>
1, 6, 7, 5, 5, 8, 11, 12, 15
<span style="font-family: trebuchet ms; color:#003300;">
<pre>
# store numbers of hours spent in a hours vector.
hours<-c(1 , 6 , 7 , 5 , 5 , 8 , 11 , 12 , 15)
> median(hours)
[1] 7
> mean(hours)
[1] 7.777778
# as we have total 9 observations, (n+1)/2th observation (in sorted data), i.e. 5th.
</pre>
</span>
</UL>
</OL>
</OL>
</OL>Krishna Daglinoreply@blogger.com0tag:blogger.com,1999:blog-542053665666511781.post-10203483781061712852008-04-23T20:17:00.003+05:302008-04-28T00:42:27.041+05:30Learning Statistics using R : One Quantitative Variable:<span style="font-family: verdana;font-size:90%; color:black;">
The
course in this section teaches how to explore data collected from a
quantitative variable, and summarize important features of its
distribution. This section starts with graphs and then moves on to
numerical measures of the distribution and 5 number summary.
As course suggest to display data from one quantitative variable
graphically, we can use either the <big>histogram, stem-plot, or
box-plot.</big> We will try to do course examples using R.
<br/><br/>
<OL type="a">
<LI>Histogram:
For histogram we break range of values into intervals and count how
many observations fall into each interval. The first example
illustrates histogram with the help of exam grades of 15 students. In
example bins/intervals are created using 10 points wide length,
including first value while excluding last value.
<br/><br/>
Here are details:
<br/>
<p>
Exam grades of 15 students: 88, 48, 60, 51, 57, 85, 69, 75, 97, 72, 71, 79, 65, 63, 73.
Bins, intervals that are chosen along with the count of observations:
</p>
<table style="align:center">
<tr><td>Score</td><td>Count</td></tr>
<tr><td>[40-50)</td><td> 1</td></tr>
<tr><td>[50-60)</td><td> 2</td></tr>
<tr><td>[60-70)</td><td> 4</td></tr>
<tr><td>[70-80)</td><td> 5</td></tr>
<tr><td>[80-90)</td><td> 2</td></tr>
<tr><td>[90-100]</td><td> 1</td></tr>
</table>
As shown above intervals are closed at one end while open on other
end. Lets try to plot histogram-using R.
<span style="font-family: trebuchet ms; color:#003300;">
<pre>
>grades<-c(88, 48, 60, 51, 57, 85, 69, 75, 97, 72, 71, 79, 65, 63, 73)
# Above command we uses R's 'c' (combine) function to create a vector of grades.
>hist(grades, right=FALSE)
# Just a simple command draws the following histogram.
</pre></span>
<a href="http://bp2.blogger.com/_4NC7AWExX5g/SBC-CntiaII/AAAAAAAAAHA/Fc_07TnY2bo/s1600-h/grades-hist.png"><img style="display:block; margin:0px auto 10px; text-align:center;cursor:pointer; cursor:hand;" src="http://bp2.blogger.com/_4NC7AWExX5g/SBC-CntiaII/AAAAAAAAAHA/Fc_07TnY2bo/s320/grades-hist.png" border="0" alt=""id="BLOGGER_PHOTO_ID_5192859322587179138" /></a>
<br/>
Note if in the above "hist" command if you do not use "right=FALSE"
argument then the observation with value 60 will be counted in the
[50-60] interval. The R help page says, "right" argument is logical;
if TRUE, the histograms cells are right-closed (left open)
intervals. In our command we used "right=FALSE" that means the
intervals are of the form [a, b); which what we wanted.
<br/><br/>
<LI>More on histogram: The Center of distribution is its midpoint -
the value that divides the distribution so that approximately half the
observations take smaller values, and approximately half observations
take larger values.
<span style="font-family: trebuchet ms; color:#003300;">
<pre># we use our above "grades" vector.
>summary(grades)
Min. 1st Qu. Median Mean 3rd Qu. Max.
48.0 61.5 71.0 70.2 77.0 97.0
</pre>
</span>
As you can see in above example "summary" command displays various
important numbers. The out put displays center of the distribution
along with min and max and other values.
<br/><br/>
<LI>Best Actor Oscar Winners: Applying histogram to actual data. The
actual data set is
<a href="src=859d4a6a80020c69009b0d6efa1ec180&dst=bestactressdataset">here.</a>
As usual we save the excel file in a csv (comma separated) file. In
following text we show how to do interesting things with few R
commands.
<span style="font-family: trebuchet ms; color:#003300;">
<pre>
# read the actor.csv file in an actor data frame.
>actor <- read.csv ("actor.csv", header=T, sep=",")
# we run summary command, just to see five-point summary.
>summary(actor)
Age
Min. :31.00
1st Qu. :37.75
Median :42.50
Mean :44.72
3rd Qu. :48.75
Max. :76.00
# with following command we do not have to keep writing actor$Age to refer to Age column of actor.
>attach(actor)
# as we have only one column, summary(Age) would also give same output as above.
</pre></span>
As explained in <a
href="https://oli.web.cmu.edu/jcourse/workbook/activity/page?context=859d4a7580020c69003149349f5f3f16">here</a>,
our minimum data point is 31, and our max is 76. We'll use a bin width
of 5, and make bins from 30 to 80. So we have total 10 bins.
<span style="font-family: trebuchet ms; color:#003300;">
<pre>
> hist(Age, breaks=10, xlim=c(30,80))
</pre></span>
In above command we directly refer to "Age" column, the "breaks"
parameters can either specify number of bins or actual breaks. Also
"xlim" parameter is used to define the range of x values with sensible
defaults. Also help page mentions that "xlim" is not used to define
histogram breaks, but only for plotting.
Here is what final histogram looks like: Play with "xlim" and "ylim"
options and see what happens to the histogram. In our example
Frequency count is only up to 8 on the Y-axis adjust it to 9.
<br/><br/>
<a href="http://bp3.blogger.com/_4NC7AWExX5g/SBDAK3tiaKI/AAAAAAAAAHM/yW5yHONBDWg/s1600-h/actor-hist.png"><img style="display:block; margin:0px auto 10px; text-align:center;cursor:pointer; cursor:hand;" src="http://bp3.blogger.com/_4NC7AWExX5g/SBDAK3tiaKI/AAAAAAAAAHM/yW5yHONBDWg/s320/actor-hist.png" border="0" alt=""id="BLOGGER_PHOTO_ID_5192861663344355490" /></a>
<br/><br/>
<LI>Stemplot: The stemplot also called stem and leaf plot
is another graphical display of the distribution of quantitative
data. As course suggest for drawing stemplot we separate each data
point into stem and leaf, where <br/>
<UL><LI><Big>Leaf - right-most digit.</BIG>
<LI><Big>Stem - anything but the right-most digit.</Big>
</UL>
<br/>
So if the data point is 34 then leaf is 4 and stem is 3 but if data
point is 3.41 then leaf is 1 and stem is 3.4
<br/><br/> <LI>Best Actress Oscar Winners: Stemplot of actual data.
We use now Best Actress data set. The actual data set is here. It also
included in the document at the end.
<span style="font-family: trebuchet ms; color:#003300;">
<pre>
# read the actress.csv file in an actress data frame.
>actress <- read.csv ("actress.csv", header=T, sep=",")
# with following command we do not have to keep writing actor$Age to refer to Age column of actor.
>attach(actress)
# as we are interested in "Age" column of the data frame, we will work with it only, but check summary
>stem(Age, scale=2)
</pre></span>
As you can see a simple "stem" R command draws the stemplot. The
"scale" option/argument to the function expands the scale of the
plot. Here is what our stemplot looks like:
<pre>
When we use scale=2
The decimal point is 1 digit(s) to the right of the |
2 | 1
2 | 56669
3 | 013333444
3 | 555789
4 | 11123
4 | 599
5 |
5 |
6 | 1
6 |
7 | 4
7 |
8 | 0
When we use scale=1
The decimal point is 1 digit(s) to the right of the |
2 | 156669
3 | 013333444555789
4 | 11123599
5 |
6 | 1
7 | 4
8 | 0
</pre>
</br>
Well, I do not know how to rotate stemplot 90 degree counter clockwise
using R, so it visually resembles histogram; if anybody know please
let me know.
</OL>
</span>Krishna Daglinoreply@blogger.com2