Wednesday, June 18, 2008

Learning Statistics using R: Role-type classification (1 of 2)

Most of the material here is taken from the course website!
The second module of the course explains the relationship between two variables. In earlier sections we learned how to work with a distribution of a single variable, either quantitative or categorical.
This section start with Role-Type classification of two variable:In most studies involving two variables, each of the variables has a role. We distinguish between:
  • Response variable: the outcome of study.
  • Explanatory varible: the variable that claims to explain, predict or affect the response.
The response variables are also known as "Dependent" variables and the explanatory variables as "Independent" variables. Dependent variable depend on the Independent variable and hence the name. A simple example would be of a function that computes the sum of passed arguments; in this case arguments (the values whose sum we need to find) are independent variables while output (sum of these values) is dependent variable.
Lets take 8 example from course website to make this clear. We will be using these examples for further variable type classification.
  1. We want to explore whether the outcome of the study - the score on a test - is affected by the test-taker's gender. Therefore:
    • Gender is the explanatory variable
    • Test score is the response variable
  2. How does the number of calories a hot dog has related to (or effected by) the type of hot dog (beef, meat or poultry)? (in other words, are there differences in the number of calories between the three type of hot dogs?)
    • Number of calories is response variable
    • Type of hot dog is explanatory variable
  3. In this study we explore whether nearsightedness of a person can be explained by the type of light that person slept with as a baby. Therefore:
    • Light Type is the explanatory variable
    • Nearsightedness is the response variable
  4. Are the smoking habits of a person (yes/no) related to the person's gender?
    • Gender of person (male/female) is explanatory variable
    • Smoking habit is response variable
  5. Here we are examining whether a student's SAT score is a good predictor for the student's GPA in freshman year. Therefore:
    • SAT score is the explanatory variable
    • GPA of Freshman Year is the response variable
  6. In an attempt to improve highway safety for older drivers, a government agency funded a research that explored the relationship between drivers' age and sign legibility distance (the maximum distance at which the driver can read a sign).
    • Driver's age is the explanatory variable
    • Sign legibility distance is response variable
  7. Here we are examining whether a person's outcome on the driving test (pass/fail) can be explained by the length of time this person has practiced driving prior to the test. Therefore:
    • Time is the explanatory variable
    • Driving Test Outcome is the response variable
  8. Can you predict a person's favorite type of music (Classical/Rock/Jazz) based on his/her IQ level?
    • IQ Level is explanatory variable
    • Type of music is response variable

Above examples helps in identifying response and explanatory variable but is it always clear what is the role classification? In other words, is it always clear which of the variables is the explanatory and which is the response?
Answer: NO! There are studies in which the role classification is not really clear. This mainly happens in cases when both variables are categorical or both are quantitative. An example could be a study that explores the relationship between the SAT Math and SAT Verbal scores. In cases like this, any classification choice would be fine (as long as it is consistent throughout the analysis).


We know that a variable is either categorical variable or quantitative variable. We use this information to further classify response and explanatory variables. With this role-type classification we get following 4 possibilities:
  1. Case I: Explanatory is Categorical and Response is Quantitative variable.
  2. Case II: Explanatory is Categorical and Response is Categorical variable.
  3. Case III:Explanatory is Quantitative and Response is Quantitative variable.
  4. Case IV: Explanatory is Quantitative and Response Categorical variable.
Following table taken from course web summarizes above 4 cases:
The couse warns us that this role-type classification serves as the infrastructure for the entire section. In each of the 4 cases, different statistical tools (displays and numerical measures) should be used in order to explore the relationship between the two variables.

Along with this course also suggest us following important rule:
Principle:
When confronted with a research question that involves exploring the relationship between two variables, the first and most crucial step is to determine which of the 4 cases represents the data structure of the problem. In other words, the first step should be classifying the two relevant variables according to their role and type, and only then can we determine the appropriate statistical tools.
Lets go back to our 8 examples and try to classify explanatory and response variables to categorical or quantitative variable.
  1. We want to explore whether the outcome of the study - the score on a test - is affected by the test-taker's gender. Therefore:
    • Gender is the explanatory variable and it is categorical variable.
    • Test score is the response variable and it is quantitative variable.
    • Therefore this is an example of Case I.
  2. How does the number of calories a hot dog has related to (or effected by) the type of hot dog (beef, meat or poultry)? (in other words, are there differences in the number of calories between the three type of hot dogs?)
    • Type of hot dog is explanatory variable and it is categorical variable.
    • Number of calories is response variable and it is quantitative variable.
    • Therefore this is an example of Case I.
  3. In this study we explore whether nearsightedness of a person can be explained by the type of light that person slept with as a baby. Therefore:
    • Light Type is the explanatory variable and it is categorical variable.
    • Nearsightedness is the response variable and it is categorical variable.
    • Therefore this is an example of Case II.
  4. Are the smoking habits of a person (yes/no) related to the person's gender?
    • Gender of person (male/female) is explanatory variable and it is categorical variable.
    • Smoking habit is response variable and it is categorical variable.
    • Therefore this is an example of Case II.
  5. Here we are examining whether a student's SAT score is a good predictor for the student's GPA in freshman year. Therefore:
    • SAT score is the explanatory variable and it is quantitative variable.
    • GPA of Freshman Year is the response variable and it is quantitiative variable.
    • Therefore this is an example of Case III.
  6. In an attempt to improve highway safety for older drivers, a government agency funded a research that explored the relationship between drivers' age and sign legibility distance (the maximum distance at which the driver can read a sign).
    • Driver's age is the explanatory variable and it is quantitiave variable.
    • Sign legibility distance is response variable and it is quantitative variable.
    • Therefore this is an example of Case III.
  7. Here we are examining whether a person's outcome on the driving test (pass/fail) can be explained by the length of time this person has practiced driving prior to the test. Therefore:
    • Time is the explanatory variable and it is qunatitative variable.
    • Driving Test Outcome is the response variable and it is categorical variable.
    • Therefore this is an example of Case IV.
  8. Can you predict a person's favorite type of music (Classical/Rock/Jazz) based on his/her IQ level?
    • IQ Level is explanatory variable and it is quantitiave variable.
    • Type of music is response variable and it is categorical variable.
    • Therefore this is an example of Case IV.
After this we learn more about role-type classification and which tool to use in which cases.

Tuesday, June 17, 2008

Learning Statistics using R: Rule Of Standard Deviation

Most of the material here is taken from the course website!
Following explains the rule of standard deviation, also known as The Empirical Rule. The rule is applied only to normal (symmetric) data distribution.
  • Approximately 68% of the observations fall within 1 standard deviation of the mean.
  • Approximately 95% of the observations fall within 2 standard deviations of the mean.
  • Approximately 99.7% (or virtually all) of the observations fall within 3 standard deviations of the mean.
This rule provides more insights about standard deviation and following picture taken from course web site illustrates the same rule: Lets understand this with an example: The following data represents height data of 50 males. Lets use R to find 5 number summary of these and to confirm if the distribution is nomal - mound shaped.
# We use 'c' command to populate male vector; on which we will carry our operations.
male <- c(64, 66, 66, 67, 67, 67, 67, 68, 68, 68, 68, 68, 68, 69, 69, 69, 69, 69, 70, 70, 70, 70, 70, 70, 70, 71, 71, 71, 71, 71, 71, 71, 72, 72, 72, 72, 72, 72, 73, 73, 73, 74, 74, 74, 74, 74, 75, 76, 76, 77)
> hist(male)
In above code sample 'hist' command draws a histogram that has almost normal-mound shape. Here is the image that R draws for us. Lets find five number summary and confirm if standard deviation rule applies correctly to this data set.
# Just a simple summary command gives five point summary
> summary(male)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  64.00   68.25   70.50   70.58   72.00   77.00
> sd(male)
[1] 2.857786
# Lets apply first rule - 68% of data points are within (mean - 1 * SD) and (mean + 1 * SD)
> male >= (mean(male) - (1 * sd(male))) & male <= (mean(male) + (1 * sd(male)))
 [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE  TRUE  TRUE  TRUE  TRUE  TRUE
[13]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
[25]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
[37]  TRUE  TRUE  TRUE  TRUE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[49] FALSE FALSE
# above command gives us the indices of male vector as TRUE where our condition satisfies.
# Lets count how many such obervations are there.
> length(male[male >= (mean(male) - (1 * sd(male))) & male <= (mean(male) + (1 * sd(male)))])
[1] 34
# So out of 50 observation 34 observation are with in mean +/- 1 SD. ie.
> 34/50 * 100
[1] 68
# So as rule suggests, 68% observations are with in mean +/- 1 SD.

# Lets check second rule - 95% of data points are within (mean - 2 * SD) and (mean + 2 * SD)
> length(male[male >= (mean(male) - (2 * sd(male))) & male <= (mean(male) + (2 * sd(male)))])
[1] 48
> 48/50 * 100
[1] 96
# So indeed 95% of data points are within mean +/- 2 SD.

# Lets check third rule - 99.7% of data points are within (mean - 3 * SD) and (mean + 3 * SD)
> length(male[male >= (mean(male) - (3 * sd(male))) & male <= (mean(male) + (3 * sd(male)))])
[1] 50
> 50/50*100
[1] 100
# this shows that 99.7% of data points are with in mean +/- 3 SD.
Following table taken from course website makes this more clear:
Summary:
  • The standard deviation measures the spread by reporting a typical (average) distance between the data points and their average.
  • It is appropriate to use the SD as a measure of spread with the mean as the measure of center.
  • Since the mean and standard deviations are highly influenced by extreme observations, they should be used as numerical descriptions of the center and spread only for distributions that are roughly symmetric, and have no outliers.
  • For symmetric mound-shaped distributions, the Standard Deviation Rule tells us what percentage of the observations falls within 1, 2, and 3 standard deviations of the mean, and thus provides another way to interpret the standard deviation's value for distributions of this type.

Saturday, June 14, 2008

Krishna Dagli: Resume

Krishna Dagli
krishna dot dagli at gmail dot com
9167705218


  • POSITIONS HELD
    • Technical Consultant, Ministry of Finance Feb 2005 - Dec 2006
    • Project Manager, Infotech Financials Pvt. Ltd. Oct 2002 - Sep 2006
    • Team Leader, Infotech Financials Pvt. Ltd. Aug 2000 - Sep 2002
    • Senior Programmer, Infotech Financials Pvt. Ltd. Feb 1999 - Jul 2000

  • CONSULTANCY
    • Ministry of Finance (Feb 2005 - Dec 2006)

      Member of Technical committee to upgrade IT infrastructure at Ministry of Finance. The agenda was to determine which software was to be purchased based on the requirements of each department, minimum standards to be adapted while purchasing software, hardware and use of open source software to replace proprietary sotfware. Linux consulation.

    • Visa Pay (Jan 2006)

      To review the performance of purchased hardware and suggest configuration changes, mainly for cryptographic hardware solutions.

    • Siyaram Silk Millls Ltd. (Mar 2006)

      Email solution using Open source Technologies. I was brought in as a consultant for optimization and configuration problems.

  • PROJECTS
    • Algorithmic Trading System (Jan 2008 - )

      The second phase of Algorithmic trading solution.

    • Dealer Back Office Automation and Reporting (Jan 2007-)

      Enterprise integration system capable of generating performance and analysis reports for a broker.

    • Capital Market Risk Analyzer (Dec 2006 -)

      The system that calculates client's Risk and other parameters by taking into account client's holdings and trading positions.

    • Basic CRM - BCM (Nov 2006 - Feb 2007)

      Contact management and Interaction logging, reporting system for institutional clients of an Equity brokerage firm.

    • Algorithmic Trading System (Feb 2005 - Nov 2006)

      An algorithmic trading solution that allows broker to define trading strategies, based on which orders could be triggered off automatically to the Exchange. The system was amongst the first of its kind in India and was a real-time financial software with adquate risk checks and 100\% uptime.

    • TAN \& PAN Logistics Control Module (Mar 2004 - Jan 2005)

      An in-house application for National Securities Depository Ltd. (NSDL) allows them track the document(TAN and PAN Card) flow of their TAN and PAN management services.

    • Telerate Systems' Capital Market Module (Feb 2004)

      The conceptualisation and design of plugin for Money Line's Active8 which displays company charts, reports and financial summary depending on the screen and scrip selection.

    • Zero Coupon Yield Curve (Jan 2003 - Dec 2003)

      Zero coupon yield curve estimation and reducing SSE. The system was able to perform 20 times faster than the exiting system.

    • Realtime Arbitrage Position Monitor (Mar 2002 - Nov 2002)

      A real-time application used to calculate the Netposition of dealers(jobbers) across stocks and across market in an equity broking firm.

    • Chanakya (Jan 2002 - Mar 2002)

      Chanakya is a decision support product for derivatives trading, capable of handling multiple data feeds from Exchange or information vendors such as Moneyline Telerate.

    • e-Brokerage (Jan 2001 - Dec 2001)

      A web based trading system which allowed clients sitting at various parts of the globe to input orders to the brokers's central server from there dealer routes orders manually to the exchange.

    • i-Depository (Jan 2001 - Dec 2001)

      An online reporting and emailing system which allows clients of brokers to see their holding and transaction statements online.

    • i-Fund (May 2000 - Sep 2000)

      An index fund management product developed in-house for Index fund tracking error calculation.

  • KEY ACCOMPLISHMENTS
    • Socket Library in C: Developed socket handling library in C and released as GNU software. Took part as a developer to be a part of the open source Gammu project (version 0.2) which allows users to interface with mobile phones using Linux.
    • Reverse engineering of MBUS protocol for Nokia 5110 handset. Used for design a mobile PCO with the Nokia 5510 handset, a keypad and assembled electronic components. This device tracks when calls are connected and starts a timer which stops when the call is disconnected.Project was done under the guidance of Prof. Pankaj Siriah of IIT Bombay.
    • Did the design for software to sign contract notes using digital signatures and encrypt each note before dispatch to clients.
    • Developed a wap based system for trading via the Internet for a premier mobile telecommunications company.
    • Installation, configuration and performance tuning of Sybase and DB2.
    • Considerable knowledge of Diskless Linux setup, Webserver and Mail server setup. (Apache, qmail). Integration of Wap, SMS services with Apache using Kannel gateway.
  • ACADEMICS
    • Systems Analysis and Design: From UC Berkeley University (Oct 2007).
    • Secure Programming and Security : From Stanford University (Oct 2005).
    • NSE's derivatives certification exam with a score of 83\%.
    • Certification in AIX from IBM.
    • DB2 UDB V8.1 Family Fundamentals (Test 700) with a score of 81\%.
    • DB2 UDB V8.1 for Linux, Unix and Windows Database Administration (Test 701) with a score of 82\%.
    • BSc, Mathematics, at Mumbai university. (1998) with a score of 68\%.
  • Friday, June 13, 2008

    Learning Statistics using R: Standard Deviation and Histogram

    Most of the material here is taken from the course website!
    In following example we will see how histogram can help us to clarify the concept of Standard Deviation.
    Example:
    At the end of a statistics course, the 27 students in the class were asked to rate the instructor on a number scale of 1 to 9 (1 being "very poor", and 9 being "best instrctor I've ever had"). The following table provides three hypothetical rating data:
    Following are the histogram of data of each class:
    What can we say about standard deviation by looking at these histogram and data set?
    Lets assume that the mean of all three data set is 5 (which is reasonable clear by looking at histograms) and we know (roughly) that standard deviation is average distance of all data points from their mean.
    • For class I histogram most of the ratings are at 5 which is also the mean of the data set. So the average distance between mean and data points would be very small (since most of the data points are at mean).
    • For class II histogram most of the ratings are at far points from mean - 5. In this case most of the data points are at two extrems at 1 and 9. So the average distance between mean and data points would be larger.
    • For class III histgram data points are evenly distributed around mean. We can safely say that in this case the average distance between mean and data points would be greater than that of class I but smaller than that of class II. ie. in-between class I and class II standard deviation.

    Lets check our assumption by loading these data set into R and verifying standard deviation of each. The excel contained data set can be downloaded from here.
    > class1 <- c(1,1,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,9,9)
    > sd(class1)
    [1] 1.568929
    > summary(class1)
       Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
          1       5       5       5       5       9 
    
    > class2 <- c(1,1,1,1,1,1,1,1,1,1,1,1,1,5,9,9,9,9,9,9,9,9,9,9,9,9,9)
    > sd(class2)
    [1] 4
    > summary(class2)
       Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
          1       1       5       5       9       9 
    
    > class3 <- c(1,2,3,4,5,6,7,8,9,1,2,3,4,5,6,7,8,9,1,2,3,4,5,6,7,8,9)
    > sd(class3)
    [1] 2.631174
    > summary(class3)
       Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
          1       3       5       5       7       9 
    
    So we have following standard deviation for our 3 class ratings
    • Class I : 1.568929
    • Class II : 4.0
    • Class III : 2.631174
    (Note that excel may vary a bit in results of standard deviation if you are using stdev function.) So calculated standard deviation confirm our assumption that we made by looking at histograms.

    Wednesday, June 4, 2008

    Learning Statistics using R: Standard Deviation

    Most of the material here is taken from the course website!
    Earlier we examined measure of spread using Range(max - min) and IQR(the range covered by middle 50% data). We also noted that IQR should be used when median is used as measure of center. Now we move to another measure of spread called standard deviation.

    The idea behind standard deviation is to quantify the spread of distribution by measuring how far the observations are from mean of distribution. That is how distant observations are located from the mean of all observations. The standard deviation gives average (or typical distance) between an observation/data point and the mean, X-bar.

    Lets understand standard deviation using an example; we calculate standard deviation step by step using R commands. (There is a single R function to do the same!)
    Assume we have following data set of 8 observations:
    7, 9, 5, 13, 3, 11, 15, 9
    • Calculate mean:
      We use R's 'c' function to combine them in a vector and then use mean function to calculate mean of the data set.
      > dataset <- c(7, 9, 5, 13, 3, 11, 15, 9)
      > is.vector(dataset)
      [1] TRUE
      # we have stored our observation in a vector called dataset
      > mean(dataset)
      [1] 9
      # so we have 9 as the mean of our data set; now we need to find
      # distance of each observation from this value : 9.
      > deviation <- c (dataset - 9)
      > deviation
      [1] -2  0 -4  4 -6  2  6  0
      # above command directly gives us deviation of each observation from 
      # the mean; that we have stored in another vector called deviation.
      
      Thinking about the idea behind the standard deviation being an average (typical) distance between the data points and their mean, it will make sense to average the deviations we got. Note, however, that the sum of the deviations from the mean is always 0.

    • Square of deviation:
      So we square each of the deviation and then take its average; which following R code does:
      # we can use either of following two methods to calculate square of
      # deviations.
      > deviation ^ 2
      [1]  4  0 16 16 36  4 36  0
      
      > deviation * deviation
      [1]  4  0 16 16 36  4 36  0
      
    • Average the square deviations by adding them up, and dividing by n-1, (one less than the sample size): Lets do that in R.
      > (sum(deviation ^ 2)) / (length(dataset) - 1)
      [1] 16
      
      This average of the squared deviations is called the variance of the data.

    • Find standard deviation:
      The standard deviation of the data is the square root of the variance. So in our case it would be square root of 16.
      >sqrt(16)
      [1] 4
      
      Why do we take the square root? Note that 16 is an average of the squared deviations, and therfore has different units of measurement. In this case 16 is measured in "squared deviation", which obviously cannot be interpreted. We therefore, take the square root in order to compensate for the fact that we squared our deviations, and in order to go back to the original units of measurement.

    Properties of the Standard Deviation:
    1. It should be clear from the discussion thus far that the standard deviation should be paired as a measure of spread with the mean as a measure of center.
    2. Note that the only way, mathematically, in which the standard deviation = 0, is when all the observations have the same value. Indeed in this case not only the standard deviation is 0, but also the range and the IQR are 0.
    3. Like the mean, the SD is strongly influenced by outliers in the data. Consider our last example: 3, 5, 7, 9, 9, 11, 13, 15 (data ordered). If the largest observation was wrongly recorded as 150, then: the average would jump up to 25.9, and the standard deviation jumps up to 50.3 Note that in this simple example it is easy to see that while the standard is strongly influenced by outliers, the IQR is not! In both cases, the IQR will be the same since, like the median, the calculation of the quartiles depends only on the order of the data rather than the actual values.

    Choosing Numerical Summaries
    • Use mean and the standard deviation as measures of center and spread only for reasonably symmetric distributions with no outliers.
    • Use the five-number summary (which gives the median, IQR and range) for all other cases.
    R function for Standard Deviation: There is a single R function "sd" that calculates Standard Deviation of dataset, just be careful to use "na.rm=TRUE" argument if you have NA values in your dataset. This function would return vector of SD of columns if dataset is dataframe or matrix. Remember its column's SD and not rows by default.