Computer Practical: 1. Please read the information about R and login from week one problem sheet posted the course website. You may have problem of login using your UniKey a/c for the first time. If possible, try to login before your first computer practical and ask Mike Wilson in 636 if you have problem. You may also ask him for other support, say when printer runs out of paper. 2. After you login the system, you should open the problem sheet from the course website. Then from the Application menu bar, tick Nedit to open a window where you can type your program. You may copy and paste the commands from the problem sheet into your program to avoid typo errors. Remember to enter values into vectors and variables and change their names whenever necessary. 3. When you have finished one question, you should save it with a file name say 'prac1' (no space in the name). Then from the Application menu bar, tick R-tutorial and type 'process(prac1)' to generate a pdf file of the report that include commands, result, graphs and your written comment. 4. You should save and process your program after each question so that it is easier to chase the errors. If you are in doubt of a R command, you may check it by copy, paste and run it directly in R without having to 'process' it. 5. Enclose R commands with \code and \end and R graphic commands with \graph and \end. Otherwise all R codes will be treated as comment and NO graph will be generated. On the other hand, you should write your own comment whenever asked. You should know how to read and intepret the result from the output. 6. When there is no error message, say 'chunk 1 ...', and no pdf file comes out yet, it may be just too slow. You can press 'enter'. If there is still no file after a while, it may due to editing error, i.e. error outside \code and \end, say '\paragraph{Q1)' where the right bracket is wrong. Look for these types of error. To avoid these type of errors, just make it simple by typing 'Q1' instead of '\paragraph{Q1}'. 7. You may have forgotten to change the variable name and end up with wrong results in the computer practical report. You should redo it and hand it to me again. Otherwise, you will just get low marks. 8. You may have lines of nuisance statements: The following object(s) are masked from survey (position 3) ... To get rid of these in your final report, you may just quit and open a new R-tutorial to run process(.). 9. You may have problem in generating the pdf file during the computer practical. Then you may print the program or text file of R codes and hand it first. You should debug your program and hand in the pdf file later. If you have problem of debuging the program file, you may ask me. Weekly comment: 1. Week 2: You should write hypotheses, check model assumption and state the test result which includes the test statistic, p-value and decision whenever you are asked to do so. 2. Week 2: Remember that you test H0 NOT H1 because you calculate the probability of observed and more extreme event assuming H0 is true which is the p-value. Hence you should say either Since the p-value > 0.05, the data is consistent with H0 or there is insufficient evidence in the data to reject H0. Since the p-value < 0.05, there is (strong) evidence in the data against H0. You should NOT say the other way round, say Since the p-value > 0.05, there is evidence in the data against H1. Since the p-value < 0.05, the data is consistent with H1 or there is insufficient evidence in the data against H1 because you test H0 not H1 as the test statistic is calculated assuming H0 is true. 3. Week 2: It should be 'alternative' or 'alt' NOT 'alternatives' ! 4. Week 4: The following are some mistake in writing the comments: 'The normal assumption is more powerful.' It should be 'the t-test is more powerful.' It does not make sense whether an assumption is powerful or not. 'As the p-value is low, the normal assumption is right.' The p-value is the probability of rejecting H0 if H0 is true. We do not test but only check the normality assumption. What we test is H0. 't-test is more accurate in rejecting H0.' It should be 't-test is more powerful in rejecting H0, that is, it has a higher chance of rejecting H0 if H0 is true. In real situation, you never know whether your test decision is right unless you know the true mean obtained from a census. Example in P.85 of the lecture note show that t-test is more likely to give wrong decision if the normality assumption is not satisfied. That is the cost of being a more powerful test, that is more likely to reject H0 if H0 is true. 'sign-test has lower accuracy.' It should be 'sign-test has lower power.' Sign-test is less likely to reject H0 even if H0 is true because it disgards the information of magnitude in the data resulting in insufficient evidence to detect a difference between sample mean and hypothesize mean and hence reject H0. 'Wilcoxin test has similar power as t-test since both assume normal distribution.' Wilcoxin Signed Rank (WSR) test also assumes symmetric data distribution, same as sign-test but it is more powerful than sign-test because it uses the information of magnitude (rank) from the data, not just 'sign'. It is difficult to say in general whether WSR test or t-test is more powerful. It depends on data but WSR test can be applied in more general situation, an advantage over t-test because it only assumes a less restrictive symmetric data distribution, rather than normal. 't-test is a stronger test.' What is meant by a stronger test. We do not define a 'strong' test. Instead, we define a test to be powerful if it has a high chance of rejecting H0 given that H0 is true. 'When there are ties, we assume normal data distribution.' We use normal approximation in calculating p-value for the WSR test. We do not make normality assumption for the data. Remember! WSR test is a non-parametric (distribution free) test. 5. Week 5: Q1b, you should comment both the normality assumption from the qq plot and equality of variance assumption from the boxplots. If the spreads of the two boxplots are similar, the equality of variance assumption is satisfied. If the points lies close to a straignt line in QQ plot, normality assumption is satisfied. 6. Week 5: Q1b, If there are no plots, that is because you forgot to enclose the R codes with \graph \end. Remember! 7. Week 5: Q1c, you should use mu_d=0, NOT the diff. of sample means bar x - bar y, to test if H0 should be rejected. If mu_d=mu_x-mu_y=0 is included in the confidence interval (CI), accept H0. Otherwise reject H0. Note that bar x - bar y is the center of CI, and so CI ALWAYS contain bar x - bar y. Some of you even use the test statistics t0 to test which is wrong. 8. Week 5: Q1d, you should also note the difference in d.f. between the two t-test with or without equality of variance assumption. Some of you said that the test with equal variance assumption has lower power. This is in general incorrect. We can't say that the test which gives higher p-value has lower power. Whether model assumptions are invalid should also be considered. 9. Week 5: Q2a, you can't say that normal approximation should be used because the normality assumption is verified from the QQ plot. T-test uses the original measurements, which are assumed to follow a normal distribution, to construct the test statistic. However in WRS test, the ranks, NOT original measurements, are used instead and if there are ties, we assume the test statistic W (sum of ranks for smaller sample) follows a normal distribution. 10.Weel 6: Q2, make sure that the order of var and d.f. in the numerator and denominator of the f0 agree. Otherwise you may get a p-valve which is greater than 1, impossible! You should make f0 >1. Otherwise, you should specify a lower side p-value. 11.Week 6: Q3, QQ-plot is to check normality whereas boxplots is to check equality of variances by looking at the spreads of the boxplots. If the spreads are similar, the equality of variance assumption is satisfied. 12.Week 7: Q1, some of you still say that "The boxplots are not symmetric and hence equality of variance is not hold." That is wrong. To check for equality of variance, you should compare the spreads of the boxplots. 13.Week 7: alpha=0.1 is used throughout this practical. Please read carefully the questions and be aware that alpha other than 0.05 can be used. You should realize that alpha=0.1 is used to get a significant result for ANOVA test, that is, the means ARE NOT ALL EQUAL so that you can use the Bonferroni multiple comparison to check which pair(s) should give rise to the different. Also you should say "the means are not all equal", not just "the means are not equal" or "the means are different" because some pairs are equal but some not. 14.Week 7: You should say pair (2,3) are significantly different, NOT just MOST DIFFERENT when you state the test result. 15.Week 8: Q1, You shouldn't say that the spreads across the 5 boxplots of range are not equal whereas those of range1 are similar because they are from the same data except the outlier. The spreads of the 5 boxplots of range1 (with outlier) look similar just because of the enlarged scale to accommodate the outlier but the huge difference in the spread for R4 violates the equality of variance assumption seriously. Same for the check of normality assumption. The points from range1 seems to lie closely to a straight line because of the enlarged scale. The normality assumption violates seriously but the huge difference between the outlying point and the straight line. Always be aware that large differences look smaller if they are drawn on a larger scale. 16.Week 8: Q1, nonparametric test is less powerful but it is less affected by outliers. This question illustrates the sensitivity of the ANOVA test to the presence of outliers. Because of the serious violation of model assumption, the significance result is not reliable. However you shouldn't say KW test is more powerful or has a higher chance of rejecting H0. It is just less affected (more robust) by outliers. 17.Week 8: Q2, the variability of means across the three appraiser groups is hidden by the variability across the car sizes in the boxplots. ANOVA test separates the effect due to car size resulting in a much lower unaccounted error size and hence a more clear appraiser group difference relative to the errors. 18.Week 9: Q1(b) You may that the two results from Friedman test is in agreement with the ANOVA test from last week. The Friedman test is less sensitive against normality and equality of variance assumption but it these assumptions are satisfied, the ANOVA test would be more powerful as shown by the much lower p-value. ANOVA test using the original data rather than ranks is more powerful to detect the block effect. because it has smaller p-value. 19.Week 9: Q1(b) You shouldn't say that ANOVA test is more accurate as the truth of the existence of certain effect is never known, it is difficult to say if ANOVA test is accurate or not. 20.Week 9: Q2(d) You shouldn't say as the price increases, the discount rate increases. That should be the other way round because price is a dependent variable and discount an independent variable. You should say as discount rate increases, price increases. Same for promotion effect. 21.Week 9: Q2(d) You shouldn't say as the trend lines for discount are nearly parallel, there are no interactions for discount but as the trend lines for promotion are nearly crossed, there are interactions for promotion. Interaction refers to the inconsistency on the direction of one factor across levels of another factor. Hence both factors are involved when you describe interaction. If there are inconsistency of the effect of one factor on outcomes across levels of another factor, there should also be inconsistency of the effect of other factor on outcomes across levels of the first factor. 22.Week9: Q2(d) Read the solution to see how to describe the trend of effects of discount and promotion alone and their interaction on price (outcome). It will give you a better understanding of ANOVA test as a whole. 23.Week 10: Q1(a) Many students set wrong x and Y vectors. Some interchange x and y and got wrong results. Be careful. 24.Week 10: Q1(b) Some students said that boxplot was symmetric and hence showed equality of variance. That's wrong! Instead, symmetry of boxplot shows normality of residuals. Unlike ANOVA test, the x variable is continuous instead of categorical. Hence residuals are ploted in a scatter (residual) plot instead of boxplot to show equality of variance. When the residual is randomly scatter around x-axis with equal spread across x, the equality of variance is confirmed. 25.Week 10: Q1(b) Some students said that the qq-plot showed that the data follow a straight line. That's wrong. The fitted line plot shows that the data follow a linear relationship between x and y. The qq-plot shows normality of residuals. 26.Week 10: Q1(c) Some students said that the new regression model is not much affected. That's wrong. You can see the regression coefficients change a lot with an outlier. Again the enlarged scale to accommodate the outlier lift up the new regression line just slightly than the original regression line. You should also mention that the residual plot no longer shows a random scatter but a pattern that shows a downward trend across x. Hence the regression model is no longer suitable when an outlier is added.