Monday, September 26, 2011

Empirical evidence of the fairness and quality of peer evaluations.

Empirical evidence of the fairness and quality of peer evaluations. INTRODUCTION A challenge to academics has long been the fair and rigorousevaluation of the performance of students in classes, when suchevaluation is called for (such as in most western European and Americanuniversities.) During the 1990s, due to a surge in international effortsdirected toward making accounting education more participative (see, forexample, AAA, 1986; AECC, 1990; Libby, 1991; Albrecht, et al., 1994;Lindquist, 1995; United Nations, 2003) pedagogical methods such as groupwork, case analysis, team projects, etc., have made evaluation ofstudent performance more complex (Humphreys, et al. 1997). Student peer evaluations offer a variety of benefits insupplementing the instructor's task of evaluating students. First,when working in groups, fellow students have a unique perspective fromwhich to evaluate the relative contributions of group members. Greguras,et al. (2001) observed that proximity of peers in performance of tasksmake them uniquely positioned to observe level and quality of peerperformance. Second, if asked to assume partial ownership of the educationprocess, students should be more engaged in that process. Thus, ifexpected to submit peer evaluations, students should be invested inpaying attention to, being prepared for, and taking seriously workexecuted by their peers in order to compose a fair evaluation of theirwork. An additional benefit to use of peer evaluations is theirincreasing use within firms (Greguras, et al. 2001.) Upon graduation,new employees often find themselves called upon to evaluate those withwhom they work. Guidance provided to students in the formulation ofevaluations of peers, as well as the experience of being evaluated bytheir peers while they are in school would invariably carry over intotheir ensuing professional lives. Several problems associated with the use of peer evaluationspresent themselves, however. First, there exists the possibility of aprisoner's dilemma when students are asked to evaluate each other.(Numerous works exist describing the prisoner's dilemma. See, forexample, Poundstone, 1992, pp. 8-9.) A prisoner's dilemma existswhen two players (for example, two students) in the absence ofcollaboration, make independent decisions that lead to a suboptimaloutcome for either player. In this case, assume for a moment a simpleexample of two students who are asked to evaluate one another. Eachstudent can choose either to evaluate the other student fairly orunfairly (i.e., lower than deserved.) Each student, when facing his orher decision, will evaluate the alternatives in light of what the otherstudent may choose. A "dominant strategy" exists wheneverthere is one alternative that is better in any case, no matter thechoice made by the other player. In a strictly competitive game (which, in a class using peerevaluations and a competitive grading model is almost certainly thecase,) regardless of what another student does, and in the absence ofsignaling, a student's best option (dominant strategy) will alwaysbe to assign a lower evaluation to the work of his/her peer. Thus, aconcern of this study is that students, acting in their own selfinterest, will systematically grade their peers lower in an effort tomake their own evaluations relatively better. A second potential problem is that students may not have thecapacity to judge the work of their peers. Technical courses inparticular (e.g., accounting courses) present an environment in which,prior to the completion of the educational cycle, the student is not yetequipped to judge technical competency of a complex solution. How, forexample, can a student evaluate the correctness of a solution to a cashflow problem if the student has not yet mastered the preparation of acash flow statement? COURSE ENVIRONMENT AND PEER EVALUATIONS The course in which peer evaluations were implemented and examinedwas a four semester hour course covering introductory financial andmanagerial accounting offered at the graduate level for MBA students ata major, public American university. Observations of behaviour were madeover two semesters and covered three sections of the course. The averageenrollment was 35 students per section. Twenty-one Harvard BusinessSchool cases were used each semester, with students taking on teamresponsibilities in presenting the cases. In general, teams of twostudents were assigned one case each, based on a bidding scheme thatrewarded teams for taking on more difficult cases. The case presentation counted for five percent of a student'sgrade, and was earned by the team, rather than by individualsseparately. Each student in the course, whether presenting or not, wasexpected to be thoroughly prepared for each case. Preparedness wasmonitored through a series of quizzes that were administered frequently,but on a random basis. Participation was observed and graded to provideadditional incentives for case preparation among class members. Avariety of benefits accrue from requiring student preparation andpresentation of cases. Adler et al. (2004) argued that self-directedlearning that emerges in student presentation of cases is moreconsistent with learning objectives intended in the case method, bycomparison to a teacher-led case pedagogy. These include enhancement ofcommunication skills, building of confidence, increased willingness toconfront new experiences, promotion of self-directed learning, amongothers. As the semester progressed, and cases were presented, students wereasked to evaluate their peers on five dimensions (professionalism,technical quality, clarity and organization, identification of issues,and use of external resources), and on a scale of 0-5 on each of thosedimensions. The five dimensions were provided on an evaluation form towhich the students responded following each presentation. Evaluationswere e-mailed to the professor, along with their assessment of degree ofdifficulty of the case. Peer evaluations presented several challenges.Students in the first semester were not given specific instructions withrespect to timeliness of their evaluations nor the importance ofactually completing them. As a consequence, the response rate was onlyabout 50%. By comparison, in the second semester, when asked to providetheir evaluations within two days of the presentation and told thattheir response rate may factor into their participation grade, responserate improved significantly, rising to over 80%. Kilpatrick et al. (2001) identified several characteristics in peerevaluations that, according to students, are desirable. These include astructured evaluation form, allowance for additional comments, and thatevaluators remain confidential. Each of these characteristics wasincorporated into the peer evaluation process used in the coursesobserved in this study. RESEARCH QUESTIONS AND HYPOTHESES Of significant concern is whether peer evaluations add or detractfrom a fair and impartial score. In MBA classes under a quasi-cohortsystem, one would be naive to expect that peer evaluations would becompletely impartial. One expects that both alliances and rivalrieswould develop over time--perhaps most obviously that friends would scorefriends highly; and, possibly, that rivalries or animosities may emergeamong students, having the opposite effect. There are also potential sources that arise from purelyself-interested behaviour. In its simplest form, a self-interestedbehaviour might manifest itself in the form of lower scores assigned bystudents hoping to gain a competitive advantage over their colleagues.The grading mechanism in these classes was competitive, in the sensethat grades were assigned based on performance relative to that ofone's peers. Under that circumstance, and if one recognizes theopportunity, assigning low peer evaluations can secure a competitiveadvantage over those who evaluate their peers fairly. In response tothis concern, the research question the study asks is: R1: Are there groups of students who systematically act in theirown self-interest in evaluating their peers? A logical extension of this question is whether students whoexhibit lower levels of moral development are more likely to use a peerevaluation system to put themselves at a systematic advantage to theirclassmates. To answer this question, the Defining Issues Test (DIT) wasadministered to each student in an effort to quantify various dimensionsof the student's moral reasoning. The most recent version of theDIT, the DIT-2, provides several measures that help identifyprogressively higher levels of moral reasoning. The N2SCORE is adevelopmental index that attempts to measure levels of sophistication inthinking about moral issues (Bebeau and Thoma, 2003, pp. 19-20). Whileit does not necessarily follow that more sophisticated thinking (andrejection of "simpler and biased" thinking) will produceethical behavior, that there would be a systemic bias toward more moralbehavior in the case of higher level thought does. Thus, the firsthypothesis tested by this study is: [H1.sub.1]: Mean evaluations by students with a higher N2SCORE arehigher than mean evaluations by students with a lower N2SCORE. As results are discussed, whether the null is rejected or not, andits interpretation as a desirable outcome, or an undesirable one, willvary depending on the nature of the question. In this case, theregression results (Table 1) do not support rejection of the null,suggesting that there is not a systematic, self-interested behaviourexhibited during the peer evaluation process by students with a lowerN2SCORE. Further, students with a higher N2SCORE (i.e., higher measuredlevels of moral development) are not at a systemic disadvantage to thosewith lower scores. Another interesting question is whether students who are doingpoorly in the course either consciously or subconsciously lower theirevaluations to gain competitive advantage in an effort to improve theirstanding in the class. Two testable hypotheses were developed to addressthis question. They are: [H1.sub.2]: Change in mean evaluations by students from the firstto the second half of the course is inversely related to their scores onthe midterm exam. [H1.sub.3]: Student scores on the midterm exam are positivelyrelated to their mean evaluations in the second half of the course. In the case of [H1.sub.2], upon receiving their score on themidterm exam, a student who has performed poorly may seek to obtain anycompetitive advantage they might be able to find. One possible sourcewould be for that student to lower their peer evaluations for theduration of the semester. Since students are informed that grading iscompetitive in the course, this behaviour would represent a dominantstrategy if their goal is to raise their relative position in the class. In much the same way, the population of students performing well onthe midterm should be more confident (i.e. less insecure) about theirgrade and feel less pressure to lower their scores. In [H1.sub.3], meanpeer evaluations prior to the midterm exam are assumed to be equivalent.This assumption was supported by an analysis of the data. Results of the statistical tests of these hypotheses are presentedare presented below (Tables 2 and 3.) Again, in neither instance arethese assertions supported; and, once again, this should be interpretedas a desirable outcome. Of course, there can be several explanations ofwhy students appear to behave in a way true to the task of evaluatingtheir peers fairly. The most optimistic interpretation is that studentsare behaving responsibly toward their peers, judging their work fairly,and acting in an altruistically consistent way with Kant's firstcategorical imperative (Beck, 1990, p. 38). It is also possible thatstudents do not realize the marginal advantage to be gained by loweringtheir peer evaluations; or, that they do understand, but consider theprobabilistic benefit to be so low that they do not wish to risk thattheir peers might discover the source of their low evaluations. In anyevent, there appears to be no evidence of gaming taking place in thepeer process. Another interesting question posed here is whether there isevidence that the peer evaluation system has characteristics thatdiminish the quality of the assessment. Although Kilpatrick et al.(2001) presented evidence that students favor student input into theevaluation process, there may be problems associated with the content ofthose evaluations. The research question suggested here is: R2: Are there characteristics in student peer evaluations thatwould suggest qualitative shortcomings to those evaluations? A variety of ways exist to approach answering this question. Oneinteresting observation, for example, is the proportion of students whoappeared to give uniform evaluations, offering very littlediscrimination among case presentations. Several examples illustratethis point. In one student presentation of Crystal Meadows of Tahoe,Inc. (HBS Case 192-150) requiring preparation of a cash flow statement,an income statement was presented instead. Because the error was soegregious, control of the presentation was temporarily assumed by theprofessor in order to correct any impression that the income statementmight be a cash flow statement. Still, in the evaluations, undertechnical merit, several students assigned "5", when a majortechnical flaw had been assertively pointed out. In severalpresentations, students would dress in shorts, wear t-shirts orotherwise dress unprofessionally. Groups also often suggested a lack ofpreparedness. Alternatively, other groups were dressed in business suitsand had smoothly delivered, professional presentations. Still, acritical mass of students failed to discriminate between these twolevels of apparent effort, assigning "5" in each instance tothe "conducted in a professional manner" dimension. While thisstudy did not attempt to measure these more subjective qualities, theyexist as evidence that perhaps the marginal efforts made by somestudents were not rewarded in the peer evaluation process. Another concern is that students who came to class unprepared maynot have had a basis upon which to evaluate certain dimensions of thepresentation. Question 2 on the evaluation form asked the reviewer toevaluate the presentation on its technical merits. Absent knowledge ofthe case and insight into viable solutions, a student may have given thepresenter the benefit of the doubt and submitted a high evaluation.During both semesters, short quizzes were administered at the beginningof class periods, at random. These quizzes were used as a proxy forstudent preparedness, and were part of the grading mechanism servingthat purpose. The hypothesis thus suggested is: [H2.sub.1]: Students performing poorly on daily quizzes submittedhigher evaluations of technical merit for cases than students performingwell on daily quizzes. By a similar logic, students who performed well, and by extensionare presumed to have been prepared each day, should have had moreconsistent insights into the technical merit of a presentation. Thescores by those students, therefore, should be more narrowly distributedthan scores assigned by students who were less well prepared. Regarding workload, preparing for an easier case will take less ofa commitment on the part of students not assigned to present. With moredifficult cases, one might expect that fewer students will have preparedfor that case, and thus would be less informed in evaluating their peerswhose responsibility it was to present. In those cases, too, one mightexpect that evaluations would be more widely dispersed than when thecase assigned was less difficult. Based on these arguments, thefollowing hypothesis was developed: [H2.sub.2]: Dispersion of evaluations of technical merit bystudents is inversely related to scores on the midterm exam. Tables 4 and 5 provide the statistical results for the precedingtwo hypotheses. The results suggest no evidence that potential problemsimplied by either hypothesis exist. Again, failure to reject the null isa desirable outcome in each instance, indicating that lack ofpreparedness did not interfere with assessments when compared to thosestudents who were more prepared. An interesting possibility is the "halo" effect that mayaccompany the presentation of more difficult cases. Anyone familiar withjudging of diving understands this effect. Presumably, easier divesshould be easier to execute and thus be accompanied by better scores.More difficult dives, however, seem to be those that will draw the 9sand 9.5s from the judges, while the easier dives will tend not to bescored as well. There thus seems to be a subconscious awarding ofadditional credit for attempting the more difficult dives, even thoughthe degree of difficulty system is intended to compensate automaticallyfor this (Thomas et al, 2005, p. 208). In the same way, one expects thatstudents executing easier cases should receive higher scores for theirpresentation. If the opposite were true, as seems to be the case indiving scores, rewards for cases would be distributed in a way otherthan intended. The following hypothesis, therefore, tests this notion: [H2.sub.3]: Unadjusted peer evaluations of cases are positivelyrelated to their degree of difficulty. Results (Table 6) suggest a strong statistical relationship betweenunadjusted peer evaluations and case difficulty, suggesting theaforementioned "halo" effect. The coefficient is positive,consistent with the hypothesized direction of the relationship. If thereis solace to be found in this result, one might find it in two places.First, the adjusted R-square is only 0.0567. That suggests that thereare other, more important variables that would help explain better thevariance among subjects. Second, this may be a "problem" thatis acceptable. Students are taking on a risk and additional work bybidding aggressively on more difficult cases. The effect discussed hereis simply a hidden reward associated with the extra risk taken on bythose individuals. Another indication of uninformed evaluations may be inconsistenciesin distribution of evaluations on days when multiple cases werepresented. When one case is assigned for a given day, the task ofpreparing adequately is more manageable than on days when multiple casesare assigned. Also true, perhaps, is that if evaluations of groupedcases are more widely distributed, a case could be made that students,in formulating their evaluations, are less focused because of theadditional inputs. The fourth hypothesis for the second researchquestion is thus suggested: [H2.sub.4]: Mean evaluations of cases presented alone are morenarrowly distributed than of cases presented on days when multiple casesare presented. Examining the results (F test for unequal variances, Table 7,) thevariance for these two samples was shown to be unequal at the 0.03 levelof significance; however, the variance for the isolated cases is morenarrowly distributed than that of the grouped cases. This result isopposite the relationship suggested in the hypothesis. The null,therefore, is not rejected. CONCLUSIONS AND RECOMMENDATIONS The purpose of this paper has been to explore the fairness andquality of student peer evaluations in accounting courses. Two questionswere asked: 1) did students exhibit self-interested behaviours inassessing the performance of their peers; and, 2) were there qualitativeshortcomings to peer evaluations? In both questions 1 and 2, there seemed to be little evidence inthe data gathered either that a) students behaved in a self-interestedway; or, b) there were qualitative problems with peer evaluations. On the subject of peer evaluations, guidance, perhaps in the formof specific instructions, should be offered to students on how to assignscores to the different dimensions of the peer evaluations. Knechel(1992) describes an interesting alternative to the method adopted here.Rather than having students evaluate each case presentation, Knechelsuggests having students, at the end of the semester, name the five bestpresentations. Students would then be rank-ordered according to thenumber of votes they received. There are obvious scaling issues thatmight be encountered with this problem (e.g., several or many groupsreceiving no votes, a recency effect, etc.) This method may, however,offer better discrimination. One dimension that was not covered in the evaluations wasintra-group evaluation. There were, of course, several confidentialcomplaints by team members that they were "doing all thework." The decision to assign grades equally to the team, ratherthan allowing intra-group allocations was done more for expediency thananything else. Since the grade component for the case was only 5% of theoverall grade, the cost of administering an intra-group evaluation wasjudged to be greater than its benefits. Were the component higher, or ifthere were greater concern for the extent of free-rider problems, anintra-group evaluation might be advisable. Several citations exist onmethods of incorporating such an evaluation (see, for example, Knechel,1992; Stout, 1996; or, Greenstein and Hall, 1996.) Additional studies ofthose pedagogical models need to be made in order to assess the fairnessof the evaluation processes related to those models. This study was not an experiment, in the traditional sense. Rather,the study examined various characteristics associated with a particularpedagogy and its implementation in a real classroom. Obviously, thefirst priority in the class was to have the best possible pedagogy andassociated evaluation system in place, such that learning potential wasmaximized. There were, therefore, no experimental manipulations amongsubjects. Future research may be well served by examining studentbehaviors within an experimental setting where variables similar tothose examined in this study can be evaluated under more controlledcircumstances. In particular, this study is limited in that students examined, forthe most part, were traditional students who matriculated directly intothe graduate program. Further, students examined in the study werepredominately non-Hispanic white males. Effects of interactions amongmore diverse student populations are well worth considering in futurestudy. Numerous studies, for example, find that male and female studentsare rated differently in peer evaluations (e.g., Park, DiRaddo andCalogero, 2009; Selinow and Treinen, 2004; Aires, 1996; and the manystudies conducted by Sadker and Sadker, e.g, 1990.) Gender basedinteractions, as well as those among populations enriched with foreignstudents, African American students, non-traditional students, etc. aresuggestive of possible extensions of the current study. REFERENCES Accounting Education Change Commission (1990) Objectives ofeducation for accountants: Position statement No. 1, Issues inAccounting Education, 5(2), pp. 307-312. Adler, R. W., Whiting, R. H. and Wynn-Williams, K. (2004)Student-led and teacher-led case presentations: Empirical evidence aboutlearning styles in an accounting course, Accounting Education, 13(2),pp. 213229. Aires, E. (1996). Men and women in interaction: Reconsidering thedifferences. New York: Oxford University Press. Albrecht, W. S., Clark, D. C., Smith, J. M., Stocks, K. D., andWoodfield, L. W. (1994) An accounting curriculum for the next century,Issues in Accounting Education, 9(2), pp. 401-425. American Accounting Association: Committee on the Future Structure,Content, and Scope of Accounting Education (1986) Future accountingeducation: preparing for the expanding profession, Issues in AccountingEducation, 1(1), pp. 168-195. Ballantine, J. A. and Larres, P. M. (2004) A critical analysis ofstudents' perceptions of the usefulness of the case study method inan advanced management accounting module: the impact of relevant workexperience, Accounting Education, 13(2), pp. 171-189. Bebeau, M. J. and Thoma, S. J. (2003) Draft Guide for DIT-2.(Minneapolis: Center for the Study of Ethical Development). Beck, L. W. (1990) Kant: Foundations of the Metaphysics of Morals2nd ed. (McMillan Publishing Company: New York.) Greenstein, M. M. and Hall, J. A. (1996) Using student-generatedcases to teach accounting information systems, Journal of AccountingEducation, 14(4) pp. 493-514. Greguras, G., Robie, C. and Born, M. (2001) Applying the socialrelations model to self and peer evaluations, Journal of ManagementDevelopment, 20(6), pp. 508-525. Humphreys, P., Greenan, K. and McIlveen, H. (1997) Developingwork-based transferable skills in a university environment, Journal ofEuropean Industrial Training, 21(2), pp. 63-69. Kilpatrick, D. J., Linville, M. and Stout, D. E. (2001) Proceduraljustice and the development and use of peer evaluations in business andaccounting classes, Journal of Accounting Education, 19(4), pp. 225-246. Knechel, W. R. (1992) Using the case method in accountinginstruction, Issues in Accounting Education, 7(2), pp. 205-217. Libby, P. A. (1991) Barriers to using cases in accountingeducation, Issues in Accounting Education, 6(2), pp. 193213. Lindquist, T. M. (1995) Traditional versus contemporary goals andmethods in accounting education: Bridging the gap with cooperativelearning, Journal of Education for Business, 70(5), pp. 278-284. Park, L. E., DiRaddo, A. M. and Calogero R. M. (2009) Socioculturalinfluence and appearance-based rejection sensitivity among collegestudents, Psychology of Women Quarterly 33, pp. 108-119. Poundstone, W. (1992) Prisoner's Dilemma. (Anchor Books: NewYork). Sadker, M. and Sadker, D. (1990) Confronting sexism in theclassroom. In S. L. Gabriel & I. Smithson (Eds.), Gender equity inthe classroom: Power and pedagogy (pp. 176-187). Urbana: University ofIllinois. Sellnow, D. D. and Treinen, K. P. (2004) The role of gender inperceived speaker competence: An analysis of student peer critiques,Communication Education 53(3), pp. 286-296. Sherrard, W. R., Raafat, F. and Weaver, R. R. (1994) An empiricalstudy of peer evaluations: Students rating students, Journal ofEducation for Business, 70(1), pp. 43-47. Stout, D. E. (1996) Experiential evidence and recommendationsregarding case-based teaching in undergraduate cost accounting, Journalof Accounting Education, 14(3), pp. 293-317. Thomas, J. R., Nelson, J. K. and Silverman, S. J. (2005) ResearchMethods in Physical Activity 5th ed. (Human Kinetics: Champaign, IL). United Nations (2003) Revised Model Accounting Curriculum, UnitedNations Conference on Trade and Development, Geneva, Switzerland. David Malone. Weber State UniversityTable 1. Average Evaluation = f (N2SCORE) Root MSER-square 0.0205 0.3898 Adj R-square 0.0108 C.V. 8.6966Source DF SS MS F Pr > FModel 1 0.321 0.321 2.110 0.1494Error 101 15.34 0.152Total 102 15.67Table 2. Change in Evaluation = f (Midterm Exam) Root MSER-square 0.001 0.258 Adj R-square -0.011 C.V. -276Source DF SS MS F Pr > FModel 1 0.008 0.008 0.118 0.732Error 82 5.467 0.067Total 83 5.475Table 3. Second Half Mean Evaluations = f (Midterm Exam) Root MSER-square 0.008 0.386 Adj R-square -0.0012 C.V. 8.572Source DF SS MS F Pr > FModel 1 0.129 0.129 0.869 0.353Error 109 16.21 0.149Total 110 16.34Table 4. Average Technical Evaluation = f (Quiz Average) Root MSER-square 0.0054 0.3861 Adj R-square -.0038 C.V. 8.583Source DF SS MS F Pr > FModel 1 0.088 0.088 0.588 0.4447Error 109 16.25 0.149Total 110 16.34Table 5. Dispersion of Scores of Technical Merit = f (Midterm Exam) Root MSER-square 0.0146 0.2481 Adj R-square 0.0051 C.V. 53.543Source DF SS MS F Pr > FModel 1 0.094 0.094 1.530 0.2189Error 103 6.342 0.062Total 104 6.436Table 6. Unadjusted Peer Evaluations = f (Case Difficulty) R-square 0.0652 Adj R-square 0.0567Source DF SS MS F Pr > FModel 1 0.356 0.356 7.737 0.006Error 111 5.109 0.046Total 112 5.465Table 7. Mean Peer Evaluations = f (Case Isolation) Grouped Cases Isolated CasesMean 4.559 4.557Variance 0.0085 0.0302Observations 11 9Degrees of Freedom 10 8F 0.2813p-value 0.0323

No comments:

Post a Comment