The convergent and discriminant validity of NSSE scalelet scores

Faculty and administrators are more likely to take responsibility for student learning and development if they believe that assessment data represent their students and identify specific actions for improvement. An earlier study found that NSSE scalelets provide dependable metrics for assessing student engagement at the university, college, and department levels. Building on the earlier study, the findings of the current research indicate that the NSSE scalelets have greater explanatory power and provide richer detail than the NSSE benchmarks.


Gary R. Pike
Faculty and administrators are more likely to take responsibility for student learning and development if they believe that assessment data represent their students and identify specific actions for improvement. An earlier study found that NSSE scalelets provide dependable metrics for assessing student engagement at the university, college, and department levels. Building on the earlier study, the findings of the current research indicate that the NSSE scalelets have greater explanatory power and provide richer detail than the NSSE benchmarks.
Assessment has become an integral part of American higher education, and surveys of constituents are an important element in assessment efforts. The National Center for Postsecondary Improvement (NCPI) reported that 96% of the 1400 institutions responding to its survey had implemented some form of assessment, and 75% used surveys in their assessment efforts (Peterson, Einarson, Augus tine, & Vaughan, 1999). Although assessment is widespread, examples of assessment data, including survey results, being used to effect institutional change are relatively rare (Banta, 2002;Ewell, 2002;Peterson et al.). Pike (2002, p. 147) concluded "there is no greater problem in assessment than our inability to influence decision making with assessment results." A major barrier to using survey data for improvement is that many campus decision makers, particularly deans and department heads, find the results of institutional surveys to be too general (Kuh, Gonyea, & Rodriguez, 2002). That is, the results do not suggest specific courses of action. Experience indicates that faculty and administrators are more likely to take responsibility for student learning and development if they believe that assessment data represent their students and identify specific actions for improvement. Case studies of effective use of survey results reveal that surveys lead to improvement when the data are broken down or disaggregated at the college or department level and focus on a few highly related items that suggest specific actions (ElKhawas, 2003;Kezar, 2002Kezar, , 2003. Presenting results that are specific to a department or college frequently requires that a survey be administered to a large number of students in order to produce dependable measures (Indiana University Center for Postsecondary Research, 2001;Kuh, Gonyea, & Rodriguez, 2002). This is a requirement that institutions may not be able to meet for many of their programs.
In an earlier article, I proposed that researchers and assessment professionals use scalelets to overcome the challenges posed by the need to present survey data that are specific to a department or college (Pike, 2006). My generalizability study found that the 12 NSSE scalelets I developed yielded dependable college experience mean scores based on relatively few (i.e., 25 to 50) respondents. The present research builds on my earlier generaliz ability study and examines the convergent and discriminant validity of using the 12 NSSE scalelet scores in assessment research.

BackGRound
The term scalelet is derived from the concept of testlets proposed by Wainer and Kiely (1987, p. 190): "A testlet is a group of items related to a single content area that is devel oped as a unit and contains a fixed number of predetermined paths that an examinee may follow [in a computerized adaptive test]." The use of testlets allows developers to construct test units that contain more than one item and reduces problems associated with context and order effects (Wainer et al., 1990). The result is test scores with greater dependability and less error than scores based on single items .
A scalelet consists of a set of survey questions related to a specific aspect of the educational experiences of a group of students. Three elements of this definition require elaboration. First, a scalelet consists of a set of survey questions. It is not possible to make generalizations about a construct, such as involvement in cocurricular activities, based on a single survey question (e.g., the number of organizations to which a student belongs). The richness of the constructs used in out comes assessment requires that generalizations based on a single question be limited to that question. In the preceding example, general izations should be made about the number of organizations to which the student belongs.
Although the ideal would be to base generalizations about students' experiences on all possible questions about those experiences, the reality of survey research requires that assessment professionals base their conclusions on responses to a sample of questions. The second element in the definition of scaleletsthat questions relate to a specific educational experience-allows a relatively small sample of items to be included in a scalelet. In effect, there is a continuum ranging from very broad generalizations, based on many survey ques tions, to specific conclusions based on a single item. Scalelets strike a balance between the breadth of generalizations and the number of questions in a survey.
The third element-that scalelets repre sent the experiences of groups of students-is based on an understanding that assessment for accountability or program improvement requires data about groups (Ewell, 1991;Pike, 1994). The cocurricular experiences of a student are important, but data about a single student provides little information about programs at the institution. Evaluations of the cocurriculum should ideally be based on data about the experiences of all students or at least a representative sample of students.
The use of scalelets requires that research ers and assessment professionals make several different generalizations from samples to populations. Evaluating the quality and effectiveness of a program requires that an assessor make generalizations about the effectiveness of a program based on a sample of questions about the program. Likewise, the assessor may need to make generalizations about all of the students in a program based on a sample of students in that program. Gen eralizability theory, developed by Cronbach, Gleser, Nanda, and Rajaratnam (1972), provides a mechanism for researchers, assess ment professionals, and policy makers to identify the limits of the inferences they can draw from their samples (Brennan, 1983;Shavelson & Webb, 1991).
In an earlier study, I examined the depend ability (i.e., generalizability) of 12 scalelets drawn from questions used by the National Survey of Student Engagement (NSSE). Because the scalelets required that generaliza tions be made over samples of items and samples of students, the generalizability of group means was the focus of the study (Kane, Gillmore, & Crooks, 1976;Pike, 1994). Based on the responses of 50 randomly selected Pike seniors each from 50 randomly selected institutions, I found that all 12 scalelets pro duced dependable group means (Er 2 > 0.70) with 25 to 50 respondents (Pike, 2006).
Although dependability is a necessary condition for demonstrating that scalelets can provide valid scores for assessment research, it is not sufficient (Messick, 1989). Additional criteria must be satisfied. Banta and Pike (1989) have argued that these criteria should include the convergent and discriminant validity of scores.
Given that the objective of using scalelets in outcomes assessment is to make judgments about educational quality, validity provides a reference point for evaluating scalelets because questions about validity focus on the adequacy and appropriateness of the inferences drawn from data (Cronbach, 1971;Messick, 1989). Although several criteria can be used to evaluate the validity of a measure, there is a growing sentiment that validity should be treated as a unitary concept with construct validity at its core (Angoff, 1988;Messick). Loevinger (1957) grouped questions related to construct validity into three categories: (a) the extent to which items are accounted for by the construct (a substantive component), (b) the extent to which relationships among the items reflect relationships within the construct (a structural component), and (c) the extent to which relationships between scores and other variables are consistent with theories of the construct (an external component).
The approach used in this study is based on Loevinger's (1957) external component of construct validity and focuses on the concepts of convergence and discrimination (Campbell, 1960;Campbell & Fiske, 1959;Fiske, 1982). Banta and Pike (1989) and Pike (1989Pike ( , 1992 used this approach to evaluate several tests of generaleducation outcomes. For an assessment measure to be a valid indicator of program effectiveness, scores should be associated (i.e., converge) with other measures of educational quality and student learning. Moreover, these relationships should transcend institutional characteristics such as size, selectivity, mission, and control. In addition, the scores should discriminate among different quality indi cators. That is, scores for some measures should be associated with one set of learning outcomes, whereas the scores for other mea sures should be associated with different learning outcomes. Absence of discrimination would indicate that scalelets are not needed and that a total score would be a sufficient indicator of program quality.
Because the purpose of this research was to evaluate measures of student engagement, studentengagement theory served as the construct against which the scalelets were judged. This theory has its origin in the work of Pace (1980Pace ( , 1984, Astin (1984Astin ( , 1985, and Kuh and his colleagues (Kuh, Schuh, Whitt, & Associates, 1991). Although the writers used different terminology (e.g., quality of effort, involvement, and engagement) to describe their concepts, their views were based on the deceptively simple premise that students learn from what they do (Kuh, 2003). A second important premise of student engagement theory is that, even though the focus is on student engagement, institutional actions influence levels of engagement and learning on campus (Astin, 1985;Kuh, Schuh, et al.;Pace, 1984).
Research has provided consistent support for both assumptions. Studies show that engagement is positively related to test scores and students' reports of learning (Gellin, 2003;Kuh, Hu, & Vesper, 2000;Pascarella et al., 1996;Pike, 1995;Pike, Kuh, & Gonyea, 2003). Moreover, different types of engage ment have been found to be differentially related to learning outcomes. For example, Pike (1995) found that students' writing experiences and their interactions with faculty and peers were positively related to English outcomes but negatively related to learning in mathematics. Conversely, involvement in academic activities and extracurricular in volvement were not related to learning in English but were positively related to learning in mathematics and the social sciences.
Within the context of studentengagement theory, it is reasonable to expect that measures of student engagement will converge with and discriminate among measures of student learning. Two questions, corresponding to the concepts of convergence and discrimination, formed the basis for the current research: 1. Are NSSE scalelet scores significantly related to institutional measures of student learning and development after accounting for institutional characteristics?
2. Are NSSE scalelet scores differentially related to student learning outcomes after accounting for institutional characteristics?
The answer to the first question provides evidence of convergent validity, whereas the answer to the second question provides evidence of discriminant validity.

ReseaRch Methods data sources
The data for this study came from the 2004 administration of the NSSE survey, the Integrated Postsecondary Education Data System (IPEDS) data files, Barron's ratings of institutional selectivity, and institutional enrollment reports. The initial sample con sisted of 114,061 seniors attending 473 fouryear institutions. A comparison of the characteristics of the NSSE 2004 institutions and all fouryear colleges and universities revealed that NSSE institutions were very similar to the national profile in terms of geographic region and urbanrural location. Public institutions and master's universities were overrepresented among survey parti cipants, whereas baccalaureate general colleges were underrepresented (Indiana University Center for Postsecondary Research, 2004). At the conclusion of the survey cycle, 45,208 seniors had responded to the survey, a response rate of slightly less than 40%. A comparison of respondents' characteristics to the characteristics of the student populations at participating institutions revealed that women were overrepresented among respon dents, as were Caucasians and fulltime students. However, the differences were relatively small and should not affect the generalizability of the results (Indiana Univer sity Center for Postsecondary Research, 2004). Institutional means based on seniors' responses were used in this study. Complete data, including institutional benchmark and scalelet scores based on the responses of at least 50 seniors, were available for 454 colleges and universities.

Measures
Fortynine questions from the NSSE survey were used to create 12 scalelets. A list of the questions comprising the scalelets along with generalizability coefficients for group means based on 50 students are included in the appendix. The items comprising each scalelet were selected based on face and content validity, and the content of the scalelets paralleled the content of the NSSE bench marks. For example, most of the items in the Course Challenge, Writing Experiences, and HigherOrder Thinking Skills scalelets were drawn from the Level of Academic Challenge benchmark. Items included in the Active Learning and Collaborative Learning scalelets were from the Active and Collaborative Learning benchmark, and items in the Course Interaction and OutofClass Interaction scalelets were from the Student Interaction with Faculty Members benchmark. Many of the items in the Varied Experiences and Information Technology scalelets were taken from the Enriching Educational Experiences benchmark. The Diversity Experiences scalelet also was composed of items from the Enriching Educational Experiences benchmark. The items included in the Support for Student Success and Interpersonal Environment scalelets came from the Supportive Campus Environment benchmark.
Questions about gains in learning were used to create two outcome measures. Ori ginally developed by Kuh, Gonyea, and Palmer (2001), the Gains in General Education scale includes questions about gains in writing, speaking, analytical skills and general educa tion. The Gains in Practical Competence scale includes gains in computer and information technology, quantitative skills, and knowledge and skills needed for work. In addition to the scalelets and gain scores, the NSSE bench marks were included in the study, and the items comprising the benchmarks are identi fied in the appendix.
Several institutional characteristics were included in the study. These variables were institutional control (1 = Private, 0 = Public), Carnegie classification (dummy coded as Doctoral/ResearchExtensive, Doctoral/ ResearchIntensive, Master's, Baccalaureate Liberal Arts, and Baccalaureate General [not coded]), percent of female students, percent of minority students, percent of oncampus students, and percent of fulltime students. These measures were taken from IPEDS data. Two other characteristics, Barron's selectivity ratings and Fall 2003 enrollment as reported by the institutions, were included in the study.

data analysis
Institutions served as the units of analysis in the study. Initially, correlations among insti tutional characteristics, NSSE benchmarks, scalelets, and outcome measures were calcu lated to aid in interpreting the results of subsequent analyses. Next, four multiple regression models were specified and tested. In the first model, generaleducation gains were regressed on institutional characteristics and NSSE benchmarks. In the second model, generaleducation gains were regressed on institutional characteristics and scalelet scores. Gains in practical skills were regressed on institutional characteristics and NSSE bench marks in the third model, and practical skill gains were regressed on institutional character istics and scalelet scores in the final model.
Testing of regression models was a two step process. First, gain scores were regressed on institutional characteristics. Second bench mark or scalelet scores were added to the models. Significance tests and measures of explained variance were calculated for both steps to evaluate convergence. Significance tests provided indications of whether there were relationships between benchmarks or scalelets and gains, whereas changes in explained variance indicated whether the relationships were educationally important. Standardized regression coefficients from the second step identified unique relationships between benchmarks or scalelets and gains. Different patterns of relationships across scalelets and gains indicated that the scalelets were able to discriminate among different types of engagement and different learning outcomes.
Tests of multicolinearity and influence diagnostics were calculated for each model. Preliminary analyses indicated that multi colinearity was not a problem. However, examination of the influence statistics revealed that two institutions had extreme scores that exerted undue influence on the regression results. Both institutions were small private liberal arts colleges with extremely high levels of student engagement. Those institutions were dropped from the final analyses.

Results
Table 1 displays the correlations and standard ized regression coefficients representing the relationships between the gain measures and institutional characteristics, NSSE bench marks, and scalelet scores. Most of the independent variables included in the regres sion analyses were significantly correlated with general education gains, and many of the independent variables were significantly correlated with gains in practical skills. Moreover, the results of the regression analyses provided clear evidence of the convergent validity of the NSSE benchmarks and scalelet scores. Institutional characteristics and NSSE benchmarks accounted for 78.0% of the variance in generaleducation gains, and the NSSE benchmarks alone accounted for 30.7% of the variance. The relationships were slightly stronger for the model that included scalelet scores. Institutional characteristics and scalelet sores accounted for 81.3% of the variance in generaleducation gains and the scale lets accounted for 34.0% of the gainscore variance.
Evidence supporting the convergent validity of scalelet scores was more pronounced for gains in practical skills. Institutional characteristics and NSSE benchmarks com bined to explain 40.3% of the variance in practicalskill gains, and 22.2% of this variance was explained by the benchmarks. Institutional characteristics and scalelet scores explained 53.6% of the variance in gains in practical skills. Of the variance in practicalskill gains, 35.5% was explained by the scalelet scores.
The standardized regression coefficients, shown in Table 2, also provide evidence of convergent validity. However, caution should be used in interpreting the coefficients due to the intercorrelations among institutional characteristics and engagement measures. Statistically significant regression coefficients that have the same sign as the corresponding correlations, which are also statistically significant, indicate that the variables uniquely contribute to the variance in a gain score. Nonsignificant coefficients or coefficients with signs that are opposite the signs of the corresponding correlations indicate that the variable does not uniquely contribute to the gain measure.
Evidence of the convergent validity of the NSSE scalelets can be found in the fact that gains in general education, which include gains in writing and analytical skills, were related to the Writing Experiences and HigherOrder Thinking Skill scores. Practicalskill gains, which include gains in understanding and using information technology, were positively related to scores for the Information Tech nology scalelet.
The multiple regression results also provide evidence of discriminant validity. Two types of evidence were found. First, the relationships between scalelet scores and gains were more highly differentiated than the relationships between gain scores and the NSSE benchmarks. For example, both the Course Interaction and Varied Experiences scalelets uniquely contributed to the variance in generaleducation gains, but the Student Interaction with Faculty Members and Enrich ing Educational Experiences benchmarks were not uniquely related to gains. Although the Active and Collaborative Learning benchmark was positively related to gains in practical skills, only the Collaborative Learning bench mark had a statistically significant relationship with this type of gain. The Enriching Educa tional Experiences benchmark was negatively related to practicalskill gains, but the relation ships for the scalelets derived from this taBle 1. A comparison of the relationships between scalelet scores and gains across the two outcome measures provided additional evi dence of discriminant validity. All three of the scalelets derived from the Level of Academic Challenge benchmark were related to general education gains, but only the HigherOrder Thinking Skills scalelet was related to practical skill gains. Neither the Active Learning nor the Collaborative Learning scalelets were related to generaleducation outcomes, but Collaborative Learning scores were related to gains in practical skills. Conversely, Course Interaction scores were related to general education but not practicalskill gains. Varied Experiences scores were positively related to gains in general education but negatively related to gains in practical skills. Information Technology scores were positively related to gains in practical skills.

General Education Practical Skills
limitations Although the results for NSSE 2004 are generally consistent with the results reported across the first few years of surveys, only one year of data was analyzed in this study. If institutions participating in other years were included, the results might differ in unknown ways. In addition, the data in this study are specific to the NSSE survey. Consequently, the results of this study do not indicate that valid scalelets can be developed for other surveys. Furthermore, the data for the validity analyses were at the institution, rather than the college or department level. If departmentlevel data had been used, different results might have been obtained.
The most serious limitation is that the criterion variables for establishing convergent and discriminant validity were students' self reports of their learning. Although selfreport data have been studied extensively and shown to yield valid assessment information (see Kuh, 2001), both the measures to be evaluated and the criteria for evaluation used the same measurement method. Messick (1989) noted that the use of a single measurement method in validity studies may produce misleading results due to shared, methodspecific variance. The presence of methodspecific variance in this study may explain why most of the correlations between outcome measures, NSSE benchmarks, and scalelet scores were positive and statistically significant.

dIscussIon
Despite these limitations, the results of the present research have important implications for assessment practice. For institutions that participate in the NSSE, this study indicates that NSSE scalelet scores provide valid mea sures of students' educational experiences and can be used for institutional assessment and improvement. The presence of strong relation ships between scalelet scores and selfreported gains in student learning is one indication of the convergent validity of these scores. Scalelet scores accounted for approximately 35% of the variance in both gain measures. Scalelet scores also evidenced greater explanatory power than the NSSE benchmark scores. Increases in explained variance ranged from 3% for generaleducation gains to 13% for gains in practical skills. Most important, the relationships supporting the convergent validity of scalelet scores were consistent with studentengagement theory. That is, a partic ular type of involvement was associated with gains in a corresponding content area or skill. For example, greater involvement with writing was positively related to gains in general education, which included gains in writing. Likewise, experience with information tech nology was positively related to gains in practical skills, including gains in the ability to use information technology effectively.
The results of this study also provide evidence of the discriminant validity of NSSE scalelet scores. Generally, the relationships between engagement and outcomes were more nuanced for scalelet scores than for the NSSE benchmark scores. For example, the Active and Collaborative Learning benchmark was posi tively related to gains in practical skills. However, the analysis of scalelet scores revealed that this relationship was present for collabora tive learning but not active learning. Similarly, the regression analyses indicated that scores on the Student Interaction with Faculty Members benchmark were positively related to gains in practical skills. Regression results for the scalelet scores indicated that outofclass interaction with faculty members was related to gains in practical skills but interaction during class was not.
Once again, the evidence supporting the discriminant validity of scalelet scores is consistent with studentengagement theory. Take for example the relationship between the Varied Experiences and Information Tech nology scores and the two learningoutcome measures. Both the Varied Experiences and Information Technology scalelets are subsumed by the Enriching Educational Experiences benchmark. In the current study, Enriching Educational Experiences scores are not signi ficantly related to gains in general education, but they are negatively related to gains in practical skills. Both results are surprising given that studentengagement theory suggests that many of the activities included in the bench mark, such as interacting with diverse groups of students, attending campus events, and participating in learning communities, are associated with gains in general education. Similarly, use of electronic technology, which is also included in the Enriching Educational Experiences benchmark, should be positively related to gains in practical skills. When gain scores were regressed on the scalelet scores, the results conformed to expectations. Scores on the Varied Experiences scalelet were positively related to gains in general education but negatively related to gains in practical skills. Scores on the Information Technology scalelet were positively related to gains in practical skills but not significantly related to gains in general education.
These results do not warrant abandoning the NSSE benchmark scores. Those scores serve a very useful purpose of providing senior administrators with a general overview of engagement on their campuses. Scalelet scores are most useful to academic affairs, student affairs, and assessment professionals who are charged with taking NSSE results and trans lating them into a series of action items to improve the student experience on campus. In addition, the present research should be considered a starting point for further research on the NSSE scalelets. The results of this study indicate that the NSSE scalelet scores can provide useful information for improvement at the institution level. More research is needed to demonstrate the convergent and discrimi nant validity of scalelet scores at the college and department levels. As Messick (1989) noted, validation is an ongoing process of collecting and synthesizing information about the accuracy and appropriateness of scores for a variety of uses and across a variety of contexts.
The results of this research also have important implications for institutions that are involved in survey research but do not participate in NSSE. Scalelets are not unique to NSSE. Institutions that use other com mercially available surveys, such as the College Student Experiences Questionnaire or the Cooperative Institutional Research Program freshman survey, may want to consider developing scalelets that are specific to those instruments. The key to developing scalelets is in identifying a limited number of survey questions that are related to a specific aspect of students' educational experiences. Once possible scalelets have been identified, research can be conducted to evaluate the general izability and validity of scalelet scores.
Scalelets can also be constructed for institutions using locally developed surveys. In fact, scalelets may hold the greatest promise for surveys developed by colleges and uni versities because they suggest a new model for survey construction. Wainer and Kiely (1987) argued that testlets are the basic building blocks of computer adaptive tests. Test questions are important only insofar as they contribute to the development of testlets. In an earlier article, I argued that the same model can be applied to survey development (Pike, 2006). The development process would begin with the identification of the constructs to be assessed by the survey. The definitions of these constructs would serve as the frameworks around which the scalelets would be devel oped. Samples of items would be generated and tested for each construct in order to identify groups of four or five questions that would yield generalizable and valid scalelet scores. The final step in the process would be to combine the items comprising the scalelets with biographic and demographic questions to form a completed survey.
Although identifying strategies for im proving student learning was not an objective of this study, the results do suggest that some types of engagement initiatives may result in broad learning gains, whereas other engage ment initiatives may yield more focused improvements. If an institution is interested in increasing gains in a broad range of learning outcomes, the evidence suggests that the institution should focus on strategies that would improve support for student success, the quality of the interpersonal environment, and students' higherorder thinking skills. Institutions interested in improving specific learning outcomes should focus on improving collaborative learning, course interaction, the variety of experiences available to students, and the use of information technology.
Correspondence concerning this article should be addressed to Gary R. Pike, Office of Planning and Institutional Improvement, Indiana University Purdue University Indianapolis, 355 N. Lansing Street, AO 140, Indianapolis, IN 46202-2896 • to what extent has you experience at this institution contributed to your knowledge, skills, and personal development in . . . analyzing quantitative problems? [gnquant] • to what extent has you experience at this institution contributed to your knowledge, skills, and personal development in . . . acquiring job or work-related knowledge and skills? [gnwork]