|Year : 2020 | Volume
| Issue : 3 | Page : 87-92
Survey research methods: Preparing a validity argument
George M Harrison, Katie A Azama
Department of Educational Psychology, College of Education, University of Hawaii at Manoa, Honolulu, Hawaii, USA
|Date of Submission||16-Aug-2020|
|Date of Acceptance||10-Sep-2020|
|Date of Web Publication||6-Nov-2020|
Dr. George M Harrison
Curriculum Research & Development Group, College of Education, University of Hawaii at Manoa, 1776 University Avenue, Honolulu 96822, Hawaii
Source of Support: None, Conflict of Interest: None
Validity occupies a central role in studies that use survey instruments to contribute to their research conclusions. In a typical process, researchers collect survey data, draw inferential claims about what those data mean, and imply how these interpretations should be used. In this article, we introduce the argument-based approach to validation as a means to clarify and evaluate these claims and their underlying assumptions. We provide an example of how to prepare an argument by stating the proposed interpretations of a nurse practitioner survey along with the anticipated challenges to these inferential claims. We propose the types of evidence we need to counter these challenges and anticipate the limitations that will remain. Using this approach, we economize our validation work, identify evidence that we might otherwise overlook, and avert overstated claims. Researchers can employ similar methods to document their validation work and further strengthen their larger research conclusions.
Keywords: Nurse practitioner, self-efficacy, survey research, surveys, validation, validity evidence
|How to cite this article:|
Harrison GM, Azama KA. Survey research methods: Preparing a validity argument. Educ Health Prof 2020;3:87-92
| Introduction|| |
Survey instruments are pervasive in health profession education research  largely because they are perceived to be easy to develop and are convenient for measuring attitudes, perceptions, and other concepts that cannot be easily directly observed or tested. Typically, the data collected from a survey are interpreted and then used (often in conjunction with other sources of data) to draw larger conclusions in a research study. As a result, when critics investigate the credibility of the study's conclusions, the validity of those survey interpretations becomes a central concern.
Given this central role of validity, it is important for researchers to determine which types of evidence they need to secure in order to convince critics that their survey interpretations are legitimate. An efficient approach to doing this is to anticipate the challenges critics will use to rebut our claims, determine if those challenges will indeed pose serious threats to the claims we intend to make with our survey, and draw from the existing body of knowledge about validity evidence to articulate the type of backing we need to warrant our claims. This constitutes the fundamental process of an argument-based approach to validation.,,, In this methods' note, we introduce the argument-based approach and present an example to illustrate how researchers can prepare for such an argument.
| Validity and Validation|| |
Although the concept of validity has undergone considerable debate among scholars, there is general agreement in the field of educational and psychological measurement, as documented in the Standards for Educational and Psychological Testing (hereafter the Standards), that validity is “the degree to which evidence and theory support the interpretations of test scores for proposed uses of tests” and validation “can be viewed as a process of constructing and evaluating arguments for and against the intended interpretations of test scores and their relevance to the proposed use”.[7(p11)] Though these definitions include the terms test and test scores, they apply equally to survey instruments and scores resulting from survey data. Furthermore, consistent with the Standards, most measurement scholars agree that (a) validity is not a property of the instrument but of the inferences drawn from the instrument (i.e., the interpretations), (b) there are no separate types of validity but rather separate types of evidence, (c) validity is a question of strength rather than a yes-or-no property, and (d) validation can be an ongoing endeavor as scientific scholarship reveals new understandings.,,
For many researchers in the health professions, this conceptualization of validity has not yet been reflected in their lexicon. For example, it is not uncommon in published research to encounter the term “validated survey,” implying that validity is a property of the instrument. In some circles, validity is narrowly conceptualized as a validity coefficient, which provides a convenient numeric estimate but which reduces validity to one type of evidence (an adjusted correlation between two instruments purported to measure the same construct). Moreover, an outdated yet still tenaciously advised, practice is to address separate types of validity, with researchers referring to an instrument's content validity, construct validity, and criterion-related validity., These are more appropriately addressed as types of evidence for the validity of the claims made using the instrument.,
There has been methodological advice to develop a validation study around the five types of evidence described in the Standards. These types include evidence based on (a) instrument content, (b) response processes of respondents, (c) internal structure of instrument items, (d) relations of the resulting scores to other variables, and the (e) consequences of instrument use (all of which we will refer to in the example herein). These classifications are indeed valuable; however, in the argument-based approach, it is the proposed interpretations (for proposed uses) that should drive what types of evidence are needed. Furthermore, with this stance, we can be open to types of evidence that do not fit into these categories and we do not have to commit ourselves to collecting evidence that adds little to our argument, which in turn helps to set realistic goals for a validation study.
The Argument-Based Approach to Validation
The argument-based approach to validation provides a means to prepare for and conduct a validation study that is consistent with the Standards. The approach comprises two stages: (1) a preparation stage in which the proposed inferential claims (i.e., the intended interpretations and uses) and accompanying assumptions that will warrant those claims are documented and (2) an evaluation stage in which the inferences are appraised for coherence and completeness.,, In the preparation stage, we can anticipate those assumptions that will most likely be challenged and identify the types of evidence we will need to marshal in order to back these assumptions. This in turn will help us plan our evidence-gathering efforts so that we can build the strongest possible argument when it is appraised during the second stage.
Aside from these two stages, there is no prescriptive set of steps for researchers to follow to prepare a validity argument. However, examples of how to apply this approach can guide researchers in their own planning. In the example that follows, we illustrate how to engage in the preparation stage when working with a survey to draw inferences about recent nurse practitioner graduates' self-efficacy levels with professional skills, which will subsequently be used to inform graduate programs' future curriculum decisions.
| An Example of How to Prepare for a Validity Argument|| |
Our example involves a survey to be used by nurse practitioner education programs to make decisions about their curricula. If the interpretations drawn from this instrument have weak validity, the curriculum decisions will be misinformed, which can result in undesirable educational experiences for future students and possibly a subsequently underprepared workforce. In what follows, we present the proposed use and interpretations of our example survey. Then, for the proposed interpretation claim, we pose questions that challenge our inferential reasoning. We respond by anticipating the extent to which the challenge would threaten the validity of our interpretations and identify evidence we can rally to back our reasoning. We also provide provisional statements to delimit the proposed claim, or to use Kane's terminology, we identify qualifiers to the survey claims. Finally, we explain the importance of also considering the completeness and coherence of the validity argument when selecting which evidences to marshal for our overall interpretations and uses. In documenting our proposed inferences, anticipated counterarguments, needed evidences, and qualifiers, we reveal the dialectic reasoning inherent in our validity argument and we set forth a plan for validation.
The proposed use of the survey information
The main purpose of this survey is to inform nurse practitioner education programs about their recent graduates' levels of self-efficacy in performing tasks that are required for a nurse practitioner. Education programs can use this information to make decisions about their curricula. For example, if the results suggest that the recent graduates of a program have low self-efficacy in the nurse practitioner skill captured in the NONPF standard, “demonstrates an understanding of the interdependence of policy and practice,” the program can change the curriculum to emphasize activities that develop self-efficacy in this skill. The intended consequence is for programs to provide a better education and to have future students graduate with higher levels of self-efficacy. The intended downstream effect is to improve job retention and workplace satisfaction.
What unanticipated negative consequences are conceivable with this proposed use?
One possible negative consequence is that programs end up diverting too many resources away from skills that are important for reasons other than self-efficacy. With this, it is incumbent upon the programs to document their decision-making and track the outcomes of their curriculum changes. Another scenario, which would arise if the scores reflected something other than self-efficacy, would be that future students are subjected to self-efficacy education experiences that are not needed and which draw resources away from other important curriculum topics. To mitigate this threat to the use claim, it is important to establish the validity of the interpretation claim. Taken together, the evidence needed is classifiable in the Standards as evidence about the consequences of instrument use. For this use inference, qualifying clauses can be “insofar as the programs also use other pertinent sources of information in determining the contents of their curricula,” and “insofar as the survey interpretations have acceptable validity.”
The proposed interpretations
For this proposed use, we need recent graduates' self-efficacy scores for each type of skill that is deemed to be important for programs' curricula. That is, for each skill, we need a score that estimates how self-efficacious graduates are. These are our proposed score interpretations. Using these, the graduate–program developers can identify which skills require attention in their planning decisions.
One Plausible Challenge to our Inference
How can you be sure that the instrument will yield self-efficacy scores for skills needed to inform curriculum decisions?
This is important for determining the meaning of the scores. We need backing to support the interpretation claims about graduates' levels of self-efficacy on each of the skills that are included in programs' curricula. This concern about how well the scores represent proposed construct is often labeled as construct representation.
The type of backing needed is classified as evidence based on the content. One evidence source is documentation of what the proposed survey's content components are – that is, the types of skills – and which survey items are intended to measure each component. A blueprint of the survey and a description of the content domain will serve as evidence. Justification for how specific these scores will need to be (i.e., their grain size) for the proposed use needs to be explicated. This description and the blueprint explain how the proposed content aligns with the Nurse Practitioner Core Competencies Content standards, which are intended to guide educational programs in curriculum development. It is important to recognize that some skills are inherently more difficult than others (e.g., “develops new practice approaches based on the integration of research, theory, and practice knowledge” is more difficult than “translates research and other forms of knowledge to improve practice processes and outcomes”) and that we would expect self-efficacy to be lower with difficult skills than with easy skills. A blueprint and narrative description of the content domain document this hypothesized skill difficulty, which in turn can help programs use the score interpretations to make curriculum decisions.
A second source of evidence is from subject-matter-expert review. Judgments from external reviewers who are knowledgeable in the content can address questions about (a) the extent to which the components are indeed important for the proposed interpretations, (b) how appropriate the grain size is for these intended interpretations and uses, (c) how accurate the blueprint is, and (d) how well the items align with those components listed in the blueprint.
Documentation and expert reviews do not completely safeguard against this threat to validity. Enough reviewers with the appropriate qualifications are needed to justify the validity of this evidence. Qualifiers can be “insofar as this content indeed represents what programs should teach,” and “insofar as the subject-matter experts' judgments represent those of their peers.”
A Second Plausible Challenge
How can you be sure that the scores represent self-efficacy and not some other construct such as motivation, social desirability, or comprehension of the items?
This is important because the primary warrant on which we are basing our interpretation claim is that the scores reflect graduates' self-efficacy levels with the skills. If the scores are unduly affected by constructs other than this, our interpretation claims are spurious. This threat is often labeled as construct-irrelevant variance. Indeed, motivation is something we need to consider because recent research has suggested that many survey scales purported to measure self-efficacy tend to conflate self-efficacy and motivation.
Social desirability is a plausible contaminating construct if respondents believe that their data will be shared with their supervisors, as this may influence their willingness to reveal their actual levels of self-efficacy with skills that are important for their work. In a similar manner, recent graduates likely have a self-interest in promoting the quality of their graduate program, which can influence their desire to inflate their self-efficacy reports if they believe that the results will be used to evaluate the quality of their degree.
Item wording can also contaminate the proposed interpretations. For example, if the items measuring one skill are phrased in a way that stimulates low self-efficacy responses, such as with the words “I am confident I can always…” or “I can… no matter what,” the scores can underestimate graduates' self-efficacy. If this wording is not equivalent across skills, the scores will not be comparable when programs use them to make curriculum decisions, which poses a threat to the proposed use inference.
One source of backing is classifiable in the Standards as evidence based on the relations with other variables. If there were another instrument measuring self-efficacy with this same content, we could use that. Unfortunately, none seems to exist. Another external variable is from known groups, such as very experienced nurse practitioners in one group and very recent graduates in another group. We hypothesize self-efficacy scores to be high with the very experienced respondents and low with their counterparts.
A second source of empirical evidence is classifiable in the Standards as evidence based on response processes. Using cognitive interviews with participants from the population,,, we can document what participants report themselves to be thinking while responding to the survey. These are similar to think-aloud protocols but they also include interviewer probes to explicitly ask whether participants' answers reflected motivation, social desirability, acquiescence, or other variables.
A third source of evidence, also classifiable as being about response processes, will be documentation of the item development and the decisions justifying the item wordings. These justifications can be derived from guidelines present in the survey development literature., For instance, if item difficulty (such as difficulty to endorse) is a perceived threat, we can document that the items and response scales are worded the same across skills. Items and instructions can be worded to assure respondents that their data will only be used to recommend changes to the program and not to evaluate the quality of their work or education. A related source of evidence for reviewing item wording can be from reviews by experts in applied linguistics or in the psychology of survey response.
These sources of evidence themselves are prone to counterarguments. For example, an alternative explanation for higher scores with more experienced respondents may be that these participants feel they should respond in this manner. In this manner, the known-groups comparisons alone should not be relied upon to back this claim. Cognitive interviews may allay concerns about motivation as a conflating construct, but they are susceptible to social desirability bias, so the same bias may be present in these interviews, making it difficult to isolate self-efficacy from this construct. Survey development guides often offer conflicting advice, such as the need for briefly worded items versus the need for precision. These limitations are consistent with the caveat that construct-irrelevant variance cannot ever be fully eliminated. Given these limitations, we must qualify our claims, using language such as “the scores likely represent self-efficacy,” and “insofar as the data from the cognitive interviews indeed capture how respondents think through the items.”
A Third Plausible Challenge
How can you be sure that the instrument distinguishes among these skills? What if the instrument really only measures a general nurse practitioner's self-efficacy construct?
This is important because the decisions about which program skills should receive more attention require scores that reflect those skills. In other words, just as was the case with the instrument content, this is an issue with granularity. Our warrant is that the items hypothesized to stimulate responses that reflect graduates' self-efficacy with one skill are more strongly related to each other than the items hypothesized to stimulate responses with other skills. This justifies our use of skill-specific scores rather than an overall self-efficacy score, which would not inform specific curriculum changes. An undesirable pattern, in contrast, would be that the entire set of items simply measures a single construct, as this would not provide useful information.
This concern is a question of dimensionality. Another way to state our warrant is that each score is aligned with its hypothesized dimension. Theoretically, each dimension consists of a latent variable (which represents the construct) that causally determines the responses in its respective items. The instrument blueprint presents this hypothesized dimensional structure.
One source of evidence is from confirmatory factor analysis (CFA). CFA helps us to judge how well the patterns in a set of response data match up with – or fit – the hypothesized dimensional structure. We can also compare the fit of two or more hypothesized structures to see which provides a better explanation of the response data. If the dimensional structure we hypothesized with our blueprint fits the data well (based on fit indexes from the CFA model output), we have supportive evidence. If we also find that our hypothesized structure fits the data better than a general, single-dimension, structure, we have evidence to support the calculation of a separate score for each skill. In the Standards, this is classifiable as evidence based on internal structure.
There are many decisions we face when we use CFA. One of them is how we treat the item-response data. Likert-type response scale data are generally not normally distributed interval-level data, so an appropriate estimation method should be considered (such as methods that estimate ordered categorical data or those that are robust to nonnormality). Evidence supporting these decisions requires documentation and given that models by their very nature are never perfect representations of the observed data, there will likely be plausible challenges to the conclusions drawn from the CFA. Rather than fall into infinite regress, we will qualify the interpretation claims with “provisionally,” “probably,” and “under the assumption that the CFA results are accurate.”
A Fourth Plausible Challenge
How can you be sure that the items are sensitive to variations in respondents' levels of self-efficacy?
This is a question about the calibration of the items in our instrument. If the items are poorly calibrated to the level of self-efficacy of the respondents, the precision of the scores will be poor. For instance, if the instrument were administered to students in the early stages of their graduate education, we would expect everyone to report low self-efficacy for skills that they are expected to learn about in the program; if it were administered to experts in the field, with decades of experience, we would expect very high self-efficacy scores. The variance of the scores with these two populations should be low compared to the systematic variance with our target population. In other words, an instrument that is well calibrated to the target population will be sensitive to variation in the construct of interest.
A source of empirical evidence is from reliability estimates with each score. This can be from internal consistency estimates, such as coefficient alpha or omega,,, and analysis of item functioning to determine, for instance, if all of the response scale categories for each item are functioning as predicted. Further backing can be from the results of item–response modeling, such as a polytomous Rasch model  that reveals how well each item's response scale targets the respondents' levels on the construct of interest. In the Standards, this backing is classifiable as evidence based on internal structure.
The methods used to obtain these sources of evidence also require careful procedures and attention to assumptions and threats. For instance, novices may overestimate their self-reported self-efficacy until they become aware of how difficult those tasks really are. With this, qualifiers to the claim can be something such as “insofar as the novices truly understand the task difficulty,” and “insofar as these estimates of reliability and item functioning are accurate.”
A Fifth Plausible Challenge
How can you be sure that the scores will represent all of the graduates from a program? It seems plausible that people who are motivated to participate in the survey will differ, in their levels of self-efficacy, from those who are not motivated?
This is important because our claim is based on the warrant that the scores can be interpreted as representing the self-efficacy levels of all of the graduates from a program. In survey methodology, this threat to validity is referred to as nonresponse error. That is, those people who respond differ from those who do not on the construct being measured.
One source of evidence to collect is follow-up interviews with a sample of the people who were invited but did not respond. If those interviewees who did not earlier respond to the survey are arguably similar in their levels of self-efficacy to those who did respond, nonresponse error may not be unduly affecting the score interpretations.
If those interviews reveal self-efficacy patterns that differ from those who did respond to the survey, an exception to the claim must be added. An exception may be something like this: “The claims apply to people who tend to opt into this type survey and cannot be interpreted as the self-efficacy levels of the full population of graduates.” If the patterns in the follow-up interviews appear to be similar, the nonresponse error may still be present even if it is undetectable. Therefore, the claims can be qualified with “probably,” or “insofar as the participants represent the population of graduates.”
Anticipating evaluations of the completeness and coherence of our argument
Throughout this process, we also need to anticipate broader challenges about the argument's completeness and coherence. If the argument fails to address crucial plausible challenges, it is incomplete. For instance, if we only considered the challenge about dimensionality and its accompanying factor analysis evidence, other challenges that threaten the validity of our claims would remain. That is, if we were to ignore concerns about the instrument's content and the degree to which the scores will represent self-efficacy with each of the skills (among other challenges), our proposed use (to inform curriculum changes) will lack legitimacy.
Coherence has to do with how well the evidences connect to the larger argument. As an example, consider a practice that is often employed with exploratory factor analysis: if we were to draft a large set of survey items that we believed measured the content that is important for curriculum decisions, administered these items to a large sample, identified dimensions as they emerged from the exploratory factor analysis of the data, and then used these dimensions to dictate which scores we therefore need for informing curriculum decisions, we had better be sure these dimensions adequately serve our intended uses, which is usually documented with an instrument blueprint. If these dimensions do not align with our blueprint, our argument will lack coherence.
The degree of completeness and coherence that we strive for will depend on how ambitious we seek our claims to be. With proposed uses that have high stakes, all eyes will be on our argument. Thus, in preparing for ambitious claims, our budget allocation must cover the high costs of evidence collection, as many large-scale testing companies will attest. With low-ambition low-stakes uses, we can afford more provisional statements and cautious claims, therefore requiring fewer resources for evidence collection. Overall, in preparing for the validity argument, researchers can secure backing for these broader challenges about completeness and coherence through the evaluative judgments of external reviewers who are familiar with the discipline and the audience for whom the research will be important.
| Conclusion|| |
This example introduces how we can prepare for a validity argument. With this preparation having been documented, we have guidance in what types of evidence we need to collect and we disclose the provisionality of the proposed interpretations so as to avert overambitious claims about how the information should be used. It is worth noting that all five types of error were mentioned in this prepared argument but that the argument was arranged by the perceived threats to validity rather than the specific types of evidence. It is also worth noticing that the final threat included a source of evidence, having to do with sampling, that is not classifiable in the five types of evidence documented in the Standards. This illustrates the value of an argument-based approach because we do not restrict ourselves to particular types of evidence, but instead focus on the proposed claims and the backing needed to support the validity of those claims. The next steps to complete the validation are to carry out the evidence collection work and appraise the strength of the argument given that evidence. Such an appraisal can make it clear to readers the degree to which the claims are indeed valid, which in turn will inform judgments about the credibility of the overall conclusions drawn in the research study.
Finally, it is worth noting that this documentation of the validity argument does not have to reside in the research article itself. Given the space limitations in journal manuscripts, Kane  advises researchers cite outside documentation of the validity argument referred to in the research study. Technical appendices or accompanying publications can serve this role. Minimally, what should be included in research articles is an explicit statement of the proposed interpretations and uses  and a description of the most pertinent assumptions and accompanying evidences that are used to justify statements about how valid these interpretations are for their intended uses.
Financial support and sponsorship
Conflicts of interest
There are no conflicts of interest.
| References|| |
Artino AR Jr., La Rochelle JS, Dezee KJ, Gehlbach H. Developing questionnaires for educational research: AMEE Guide No. 87. Med Teach 2014;36:463-74.
Kane MT. Validating the interpretations and uses of test scores. J Educ Meas 2013;50:1-73.
Kane MT. An argument-based approach to validity. Psychol Bull 1992;112:527-35.
Kane MT. Validation. In: Brennan RL, editor. Educational Measurement. 4th
ed. Washington, DC: American Council on Education; 2006. p. 17–64.
Cronbach LJ. Five perspectives on validity argument. In: Wainer H, Braun HI, editors. Test Validity. Hillsdale, NJ: Lawrence Erlbaum; 1988. p. 3-17.
Newton PE, Baird JA. The great validity debate. Assess Educ Princ Policy Pract 2016;23:173–7.
American Educational Research Association, APA, NCME. Standards for Educational and Psychological Testing. American Educational Research Association; 2014.
Cronbach LJ. Test validation. In: Thorndike RL, editor. Educational Measurement. 2nd
ed. Washington, DC: American Council on Education; 1971.
Royal KD. Four tenets of modern validity theory for medical education assessment and evaluation. Adv Med Educ Pract 2017;8:567-70.
Hojat M, Erdmann JB, Gonnella JS. Personality assessments and outcomes in medical education and the practice of medicine: AMEE Guide No. 79. Med Teach 2013;35:e1267-301.
DeVellis RF. Scale development: Theory and Applications. 4th
ed. Thousand Oaks, CA: Sage Publications, Inc.; 2017.
Tsang S, Royse CF, Terkawi AS. Guidelines for developing, translating, and validating a questionnaire in perioperative and pain medicine. Saudi J Anaesth 2017;11:S80-S89.
Tavakol M, Dennick R. The foundations of measurement and assessment in medical education. Med Teach 2017;39:1010-5.
Kane MT. The argument-based approach to validation. Burns M, ed. Sch Psychol Rev 2013;42:448-57.
Newton PE, Shaw SD. Validity in Educational & Psychological Assessment. Los Angeles, CA: Sage; 2014.
Kane M. Validity studies commentary. Educ Assess 2020;25:83-9.
Messick S. Validity. In: Linn RL, editor. Educational Measurement. 3rd
ed. New York: Macmillan; 1989. p. 13-103.
Williams DM, Rhodes RE. The confounded self-efficacy construct: Conceptual analysis and recommendations for future research. Health Psychol Rev 2016;10:113-28.
Beauchamp MR, McEwan D. Response processes and measurement validity in health psychology. In: Zumbo BD, Hubley AM, editors. Understanding and Investigating Response Processes in Validation Research. Vol. 69. Cham, Switzerland: Social Indicators Research Series. Springer International Publishing; 2017. p. 13-30.
Zumbo BD, Hubley AM, editors. Understanding and Investigating Response Processes in Validation Research. Cham, Switzerland: Springer; 2017.
Leighton J. Using Think Aloud Interviews and Cognitive Labs in Educational Research. Oxford, UK: Oxford University Press; 2017.
Dillman DA, Smyth JD, Christian LM. Internet, Phone, Mail, and Mixed-Mode Surveys: The Tailored Design Method. 4th
ed. Hoboken, NJ: Wiley; 2014.
Tourangeau R, Rasinski KA. Cognitive processes underlying context effects in attitude measurement. Psychol Bull 1988;103:299-314.
Peters GJ. The alpha and the omega of scale reliability and validity: Why and how to abandon Cronbach's alpha and the route towards more comprehensive assessment of scale quality. Eur Health Psychol 2014;16:56-69.
McNeish D. Thanks coefficient alpha, we'll take it from here. Psychol Methods 2018;23:412-33.
Raykov T, Marcoulides GA. Thanks coefficient alpha, we still need you! Educ Psychol Meas 2019;79:200-10.
Wilson M. Constructing Measures: An Item Response Modeling Approach. New York: Psychology Press; 2005.
Masters GN. A Rasch model for partial credit scoring. Psychometrika 1982;47:149-74.