Measuring Non-Cognitive Skills: Psychometric Validation of Scales
Non-cognitive skills (NCS) are a group of behavioural and attitudinal traits or abilities that cover a range of abilities such as conscientiousness, perseverance, and self-control (see the Glossary for more examples). Research shows NCS matter for a range of one’s outcomes in life, from education to employment (Cunha et al., 2010; Brunello and Schlotter, 2011). For this reason, economists are increasingly interested in measuring NCS and psychological traits in their work (Kattan, 2017). Studies of NCS in LMICs reveal a relationship between lower levels of skills and lower earnings (Gertler et al., 2014) and poorer health (Haushofer et al., 2019). This relationship has important implications for poverty alleviation programmes.
Using appropriate tools to measure NCS is crucial to learning accurately about the levels of these skills in different populations. However, researchers wishing to study NCS are faced with the issue that standard assessment tools have been developed in HICs and may not capture the same constructs accurately in other contexts. Studies testing the performance of standard tools in LMIC settings often find that these measures do not capture the same constructs in a reliable way (Cheng et al., 2013; Esopo et al., 2018; Laajaj & Macours, 2020).
Therefore, researchers seeking to successfully measure NCS in LMICs have two options: developing new measures or using existing measures selected through extensive piloting and validation tests. Developing a new instrument can take many years, making existing scales a much quicker and cheaper option. Baron et al. (2017), Esopo et al. (2018) and Laajaj and Macours (2020) provide best practices to follow when selecting and piloting a pool of candidate instruments, such as back-translation and clinical comparison. This post outlines the process of validating likert-type scales used to measure NCS. To ensure scales perform well, this validation process should be performed during piloting and as part of quality checks conducted during data collection.
Property |
Definition |
Internal consistency reliability |
The extent to which all the items in a scale reliably measure the same attributes, or the interrelatedness of scale items |
Stability reliability |
A measure of whether a scale is likely to produce similar results under similar conditions when the attributes under study are not expected to change |
Endorsement rates |
Evaluate the distribution of responses for each item to search for features such as floor and ceiling effects (high concentration of observations at the extremes) |
Construct Validity |
The ability of the scale instrument to capture only the construct of interest |
Comprehension |
How well respondents understand instrument items |
Table 1 provides an overview of some of the properties that researchers may wish to test to assess scale performance. This list is not exhaustive;[1] however, it provides a broad enough set of properties to gauge how successful the instrument is in capturing the measures of interest.
[1] See DeVon et al. (2007) for further examples of properties of interest.
Table 2 lists some of the psychometric tests and techniques that can be used to assess the properties of candidate scales. For each test, the thresholds and conditions for success are specified in column 3. Stata commands are either listed in column 4, or provided in the Appendix log file (available to download from the bottom of the page).
Test |
Properties tested |
Test threshold |
Stata command/code |
Cronbach’s alpha (Cronbach, 1951) |
Internal consistency reliability |
α≥0.7 indicates acceptable internal consistency. α>0.9 may indicate item redundancy, but shortening the scale may not be necessary (DeVon et al., 2007) |
alpha varlist [if] [in] [, options] |
Test-retest reliability over a period of time (2 weeks - one month) |
Stability reliability |
High correlations, ρ≥0.7 |
correlate [varlist] [if] [in] [weight] [, correlate_options] |
Exploratory factor analysis (EFA) |
Construct Validity |
Eigenvalues >1.0 |
See detailed post on Exploratory Factor Analysis on CSAE’s Coders Corner: https://www.csae.ox.ac.uk/coders-corner/coders-corner |
Confirmatory factor analysis (CFA)[1] |
Construct Validity |
Factor loadings >0.4 (see method for EFA) Measures of goodness-of-fit, for example: Comparative Fit Index (CFI)≥0.95; Root mean square error approximation (RMSEA)<0.08; Tucker-Lewis Index>0.90 |
Appendix Section 1 |
Maximum endorsement frequencies |
Endorsement rates |
>80% of same responses in any item category |
Appendix Section 2 |
Check the proportion of scale items the respondent is able to answer. |
Observation level comprehension |
Drop individual observations for a scale of: 4-5 items, if ≥2 were missing; 6-8 items, if ≥3 were missing; 9+ items, if ≥4 were missing. |
Appendix Section 3 |
Check the proportion of missing responses across all respondents. |
Item level comprehension |
Drop items which could not be answered by 20% or more respondents |
Appendix Section 3 |
[1] While EFA explores the number of common “factors”, or constructs, present in the data, CFA tests the data collected against the theoretical factor structure proposed in literature. For example, the Big Five measure of personality should fit a five-factor model. To conduct CFA, first follow the process used in EFA and evaluate factor loadings. Second, estimate a structural model to test how well data fit the theory. Appendix Section 1 illustrates how to do this for a two-factor model.