Item Banking for an Adaptive Measurement of Neuroticism

The psychometric properties of a bank of 36 items are presented measuring Neuroticism based on the Five-Factor Model. These items pertain to the facets that were identified by the work of McCrae and Costa. The sample was comprised of 1133 adult subjects that reside in the Buenos Aires Metropolitan Area in Argentina. Women accounted for 52.1% of those subjects with an average age of 29.5 years (SD = 11.32). In order to get the items calibrated according to Item Response Theory (Graded Response Model), acquire the bank’s information functions and assess the estimated associations with other instruments, 70% of the cases were randomly selected. An adaptive administration simulation was made with the remaining 30% so as to test two stopping rules: a) using 18 items and b) standard error of ≤ 0.25. Correlations greater than .95 were found between the estimated bank scores and the two adaptive versions. The advantages of using the adaptive Neuroticism measurement over other well-renowned instruments that use conventional large formats, as well as abbreviated ones, are discussed.


Psychological Thought
South-West University "Neofit Rilski" 2020, Vol. 13(2), 459-485 https://doi.org/10.37708/psyct.v13i2.503 The measurement of Neuroticism for screening purposes in the clinical context, as with other personality traits, requires efficient tests that allow a valid and reliable measurement in the shortest possible time. A similar demand arises in large-scale epidemiological assessments, wherein a reduction in items could be used to measure other variables of interest (Baldasaro et al., 2013). The most recognized inventories are usually extensive because they define a hierarchical model of Neuroticism composed of facets and use a considerable amount of elements to make an exhaustive evaluation of each of these sub-dimensions (Goldberg et al., 2006;McCrae & Costa, 2010). There are also shorter scales that perform a onedimensional assessment of the domain (Goldberg, 1992;McCrae & Costa, 2007) or reduce the number of facets by either eliminating them (Soto & John, 2017b) or by subsuming them (DeYoung et al., 2007). Even short and extra short tests have been developed (Donnellan et al., 2006;Gosling et al., 2003;Soto & John, 2017a) in order to reduce measurement errors caused by fatigue or boredom.
However, the practical gain provided by these short or abbreviated forms is achieved at the expense of resigning psychometric quality (Credé et al., 2012). A brief scale that evaluates a broad construct such as Neuroticism, includes elements with moderate correlations that reflect the relative heterogeneity of the content and, consequently, decrease the internal consistency indices considered to examine reliability (Baldasaro et al., 2013;Sibley, 2012).
The most frequent solution has been to select the items with greater discriminative capacity to elevate the internal consistency, neglecting the representativeness and exhaustiveness of the content (Milojev et al., 2013;Morizot, 2014;Ziegler et al., 2014).
The advance of modern psychometry has made it possible to apply the developments of Item Response Theory (IRT) to make the evaluation of personality traits more flexible and efficient by means of Item Banks and Computerized Adaptive Testing (CAT) (Attorresi et al., 2009;Reise & Revicki, 2015). In a CAT, an algorithm is used to progressively choose the items in a bank that provides more information which is based on the responses of the subject. The administration continues until either the standard error drops below a specified level, or the participant has answered the maximum number of questions. This adaptive procedure has an important practical advantage since it would allow the evaluation of Neuroticism to be shortened without compromising the reliability or measurement validity, as is the case with traditional forms of administration.
Item Banks and CAT have begun to have greater visibility in the context of instrumental studies in order to evaluate personological aspects (e.g. Abal et al., 2019;Nieto, et al., 2017;Rubio et al., 2007;Stark et al., 2012). Their construction is more expensive than a Psychological Thought South-West University "Neofit Rilski" 2020, Vol. 13(2), 459-485 https://doi.org/10.37708/psyct.v13i2.503 conventional test, but they have demonstrated important practical advantages (Reise & Revicki, 2015). Indeed, Clinical and Health Psychology are the areas in which the development of CAT has been encouraged for the detection of pathological levels of Depression and Anxiety (e.g. Beiser et al., 2016;Devine et al., 2016;Gibbons et al., 2016).
The general objective of this study was to introduce the construction process of a bank of items for the measurement of the Neuroticism domain according to McCrae and Costa's Five Factor Model (2010). The aim was to contribute with an assessment tool appropriate to the characteristics of the local population, with discriminatory capacity based on individual differences, with solid evidence of validity, and with reliability studies of the generated measurements. In light of the instrumental demands pointed out to obtain an optimal measurement of Neuroticism, and considering the most recent psychometric advances, the following objectives were proposed for this study: a) to calibrate a set of Neuroticism items with IRT to build a bank, b) to obtain evidence of validity based on the correlation between Neuroticism and the variables of personality and psychological symptomatology, c) to examine if the adaptive administration of these items could show advantages and d) to analyze if adaptive administration affects the correlation of Neuroticism with external criteria.
Drawing from these objectives, the following hypotheses were formulated: H1) Despite Neuroticism being defined as a personality domain made up of facets, the construct is expected to fit a unidimensional model to guarantee the bank modeling using IRT.
H2) The score estimated from the Neuroticism item bank will be significantly associated with its conceptually-related personality and psychopathological variables. H3) Adaptive administration allows the instrument to be shortened without compromising its measuring quality.

Participants
The sample consisted of 1133 general-population adults residing in the metropolitan area of Buenos Aires, Argentina who chose to collaborate. The subjects were selected from a non probability sampling method (convenience sampling). 52.1% of them considered themselves female. The mean age of all participants was 29.5 years (SD = 11.32; Min = 18, Max = 82). According to these authors, the definition of each facet is based on some kind of negative emotion or feeling that provides it with entity. The facets Anxiety and Hostility are built upon emotions of fear and anger respectively, while Depression and Self-consciousness are based on the feelings of sadness and shame. Impulsivity and Vulnerability, on the other hand, respond to a behavioral order. While the former is described as the impossibility of resisting temptations, the latter is characterized by the difficulty of implementing effective coping strategies in stressful situations.
A primary depuration was carried out based on the criticism of seven expert judges and a pilot study, which allowed the selection of the 36 items (see Appendix) that were being administered (six items for each of the facets). Of these, six items belonged to the Argentine adaptation of the IPIP-NEO Inventory (Cupani et al., 2014) that was included in the International Personality Item Pool (IPIP) by Goldberg et al. (2006). It was the experts' opinion that all the items were congruent with the conceptual definition of the facet that they operationalize (Aiken's V ≥ .85). In order to avoid the violation of the IRT local dependency assumption, a qualitative analysis of the selected items' content was carried out, which allowed to verify that they were not mutually redundant (Abal et al., 2010;Reise & Rodriguez, 2016).
All items had a Likert response format of four options (Disagree, Slightly Disagree, Slightly Agree and Agree). This decision was based on recommendations derived from empirical and simulation studies (e.g. Abal et al., 2018;Lozano et al., 2008) wherein four categories were found to be an optimal amount to ensure a balance between the degree of the IRT model fit and the measurement reliability.
Eysenck Personality Questionary Revised short version, EPQ-R (Eysenck & Eysenck, 1994; adapted from Squillace et al., 2013). It comprised 42 items with a dichotomous response. At the local level, the adapters replicated the three-factor structure of the Eysenck model Psychological Thought South-West University "Neofit Rilski" 2020, Vol. 13(2), 459-485 https://doi.org/10.37708/psyct.v13i2.503 (Psychoticism, Extraversion and Neuroticism) and the fourth Lie factor. The reliability studies of the four scales recorded adequate internal consistency indices (KR-20 between .66 and .84), which were slightly lower than those obtained with the sample of the present study .
Symptoms Checklist -90 Revised SCL-90-R (Derogatis, 1994). It consisted of 90 items that were grouped to enable the measurement of the intensity of the symptomatology perceived using a seven-day time reference in nine clinical dimensions (Somatization, Obsessive-Compulsive, Interpersonal Sensitivity, Depression, Anxiety, Hostility, Phobic Anxiety, Paranoid Ideation and Psychoticism). It also permitted the obtention of three global indices: Global Severity Index, Positive Symptom Distress Index, Positive Symptom Total. The items had a five-option response format (from 0 -not at all, to 4 -extremely). The local adaptation showed validity evidence and reliability studies suitable for both non-clinical (Casullo, 2004) and clinical populations (Sánchez & Ledesma, 2009

Procedure
Individuals responded to the protocol individually, without any time limit and using a paperand-pencil format. The administrations were carried out by psychologists, duly trained and supervised so that they carried out the applications in a correct evaluation environment according to what is commonly expected.
The examinees were informed about the purpose of this study. Before its administration it was explained to them that the task consisted in responding to a series of inventories that sought to evaluate personality features. It was emphasized that there were no correct or incorrect answers to the questions and that dedication and sincerity in answering was desired. They were informed about the voluntary nature of their participation and the possibility of abandoning the evaluation at any time during the activity. They were also notified that the anonymity and confidentiality of their responses were guaranteed. These considerations were detailed in writing and formed part of the consent that the subjects had to sign before responding.
Item Calibration. A Confirmatory Factorial Analysis (CFA) was performed using the Mplus program (Muthén & Muthén, 2010) in order to verify the GRM assumption of unidimensionality. The parameters were estimated using the Weighted Least Squares Mean and Variance Adjusted (WLSMV) method on the basis of the polychoric correlation matrix. To confirm the degree of fit, the indicators and criteria recommended by Byrne (2012) were considered: Comparative-Fit-Index (CFI) and Tucker-Lewis-Index (TLI) greater than .90 and a Root Mean Square Error of Approximation (RMSEA) less than .08.
The GRM item calibration was performed using the MULTILOG program (Thissen, 2003).
Marginal Maximum likelihood procedure was used to estimate item and response parameters. For each of the 36 items there was an estimation of a discrimination parameter (a) and three location parameters (b 1 , b 2 , b 3 ) that separate the adjacent categories of the Likert scale. In addition, a parameter θ was estimated to quantify their trait level per every subject. The GRM fit was studied by MODFIT (Stark, 2007). This program provided graphs that allowed the comparison of observed and expected probabilities for each item response category at 25 trait levels determined by default. Thus, the program offered information to define whether the model adequately predicts empirical curves. The fit was also assessed with the χ 2 index dividing MODFIT degrees of freedom (χ 2 /df) from the comparison of pairs and triads of items. Following Drasgow et al. (1995), it was considered that ratio values of χ 2 /df over 3 reflected problems regarding model-fit.

Item Bank reliability and validity studies. Global reliability indicators were obtained:
Cronbach's alpha, ordinal alpha and marginal reliability (Thissen, 2003). The Test Information Function (TIF) and the standard error of measurement that was found were plotted. Additionally, evidence of convergent and discriminant validity was obtained considering the correlations between the N estimates made with the complete bank and the EPQ-R and SCL-90-R scales.
Adaptive algorithm. The adaptive administration was studied with the Firestar software (Choi, 2009). A post hoc simulation was performed using the data matrix of the 340 separate sample cases that were intended for this purpose. This procedure consisted in the algorithm progressively choosing the items it would present in the case of an evaluatee responding to a CAT. Then, it retrieved the stored responses of the subject to choose the next item.

Successive provisional estimates from θ were made with the Bayesian method of Expected
A Posteriori (EAP) measure using the normal standard as a priori distribution. The selection of items was made using the Maximum Fisher Information Selection Criterion, which allowed progressively selecting the most informative items for each provisional θ estimated, from the pool of items that have not yet been presented. In order to achieve a greater representativeness of the content in the sampling of the items, it was specified that the selection should be made at random among the three items with maximum information.
Finally, two stopping criteria were tested: a) fixed length when administering 18 items (equivalent to 50% of the bank) and b) variable length when a target measurement precision has been attained (standard error of ≤ 0.25, equal to a classical reliability of .94).
The efficiency of both procedures was analyzed by correlating the θ estimated from the CAT (in its different stopping rules) with the θ estimated by responding to all items. The impact of adaptive measurement on the relationship of θ with the EPQ-R and SCL-90-R scales was also examined.

Item Calibration
Unidimensionality. The results showed that the data properly fitted the undimensional model GRM application. Fifty-eight iterations were required to reach the convergence criterion of the estimation parameters. Table 1 shows the item parameters and the standard error of estimation put in order according to the facet that operationalizes them: Anxiety (items 1-6), Hostility (items 7-12), Depression (items 13-18), Self-consciousness (items 19-23), Impulsivity (items 24-29) and Vulnerability (items 30-36). The location parameters b 1 , b 2 and b 3 of the set of items were located along the different levels of the trait, mainly between -3 and 3. The appearance of a b parameter out of range was associated with items whose trait description were extreme. The a parameters showed, on average, moderate values with a mean of 1.18 (SD = 0.38, Min = 0.70, Max = 2). The comparison of the a parameters, in accordance with the content of the items, revealed variations associated with the facets, with Impulsivity and Hostility being the ones that take lower values.

Psychological Thought
South-West University "Neofit Rilski" 2020, Vol. 13 (2) conclusion, both the graphical methods and the fit indices manifested that the GRM was adequate so as to model Neuroticism items. The set of 36 calibrated items formed the Item Bank that was used in the next phase of the CAT study. The FIT was relatively symmetrical with respect to θ = 0 demonstrating that the bank was reliable to measure the Neuroticism level where the largest number of individuals were located ( Figure 1).

Figure 1. Test Information and Standard Error Functions
The associations between the θ estimates with the complete bank and the EPQ-R and SCL-90-R scales showed results according to what was expected from a theoretical perspective (

CAT simulation
A high and positive correlation was found between the θ estimates with the bank and both its fixed r(338) = .98, p < .001 and variable r(338) = .95, p < .001 length adaptive version. The intensity of these correlations decreased when considering the association of the CAT with the total score calculated with classical theory (Table 3).
The standard error of estimation at θ obtained when responding to the fixed-length CAT varied between 0.18 and 0.31 with an average of 0.22 (SD = 0.023). This implied that an optimum level of precision was reached, even when the number of items that were administered were reduced by half. Under the conditions established by the variable length version, it was required to administer an average of 12.6 items (SD = 4.41) per subject. After presenting 12 items, 59.4% of participants reached an error ≤ 0.25 and 91.7% required 18 items or less. Only two people (0.59%) did not reach the pre-established error and their Psychological Thought South-West University "Neofit Rilski" 2020, Vol. 13(2), 459-485 https://doi.org/10.37708/psyct.v13i2.503 evaluation was interrupted due to reaching the 36-item limit. Both evaluatees adopted scores θ located over 1.5 standard deviations below the trait mean and the standard error did not exceed .27. Table 1 shows the percentage of cases in which each item was administered in adaptive versions. In relation to the starting rule method (trait mean) and item selection method (Maximum Fisher Information), the most used bank items were those whose b parameters were located close to θ = 0 and which had high and moderate a parameters. Those contents linked to Hostility were unlikely to be chosen (items 7 to 12) because the low values of their a parameters reduced the chances of them being administered. This problem was worsened in the variable length CAT since Hostility items presented lower use percentages than in the fixed length CAT. Table 3, the correlations between the estimated θ with the CAT and the EPQ-R and SCL-90-R variables can be seen. The correlation indices found showed that, in general, the Neuroticism relations with other external variables were not altered. This suggested that the tested version of the adaptive administration of items did not substantially impact the evidence of convergent and discriminant validity.

Discussion
The clinical relevance of Neuroticism that has been demonstrated in recent years makes it crucial to think of evaluation instruments that adjust to the demands of application contexts in which efficient measurement is prioritized. The theoretical consolidation of the FFM has provided enough motivation to create short instruments for Neuroticism and the other personality domains that have been found to be useful for evaluating large samples (e.g. Cupani & Lorenzo-Seva, 2016;Donellan, et al., 2006;Gosling et al., 2003;Natividade & Hutz, 2015;Soto & John, 2017a, 2017b. But the strategies used by the Classical Test Theory perspective to shorten tests show a psychometric cost that may limit the practical benefit. In this sense, within the IRT framework, an alternative solution is offered to optimize the evaluation of Neuroticism from Computerized Adaptive Testing. One aspect that should be analyzed is the estimation of relatively lower a parameter values when calibrating Hostility and Impulsivity items. By definition, Neuroticism is a complex domain composed of a variety of negative feelings and emotions that are interrelated but also heterogeneous enough to conceptually isolate them into facet. In recent years some authors have pointed out that the inclusion of anger and impulse control as an essential part of Neuroticism entails a very risky theoretical compromise (e.g. Tackett & Lahey, 2017;Widiger, 2009). Both facets tend to stand out in validation studies because they present the lowest factor loadings in the Neuroticism factor and because they are associated with similar intensity to other model domains. Indeed, there are not few FFM theorists who proposed alternative taxonomies to that of McCrae & Costa (2010) and who have operationalized Neuroticism without including at least one of these two discussed facets (e.g. Aguado et al., 2008;Saucier, 2002;Soto & John, 2017b;Taylor & DeBruin, 2006;Watson et al., 2017).
However, in the absence of a consolidated theoretical model that recognizes these variations in the delimitation of the construct, it was decided to maintain a top-down approach that could provide a basis for the bank's development.
In relation to the CAT methodology implemented here, it has been demonstrated that it is possible to obtain estimates of Neuroticism with an optimal degree of precision over much of the trait continuum, even when only part of the items that make up the bank are administered. For both of the CAT version the correlations of θ with the entire bank were high, surpassing even the most demanding criterion of r ≥ .95 suggested by Thompson (2009). In the two CAT versions the standard errors of estimation at θ were low (equivalent to a classical minimum reliability of .90) even for those subjects who were two standard deviations above or below the trait's mean. When a stopping rule of fixed-length is considered only 50% of total bank's items were administered, while with variable-length stopping rule (error ≤ .25) this amount was reduced, on average, by 35%.

Psychological Thought
South-West University "Neofit Rilski" 2020, Vol. 13(2), 459-485 https://doi.org/10.37708/psyct.v13i2.503 If the results of the CAT stopping rules are compared, it can be concluded that the 18-itemfixed-length variant was more efficient. Although it was necessary to administer more items, the mean of standard error of estimation = 0.22 was lower than the forecasted in the variable-length version (0.25). In addition, the fixed-length CAT estimates of θ were more strongly associated with both the θ of the whole bank and with the total score calculated based on the classical theory. These latter correlations are particularly important if they are interpreted taking into account the validity of the CAT content. The adaptive algorithm selects one item among all the available items based on a quality psychometric criterion (maximum information) regardless of the content of the item. As a consequence, in the variable-length CAT the under-representation of Hostility and Impulsivity items is worsened. The correlation registered between the θ estimated with the fixed-length CAT and the θ obtained with the complete bank was higher, it is observed that the content sampling had less impact on the construct measurement than variable-length CAT.
The saved administration time for Neuroticism measurement that comes as a result from applying an adaptive version with 18 items is considerable if the 48 items of the NEO-PI-3 or the 60 elements of the NEO-IPIP (Goldberg, et al., 2006)  In this regard, it is convenient to highlight two aspects that differentiate these conventional abbreviated tests from the measurements obtained with the CAT: 1) In the conventional short versions the coverage of the construct is usually reduced by eliminating the items with contents that do not show high discrimination capacity for the average feature. For example, in the NEO-FFI-3, Impulsivity is considered irrelevant for the brief Neuroticism measurement, so it does not include items that operationalize this content.
On the other hand, although with different exposure rates, all items of the bank were used for some of the subjects adaptively evaluated in this study. This implies that the reduction in the number of items administered with the adaptive version does not restrict the representation of the facets in the trait measurement. The content is available in the bank and serves for the evaluation of individuals as long as it can provide information about the trait. Nonetheless, the current limitation of the bank's control of content coverage will be discussed further below.

Psychological Thought
South-West University "Neofit Rilski" 2020, Vol. 13(2), 459-485 https://doi.org/10.37708/psyct.v13i2.503 2) The adaptive procedure offers guarantees of a measurement with a higher level of accuracy even for the extreme trait values, for which conventional tests are more error-prone (Aguado et al., 2005;Reise & Revicki, 2015). Given that it is precisely the extremely high Neuroticism scores that may be most relevant in the clinical context, it seems justified to measurably increase the number of items that would be applied with a short instrument in order to achieve a better precision in the trait estimation.

Conclusion
The results obtained in this study are encouraging because they show that the bank's current psychometric properties would enable an accurate and faster adaptive Neuroticism measurement than instruments using conventional forms of administration. However, one of the limitations of the adaptive design presented is the low control over the item's content that was effectively answered by each of the evaluatees. While content coverage in the entire bank has been guaranteed, the same is not true for the adaptive versions that the subjects have responded to. The strong correlations found between the θ estimates with the complete bank and with each CAT demonstrate that these variations in the selected content sampling for every evaluatee did not significantly affect the measurement of the construct. Even so, the programming of adaptive algorithms that regulate the representativeness of the facets CAT will be analyzed in future studies. To this end, the incorporation of new items into the bank is essential. A greater degree of specificity will be required to identify item contents applicable in the local culture that show more discriminatory capacity, especially for the Hostility and Impulsivity facets. The inclusion of these new items will allow to propose improvements in the adaptive procedure to further optimize the Neuroticism measurement.
Another line of complementary research that is being developed aims to carry out adaptive Neuroticism measurements at the facets level (Abal, et al., 2019). The debate about the convenience of assessing personality traits with narrow or broad measures has no conclusive answers (e.g. Ashton et al., 2014;Salgado, et al., 2015). But to obtain a measure of each facet would allow to reach a greater completeness in the description and prediction of the profiles of those evaluated. In this line, it will be the bank user who will decide whether to measure the domain or facets according to their evaluation objectives in the future.

Limitations of the study
At this stage of the Neuroticism bank construction, no differential functioning studies of the items (DIF) have been carried out, which constitutes a methodological limitation to the present study. The DIF analysis provides validity evidence that makes it possible to Psychological Thought South-West University "Neofit Rilski" 2020, Vol. 13(2), 459-485 https://doi.org/10.37708/psyct.v13i2.503 guarantee that the bank's measurements are not conditioned by the belonging of an individual to a specific group. This would allow the detection of potential bias based on, for example, gender, age range, or even the clinical/non-clinical condition of the evaluated person.

Implications for future research
Further research will also seek to validate a cut-off on the Neuroticism scale in order to differentiate subjects with clinically significant levels. The study of an interruption modality based on clinical criteria can optimize adaptive measurement if it is intended to be used for evaluation tasks with screening purposes (Fonseca-Pedrero et al., 2013, Rudick et al., 2013. In this circumstance, administration may be briefer, because items that showed maximum discrimination capacity at the level of the trait associated with the cut-off point, are only selected.

Other Support/Acknowledgement
The authors have no support to report.