ORIGINAL REPORT
Malan ZHANG, MD, PhD1, Yun ZHANG, BMS2, Minghong SUI, MD, PhD3, Liyin WANG, BSc4, Ziling LIN, MMSc5, Wei SHEN, MMSc6, Jiani YU, MD, PhD7 and Tiebin YAN, MD, PhD8,9
From the 1Department of Exercise Rehabilitation, College of Exercise and Health, Guangzhou Sport University, Guangzhou; Departments of Rehabilitation: 2The Fifth Hospital of Xiamen, Xiamen, 3Shenzhen Nanshan People’s Hospital, Shenzhen, 4Clifford Hospital, Guangzhou, 5The Fifth Affiliated Hospital, Sun Yat-sen University, Zhuhai, 6Guangdong 999Brain Hospital, Guangzhou, 7Guangdong Province Hospital of Chinese Medicine, Guangzhou, 8Department of Rehabilitation Medicine, Sun Yat-sen Memorial Hospital, Sun Yat-sen University, Guangzhou and 9Guangdong Engineering Technology Research Center for Rehabilitation and Elderly Care, Guangzhou, China
Objective: To quantify the agreement between functional assessments by a single rater and a team using the Chinese version of the International Classification of Functioning, Disability and Health Rehabilitation Set in a clinical situation.
Design: Inter-rater, multi-centre agreement study.
Subjects: A total of 193 adult inpatients admitted to 5 rehabilitation centres at 5 hospitals in China
Methods: The Chinese version of the International Classification of Functioning, Disability and Health Rehabilitation Set was used by either a single rater or a team to assess 193 patients at 5 Chinese hospitals. Percentage of agreement and quadratic-weighted kappa coefficients were computed. Evaluation times were compared with paired t-tests.
Results: The mean team and individual evaluation times were not significantly different. The percentage of agreement ranged from 46.1% to 94.2% depending on the item, and the quadratic-weighted kappas ranged from 0.43 to 0.92. Eight categories (26.6%) showed a weighted kappa exceeding 0.4, 11 others (36.7%) exceeded 0.6, and another 11 (36.7%) produced kappas of more than 0.8.
Conclusion:
Either a single rater or a team of raters can produce valid and consistent ratings when using the Chinese version of the International Classification of Functioning, Disability and Health Rehabilitation Set to assess patients in a rehabilitation department. The team rating approach is suitable for clinical application.
A new team evaluation approach to implementing the rehabilitation measures of the World Health Organization’s International Classification of Functioning, Disability and Health was tested by asking teams including a physician, a nurse, and a physiotherapist or an occupational therapist to evaluate 193 adult inpatients admitted to the rehabilitation departments of 5 hospitals in China. The teams’ ratings were compared with those of single physicians and therapists. The agreement of the assessment results and the time taken by a single rater and a team were compared. There was moderate to high consistency in the ratings, and the mean times taken by the teams and the individual raters were not significantly different. In conclusion, team and single rating can both produce consistent assessments.
Key words: assessment; International Classification of Functioning Disability and Health; team evaluation; rehabilitation.
Citation: J Rehabil Med 2023; 55: jrm14737. DOI: https://doi.org/10.2340/jrm.v55.14737.
Copyright: © Published by Medical Journals Sweden, on behalf of the Foundation for Rehabilitation Information. This is an Open Access article distributed under the terms of the Creative Commons Attribution-NonCommercial 4.0 International License (https://creativecommons.org/licenses/by-nc/4.0/)
Accepted: Sep 26, 2023; Published: Dec 4, 2023
Correspondence address: Tiebin Yan, Department of Rehabilitation Medicine, Sun Yat-sen Memorial Hospital, Sun Yat-sen University, Guangzhou, China. E-mail: yantb@mail.sysu.edu.cn
Competing interests and funding: The authors have no conflicts of interest to declare.
The International Classification of Functioning, Disability and Health (ICF) is officially endorsed by the World Health Organization (WHO) as the international standard for describing and measuring functioning and disability (1). It conceives of functioning as a dynamic interaction between a person’s health, environmental factors and other personal factors (2). In its Global Disability Action Plan 2014–2021 (3) the WHO recommended the ICF as a framework for collecting comprehensive information on functioning and disability. The ICF includes nearly 1,500 categories covering diverse domains of functioning and a wide range of content and related concepts. This makes the ICF difficult to apply in clinical practice (4). To address this problem, ICF Core Sets, condensed from the whole set of ICF categories, have been developed to provide application-tailored shorter lists better related to specific health conditions and healthcare contexts (5–8).
Among the ICF Core Sets, one is a minimal, generic rehabilitation Core Set designed to address one of the most important challenges in health measurement: the comparability of data across studies and countries (9, 10). Although the ICF generic set has demonstrated application feasibility and good properties (11–13), it has only 7 categories, which limits its clinical application. An ICF Rehabilitation Set (ICF-RS) was therefore developed from the ICF generic set to reflect more key functional information universal among different patient populations (10). The ICF-RS includes 9 categories specifically for physical functioning and 21 categories for activities and participation. It can serve as a starting point for developing practical tools that compare a minimum set of data on disability across studies and countries (14).
The original ICF-RS had only a list of categories with some rather unclear definitions. Chinese rehabilitation professionals have been working with the ICF Research Branch to generate simple, intuitive descriptions of the categories in Chinese to promote their nationwide implementation (15). However, the detailed information professionals need to guide the application of the categories in clinical settings is still lacking. To alleviate this problem, an assessment standard has been developed for each category in the Chinese version of the ICF-RS. This provides detailed items easily applied in rehabilitation practice. The standards have demonstrated good validity and reliability (16, 17).
The clinical application of the Chinese assessment standards has, however, raised some problems. An evaluation using the standards involves interviews and clinical examination. It was difficult for a rater to complete the entire evaluation in a single setting, especially with a patient with complex complaints or poor language expression. In addition, the categories refer to 3 dimensions: body functioning, activity and participation. Some of the categories may be more relevant to and better rated by certain professionals.
To address these difficulties, a Delphi expert survey was conducted aiming to develop a new team evaluation approach rather than the default single rater approach to implementing the ICF-RS. It groups the 30 categories into 4 groups to be rated by a physician, a nurse, a physiotherapist, or an occupational therapist according to the content, with 6 categories assigned to the physician, 7 to the nurse, 9 to the physiotherapist, and 8 to the occupational therapist (18, 19). Using this team rating approach, each professional is responsible for evaluating the categories closest to their routine practice. Thus, assessments can be completed more easily without investing too much time. The assessors can easily generate the necessary information in their routine work.
The aim of this study was to quantify the agreement between functional assessments by a single rater and a team using the Chinese version of the ICF-RS in a clinical situation. The study compared the agreement between a single rater and a team of raters and also the time taken to complete the evaluation.
This study applied a design in which each patient was evaluated separately by a single rater and a team of raters who were blinded to each other’s collection of the data. Five rehabilitation departments from general or specialized hospitals participated. Four were from Guangdong Province, including 2 from Guangzhou and 1 each from Shenzhen and Zhuhai. The other participating hospital was in Fujian Province. The Chinese qualitative standards of the ICF-RS have been applied for years in those rehabilitation departments to assess patients’ functioning. Many staff there have been formally trained to use the ICF-RS, so they are familiar with the assessment process.
The participants were recruited from among the inpatients admitted to the 5 rehabilitation departments between July and December 2019. The following inclusion criteria were applied: older than 18 years; at least 2 weeks since onset; conscious with a score ≥6 on the Chinese version of Hadkinson’s abbreviated mental test (good cognitive ability); and continuously able to communicate verbally. Patients scheduled for discharge within 3 days, or those who were critically ill with unstable vital signs and any who were unwilling to cooperate with the whole evaluation process were excluded.
Participants meeting the inclusion criteria were recruited by quota sampling. Candidates were first classified in terms of neurological, musculoskeletal, cardiopulmonary or another condition. The proportions of the candidates selected at each rehabilitation department were then specified as nervous system dysfunction 50%, musculoskeletal system 25%, and 25% cardiopulmonary and others (e.g. tumour, geriatric) (17). The only exception was the Guangdong 999 Brain Hospital, which is a specialized hospital for neurological diseases.
A sample size of at least 50 is considered acceptable for reliability studies (20). Allowing for 20% wastage, the target minimum sample size was therefore set as 63 in this study. The purpose, benefits, risks and confidentiality of the study were explained to each candidate. Any patient could withdraw from the study at will and their treatment would not be affected. The study protocols were approved by the ethics committees of the collaborating hospitals.
Five professionals were recruited at each collaborating rehabilitation department. In the single-rater approach, either a physician or therapist served as the single rater. The others formed a team of 4 raters with 1 physician, 1 nurse, 1 physiotherapist and 1 occupational therapist as suggested in the Delphi survey (19). All of the raters had passed a 2-day unified and rigorous training course, which included theoretical study and clinical practice. After the training, each had independently passed the test for ICF-RS raters with an inpatient under the supervision of a trainer to make sure that they had mastered the basic concepts, the evaluation rules and matters needing attention with the Chinese assessment standards. A special group was set up to provide further assessment guidance and to answer any questions in the process of independent evaluation. All of the raters were registered members of their profession and had worked in a rehabilitation department for at least 3 years; hence they had the necessary knowledge and experience related to rehabilitation assessment.
At the beginning of the rating process, the single raters completed a personal and disease information questionnaire describing each person rated, including their age, sex, marital status, education level, occupation, diagnosis and other information.
Hadkinson’s abbreviated mental test (AMT) assesses basic cognitive functioning (21). It has 10 items covering directivity, memory, attention, computation ability, and recall. The questions are scored with 1 point for each correct answer and a total possible score of 10 points (22). The test was administered to each candidate and patients with an AMT score of 6 or more were included in the subsequent formal evaluation.
The Chinese assessment standard of the original ICF-RS had 9 categories for body function, 14 for activities, and 7 for participation (10). It is used across China to assess the key functions of patients from the acute to the chronic stage (16, 17). In each category, the severity of dysfunction receives 1 of 5 grades. No dysfunction is graded 0; mild dysfunction earns a 1; moderate dysfunction means grade 2; severe dysfunction means grade 3 and complete dysfunction is graded 4. There is also a grade 8 for failure to provide relevant information and a grade 9 used when a category is not applicable to a patient (23).
In contrast, the team evaluation version of the ICF-RS consists of 4 parts (19). In this study 6 categories were assigned to the physician, 7 to the nurse, 9 to the physical therapist (PT), and 8 to the occupational therapist (OT) (see Table I).
The single raters evaluated all 30 items independently. The team raters arranged themselves to complete their parts separately whenever they had free time during working hours but within 3 days of patient’s admission to the hospital. There were team meetings but the assessment results were not shared among the raters. To further demonstrate consistency of the 2 rating approaches, the evaluation time taken by each rater was also recorded (except at the specialized Guangdong 999 Brain Hospital). The reasons for failure to assess were recorded by the rater if any part of the whole rating was not completed within a patient’s 3-day window. The case was excluded if more than 10% of the data on the 30 categories were missing (24).
The data were analysed with the help of version 25 of the SPSS software (IBM,Armonk, NY, USA) suite and version 12.0 of the Stata software package. Descriptive statistics were compiled summarizing the patients’ demographic and disease-related information. Measurement data were expressed as mean ± standard deviation (SD). Paired t-tests were used for intra-group comparison of patients with normally-distributed data and paired Wilcoxon tests were used when the data were not normally distributed. A confidence level of p ≤ 0.05 was considered to indicate statistical significance.
Paired t-tests were also applied to relate the evaluation times reported by the single raters and the teams. The agreement of each category’s rating between a single rater and a team was another important result along with a weighted κ and a bias-corrected, bootstrapped 95% confidence interval (95% CI). Weighted kappa coefficients are commonly used to quantify the agreement between 2 raters on K-ordinal scales. A linear-weighted kappa coefficient relates the mean distance between 2 raters’ classifications with respect to what would be expected by chance. That makes it suitable here, since statistical distributions are usually primarily described in terms of location and variability. A quadratic-weighted kappa coefficient provides changes in the centre of inertia about the agreement cells. Both coefficients were computed because they provide complementary information about the distribution of any disagreements (25, 26). Weighted kappas range from −1 to 1, where 1 indicates perfect agreement, 0 indicates no additional agreement beyond what is expected by chance alone, and a negative value indicates disagreement. A kappa value of 0.81–1.00 is viewed as almost perfect agreement, 0.61–0.80 as substantial, 0.41–0.60 as moderate, 0.21–0.40 as fair, and 0.00–0.20 as slight agreement (27).
A total of 217 patients were initially contacted. Six produced AMT scores < 6, and 8 declined to participate, hence 203 patients were eventually recruited. Of those, 10 could not be included in the final statistical analysis because of incomplete data. Among them, 6 were excluded because a rater did not complete the assessment within 3 days of admission. Another 2 were discharged early, and the other 2 subjects dropped out for personal reasons. Hence, 193 patients were included in the final data analyses.
The participants had a mean age of 52.6 ± 16.7 years, with 69.4% younger than 60 years. Sixty percent (n = 116) were men. 73.6% said they had not attended university. Most of the patients were unemployed after their injury (112, 58%). 139 (72%) had a nervous system dysfunction, 43 (22.3%) had musculoskeletal problems, 11(5.7%) had cardiopulmonary system diseases. The patients’ general characteristics are shown in Table II.
There were 5 raters at each of the 5 rehabilitation departments. Five of them worked as a single rater assessing all 30 categories. The other 20 participated as team raters. They had a mean age of 40.5 ± 6.76 years, with 60% older than 30 years. Sixteen (64%) were men. 48% had an intermediate title or better. They had a mean of 6.6 ± 4.9 years of experience working in a rehabilitation centre and most of them (80%) had 3–9 years of work experience. Almost all of the raters (23, 92%) had started learning about the ICF-RS within the previous year. The general characteristics of the raters are shown in Table III.
Of the 193 cases collected, 29 were from the Guangdong 999 Brain Hospital without time data. 54 patients’ time data at the remaining 4 rehabilitation departments were invalid because a rater forgot to record the time, so a final total of 110 assessments with full evaluation time were analysed. The mean time taken to complete an evaluation was 16.1 ± 5.3 min for a single rater and almost the same (16.3 ± 4.4 min) for a team. A paired t-test confirmed that there was no significant difference between the 2 groups (t = –0.429, p = 0.67). Paired t-tests also showed that there was no significant difference between a single rater or a team at any of the individual rehabilitation centres (Fig. 1).
Fig. 1. Evaluation times at each department by a single rater and a team. ICF-RS: International Classification of Functioning, Disability and Health Rehabilitation Set; TFAH-SYN: The Fifth Affiliated Hospital, Sun Yat-sen University; CH: Clifford Hospital; SNPH: Shenzhen Nanshan People’s Hospital; TFHX: The Fifth Hospital of Xiamen; TOTAL: mean evaluation time of a single rater or a team.
The observed agreement and the weighted kappas with bootstrapped 95% CIs are shown in Table IV. The percentage of agreement ranged from 46.1 to 94.2% depending on the category. A category’s weighted kappa ranged from 0.43 to 0.92, with 8 categories (26.6%) showing a weighted kappa exceeding 0.4. Eleven (36.7%) had weighted kappas of more than 0.6, and for another 11 (36.7%) it was more than 0.8. The categories are ranked by kappa value from highest to lowest in Table IV. The category “d450 Walking” had the highest weighted kappa, while the category “d710 Basic interpersonal interactions” had the lowest.
The ICF is used as a reference model in the assessment of functioning, mostly in assessing specific health conditions in a rehabilitation context (28). Many researchers have sought to reduce the size or perceived complexity of the ICF by creating short lists of ICF domains for specific recording or measurement purposes (29). The team evaluation approach was developed through a Delphi study to facilitate the evaluation of the Chinese version of the ICF-RS (19). In that study the 30 categories were grouped into 4 parts to suit the 4 types of professionals and best bring to bear their diverse skills and experience. This study then aimed to evaluate the agreement between the results of a single rater and those of a team with specialist expertise. The results demonstrate that there were no significant differences in terms of evaluation time and that the ratings of a single rater and a team demonstrated moderate to high agreement.
Much has been published about the reliability of scales used among 2 or several single professional raters (30–33), but the reliability of team evaluation has been reported relatively rarely. Alvsåker reported observing good inter-rater reliability when the Early Functional Abilities scale was used by experts from 4 different professions independently. However, there was no team division of labour (34). Functional Independence Measure (FIM) is a team rating scale especially designed for use by a multidisciplinary team (35–36). A group led by Young has reported (37) that the mean total FIM rating was similar regardless of whether a team of healthcare professionals (generally consisting of a nurse, a physical therapist, an occupational therapist and a social worker) or a single non-clinician was the interviewer. The Catz-Itzkovich Spinal Cord Independence Measure (SCIM) can also be scored by a team of professionals (38), but in that case research has shown that assessment by a single nurse is not as accurate as by a multidisciplinary team (39).
In this study the single raters and the team both produced valid and consistent ratings similar to those reported in previous studies. Those studies with the FIM and SCIM tested the feasibility of single raters because they thought single scoring might be less burdensome and expensive, but the single raters selected were only nurses or non-clinicians. That may not suitable for the Chinese ICF-RS, as it includes categories that require more specialized skills such as “b710Mobility of joint functions”. Physicians and therapists in the clinic are the preferred raters. Team rating did not, however, increase the cost of the rating and promises better accuracy in functional assessment. Also, reduced personal assessment time in a busy workday and more targeted assessment may increase willingness to use the instrument, increase the raters’ attention and promote more active intervention in clinical practice. In addition, consistent category ratings can be shared via a mobile application (18).
The weighted κ results of all of the categories were greater than 0.4, indicating moderate to high consistency between the single raters and the teams. ICF-RS assessments involve interviews and clinical examinations. The category “d450 Walking” had the highest weighted kappa. It is scored by looking at the patient’s ability to walk 10 m on flat ground, wearing a brace or prosthetic limb or using a walking aid if necessary. It is graded according to the need for “supervision, prompting, or assistance”. High consistency can be achieved because most professionals are familiar with the 10-m walking evaluation, and can give objective ratings through simple observation. The assessment of “d710 Basic interpersonal interactions” calls for the rater to make a judgment based on the subject’s enthusiasm, appropriateness, language organization ability, expression ability, etc. in interpersonal communication. The patient’s self-assessment and the opinions of family members may also be considered. The ratings range from excellent (0) to very poor (4) using Likert 5-level scoring. The ratings in that category demonstrated low consistency because the professionals, the patients and their families made different evaluations of the interviewees’ interpersonal communication. To improve the situation the evaluation could be based entirely on the professional’s rating after communicating with the interviewee.
The relatively low agreement in some categories is where subjective judgment is more important. And of course, the results of a particular evaluation depend to some extent on the degree of cooperation from the patient at that time as well as the rater’s skill. The ICF is helpful in establishing a common language between different professionals and with patients, caregivers, administrators and health policy-makers (40). This study has shown that the team approach to ICF-RS assessment is feasible and gives results very consistent with those of a single rater.
An obvious limitation is that all 5 rehabilitation departments involved in this study were in China. The findings need to be extended to other contexts. Also, this study was only conducted in the rehabilitation departments of third-class general or specialized hospitals in China. Further research will be needed to verify the suitability of team rating in community and rural rehabilitation centres. Furthermore, the quota sampling did not cover all of the patient population available during the study period. To do so would have disallowed random sampling. There was no formal debriefing. Ideally, semi-structured interviews should have been conducted with the raters involved in the team evaluations to better understand the acceptability of the team assessment approach. And it was also a limitation that the patients were not interviewed to collect their perspectives on single or team rating.
Team rating using the ICF-RS produces ratings the same as those of a single rater. Team assessment is thus potentially useful in the clinic. It can be an effective technique for producing consistent ICF-RS assessments.
The authors thank all the professionals for their participation in the study.
The study was funded by the National Natural Science Foundation of China (grant number 72104060) and the Guangdong Province University characteristic innovation project (grant number 2021KTSCX057). The funding bodies had no role in the design of the study, the collection, analysis, or interpretation of the data, or in writing the manuscript.
The data generated and analysed are not publicly available to preserve the anonymity of the participants, but they are available from the corresponding author on a reasonable request.
Ethics approval for this study involving human participants was provided by the ethics committee of the Sun Yat-sen Memorial Hospital and each collaborating centre (2019085). Written informed consent was obtained from all of the participants or their family members.