Automatic Assignment of ICD9 Codes To Discharge Summaries
Leah S. Larkey, Ph.D. and W. Bruce Croft, Ph.D.
Department of Computer Science, University of Massachusetts
A information retrieval approach was combined with a k-nearest-neighbor
algorithm to assign ICD9 codes automatically to dictated inpatient discharge
summaries. Experiments were carried out to determine how to assign a score
to a candidate code for a test document based on the codes and scores assigned
to retrieved documents, and how best to turn a test discharge summary into a
query to maximize the chance of retrieving documents with the correct codes for
the test document.
INTRODUCTION
A great deal of human time and effort is expended in assigning codes of
various types to patient medical records. Because this coding determines reim-
bursement, it is important to accomplish this task as easily and as accurately
as possible. At the Center for Intelligent Information Retrieval(CIIR) at UMass
Amherst, we are working with several medical organizations on information re-
trieval problems. We present work in progress on automatically assigning ICD9
codes to dictated inpatient discharge summaries.
We are following several different approaches to automatic coding, and will
ultimately combine some of them. First, we are pursuing a k-nearest-neighbor
[?] approach using scores from an information retrieval system as the distance
metric. Second, we are training statistical classifiers [?] for each code. Third, we
are experimenting with direct lookup in the ICD-9-CM manuals (Alphabetic In-
dex and TabularList). Because the discharge summaries contain a large number
of terms that are not relevant to the coding task, we are incorporating several
different methods for selecting and giving extra weight to words, phrases, and
document sections that provide the most diagnostic evidence. These methods
include natural language processing and heuristics to divide the summaries into
fields that represent different sections of the documents.
Here we present research using the INQUERY information retrieval en-
gine [?] and a k-nearest-neighbor classification scheme on a corpus of discharge
summaries. A similar approach has been used for other classification tasks in in-
formation retrieval [?, ?]. The major questions our k-nearest-neighbor research
has addressed so far are:
· How to assign a score to a candidate code for a test document, based on
the codes and scores assigned to retrieved documents.
· How best to turn a test discharge summary into a query, to maximize
the chance of retrieving documents with the correct codes for the test
document.
· Whether the text of the discharge summaries should be indexed in fields
corresponding to sections of the document, allowing field specific retrieval
for automatic coding.
· Whether natural language processing (NLP) can aid the automatic coding
process by tagging of items as negated, as diagnoses, or as signs and
symptoms.
METHOD
The Corpus
The corpus consists of 11,599 dictated inpatient discharge summaries, divided
into a training set of 10,902 documents, a test set of 187 documents, and a
tuning set of 510 documents. We are using the discharge summaries rather
than the entire patient medical record, because this is the part of the medical
record that has been computerized.
The discharge summaries range from around 100 to nearly 3000 words in
length with a mean length of 633 words. Each document has between one and
15 ICD-9 codes assigned to it, with a mean of 4.43 codes per document. 90% of
the documents have fewer than 9 codes.
In style, the discharge summaries are fairly typical of hospital discharge sum-
maries. Most of the documents in the corpus follow a standard medical docu-
ment chronology, usually consisting of an assessment, history of present illness,
past medical history, physical examination, laboratory examination, hospital
course, and disposition. Some have operations and procedures. A small pro-
portion of the documents are aberrant in format, or were very short addenda
to other documents. No effort was made to screen these out of the corpus, or
to attach addenda to other documents that they may belong with. The docu-
ments were produced by a large number of practitioners and were consequently
heterogeneous in linguistic style and in the way the sections were labeled.
Automatic coding of such documents is particularly difficult because there
is so much free form text in each document, much of it is not relevant or only
indirectly relevant to the coding task, and the portion of text relevant to each
code is not explicitly associated with its code in any way.
K-nearest-neighbor paradigm
The training collection of 10,902 discharge summaries was indexed and built
into an INQUERY database. The test documents were stripped of their codes
and presented one at a time to INQUERY as queries. It should be emphasized
that the queries in this paradigm are full free-text documents in natural English.
This information retrieval step retrieves a list of discharge summaries from the
database which are the most similar to the test discharge summary to be coded.
Each retrieved document has an associated belief score, and the list is ranked
by this score. Each code found in a retrieved document is a candidate for
2
Table 1: Sample information retrieval output
________________________________________________
|Test_doc_77___________________________________|
|_Doc__|_Score_|_Codes_________________________|
|5891 | .4302 | 622.8 620.8 620.2 618.2 625.6 |
|4408 | .4293 | 625.6 618.2 709.2 906.6 V61.8 |
|__...___|_____|________________________________|
assignment to the test document. The output from the information retrieval
step is exemplified in Table ??.
Our preliminary studies showed that the optimal number of documents to
retrieve for each test document was 20. In all subsequent work this number
remained fixed at 20.
The second step in assigning codes to the test document based on the codes in
the retrieved documents is to rank order the codes from the retrieved documents.
We have experimented with several different ways of assigning scores to
candidate codes for each test document. The simplest and most obvious method
is to use as a code's score the number out of the twenty retrieved documents
that have that code assigned to it, but this produces too many ties. Instead,
we start by summing the scores on the retrieved documents assigned that code,
weighting the scores in various ways before summing, i.e.
P
Scorec = iscorei* wic
where i ranges over the retrieved documents,
Scorec is the test document's score for code c,
scorei is the document score for retrieved doc i,
wicis the weight for code c in document i.
We have tried the following weighting methods:
Baseline. Weight is 1 if the code is assigned to the retrieved document, 0
otherwise. In other words, Scorec is the sum of the scores of the retrieved
documents assigned the code.
Rank weights. The weight is a function of the rank of the document in the
list of retrieved documents. We tried various linear functions of the rank.
Top-code weights. The weight is determined by the rank of the code in the
the retrieved document's list of codes. The first code listed is the code for the
principal diagnosis, and all the others are of equal significance. We experimented
with upweighting the top code of each retrieved document by giving it a weight
ranging from 1 to 2. All the other codes' weights were fixed at 1.
Icf weights. The weight is determined by the frequency of the code in the col-
lection. We experimented with an inverse code frequency (icf) analogous to the
standard inverse document frequency (idf) [?], computed by: log(numdocs/cf) +
1, where numdocs is the total number of documents in the collection and cf is
the number of docs assigned code c.
Normalized icf weights The icf weights are normalized by the log of the maxi-
mum icf value, producing weights that range from 0 to 1: log(numdocs/cf) log(numdocs) .
3
Turning documents into queries
For the baseline condition and for testing the weighting schemes above, each test
document was stripped of its ICD9 codes and input in full text form. In addition,
we tested various ways of making use of document structure. We reasoned
that certain sections of the documents (for example PRINCIPAL DIAGNOSIS:)
would be more relevant to diagnosis and hence to code assignment than other
sections. INQUERY's flexible query language allowed us to formulate each query
as a weighted sum of the sections of the documents. Two subtasks made up this
part of the research: identifying document sections, and tuning the weights on
the sections.
Sections were identified heuristically. This is the only part of the processing
that was not automatic. Although not all documents had the same sections,
and the sections were labeled in various ways, they were usually identified by a
title in upper case and terminated by a colon. We used the Unix tools flex and
awk to list the titles in reverse frequency order. All the titles above a threshold
were categorized under one of ten section types: ADD (addendum), ADMIN
(administrative), DISCH (disposition), DX (diagnosis), HOSP (hospital course),
HPI (history of present illness), LAB (laboratory examination), OR (operations
and procedures), PE (physical examination), PMH (past medical history).
A flex script was written to recognize the high-frequency titles using reg-
ular expressions, and to add tags marking the beginnings and endings of each
section. Then we iteratively examined the untagged sections, and made the reg-
ular expressions more general to encompass more variations on the titles. We
stopped when most of the remaining untagged sections belonged to the same
category as the preceding tagged section and then modified the algorithm to
include any untagged material in the section preceding it.
To weight sections differentially, we used INQUERY's weighted sum (#wsum)
and #sum operators, as follows:
#wsum (1
wi #sum ( text for sectioni)
wj #sum ( text for sectionj)... )
where wi is the weight on section i.
The weights in the baseline weighted sum condition were all equal to 1.
Weights were tuned using the tuning set divided into two sets with 255
documents each. We used a hill-climbing algorithm [?], and accepted each
successive change in weights that improved the first tuning set without hurting
performance on the second tuning set.
Field Specific Retrieval
A fielded version of the database was created by indexing the material in the
different section types into different fields. Queries were formulated by replacing
the #sum (...) operator in the weighted sum query above with a #field
(fieldname ...) operator, indicating that the retrieval system should look
for that material only inside the corresponding field in the training documents.
Natural Language Processing
In addition to the section tagging, we also tagged text with "without" (WO)
tags, in cases like:
...patient denied fevers< /WO>, chills< /WO> , ...
The aim was to avoid spurious retrievals of cases with positive mentions of
items occurring in WO fields of the test documents. We tried many different
4
Table 2: Coding accuracy
____________________________________________________
|Scoring_full_codes_________________________________|
| | Pricipal | Principal |
| Average || code is top || code in |
|__________Precision___|_candidate__||___top_10____|
|base 34.6 |24.1 | 59.4 |
|top 35.6(+ 3.0) |30.5(+26.7) | 65.2(+ 9.9) |
|wsum 38.3(+10.9) | 36.4(+51.1) | 69.0(+16.2) |
|sec______39.6(+14.6)_|_38.5(+60.0)_|__72.2(+21.6)_|
|Scoring_categories________________________________|
|base 45.8 |42.2 | 74.9 |
|top 47.6(+ 3.9) |49.7(+17.7) | 78.1(+ 4.3) |
|wsum 50.4(+10.1) | 54.0(+27.8) | 81.8(+ 9.3) |
|sec______51.0(+11.3)_|_55.1(+30.4)_|__84.0(+12.1)_|
ways of using the WO tags, including leaving the WO sections out of the queries,
downweighting them, and indexing them in their own fields in the database and
using field operators in the queries. This tagging was carried out using a simple
finite automaton incorporating rules for the scoping of negation and a lexicon
of medical words.
Measures
We report three measures of coding success. These measures reflect the success
at getting all the codes as high as possible in the list of candidates without
considering a cutoff for acceptance.
Average precision [?]. Precision (proportion of correct codes listed) was com-
puted for recall values (proportion of listed codes that were correct) of .10, .20,
through 1.00, and averaged across the 10 recall points
Top candidate. Proportion of cases where the test document's principal diag-
nosis (first) code is top candidate in the list of codes ordered by Scorec.
Top 10. Proportion of cases where the test document's principal code is in the
top 10 candidates.
Full codes vs categories. ICD9 codes have two parts, a major category
(before the decimal point) and a subcategory (additional digits after the decimal
point). Although a completely automatic coder would have to assign full codes
including subcategories, we have included some measures that reflect partial
success. Therefore, we report the three measures above for two different scoring
schemes. Full Codes means that the whole code with subcategory had to match
to be counted as correct. Categories means that only the category -- the part of
the code before the decimal point -- had to match to be counted as correct.
RESULTS
Table ?? shows performance on the three measures described above for the
baseline and best document-score weighting conditions, and for the baseline
and best section weighting conditions. The table also shows percentage increase
over the baseline for the nonbaseline conditions.
5
Baseline accuracy
The first row of Table ?? shows performance for the baseline condition (base)
scoring full codes. The fifth row shows performance for the baseline condition
scoring categories. Average precision for the baseline condition is 34.6%. The
principal code was the top candidate in 24.1% of the cases, and was in the top
ten in 59.4% of the cases. When we score categories rather than full codes, the
average precision is 45.8%. The principal category is the top candidate in 42.2%
of the cases, and is in the top 10 in 74.9% of the cases.
Document-score weighting
The best value for top-code weight was 1.8. As can be seen in the Table ?? in
the row labeled "top," this weighting scheme produced a 3% increase in average
precision over the baseline, a 26.7% increase in the top candidate measure, and
a 9.9% increase in the top 10 measure. A similar pattern is seen with category
scores.
Note that top-code weighting causes a larger increase in the top candidate
measure than in average precision or in the top 10 measure. This weighting
scheme moves correct candidates to the top of the list more than it gets correct
codes onto the list of candidates.
None of the other document-score weighting methods produced any improve-
ment, and are not shown in the table.
Dividing documents into sections
Table ?? shows the results when the test document is converted into a weighted
sum of sections. Formulating the query as a weighted sum with weights of 1,
combined with a top-code weight of 1.8 (wsum condition), produces a 10.9%
improvement in average precision over the baseline, a 51.1% increase in the top
candidate measure, and a 9.9% increase in the top 10 measure. Combining the
the optimal weights found in the tuning procedure with the best top-code weight
(sec condition), yields a 14.6% improvement in average precision, a 60% increase
in the top candidate measure, and a 21.6% increase in the top 10 measure. A
similar pattern is seen with category scores.
It is interesting that the wsum version of the documents is such an improve-
ment over the flat free-text version, even before the sections are differentially
weighted. The improvement is probably due to length normalization INQUERY
performs at each #sum node, which has the effect of giving more weight to short
sections and less weight to long sections.
NLP tags and fields
None of the methods of using the WO tags improved the results. No improve-
ments occurred using field operators and the fielded database, either for sections
or for WO tags.
DISCUSSION
Is this good enough?
Performance in this task is far from the level required for unsupervised automatic
coding. However, this system could form a component of a computer aided
coding system. It could present a list of codes as candidates to be checked by
an expert coder. Note also that the training set of 10902 documents is quite
small considering that we are trying to train more than 14000 codes. More
training data should improve the results. An automatic coding system could
identify the codes which have sufficient training data to perform reliably.
6
Generality of the Approach
Taking advantage of the section structure of the documents afforded a substan-
tial gain in retrieval accuracy. Since the labelling of sections was a heuristic
step requiring a few weeks of tweaking a set of regular expressions, one could
question the general value of this approach. However, the section labelling pro-
gram would continue to successfully tag new documents like the old ones. At
a site where the format of discharge summaries was more standardized, or in a
database where the sections were already in different fields, this step could be
more completely automated.
Future Directions
Our next step is to take advantage of yet another level of structure in these docu-
ments. Our associates are using NLP techniques to tag phrases in the discharge
summaries with five subtypes each of diagnoses and signs or symptoms [?]. Our
hypothesis is that performance will be improved by giving more weight to the
items selected in this way.
Another direction we expect to take this research is to try two other methods
of accessing the codes for documents. The first additional classification method
would be to train statistical classifiers for each code with sufficient data.
A second additional classification method would be to obtain the text of
the ICD9-CM Tabular List and Alphabetic Index, and to automate the lookup
procedures used by manual coders. This method would be particularly useful
in just the cases where the other two classification methods would fail --- the
codes for which there is too little (or no) training data.
This technique would confront us with a vocabulary mismatch problem which
was not a major part of the k-nearest-neighbor and other classification tech-
niques mentioned above. In our work so far, we have been matching test docu-
ments against a corpus of training documents of exactly the same kind. They
have a varied vocabulary, but there is no systematic difference between the
training and test documents. In contrast, the vocabulary used in the discharge
summaries is systematically different from the controlled vocabulary of the ICD9
descriptors. Our preliminary research has shown this to be a serious problem,
and we are now experimenting with using the UMLS Metathesaurus to alleviate
the mismatch problem.
Acknowledgments
I would like to thank David Aronow for his help in categorizing the section titles
in the documents. I would also like to thank David Fisher, Fang-Fang Feng, and
Stephen Soderland for the NLP tagging. This work is supported by by ARPA
contract N66001-94-D-6054. Stephen I. Gallant also contributed to this work,
supported by National Cancer Institute Grant 1 R43 CA 65250-01 to Belmont
Research Inc. Data are courtesy of Brigham and Women's Hospital, Boston,
MA.
References