Automatic Assignment of ICD9 Codes To Discharge Summaries

Leah S. Larkey, Ph.D. and W. Bruce Croft, Ph.D.
Department of Computer Science, University of Massachusetts

             A information retrieval approach was combined with a k-nearest-neighbor
         algorithm to assign ICD9 codes automatically to dictated inpatient discharge
         summaries. Experiments were carried out to determine how to assign a score
         to a candidate code for a test document based on the codes and scores assigned
         to retrieved documents, and how best to turn a test discharge summary into a
         query to maximize the chance of retrieving documents with the correct codes for
         the test document.

                                      INTRODUCTION

             A great deal of human time and effort is expended in assigning codes of
         various types to patient medical records. Because this coding determines reim-
         bursement, it is important to accomplish this task as easily and as accurately
         as possible. At the Center for Intelligent Information Retrieval(CIIR) at UMass
         Amherst, we are working with several medical organizations on information re-
         trieval problems. We present work in progress on automatically assigning ICD9
         codes to dictated inpatient discharge summaries.
             We are following several different approaches to automatic coding, and will
         ultimately combine some of them. First, we are pursuing a k-nearest-neighbor
         [?] approach using scores from an information retrieval system as the distance
         metric. Second, we are training statistical classifiers [?] for each code. Third, we
         are experimenting with direct lookup in the ICD-9-CM manuals (Alphabetic In-
         dex and TabularList). Because the discharge summaries contain a large number
         of terms that are not relevant to the coding task, we are incorporating several
         different methods for selecting and giving extra weight to words, phrases, and
         document sections that provide the most diagnostic evidence. These methods
         include natural language processing and heuristics to divide the summaries into
         fields that represent different sections of the documents.
             Here we present research using the INQUERY information retrieval en-
         gine [?] and a k-nearest-neighbor classification scheme on a corpus of discharge
        	        summaries. A similar approach has been used for other classification tasks in in-
         formation retrieval [?, ?]. The major questions our k-nearest-neighbor research
         has addressed so far are:

             · How to assign a score to a candidate code for a test document, based on
               the codes and scores assigned to retrieved documents.

             · How best to turn a test discharge summary into a query, to maximize
               the chance of retrieving documents with the correct codes for the test
               document.

             · Whether the text of the discharge summaries should be indexed in fields
               corresponding to sections of the document, allowing field specific retrieval
               for automatic coding.

             · Whether natural language processing (NLP) can aid the automatic coding
               process by tagging of items as negated, as diagnoses, or as signs and
               symptoms.

                                           METHOD

         The Corpus
         The corpus consists of 11,599 dictated inpatient discharge summaries, divided
         into a training set of 10,902 documents, a test set of 187 documents, and a
         tuning set of 510 documents.  We are using the discharge summaries rather
         than the entire patient medical record, because this is the part of the medical
         record that has been computerized.
             The discharge summaries range from around 100 to nearly 3000 words in
         length with a mean length of 633 words. Each document has between one and
         15 ICD-9 codes assigned to it, with a mean of 4.43 codes per document. 90% of
         the documents have fewer than 9 codes.
             In style, the discharge summaries are fairly typical of hospital discharge sum-
         maries. Most of the documents in the corpus follow a standard medical docu-
         ment chronology, usually consisting of an assessment, history of present illness,
         past medical history, physical examination, laboratory examination, hospital
         course, and disposition.  Some have operations and procedures.  A small pro-
         portion of the documents are aberrant in format, or were very short addenda
         to other documents. No effort was made to screen these out of the corpus, or
         to attach addenda to other documents that they may belong with. The docu-
         ments were produced by a large number of practitioners and were consequently
         heterogeneous in linguistic style and in the way the sections were labeled.
             Automatic coding of such documents is particularly difficult because there
         is so much free form text in each document, much of it is not relevant or only
         indirectly relevant to the coding task, and the portion of text relevant to each
         code is not explicitly associated with its code in any way.

         K-nearest-neighbor paradigm
         The training collection of 10,902 discharge summaries was indexed and built
         into an INQUERY database. The test documents were stripped of their codes
         and presented one at a time to INQUERY as queries. It should be emphasized
         that the queries in this paradigm are full free-text documents in natural English.
         This information retrieval step retrieves a list of discharge summaries from the
         database which are the most similar to the test discharge summary to be coded.
         Each retrieved document has an associated belief score, and the list is ranked
         by this score.  Each code found in a retrieved document is a candidate for
                                                              2




                          Table 1: Sample information retrieval output
         ________________________________________________
         |Test_doc_77___________________________________|
         |_Doc__|_Score_|_Codes_________________________|
         |5891 |  .4302 | 622.8 620.8 620.2 618.2 625.6  |
         |4408 |  .4293 | 625.6 618.2 709.2 906.6 V61.8 |
         |__...___|_____|________________________________|

         assignment to the test document.  The output from the information retrieval
         step is exemplified in Table ??.
             Our preliminary studies showed that the optimal number of documents to
         retrieve for each test document was 20.  In all subsequent work this number
         remained fixed at 20.
             The second step in assigning codes to the test document based on the codes in
         the retrieved documents is to rank order the codes from the retrieved documents.
             We have experimented with several different ways of assigning scores to
         candidate codes for each test document. The simplest and most obvious method
         is to use as a code's score the number out of the twenty retrieved documents
         that have that code assigned to it, but this produces too many ties. Instead,
         we start by summing the scores on the retrieved documents assigned that code,
         weighting the scores in various ways before summing, i.e.
                                             P
                                    Scorec =   iscorei* wic

         where i ranges over the retrieved documents,
         Scorec is the test document's score for code c,
         scorei is the document score for retrieved doc i,
         wicis the weight for code c in document i.

         We have tried the following weighting methods:

         Baseline.  Weight is 1 if the code is assigned to the retrieved document, 0
         otherwise.  In other words, Scorec is the sum of the scores of the retrieved
         documents assigned the code.

         Rank weights. The weight is a function of the rank of the document in the
         list of retrieved documents. We tried various linear functions of the rank.

         Top-code weights. The weight is determined by the rank of the code in the
         the retrieved document's list of codes. The first code listed is the code for the
         principal diagnosis, and all the others are of equal significance. We experimented
         with upweighting the top code of each retrieved document by giving it a weight
         ranging from 1 to 2. All the other codes' weights were fixed at 1.

         Icf weights. The weight is determined by the frequency of the code in the col-
         lection. We experimented with an inverse code frequency (icf) analogous to the
         standard inverse document frequency (idf) [?], computed by: log(numdocs/cf)   +
         1, where numdocs is the total number of documents in the collection and cf is
         the number of docs assigned code c.

         Normalized icf weights The icf weights are normalized by the log of the maxi-
         mum icf value, producing weights that range from 0 to 1: log(numdocs/cf)   log(numdocs)  .


                                                              3



         Turning documents into queries
         For the baseline condition and for testing the weighting schemes above, each test
         document was stripped of its ICD9 codes and input in full text form. In addition,
         we tested various ways of making use of document structure.  We reasoned
         that certain sections of the documents (for example PRINCIPAL DIAGNOSIS:)
         would be more relevant to diagnosis and hence to code assignment than other
         sections. INQUERY's flexible query language allowed us to formulate each query
         as a weighted sum of the sections of the documents. Two subtasks made up this
         part of the research: identifying document sections, and tuning the weights on
         the sections.
             Sections were identified heuristically. This is the only part of the processing
         that was not automatic.  Although not all documents had the same sections,
         and the sections were labeled in various ways, they were usually identified by a
         title in upper case and terminated by a colon. We used the Unix tools flex and
         awk to list the titles in reverse frequency order. All the titles above a threshold
         were categorized under one of ten section types: ADD (addendum), ADMIN
         (administrative), DISCH (disposition), DX (diagnosis), HOSP (hospital course),
         HPI (history of present illness), LAB (laboratory examination), OR (operations
         and procedures), PE (physical examination), PMH (past medical history).
             A flex script was written to recognize the high-frequency titles using reg-
         ular expressions, and to add tags marking the beginnings and endings of each
         section. Then we iteratively examined the untagged sections, and made the reg-
         ular expressions more general to encompass more variations on the titles. We
         stopped when most of the remaining untagged sections belonged to the same
         category as the preceding tagged section and then modified the algorithm to
         include any untagged material in the section preceding it.
             To weight sections differentially, we used INQUERY's weighted sum (#wsum)
         and #sum operators, as follows:
                             #wsum   (1
                                     wi #sum ( text for sectioni)
                                     wj #sum ( text for sectionj)... )
                             where wi is the weight on section i.

             The weights in the baseline weighted sum condition were all equal to 1.
             Weights were tuned using the tuning set divided into two sets with 255
         documents each.  We used a hill-climbing algorithm [?], and accepted each
         successive change in weights that improved the first tuning set without hurting
         performance on the second tuning set.

         Field Specific Retrieval
         A fielded version of the database was created by indexing the material in the
         different section types into different fields. Queries were formulated by replacing
         the #sum (...)  operator in the weighted sum query above with a #field
         (fieldname ...)  operator, indicating that the retrieval system should look
         for that material only inside the corresponding field in the training documents.

         Natural Language Processing
         In addition to the section tagging, we also tagged text with "without" (WO)
         tags, in cases like:

               ...patient denied fevers< /WO>, chills< /WO> , ...

             The aim was to avoid spurious retrievals of cases with positive mentions of
         items occurring in WO fields of the test documents. We tried many different

                                                              4




                                    Table 2: Coding accuracy
         ____________________________________________________
         |Scoring_full_codes_________________________________|
         |                      |  Pricipal   |  Principal   |
         |           Average   || code is top ||  code in    |
         |__________Precision___|_candidate__||___top_10____|
         |base     34.6         |24.1         | 59.4         |
         |top      35.6(+ 3.0)  |30.5(+26.7) |  65.2(+ 9.9)  |
         |wsum     38.3(+10.9) | 36.4(+51.1) |  69.0(+16.2) |
         |sec______39.6(+14.6)_|_38.5(+60.0)_|__72.2(+21.6)_|
         |Scoring_categories________________________________|
         |base     45.8         |42.2         | 74.9         |
         |top      47.6(+ 3.9)  |49.7(+17.7) |  78.1(+ 4.3)  |
         |wsum     50.4(+10.1) | 54.0(+27.8) |  81.8(+ 9.3)  |
         |sec______51.0(+11.3)_|_55.1(+30.4)_|__84.0(+12.1)_|

         ways of using the WO tags, including leaving the WO sections out of the queries,
         downweighting them, and indexing them in their own fields in the database and
         using field operators in the queries. This tagging was carried out using a simple
         finite automaton incorporating rules for the scoping of negation and a lexicon
         of medical words.

         Measures
         We report three measures of coding success. These measures reflect the success
         at getting all the codes as high as possible in the list of candidates without
         considering a cutoff for acceptance.

         Average precision [?]. Precision (proportion of correct codes listed) was com-
         puted for recall values (proportion of listed codes that were correct) of .10, .20,
         through 1.00, and averaged across the 10 recall points

         Top candidate. Proportion of cases where the test document's principal diag-
         nosis (first) code is top candidate in the list of codes ordered by Scorec.

         Top 10. Proportion of cases where the test document's principal code is in the
         top 10 candidates.

         Full codes vs categories.  ICD9 codes have two parts, a major category
         (before the decimal point) and a subcategory (additional digits after the decimal
         point). Although a completely automatic coder would have to assign full codes
         including subcategories, we have included some measures that reflect partial
         success. Therefore, we report the three measures above for two different scoring
         schemes. Full Codes means that the whole code with subcategory had to match
         to be counted as correct. Categories means that only the category -- the part of
         the code before the decimal point -- had to match to be counted as correct.

                                           RESULTS

         Table ??  shows performance on the three measures described above for the
         baseline and best document-score weighting conditions, and for the baseline
         and best section weighting conditions. The table also shows percentage increase
         over the baseline for the nonbaseline conditions.


                                                              5



         Baseline accuracy
         The first row of Table ?? shows performance for the baseline condition (base)
         scoring full codes. The fifth row shows performance for the baseline condition
         scoring categories. Average precision for the baseline condition is 34.6%. The
         principal code was the top candidate in 24.1% of the cases, and was in the top
         ten in 59.4% of the cases. When we score categories rather than full codes, the
         average precision is 45.8%. The principal category is the top candidate in 42.2%
         of the cases, and is in the top 10 in 74.9% of the cases.

         Document-score weighting
         The best value for top-code weight was 1.8. As can be seen in the Table ?? in
         the row labeled "top," this weighting scheme produced a 3% increase in average
         precision over the baseline, a 26.7% increase in the top candidate measure, and
         a 9.9% increase in the top 10 measure. A similar pattern is seen with category
         scores.
             Note that top-code weighting causes a larger increase in the top candidate
         measure than in average precision or in the top 10 measure.  This weighting
         scheme moves correct candidates to the top of the list more than it gets correct
         codes onto the list of candidates.
             None of the other document-score weighting methods produced any improve-
         ment, and are not shown in the table.

         Dividing documents into sections
         Table ?? shows the results when the test document is converted into a weighted
         sum of sections. Formulating the query as a weighted sum with weights of 1,
         combined with a top-code weight of 1.8 (wsum condition), produces a 10.9%
         improvement in average precision over the baseline, a 51.1% increase in the top
         candidate measure, and a 9.9% increase in the top 10 measure. Combining the
         the optimal weights found in the tuning procedure with the best top-code weight
         (sec condition), yields a 14.6% improvement in average precision, a 60% increase
         in the top candidate measure, and a 21.6% increase in the top 10 measure. A
         similar pattern is seen with category scores.
             It is interesting that the wsum version of the documents is such an improve-
         ment over the flat free-text version, even before the sections are differentially
         weighted. The improvement is probably due to length normalization INQUERY
         performs at each #sum node, which has the effect of giving more weight to short
         sections and less weight to long sections.

         NLP tags and fields
         None of the methods of using the WO tags improved the results. No improve-
         ments occurred using field operators and the fielded database, either for sections
         or for WO tags.

                                         DISCUSSION

         Is this good enough?
         Performance in this task is far from the level required for unsupervised automatic
         coding.  However, this system could form a component of a computer aided
         coding system. It could present a list of codes as candidates to be checked by
         an expert coder.  Note also that the training set of 10902 documents is quite
         small considering that we are trying to train more than 14000 codes.  More
         training data should improve the results.  An automatic coding system could
         identify the codes which have sufficient training data to perform reliably.


                                                              6



         Generality of the Approach
         Taking advantage of the section structure of the documents afforded a substan-
         tial gain in retrieval accuracy.  Since the labelling of sections was a heuristic
         step requiring a few weeks of tweaking a set of regular expressions, one could
         question the general value of this approach. However, the section labelling pro-
         gram would continue to successfully tag new documents like the old ones. At
         a site where the format of discharge summaries was more standardized, or in a
         database where the sections were already in different fields, this step could be
         more completely automated.

         Future Directions
         Our next step is to take advantage of yet another level of structure in these docu-
         ments. Our associates are using NLP techniques to tag phrases in the discharge
         summaries with five subtypes each of diagnoses and signs or symptoms [?]. Our
         hypothesis is that performance will be improved by giving more weight to the
         items selected in this way.
             Another direction we expect to take this research is to try two other methods
         of accessing the codes for documents. The first additional classification method
         would be to train statistical classifiers for each code with sufficient data.
             A second additional classification method would be to obtain the text of
         the ICD9-CM Tabular List and Alphabetic Index, and to automate the lookup
         procedures used by manual coders. This method would be particularly useful
         in just the cases where the other two classification methods would fail --- the
         codes for which there is too little (or no) training data.
             This technique would confront us with a vocabulary mismatch problem which
         was not a major part of the k-nearest-neighbor and other classification tech-
         niques mentioned above. In our work so far, we have been matching test docu-
         ments against a corpus of training documents of exactly the same kind. They
         have a varied vocabulary, but there is no systematic difference between the
         training and test documents. In contrast, the vocabulary used in the discharge
         summaries is systematically different from the controlled vocabulary of the ICD9
         descriptors. Our preliminary research has shown this to be a serious problem,
         and we are now experimenting with using the UMLS Metathesaurus to alleviate
         the mismatch problem.

                                      Acknowledgments

         I would like to thank David Aronow for his help in categorizing the section titles
         in the documents. I would also like to thank David Fisher, Fang-Fang Feng, and
         Stephen Soderland for the NLP tagging. This work is supported by by ARPA
         contract N66001-94-D-6054. Stephen I. Gallant also contributed to this work,
         supported by National Cancer Institute Grant 1 R43 CA 65250-01 to Belmont
         Research Inc.  Data are courtesy of Brigham and Women's Hospital, Boston,
         MA.


                                            References