dataflow analysis which is used extensively to provide the information upon which off line compilation is based
null the developed off line compilation techniques make crucial use of the fundamental properties of the hpsg formalism
of course the authors are responsible for all remaining errors
the optimizations of our earley generator lead to significant gains in efficiency
structure sharing determines the dataflow within the rules of the grammar
the first processing complement el of the head h has been displaced
this can be verified on the basis of a sample lexical entry for a main verb
we show that our use of off line grammar optimization overcomes problems with empty or displaced heads
the backward index of edge NUM is therefore identified with the forward index of edge NUM
run time creates much overhead and locally determining the optimal evaluation order is often impossible
while it can appear in a number of different configurations we wish to match only one of its forms against the sitspec though
if a sitspec encodes the situation of tom removing all the water from a tank then the verb to drain is a candidate lexeme
hence a verb s denotation can not contain that information and it follows that it is not present in the psemspec either
transformative verbs involve a change of some state without a clearly recognizable event that would be responsible for it the room lit up
in a nutshell valency as a lexical property needs to supplement the participant circumstance requirements that can be stated for types of processes
moreover the um does not know that the connectee is optional in the verbalization it does not distinguish between obligatory and optional participants
exchange the role names in the psemspec of vo as prescribed by ft0c and importantly in the order they appear there
from the entry point of a verb arcs can be followed and rules applied if the respective alternation is specified in the lexical entry
semspecs are constructed from sitspecs by select null sitspec a sitspec is meant to be neutral between the target languages and between particular paraphrases
lexicalizat ion is tlm main instrument tbr the mapl ing step and we examine the role of verb semantics in the process
as can be seen keystroke savings range roughly from NUM to NUM for NUM predictions and from NUM to NUM for NUM predictions
time saving was measured as the number of output characters produced during a given time and efficiency as a decrease in the number of keystrokes for a given text
the swedish lexicon was created from a NUM million running word balanced corpus augmented with a NUM NUM word frequency list and a NUM NUM high school word list
with syntactic prediction based on the char parsing method the savings were NUM NUM with NUM predictions and NUM NUM with NUM
unexpectedly grammatical trigrams do not appear to add more than NUM in savings at the most over bigrams
testing the same system with syntactic prediction with automaton yielded savings of NUM NUM and NUM NUM with NUM and NUM predictions respectively
it has also proved to be beneficial in spelling and text construction for persons with reading and writing difficulties due to linguistic impairments
qualitative aspects of the texts such as intelligibility and stylistics were judged by readers uninitiated as to the purpose of the study
furthermore an efficient lexicon development algorithm has been developed facilitating the creation of new lexica from either untagged or grammatically tagged text
for comparison with the present version of profet the size of the new lexicons was set to NUM words and NUM NUM bigrams respectively
two subjects with motoric dysfunction and reading and writing difficulties and five persons with dyslexia will participate in the evaluation of the new version
secondly the semantic tagging was done statically i.e. each word received one and only one semantic tag independent of context
our goal then is to compare texts written by these two individuals with the current vs new version respectively of profet
first of all the addition of semantic tags increased the total number of tags from NUM to NUM resulting in sparser training data
with subjects a and b the number of characters in text per minute increased and the total number of keystrokes decreased as expected
it is difficult to quantify the effect of the distribution of classes on a learning algorithm particularly when using naturally occurring data
in general these matrices reveal that both the em algorithm and ward s method are more biased toward balanced distributions of senses than is mcquitty s method
x count ym where count i is the current estimate of the expected count and p sm ym is formulated using NUM
with higher linguistic demands on support for individuals with severe reading and writing difficulties including dyslexia the need for an improved version of profet has arisen
2sources spr kdata NUM million words srf tal punkt NUM million words gsteborgsposten NUM million words and pressens bild NUM million words
each of the text types was divided into a NUM word section and a NUM word section each of which was contained within the larger
null the application of guarded constraints within computational linguistics has not been well explored
although formal evaluation has as yet to be performed the models examined so far with the crude wsd seem to improve on those without
it is hoped that with some alteration to the automatic seed derivation and allowance for a coarser grained distinction this would be viable
the wilks and stevenson style strategy was chosen instead because it requires storage of one parameter only and is exceptionally easy to apply
two approaches were investigated as possible ways for pretagging the head nouns that are used as input to the preference acquisition system
the verbs in our target set having between NUM clean and NUM make instances
there was no distinction of verb senses for the preferences acquired and the data and atcm for serve highlights this
wsd is particularly useful when the quantity of data is small as is the case when collecting data for a specific predicate
ttle information is arranged as follows on the right hand side is the case frame of the verb written as the semspec participant keywords each starting with a colon
having explained denotations and psemspecs specifically for verbs we can now deal with the task of accounting for the different alternations a verb can undergo
examples are the passive or the substance source alternation the tank leaked oil oil leaked from the tank the truth conditions do not change
moose produces a range of different paraphrases for the same underlying sitspec and one instrument to that end is the generation of several verb configurations
jackendoff in his treatment of the alternation suggests encoding the holistic feature in a primitive the sally sprayed paint onto the wall
this is the expected value of the loglikelihood function for the complete data d y s where y is the observed data and s is the missing sense value
in these experiments each of the NUM unsupervised disambiguation methods is applied to each of the NUM words using each of the NUM feature sets this defines a total of NUM different experiments
in this study the accuracy of the unsupervised algorithms was less than that of the majority classifier in every case where the percentage of the majority sense exceeded NUM
our feature sets are composed of various combinations of the following five types of features
merging of the two closest clusters continues until only some specified number of clusters remain
in future work we will evaluate using the one sense per collocation rule to seed our various methods
this may help in dealing with very skewed distributions of senses since we currently select collocations based simply on frequency
despite this consistency there were some observable trends associated with changes in feature set
the em algorithm was most accurate for last and line with all three feature sets
NUM m step the sufficient statistics from the e null step are used to re estimate the model parameters 9i null
for the clustering algorithms a high standard deviation indicates that ties are having some effect on the cluster analysis
observations are grouped in the manner that minimizes the distance between the members of each class
figure NUM example of problematic complement displacement
results may become more consistent if the number of parameters that must be estimated was reduced
despite varying the feature sets the relative accuracy of the three algorithms remains rather consistent
for the general np and pp subgrammars we obtained a recall of NUM and a precision of NUM concerning correct head modifier structure
for application NUM new subgrammars for company names and currency expressions have to be defined as well as a task specific reference resolution method
since next time the upper bound is NUM no more fragments will be considered for the set nec NUM in a similar manner add opt is processed
if the current token is a fragment of type np or name np then inspect the set named nec and select the constraint set typed np
after the analysis of the partially unrecognized messages including the misclassified ones we identified the following major bottlenecks of our current system
our task is to identify those messages which are about violations of the peace treaty and to extract the information about location aggressor defender and victims
special edges there exist some special basic edges namely var var current pos pos and seek name var
if the current token agrees in case which is tested by type subsumption then push it to lcompl and reduce the upper bound by NUM
tdl allows the user to define hierarchically ordered types consisting of type constraints and feature constraints and has been originally developed for supporting high level competence grammar development
if significant changes in the values of certain statistical variables are detected associated terms will selected as being topic oriented and included in a suggested list
our experience showed that to draw terms that are reflective of a given topic a much larger and more general base sample is required
our intuitions led us to believe that topical single words should appear more frequently and more regularly i.e. at approximately even intervals in the focused sample than in the base sample
to compare the intervals the base mean log interval was subtracted from the focused mean log interval and divided by the raw standard deviation from the base sample
with the mutual information scores in hand a delta score was generated by subtracting the base mutual information score from the ffocused mutual information score
thus if we observe the likelihood at point b and it goes over a certain threshold we decide it is the synchronization point for the input data
we used ten thousand NUM tv news texts between NUM and NUM NUM texts each year for the evaluation
figure NUM constraint solving ii ists deals with the interaction of feature and set membership constraint
thus we need to summarize the news program text and then show it on tv screen
solid line tv dotted line newspaper figure NUM number of sentences per article
preliminary research on detection of synchronization points is conducted by using the data we have created
we collected seven and half hours of recordings of twenty people both male and female
the main research topics we are concerned with include easy portability and adaptability of the core system to extraction tasks of different complexity and domains
starting from untagged corpora mona is used for initial tagging where unknown words are ambiguously tagged as noun verb and adjective
we assume that for each component of the system for which fragment extraction patterns are to be defined a set of basic edges exists
unifying the user annotated start category with the left hand side of this phrase structure rule leads to the annotation of the path specifying the logical form of the construction as bound see below
a head driven generator has to rely on a similar solution as it will not be able to find a successful ordering for the local trees either simply because it does not exist
an innovative approach to hpsg processing is described that uses an off line compiler to automatically prime a declarative grammar for generation or parsing and inputs the primed grammar to an advanced earley processor
a novel approach to hpsg based natural language processing is described that uses an off line compiler to automatically prime a declarative grammar for generation or parsing and inputs the primed grammar to an advanced earley style processor
the binding annotations of the lexical entries defining the auxiliary verb are used to determine with how many lexical entries the right hand side category of the rule maximally unifies i.e. its maximal degree of nondeterminacy
clearly then the order of evaluation of the complements in a rule can profoundly influence the efficiency of generation and an efficient head driven generator must order the evaluation of the complements in a rule accordingly
if the generator generates second the main verb then the subcat list of the main verb instantiates the subcat list of the head and generation becomes a deterministic procedure in which complements are generated in sequence
NUM NUM automatic synchronization of text and speech
thus the second approach is not suitable for the tv news text
the result is shown in table NUM
where they appear in the text
it means that the likelihood at point b increases
the evaluation details are as follows
NUM NUM automatic synchronization of text and speech
at the same time we havestarted to create news speech database
text summarization research in the past may he grouped into three approaches
we thus attempt to minimize the effect of incorrect tagging on the parsing component by allowing label ambiguities but control the increase in indeterminacy and concomitant decrease in subsequent processing efficiency by applying the thresholding technique
in this paper we have outlined an approach to robust domain independent parsing in which subcategorisation constraints play no part resulting in coverage that greatly improves upon more conventional grammar based approaches to nl text analysis
since the coverage on sec is increasing at the same time as on susanne we can conclude that the grammar has not been specifically tuned to the particular sublanguages or genres represented in the development corpus
see taylor el al NUM alshawi el al NUM it is clear that the subsequent grammar refinement phases have led to major improvements in coverage and reductions in spurious ambiguity
the apbs for susanne and sec of NUM NUM and NUM NUM respectively indicate that sentences of average length in each corpus could be expected to be assigned of the order of NUM and NUM analyses i.e.
in this latter figure the mean number of crossings NUM NUM is greater than zero mainly because of incompatibilities between the structural representations chosen by the grammarian and the corresponding ones in the treebank
NUM NUM NUM NUM substantial increase in coverage on the development corpus susanne corresponding to a drive to increase the general coverage of the grammar by analyzing parse failures on actual corpus material
probabilities are assigned to transitions in the lalr NUM action table via a process of supervised training based on computing the frequency with which transitions are traversed in a corpus of parse histories
these are not likely to be indicative of a particular sense but more reflect the general nature of the wall street journal corpus
we are particularly interested in the impact of the overall dimensionality of the feature space and in determining how indicative different feature types are of word senses
this bottleneck is eliminated through the use of unsupervised learning approaches which distinguish the sense of a word based only on features that can be automatically identified
under cut neheads number sensesn total heads covered NUM naturally this approach will work better for verbs which select more strongly for their arguments
this explanation seems more plausible for the em algorithm where features are weighted but less so for mcquitty s and ward s which use a representation that does not allow feature weighting
it seems more reasonable to assume that such text will not usually be available and attempt to pursue unsupervised approaches that rely only on the features in a text that can be automatically identified
the expanded sense definitions are then compared to the context of an ambiguous word and the sensedefinition with the greatest number of word overlaps with the context is selected as correct
a confusion matrix shows the number of cases where the sense discovered by the algorithm agrees with the manually assigned sense along the main diagonal disagreements are shown in the rest of the matrix
the window size the number of values for the pos features and the number of words considered in the collocation features are kept deliberately small in order to control the dimensionality of the problem
those cases where the average accuracy of one algorithm for a particular feature set is significantly higher than another algorithm as judged by the t test p NUM are shown in bold face
the adjectives were significantly more accurate when using mcquitty s method and feature set c
section its content is shown by a sentence in the content area e.g.
as an illustration we will consider the content attribute for the form reader requesting retirement pension of reader
although theoretically speaking our logic comes with some restrictions these have no practical consequences whatsoever
to compare the standard deviations the normalized base standard deviation was subtracted from the normalized focused standard deviation
the general method is to compare a topically focused sample created around the predefined topic with a larger and more general base sample
recall is NUM and precision is NUM when using the log likelihood to order the decision list with a stopping condition that the tagged portion exceeds NUM of the target data
space limitations force us to abstract over the recursive optimization of the rules defining the right hand side categories through considering only the defining lexical entries
since phonology does not guide generation the phonological realization of the head of a construction plays no part in the generation of that construction
the test grammar does make this division and always guarantees the correct order of the complements on the comps list with respect to the obliqueness hierarchy
it is based on the notion of essential arguments arguments which must be instantiated to ensure the efficient and terminating execution of a node
we further improved a typed extension of gerdemann s earley generator with a number of techniques that reduce the number of edges created during generation
more specifically the head of the right hand side of each grammar rule is distinguished and distinguished categories are scanned or predicted upon first
the system is intended for use by the technical authors and translators who design forms
to approximate this complex lp constraint employing the kind of logical machinery described in this paper we can use a description such as the one given in NUM
constraints such as doml is included in dora2 essentially builds larger domains from smaller ones and can be thought of as achieving the same effect as reape s domain union operation
this work was supported by the commission of the european communities through the project lre NUM NUM reusable grammatical resources where the logic described in this paper has been implemented
test tasks will include dictation and free writing
we would like to make use of speech recognition technology to help the task of synchronizing text with speech
one of the features of the tv news texts is that the first sentence is the most important
based on the hmms key word pair models were obtained from the phonetic transcription
as the next step we need to synchronize the text and the speech
the names of all the applicable rules those that we have discussed here for a verb appear below the line
therefore both elements also occur in the denotation of to disconnect and a co indexed variable provides the link to the psemspec
locative circumstances like in the garage are not restricted to particular verbs and can occur in addition to paths required by the verb
in contrast a semelfactive verb denotes a single occurrence thus in our system a momentaneous activity as for example to knock
thus an explicit transition between instantinted domain knowledge and a language specific semantic sentence representation is seen as the central step in generation
specifically moose accounts for the fact that events can receive different verbalizations even in closely related languages such as english and german
using parts of t he target lexicon see section NUM NUM the lexical options for verbalizing the sitspec are determined
salience assignment for verbs only a specification of the different degrees of prominence that the verb assigns to the participants
notice that this is different from levin s locatum subject alternation since it does not involve a causer
the former is in its basic form durative the cat moved and the latter transformative the door opened
null profet previously called predict has been evaluated for several years initially together with individuals with slow and laborious writing stemming from a motoric dysfunction
first of all with the dutch version of profet they varied between NUM and NUM depending on the setting of the test parameters
the degree of improvement relating to speed and efficiency was found to vary considerably among subjects depending on their underlying writing abilities and which strategies they employed
however even with these inadequacies the cut with wsd appears to provide a reasonable set of preferences as opposed to the cut at the root node which is uninformative
the head right for example contributes to a higher association score at the location node though its correct sense really falls under the abstraction node
for example in the direct object of build one instance of the word entity occurred which appears at one of the roots in wordnet
the connectee in the denotation therefore must have its counterpart in the psemspec that is the source but there it is marked as optional see figure NUM below
figure NUM experimental results accuracy standard deviation
we record the values of three nominal features for each observation
this makes it difficult to learn their distributions without prior knowledge
our findings show that each of these algorithms is negatively impacted by highly skewed sense distributions
similarly x 31o y states that x is related to y via the transitive reflexive closure of p
in the following sections we show that our constraint solving rules are sound and every clash free constraint system in normal form is consistent
the constraint v dom do lcb v rcb ensures that every element of the set vidom precedes the sign v
this constraint is similar to the constraint x 3p y except that it permits x and y to be equal
we have shown that the logic of linear precedence can be handled elegantly and deterministically by adding new logical primitives to feature logic
to keep the example simple we assume that the whole domain is in the middle field and we ignore fronting or extraposition
chief has a majority class distribution of NUM and not surprisingly these three content words are all indicative of the dominate sense which is highest in rank
at each step in mcquitty s method a new cluster ckl is formed by merging the clusters ck and cl that have the fewest number of dissimilar features between them
figure NUM shows the average accuracy and standard deviation of disambiguation over NUM random trials for each combination of word feature set and learning algorithm
to get a feeling for the adequacy of these feature sets we performed supervised learning experiments with the interest data using the naive bayes model
figures NUM NUM and NUM show the confusion matrices associated with the disambiguation of concern interest and help using feature sets a b and c respectively
a context vector is formed for each occurrence of an ambiguous word by summing the vectors of the contextual words the number of contextual words considered in the sum is unspecified
we continue to integrate the natural language processing and speech processing technology for efficient closed caption production system and put it to a practical use as soon as possible
the specification can be thought of as a formal specification of the intuitive description given in NUM
on the other hand we certainly can not prohibit atoms since they are crucially required in grammar specification
the constraints in NUM is a weaker representation of the disjunctive specification given in NUM
other constraints such as the following involving immediate precedence and first element of a domain are of lesser importance
deterministic computational model is achieved by weakening the logic such that it is sufficient for linguistic applications involving word order
in other words the set vidom is in the domain precedence relation with the singleton lcb v rcb
rule transconj eliminates the weaker constraint x 3p y when both x 2p y ax 3p y hold
to illustrate this consider the following constraints x f yay aax 3f z where a is assumed to be an atom
rule cycle detects cyclic relations that are consistent namely when x precedes or equals y and vice versa then x y is asserted
we have made a number of significant improvements to the system since then the most fundamental being the use of multiple labels for each word
thirdly the largest source of error on unseen input is the omission of appropriate subcategorisation values for lexical items mostly verbs preventing the system from finding the correct analysis
our goal is to develop a system with performance comparable to extant part of speech taggers returning a syntactic analysis from which predicate argument structure can be recovered and which can support semantic interpretation
although this is a different set of sentences it is likely that the upper asymptote for accuracy for the test corpus lies in this region
this work is part of an effort to develop a robust domain independent syntactic parser capable of yielding the unique correct analysis for unrestricted naturally occurring input
the probabilistic model is a refinement of probabilistic context free grammar pcfg conditioning cf backbone rule application on lr state and lookahead item
NUM the three miles j cooperman sheldon teller and richard austin and eight other defendants were charged in six indictments with conspiracy to violate federal narcotic law
nevertheless the number of parameters in the probabilistic model is large it is the total number of possible transitions in an lalr NUM table containing over NUM actions
however we will continue to investigate the appropriateness of this assumption
the distance between two points in a multi dimensional space can be measured using any of a wide variety of metrics see e.g.
this is the accuracy that can be obtained by a majority classifier a simple classifier that assigns each ambiguous word to the most frequent sense in a sample
li and abe estimate class frequencies by dividing the frequencies of nouns occurring in the set of synonyms of a class between all the classes in which they appear
wordnet does not adhere to this stipulation and so they prune the hierarchy at classes where a noun featured in the set of synonyms has occurred in the data
the set of content words used in formulating the co occurrence features are shown in figure NUM
from the loom model a text drafter generates equivalent texts in the three supported languages guided by some broad stylistic parameters which the author can control
we then need to ensure that precedence constraints never have to consider atoms as their values
determining consistency of such constraints in general involves solving for the following disjunctive choices of constraints
let f p l be a distinct relation symbol then we can equivalently define the first daughter constraint by
instead precedence constraints are directly embedded in feature logic and a deterministic constraint solving procedure is provided
theorem NUM termination the consistency checking procedure terminates in a finite number of steps
this restriction can be imposed by the system i.e. a typed feature formalism itself
the basic idea is to treat immediate precedence as a functional relation whose inverse too is functional
furthermore a sound complete and terminating consistency checking procedure is described
and thanks to ralf steinberger for providing useful comments on an earlier draft
this then raises the question is it possible to generate linearised models
spatio temporal information is generally seen as a circumstance
figure NUM excerpts of sample lexical entries for verbs
tom put the book on the table
on the one hand bateman et el
their denotation includes an activity and a post state
to compute these configurations automatically we define an alternation or extension rule as a NUM tuple with the following components nam a unique name dxt extension of denor ation
figure NUM sitspecs for configurations of to spray
in the literature such verbs are often also called inchoative NUM the final verb inherent feature we use is the well known causative which reflects the presence of a causer in the denotation
the critical group is NUM because if we derive verb configurations from others and rewrite the denotation in this process it has to be ensured that the process is monotonic
in this paper we adopt the latter theoretically more interesting perspective
active edge NUM resulted from active edge NUM through prediction
as the optimal evaluation order of our phrase structure rule for argument composition
the monostratal uniform treatment of syntax semantics and phonology supports NUM
the approach allows efficient bidirectional processing with similar generation and parsing times
the following is an example of such a lexical entry
note that the nonverbal complements do not become further instantiated
the third indexes lexical entries which is necessary to obtain constant time lexical access
completion of an active edge results in an edge with identical backward index
eight persons with motor disabilities participated in the study six with cerebral palsy and one with a muscular disease two of them also evidencing writing difficulties of a linguistic nature
though the numbers of matches vary we find that our automatic suggestion process provides more terms than the manual process that are useful for describing the predefined topic
we found that the second order statistics such as vadance or standard deviation of term frequencies across documents provide greater insight into topicality
y np dat e dom then x y guards on set members
NUM daft in der strai e einen mann er laufen sah
however it is not entirely clear how order constituent is supposed to interpret various linear precedence statements such as lp1
NUM dab er einen mann in der strafle lanfen that he a man in the street walking sah
we expect to increase the accuracy by improving the un supervised tagger through the use of more linguistic information determined by mona especially for the case of unknowns words
the output of the text scanner is a stream of tokens where each word is simply represented as a string of alphabetic characters including delimiters e.g.
currently these verb frames are handled by the shallow parser with no ordering restriction which is reasonably because german is a language with relative free word order
in case NUM additional fst for the text structure have been added since the text structure is an important source for the location of relevant information
however we have now started the implementation of a new application together with a commercial partner where a more systematic evaluation of the system is carried out
future research will focus towards automatic adaption and acquisition methods e.g. automatic extraction of subgrammars from a competence base and learning methods for domain specific extraction patterns
here smes is applied on a quite different domain namely news items concerning the german ifor mission in former yugoslavia
therefore the output description part is realized through a function called build item which receives as input the edge variables and a symbol denoting the class of the fst
output description part the output structure of an fst is constructed by collecting together the variables of the recognition part s basic edges followed by some specific construction handlers
however in our current system we only allow the use of type subsumption which is performed by tdl very efficiently and constraints used very carefully and restrictively
we will create a system by integrating the summarization and synchronization techniques with techniques for superimposing characters
we describe main research issues and the project schedule and show the results of preliminary research
while nlp data is typically not well characterized by a normal distribution see e.g.
thus three scores were generated for each word the mean log interval the standard deviation of the mean log interval and the normalized standard deviation of the mean log interval
unlike probability and entropy statistics which yield average scores for the whole document the use of interval makes it possible to get an instantaneous measure at any location in the document
our experiments indicated that the ffirst order statistics probability and entropy alone are not sufficient for gathering information about the topicality of a word in running text
for the sake of discussion hereafter we may sometimes refer to the focused sample as focused and the base sample as base
if significant changes in the values of certain statistical variables are detected associated terms are selected from the focused sample as being topic oriented and included in a suggested list
if a two word term occurs in both data samples and receives a negative delta score it would be included in the suggested list as being topically onented
as we are facing the growing amount of on line text the use of text analysis techniques to access information from electronic sources has become more popular and at the same time more difficult
if a single word term is found in both data samples and it receives negative scores from both interval and standard deviation comparisons it would be included in the suggested list as being topical onented
the first is to generate summarized sentences based on understanding of the text
the project started in NUM and will end in NUM
this is similar to the seek edge known from augmented transition networks with the notably distinction that in our system recursive calls are disallowed
all un filtered resulting structures of each component are cached so that a component can take into account results of all previous components
a better solution would be to attach optional constraints directly with lexical entries and to splice them into an fcp after its selection
application of an fcp then starts directly from the input position of the anchor and searches the left and right input parts for candidate fragments
fcps are attached to lexical entries e.g. verbs and are selected right after a corresponding lexical entry has been identified
although these current results should and can be improved we are convinced that the idea of developing a core ie engine is a worthwhile venture
NUM the np pp subgrammars cover e.g. coordinate nps different forms of adjective constructions genitive expressions pronouns
time expr interface to tdl the interface to tdl a typed feature based language and inference system is also realized through basic edges
the only major change to be done concerned the extension of the output description function build item for building up the new template structure
consequently as a first actual extension of smes to the new domain we extended the shallow parser to cope with passive constructions
the baseline test consisted of two tasks a to copy from a given text and b to write about a topic that was chosen freely before the test began
the current project started in july NUM and originated through the search for new applications the desire for more accurate prediction and enhancement of the pedagogical aspects of the user interface
the project is funded by the national labor market board ams the swedish handicap institute hi and the national social insurance board rfv
in an effort to systematically investigate the aid provided by this program a study was conducted in which time saving and effort saving were chosen as parameters
subject d who was not extremely slow felt that the program helped her because it forced her to use a more efficient typing strategy
it should be noted that the test texts belonged to the same corpus from which the lexicon had been built thus assuring good lexicon coverage
of the two subjects who had difficulties with spelling and text construction one showed substantial improvement and the other showed moderate improvement but reported a significant difference in ease of writing
testing jal NUM jal NUM for spanish with frequency based prediction yielded savings of NUM NUM and NUM NUM with the number of predictions set to NUM and NUM respectively
since the focused sample was drawn from the source of NUM upi the construction of its corresponding base sample was also initiated from the same source of the same year
NUM a the disembodied head b the at disembodied vvn head nn1 similar idiosyncratic rules are incorporated for dealing with gerunds adjective noun conversions idiom sequences and so forth
we also imposed a per sentence time out of NUM seconds cpu time running in franz allegro common lisp NUM NUM on an hp pa risc NUM NUM workstation with NUM mbytes of physical memory
this may yield a system accurate enough for some types of application given that the system is not restricted to returning the single highest ranked analysis but can return the n highest ranked for further applicationspecific selection
firstly improvements in probabilistic parse selection will require a lexicalised grammar parser in which minimally probabilities are associated with alternative subcategorisation possibilities of individual lexical items
a more meaningful parser comparison would require application of different parsers to an identical and extended test suite and utilisation of a more stringent standard evaluation procedure sensitive to node labellings
on susanne retagging allowing only a single label per word results in a NUM NUM label word assignment accuracy whereas multilabel tagging with this thresholding scheme results in NUM NUM accuracy
we therefore ran the same experiment as above using geig to measure the accuracy of the system on the NUM held back sentences but varying the amount of training data with which the system was provided
once the baseline statistics are generated from both data collections a meaningful comparison could spot terms that occur with unusual frequency in the focused sample
the compilation process is illustrated on the basis of the phrase structure rule for argument composition discussed in NUM NUM
automatic text summarization method for dividing a sentence into smaller sections key word extraction method for connecting sentence sections automatic synchronization of text and speech transcription speech model integration system maximum likelihood matching system speech database e cient closed caption production system integrated simulation system for closed caption production
the second is to digest the text by making use of text structures such as paragraphs
this is true regardless of the feature set employed
these steps iterate until the parameter estimates NUM and NUM converge
we conduct research on the above issues and create a prototype system in the first stage
first the importance of each sentence is calculated by the high frequency key word or tf idf method
in europe have multiple languages gist focusses on the trentino alto adige region of northern italy in which all official documentation has to be produced in two languages italian and german laid out side by side on the page
by clicking on the buttons the author can create various types of form part including sections text fields and multiple choice questions the whole form is also considered to be a form part
the most important attribute the content is shown in the main window the other attributes can be viewed by double clicking the relevant line of the model and include the following applicability condition a condition which determines whether a question or section applies to the form filler e.g. the question about the reader s previous surname only applies to married women
when specifying the content of the form the author indirectly edits a knowledge base in the language loom
we will demonstrate the gist system which generates social security forms in english italian and german
only argument heads consisting of common nouns days of the week and months and personal pronouns with the exception of it were used
in this paper some modifications to li and abe s system are described and a comparison is made of the use of some word sense disambiguation
serve has a number of senses including the sense of meet the needs of or set food on the table or undergo a due period
the tree cut features in a model termed an association tree cut model atcm which identifies an association score for each of the classes in the cut
despite these errors the advantages of using automatic parsing are significant in terms of the quantity of data thereby made available and portability to new domains
the reasons are that there is a predominant sense of eat which selects strongly for its direct object and many of the heads in the data were monosemous e.g.
our dataflow anmysis treats heads and complements alike and includes the head in the calculation of the optimal evaluation order of a rule
in this paper we focus on the application of the developed techniques in the context of the comparatively neglected area of hpsg generation
this clearly demonstrates an extremely important consequence of using our dataflow analysis to compile a declarative grammar into a grammar optimized for generation
moreover the interchanging of arguments in recursive procedures as proposed by strzalkowski fails to guarantee that input and output grammars are semantically equivalent
testing the developed techniques uncovered important constraints on the form of the phrase structure rules in a grammar imposed by the compiler
we are therefore forced to split this schematic phrase structure rule into two more specific rules at least during the optimization process
a strict top down evaluation strategy suffers from what may be called head recursion i.e. the generation analog of left recursion in parsing
we have reduced the sense inventory of these words so that only the two or three most frequent senses are included in the text being disambiguated
there are two potential problems when using the em algorithm
these common features may be sufficient for the level of disambiguation achieved here
this may explain the better performance of mcquitty s method in disambiguating those words with the most skewed sense distributions the adjectives and adverbs
in this model all features are conditionally independent given the value of the classification feature i.e. the sense of the ambiguous word
it is possible to adjust the em algorithm away from this tendency towards discovering balanced distributions by providing prior knowledge of the expected sense distribution
the results presented are preliminary but show an accuracy percentage in the mid nineties when applied to dixon a name found to be quite ambiguous
the gist consortium includes two organizations that have to implement this requirement the italian social security institute inps and the local government agency for the bolzano province pab
although sometimes clumsy sentences in this language are easily understood
in future work we will investigate modifications of these algorithms and feature set selection that are more effective on highly skewed sense distributions
our objective is to evaluate the effect that different types of features have on the accuracy of unsupervised learning algorithms such as those discussed here
nineteen correspond to the NUM most frequent words that occur in that fixed position in all of the sentences that contain the particular ambiguous word
thus we actually lack some modularity concerning this issue
a vgf is applied after fragment processing took place
processing is very robust and fast between NUM and
NUM part of speech disambiguation morphological ambiguous readings are disambiguated wrt
figure NUM a blueprint of the core system
NUM NUM h and expands abbreviations
the german version has a very broad control flow
the output texts serve as drafts which the authors and translators can modify or extend
figure i shows part of the gist main window during the definition of a simple pension form
most systems using controlled languages allow users to enter sentences in free text e.g.
an applicable question may be optional if the requested information is inaccessible or sensitive
our approach demonstrates that it is not necessary to employ a non deterministic operation such as domain union to manipulate domains
however if we add immediate precedence to our logic then it is not clear whether we can guarantee linearisable models
NUM daft einen mann in der stral e er lanfen sah
NUM daft in der stral e er einen mann laufen sah
thanks to wojciech skut for developing sample grammars to test the implementation and for working on the interface to profit
when the inflections of a specific word are presented visually or aurally the subject must be able to distinguish between the forms and make the correct selection
at the heart of the em algorithm lies the qfunction
the variance between ck and el is computed as follows
for example in figure NUM we have four observations
for space reasons our treatment is necessarily somewhat superficial since we do not take into account other interacting phenomena such as fronting or extraposition
null in effect what we add to our logic is both precedence as a feature and a new constraint for representing the inverse functional precedence
for space reasons we do not cover the logic of guarded feature constraints guards on set membership constraints and guards o n precedence constraints
a domain union operation as given in NUM is then employed to construct word order domains locally within a hpsg derivation step
in order to keep to their definition of a tree cut all nouns in the hierarchy need to be positioned at leaves
next a comparison algorithm was applied to these two sets of terms to single out those that were common to both samples yet whose patterns of occurrences differed between these two samples
though using interval alone might still not be sufficient for identifying word topicality it allowed us to measure the vadance which would help identify words that were always changing in their rate of occurrences
for each speaker a twoloop four mixture distribution phonetic hmm was learned
guarded constraints can be thought of as conditional constraints whose execution depends on the presence of other constraints
in the following sections we describe a constraint language for specifying lp constraints that overcomes both these deficiencies
we note that precedence can be restricted to nonatomic types such as hpsg signs without compromising the grammar in any way
NUM handling immediate precedence in this section we provide additional constraint solving rules for handling immediate precedence
by x y cs we mean replacing every occurrence of x with y in cs
let succ x f and succ x p and denote the sets
the essential difference being that we interpret every feature relation as a binary relation on the domain of interpretation
the condition part g of a guarded constraint if g then s else t is known as a guard
the constraint f x p g y is analogous
the implication of these findings is that the effect and efficiency of a writing aid of this type to a great extent is dependent upon the underlying writing strategy and skills of the user
these results indicated the power of prediction techniques as linguistic support for writing and stimulated the interest for the present focus on use of word prediction for persons with reading and writing difficulties and or dyslexia
NUM description of the new version of to date the modifications of the prediction system include extension of scope addition of grammatical and semantic information as well as automatic grammatical tagging of user words
however the improvement in spelling outweighs this problem and g the possibility of adding speech synthesis to the other functions of profet was an important and helpful feature to severely dyslectic individuals
therefore speed and efficiency will not be studied
geoff nunberg provided encouragement and much advice on the analysis of punctuation and greg grefenstette undertook the original corpus tokenisation and segmentation for the punctuation experiments
we then attempted to parse the test sentences using the derived verbal entries instead of the original generic entries which generalised over all the subcategorisation possibilities
thus perhaps a more informative test of the accuracy of our probabilistic system would be evaluation against the manually disambiguated corpus of analyses assigned by the grammar
we constructed randomly from susanne a test corpus of NUM in coverage sentences and in this for each word tagged as possibly being an open class verb i.e.
table NUM shows the results of this test with respect to the original susanne bracketings using the grammar evaluation interest group scheme geig see e.g.
we have not relaxed this requirement since it increases ambiguity our primary interest at this point being the extraction of subcategorisation information from full clauses in corpus data
NUM a he told them his reason he would not renegotiate his contract but he did not explain to the team owners
techniques to resolve ambiguities whilst the latter goal favors a more sophisticated grammatical formalism than is typical in statistical approaches to robust analysis of corpus material
for this reason the measure to order the decision list 3the only collocation used was within a window of NUM words either side of the target
in a experiment NUM mid frequency nouns were trained and the algorithm used to disambiguate the same nouns appearing in the semcor files of the brown corpus
these researchers also use variations on the association score given in equation NUM the log of which gives the measure from information theory known as mutual information
for clarity only some of the nodes have been shown and classes are labeled with some of the synonyms belonging to that class in wordnet
this has the advantage that it does not rely on a quantity of handtagged data however the time taken for training remains an issue
on account of the training time that would be required yarowsky s unsupervised algorithm was abandoned for the purpose of tagging the argument heads
this feature is not used for adjectives
reliance on frequency based features as used in this work means that the more skewed the sample is the more likely it is that the features will be indicative of only the majority class
figure NUM help feature set c
we are now ready to consider consistency checking rules for our constraint language
and a variable assignment a such that to a cs
but disallowing functional precedence is less problematic from a grammar development perspective
we shall construct an interpretation 7c l n n
our implementation of the logic as an extension to the profit typed feature formalism shows that a reasonably efficient implementation is feasible
if the sense tagged text were not available as would often be the case in an unsupervised experiment this mapping would have to be performed manually
where frequency data was not available for the target word the word was simply treated as ambiguous and class frequencies were calculated as in experiment NUM
our features represent part of speech pos tags morphological characteristics and word co occurrence such features are nominal and their values do not have scale
the goal of our research is a grammatically more accurate prediction psychological user support and integration with spellchecking developed by hadar in maims sweden into a writing support tool for dyslexics
the results of some of these adaptations are described here along with a comparison of the selectional preferences acquired with and without one of these methods
as opposed to a general spl term a semspec must contain only upper model concepts and no domain concepts recall that the domain model in moose is not subsumed by the upper model
while to eat usually requires a direct object it can also be used intransitively due to the strong semantic expectation it creates on the nature of the object independent of the context
resultative verbs on the other hand characterize situations in which something is going on and then comes to an end thereby resulting in some new state culminations in our ontology
regarding spatial relationships we find verbs that specifically require path expressions which can not be treated on a par with circumstances recall to put which requires a direct object and a destination
but neither this syntactic division corresponding to participants and circumstances direct or indirect object versns adverbs or prepositional phrases nor the um s semantic postulate that spatio temporal aspects are circumstances hold in general
therefore we define the directionality for group NUM to the effect that an alternation always adds meaning the newly derived form communicates more than the old form the denotation gets extended
the idea is to see verb alternations not just as relations between different verb forms but to add directionality to the concept of alternation and treat them as functions that map one into another
the following alternation rule applies to durative verb readings that denote activities of something being moved to somewhere and extends them to also cover the post state which must be subsumed by completion state
from the perspective of the domain model the roles on the left hand side of the arrows are required to be filled as is encoded in the loom definitions of the underlying concept
the key word pair model is shown in fig NUM the model consists of two strings of words keywordsl and keywords2 before and after the synchronization point when the speech is fed to the model the nonsynchronizing input data travel through the garbage arc while the synchronizing data go through the keywords
reasons why the availability is low are firstly that thousands of characters are used in the japanese language and secondly that the closed captions are produced manually at present and it is a time consuming and costly task
for example if nouns or proper nouns appear in the headline they are considered as important and may be used as measures of finding out how important the other parts of the text are
however if the generator generates second some complement other than the main verb then the subcat list of the head contains no restricting information to guide deterministic generation and generation becomes a generate and test procedure in which complements are generated at random only to be eliminated by further unifications
to mimic the evaluation of the auxiliary verb we determine the information common to all defining lexical entries by taking their generalization i.e. the most specific feature structure subsuming all and unify the result with the original right hand side category in the phrase structure rule
in the case of our example this would be the steps from edge NUM to edge NUM and edge NUM to edge NUM as nothing gets predicted from a passive edge NUM it does not have a forward index
we have conducted preliminary research for automatic text summarization and synchronization of text and speech and the results are as follows
this was done without vs with an increasing number of grammatical tag types NUM unigrams NUM unigrams and bigrams and NUM unigrams bigrams and trigrams
only sentences containing the ambiguous word were used to establish word frequencies
a knowledge specification tool allows the author to build a model of the form in the knowledge representation language loom
we have therefore preferred an input mechanism in which sentences are built through a series of menu guided choices
the words disambiguated and their sense distributions are shown in figure NUM
this will be explored in future work
NUM rules transclos and domprec asserts a relation r between pairs of variables x y
we say that a set of constraints in normal form contains a clash if it contains constraints of the form
we say that a set of constraints are in normal form if no constraint solving rules are applicable to it
the constraint solving rules given in figure NUM deal with constraints involving features setmemberships subset and first daughter
to simplify the presentation we have split up the rules into two groups given in figure NUM and figure NUM
the definition in NUM does not make specific assumption about whether a context free backbone is employed or not
the constraint x 3p y just says that x is related to y via the transitive closure of p
guards on precedence constraints NUM 3x3y fx np acc e dom a
an important advantage of using tdl in this way is that it supports the specification of very compact and modular finite expressions
we will only briefly describe some of the current applications built on top of this core machinery see section NUM
an fst consists of a unique name the recognition part the output description and a set of compiler parameters
in the latter case most of the unrecognized expressions concern expressions like nach taszar ungarn im serbischen bzw
size of the text which ranges from very short texts a few sentences upto short texts one page
although this mechanism turned out to be practical enough for our current applications we have defined also complex verb group fragments vgf
in this paper we will concentrate on the technical and implementational aspects of the ie core technology used for achieving the desired portability
in order to support re usability of fst to other applications it is important to separate the construction handlers from the fst definition
starting from the assumption that the core machinery can be used un changed we first measured the coverage of the existing linguistic knowledge sources
keystroke savings for these systems are presented below
though overlaps vary we find that the automatical suggestion provides more terms that are useful for describing the predefined topic
topically prominent two word terms normally have lower scores in the focused sample that is keyed to their topic
first the mutual information score was computed for each pair of words that occur in each of the two samples
in order to compare the standard deviations across words of different intervals we found this normalization process quite useful
for our experiments of automatic term suggestion we selected a predefined topic called european politics and business
the assumption we make is that precedence is specified in such a way that is appropriate only for nonatomic types
table NUM grammar coverage and ambiguity during
tag unigram NUM bigram NUM and trigram NUM NUM lexicons were created with the same lexicon creating algorithm as the word lexicons
where xk is the mean observation for cluster ck nk is the number of observations in ck and l and nl are defined similarly for cl
in both cases the cuts are only used to dis ambiguate the heads appearing with the target verb and the full data sample required for the prior distribution tcm is left as in experiments NUM and NUM
if the tree were pruned at the entity class there would be no possibility for the preference of build to distinguish between the subclasses life form and physi cal object
in experiment NUM class frequencies were calculated in much the same way as in li and abe s original experiments dividing frequencies for each noun between the set of classes in which they featured as synonyms
because of this the noise from erroneous senses is not as easily filtered and wsd does seem to make a difference although this depends on the verb and the degree of polysemy of the most common arguments
the tree cuts obtained in experiments NUM and NUM have been used for wsd in a bootstrapping approach where heads disambiguated by selectional preferences are then used as input data to the preference acquisition system
if a stdng of text was found to include that word unit and at the same time occur most frequently in the concordance its leading and trailing closed set words if any were chopped off
using this representation observations that fall close together in feature space are likely to belong to the same class and are grouped together into clusters
in this study we evaluated the performance of three unsupervised learning algorithms on the dis null ambiguation of NUM words in naturally occurring text
for each word the most accurate overall experiment i.e. algorithm feature set combination and those that are not significantly less accurate are underlined
the primary goal of collocation research is to build a comprehensive lexicographic toolkit or to assist automatic language generation applications
after reading each of the relevant documents found in the focused sample the domain expert manually determined NUM topical terms
the size of this dataset is about NUM times larger than the sample data file see table NUM
main research issues in the project are as follows automatic text summarization automatic synchronization of text and speech building an efficient closed caption production system we would like to have the following system figure NUM based on the research on the above issues
however the equivalent of this kind of minimal types in untyped feature structure grammars are constants which can be used in a similar fashion for off line optimization
our off line compiler extends the techniques developed in the context of the dia in that it compiles typed feature structure grammars rather than simple logic grammars
empty or displaced heads pose us no problem since the optimal evaluation order of the right hand side of a rule is determined regardless of the head
this way we provide an elegant solution to the problems with empty heads and efficient bidirectional processing which is illustrated for the special case of hpsg generation
the developed techniques are direction independent in the sense that they can be used for both generation and parsing with hpsg grammars
this uncovered some important constraints on the form of the phrase structure rules phrase structure rules in a grammar imposed by the compiler section NUM
the maximal degree of nondeterminacy introduced by a right hand side category equals the maximal number of rules and or lexical entries with which this category unifies given its binding annotations
the inversion process consists of both the automatic static reordering of nodes in the grammar and the interchanging of arguments in rules with recursively defined heads
for several of the words there are minority senses that form a very small percentage i.e. NUM of the total sample
once the new cluster ckl is created the dissimilarity matrix is updated to reflect the number of dissimilar features between ckl and all other existing clusters
the average accuracies for each feature set over NUM random trials were as follows a NUM NUM b NUM NUM and c NUM NUM
other clustering approaches to word sense disambiguation have been based on measures of semantic distance defined with respect to a semantic network such as wordnet
measures of semantic distance are based on the path length between concepts in a network and are used to group semantically similar concepts e.g.
q NUM NUM NUM e lnp y si NUM i NUM y NUM
the classes discovered by the unsupervised learning algorithms are mapped to dictionary senses in a manner that maximizes their agreement with the sense tagged text
unfortunately there is little to be done in this case other than reducing the dimensionality of the problem so that fewer parameters are estimated
for words with skewed sense distribution it is likely that the most frequent content words will be associated only with the dominate sense
finally nl t0 contains new roles and fillers that are to be added to the new psemspec these will also contain variables from the denotation extension
since atcms have only been obtained for the subject and object slot and for NUM target verbs no formal evaluation has been performed as yet
the selectional preferences of verbal predicates are an important component of lexical information useful for a number of nlp tasks including disambigliation of word senses
a tcm is similar to an atcm except that the scores associated with each class on the cut are probabilities and should sum to NUM
in contrast establish only has NUM instances and without any wsd the atcm consists of the root node with a score of NUM NUM
in order to obtain the atcms tree cut models tcms for the target slot irrespective of verb are obtained
in this paper the new functionality will be presented and the possible implications for support at different linguistic levels will be discussed
accordingly we see the upper model as a taxonomy of lexical classes
moose is designed with the goal of strong lexical paraphrasing capabilities in mind
NUM alternations that do change the denotation of the verb
when the model is complete the panel of style settings can be edited through the style menu and the output languages chosen through the language menu after these preliminaries another option in the language menu can be selected in order to generate draft texts
hierarchical relationships among form parts are shown by indenting thus the form is composed of two sections the first section is composed of four text fields and a documentation request and the second section is composed of a multiple choice question with five options
when clustering our data each observation is represented by its corresponding row or column in the dissimilarity matrix
such minority classes are not yet well handled by unsupervised techniques therefore we do not consider them in this study
for the nouns there was no significant difference between feature sets a and b when using the em algorithm
overall the most successful of our procedures was mcquitty s similarity analysis in combination with a high dimensional feature set
in order to use the em algorithm the parametric form of the model representing the data must be known
each pos feature can have one of NUM possible values noun verb adjective adverb or other
overall the most accurate of these procedures is mcquitty s similarity analysis in combination with a high dimensional feature set
ward s and mcquitty s method are agglomerative clustering algorithms that differ primarily in how they compute the distance between clusters
there the em algorithm is used as part of a supervised learning algorithm to distinguish city names from people s names
3generally only nouns and proper names axe written in standard german with an capitalized initial letter e.g. der wagen the car vs wit wagen we venture
for each analyzed word it returns a set of triple containing the stem or a list of stems in case of a compound the part of speech and inflectional information
in general a template is a record like structure consisting of features and their values where each collected fragment and the 5in some sense this mechanism behaves like the subcategorization principle employed in constraint based lexical grammars
for the date time expressions we obtained a recall of NUM and a precision of NUM and for the location expressions we obtained NUM and NUM respectively
for simple and complex date and time expressions person names company names currency expressions as well as shallow grammars for general nominal phrases prepositional phrases and general verb modifier expressions
the main components are a tokenizer based on regular expressions it scans an ascii text file for recognizing text structure special tokens like date and time expressions abbreviations and words
on the one hand to demonstrate task independent component technologies of information extraction and on the other hand to encourage work on increasing portability and deeper understanding of
the seek mechanism is very useful in defining modular grammars since it allows for a hierarchical definition of finite state grammars from general to specific constructions or vice versa
however he does not handle unknown words
please note that what we aim at is to synchronize the original te xt rather than the summarized text with the speech
to illustrate the difference between tv news program text and newspaper articles we compared one thousand randomly selected articles from both domains
although all types of tv programs are to be handled in the project the first priority is given to tv news programs
each form part is presented on a single line of the model its type is shown by a label in the outline area e.g.
in this paper we use ward s and mcquitty s methods to form clusters of observations where each observation is represented by a row in a dissimilarity matrix
based on the average accuracy over part of speech categories the em algorithm performs with the highest accuracy for nouns while mcquitty s method performs most accurately for verbs and adjectives
based on NUM examples of manufacturing plant and NUM examples of living plant this algorithm is able to distinguish between two senses of plant for NUM NUM examples with NUM percent accuracy
this representation uses NUM features to characterize a word where each feature is a linear combination of letter four grams formulated by a singular value decomposition of a NUM by NUM matrix of letter fourgram co occurrence frequencies
however for decomposable models such as the naive bayes the e step simplifies to the calculation of the expected counts in the marginal distributions of interdependent features where the expectation is with respect to NUM
as illustrated in this figure consistency checking of constraints involving both linear precedence and immediate precedence with a semantics that requires linearised models is not trivial
if the current set of constraints neither entail nor disentail g then the execution of the whole guarded constraint is blocked until more information is available
additionally our constraint language provides a broad range of constraints for specifying linear precedence that go well beyond what is available within current typed feature formalisms
the above data can be captured precisely if we can state that sah requires both its verbal argument laufen and its np argument er to precede it
proof sketch the termination claim can be easily verified if we first exclude rules subset transclos and domprec from consideration
the constraint dom dora dom2 states that every element of the set described by doml precedes every element of the set described by dom
null thus NUM should rule out the ungrammatical example in NUM if the assumptions regarding focus are made as in NUM
constraints in this section we provide formal definitions for the syntax and semantics of an extended feature logic that directly supports linear precedence constraints as logical primitives
an appropriateness condition just states that a given feature in our case a relation can only be defined on certain appropriate types
the lp constraint in NUM states that for every pair of constituents in the middle field at least one of the conditions should apply otherwise the sentence is considered ungrammatical
at present this is done by hand when the closed captions are produced
information status an indication of whether the requested information is obfigatory or optional
the standard deviations give an indication of the effect of ties on the clustering algorithms and the effect of the random initialization on the the em algorithm
prior to introduction of word prediction to the writer a baseline was established during repeated sessions with texts written without any writing support
syntax let be the set of relation symbols and let NUM be the set of irreflexive relation symbols
the logic we have described comes with NUM limitations which at first glance appears to be somewhat severe namely no atomic values no precedence as a feature this is so because it turns out that adding both functionm precedence and atoms in general leads to a non deterministic constraint solving procedure
we identify two deficiencies in reape s ap null proach namely system is non deterministic generate and test paradigm not possible to be agnostic about order this is so since domain union is a non deterministic operation and secondly underspecification of ordering within elements of a domain is not permitted
in mcquitty s method clusters are based on a simple averaging of the feature mismatch counts found in the dissimilarity matrix
perhaps the most striking aspect of these results is that across all experiments only the nouns are disambiguated with accuracy greater than that of the majority classifier
we conducted the evaluation by taking advantage of the feature
such a base sample should be randomly sampled from the same source as the focused sample and it should contain an array of different topics
the more negative the value the more significant the change and the more prominent the word would appear in the focused sample
the method for suggesting two word terms tumed out to be much simpler than that for single word terms though the same techniques are equally applicable
once an interesting word unit of distance two was selected a concordance was built of all sentences containing that word unit
once significant terms of all these three types are identified a comparison algorithm is applied to differentiate terms across the two data samples
thus the description in NUM can be solved deterministically
thanks to gregor erbach for demoing the extended system dubbed cl one
this we believe is a generaiisation of reape s approach
so we do not explore this scenario in this paper
a more complex condition would be needed to handle these
if every lp constraint is violated then an inconsistency results
a precise semantics of our constraint language is given next
we shall require that NUM and NUM are disjoint
let cs be a clash free constraint system in normal form
finally rule domprec propagates constraints involving domain precedence
these terms may be highly prominent for the topic yet may not necessarily occur in the focused sample
with these observations in mind they developed an algorithm which has proved to be effective and domain independent
the focused sample was originally created by the domain expert using the NUM united press international upi
where p wlw2 is the frequency in the data collection of the two word compound wl w2 and p wl and p w2 the frequency of the word constituents
though the ratio between the focused and base samples was arbitrary in order to generate meaningful statistics we felt that the base sample should be at least NUM times larger in size than the focused sample
our experiments demonstrated that in order to obtain a random assortment of topics to be included in the base sample it may be meaningful to sample documents from the time pedod before and after the focused documents
after scores were generated for all the words in both the focused sample and the base sample score comparisons between the two samples were carried out in two ways comparing the intervals and comparing the standard deviations
levin points out that this alternation has received much attention in linguistics research and notes that in spite of the efforts a satisfactory definition of the holistic facet has not been found
in this way it derives reading loaded hay onto the wagon tom loaded the wagon with hay jill stuffed the feathers into the cushion jill stuffed the cushion with tile feathers
l causer sally l object paint l fimction ona is a derivative of on and means that something distributively covers a surface e.g. the paint covers all of the wall
now since our instrument for ensuring the well formedness of psemspecs and semspecs is the upper model we need to inspect the role of valency information in the um
circumstances on the other hand are in the um coded as loom relations and there are no restrictions as to what circumstances can occur with what processes
instead we wish to represent the common kernel of the different configurations only once and use a set of lexical rules to derive the alternation possibilities
specifically it can contain variables these can co occur in the c0v list the items that the new verbalization covers in addition to those of the old one
a generator needs to know that a verb like to fill can occur in a variety of configurations water filled the tank the tank filled with water tom filled the tank with water
we use the type hierarchy and an extension of the unification and generalization operations such that path annotations are preserved to determine the flow of semantic information between the rules and the lexical entries in a grammar
the dataflow analysis is used to determine the relative efficiency of a particular evaluation order of the right hand side categories in a phrase structure rule by computing the maximal degree of nondeterminacy introduced by the evaluation of each of these categories
we chose instead to deal with the ordering problem by using off line compilation to automatically optimize a grammar such that it can be used for generation without additional provision for dealing with the evaluation order by our earley generator
the first supplies each edge in the chart with two indices a backward index pointing to the state in the chart that the edge is predicted from and a forward index poinfing to the states that are predicted from the edge
goal freezing can also overcome the ordering problem but is equally unappealing goal freezing is computationally expensive it demands the procedural annotation of an otherwise declarative grammar specification and it presupposes that a grammar writer possesses substantial computational processing expertise
by subsequent investigation of the maximal degree of nondeterminacy introduced by the evaluation of the complements in various permutations we find that the logical form of a sentence only restricts the evaluation of the nonverbal complements after the evaluation of the verbal complement
more specifically this problem arises when a complement receives essential restricting information from the head of the construction from which it has been extracted while at the same time it provides essential restricting information for the complements that stayed behind
for verbs the applicable alternations and extensions are computed see section NUM and added to the set of options
this separation is in principle widely accepted but views differ on where to draw the line and how to motivate it
aktionsart denotation pattern stative state x durative protracted act ivity x semelfactive moment aneous act ivity x
the variety of phenomena in aktionsart are far from clear cut and there is no generally accepted and well defined set of features
for these a stative culmination extension derives the resultative causative form directly from the stative one
for this subgroup of the fill verbs we define an extension rule that derives from a state reading a resultative one
hence we assign two different sitspecs for the sentences one activity and one event as shown in figure NUM
accordingly we split the alternation in two which only differ in the dxt component reflecting the difference in aktionsart
most of this rule covers two kinds of locative alternation which levin distinguishes the spray load alternation and the clear transitive alternation
figure NUM number of characters per sentence fig NUM and fig NUM show that in comparison with
if it matches make a copy vo of vo and assign it a new name as well as the denotation just formed
the clear verbs except for to clean can in addition be intransitive and levin states a separate alternation for them
the lower number is associated with adjectives and the higher with verbs
we disambiguated NUM senses using a NUM NUM training totest ratio
we return to this point in section NUM NUM
in few cases is the standard deviation very small
wsd using the atcms simply selects all senses for a noun that fall under the node in the cut with the highest association score with senses for this word
in experiment NUM cuts obtained in experiment NUM without any initial wsd are used to disambiguate the heads before these are then fed back in
the constraints f x p g y and f x p g y are intended to enforce precedence between two word ordering domains
for these rules NUM rule subset increases the size of succ x f but since none of our rules introduces new variables this is terminating
importantly the denotation of a lexeme need not be a single concept instead it can be a complete configuration of concepts and roles cf
the focused sample represents more or less a topical sublanguage set while the base sample a general language set
therefore the final base sample was created by randomly drawing documents from the years of NUM NUM and NUM
therefore the focus is on the extraction of all interesting word pattems without distinction of domain specificity
in this paper a preliminary experiment is presented in automatically suggesting significant terms for a predefined topic
the normalized standard deviation is produced by simply dividing the raw standard deviation by the mean log interval
the statistical information generated from the sample documents was not rich and sufficient enough for any discriminative judgment
therefore the more negative the delta score the more topically sensitive the two word term is
in most cases raw standard deviation is found to be larger for words having long mean intervals
for automatic suggestion of topical terms initial attempts were made using the sample documents the domain expert created
the highest mutual information score indicates that the individual probabilities are low while the two words occur together frequently
we are responsible for any mistakes
sentences of up to NUM tokens words plus sentenceinternal punctuation are parsed in an average of under NUM second each whilst those around NUM tokens take on average around NUM seconds
the failures are mostly text segmentation problems
table NUM grammar coverage on susanne and sec
the results obtained are given in figure NUM
this paper describes an experimental comparison of three unsupervised learning algorithms that distinguish the sense of an ambiguous word in untagged text
we also need to research on other aspects such as what the best way is to show the characters on the screen for the handicapped viewers
in addition the prototype system is to be used to produce closed captions and the capability and functions of the system will be evaluated
a speech model is produced by using three hours four male and four female persons of recording as training data
we provide a constraint based computational model of linear precedence as employed in the hpsg grammar formalism
similarly laufen would require both its arguments einen mann and in der strafle to precede it
the description in NUM non deterministicaily requires that at least one of the lp constraints hold
the feature constraint syn dora mf coinstantiates the middle field domain to the variable mf
then for the remainder of the rules termination is obvious since these rules only simplify existing constraints
we only provide a graphical argument given in figure NUM to illustrate that this is indeed possible
feature constraints then require that they behave functionally on the variable upon which the constraint is expressed
rule transclos effectively computes the transitive closure of the precedence relation one step at a time
NUM proceeded to propose six different levels of valency binding
consider to cover in these examples from jackendoff snow covered the ground
this dataflow analysis takes as input a specification of the paths of the start category that are considered fully instantiated
our compiler uses the type hierarchy to determine paths with a value of a minimal type without appropriate features as bound
an example of problematic complement displacement taken from our test grammar is given in figure NUM see next page
our earley generator and the described compiler for off line grammar optimization have been extensively tested with a large hpsg grammar
both the eaa and the dia were presented as approaches to the inversion of parser oriented grammars into grammars suitable for generation
as a result the use of generalization does not suffice to mimic the evaluation of the respective right hand side categories
this allows the evaluation of the filler right after the evaluation of the auxiliary verb but prior to the subject
in case of generation this means that the user annotates the path specifying the logical form i.e. the path cont or some of its subpaths as bound
in order to use passive edge NUM for completion of an active edge we only need to consider those edges which have a forward index identical to the backward index of NUM
because both the generalization and the unification operations preserve binding annotations this leads via structure sharing to the annotation that the logical form of the verbal complement can be considered instantiated
the compiler is not able to find an evaluation order such that the earley generator has sufficient restricting information to generate all subparts of the construction efficiently in particular cases of complement displacement
these mutual dependencies between the subconstituents of two different local trees lead either to the unrestricted generation of the partial vp or to the unrestricted generation of the subject in the mittelfeld
our off line grammar optimization is based on a generalization of the dataflow analysis employed in the dia to a dataflow analysis for typed feature structure grammars
to better illustrate the problem that underspecified heads pose consider the sentence hal karl marie geki flt f has karl marie kissed
in addition the paths with a value of a maximal specific type for which there are no appropriate features specified for example the path cat can be considered bound
as a result of the structure sharing between the left hand side of the rule and the auxiliary verb category the cont value of the auxiliary verb can be treated as bound as well
this research was partially supported by the national science foundation through grant iri9310819
comparative experiments on disambiguating word senses an illustration of the role of bias in machine learning
the algorithms tested include statistical neural network decision tree rule based and case based classification techniques
unfortunately there have been very few direct comparisons of alternative methods on identical test data
subjectspecific neighborhoods are composed of words having at least one sense marked with that subject code
it is composed of all other words that co occur with the designated word a significant number of times in the ldoce sense definitions
a supervised learning algorithm is trained with a small amount of manually sense tagged text and applied to a held out test set
second if the likelihood function is very irregular it may always converge to a local maxima and not find the global maximum
where dki is the number of dissimilar features between clusters ck and ci and dli is similarly defined for clusters cl and c1
this is at least partially explained by the fact that as a class the nouns have the most uniform distribution of senses
they suggest that a word should potentially have different neighborhoods corresponding to the different ldoce subject code
all such algorithms begin by placing each observation in a unique cluster i.e. a cluster of one
however all of these methods require that manually sense tagged text be available to train the algorithm
it should be noted that the em algorithm relates to a large body of work in speech processing
in general clustering methods rely on the assumption that classes occupy distinct regions in the feature space
unlike ward s method mcquitty s method makes no assumptions concerning the distribution of the data sample
for unsupervised approaches this problem is exacerbated by the difficultly in distinguishing the characteristics of the minority classes from noise
for verbs the value of m indicates the tense of the verb and can have up to NUM possible values
the creation of sense tagged text sufficient to serve as a training sample is expensive and time consuming
mcquitty s method was significantly more accurate for chief common public and help regardless of the feature set
a collocational expression as choueka defines it is sequences of words whose unambiguous meaning can not be dedved from that of their components
in many cases it is possible that the domain expert would introduce some terms based on his or her own professional knowledge about the topic
two word terms are identified through the computation of mutual information and the extension of mutual information assists in capturing multi word terms
the terms suggested are split into three categones single word terms two word terms and multi word terms or phrases
by carefully examining relevant documents in the focused sample a list of terms that are deemed to be significant for the definition of the topic is identified
to manually select significant terms for a predefined topic the domain expert first creates a topic focused sample from one specific source or a combination of sources
finally we analyzed and presented this set of terms as content odented candidates for the predefined topic in this case european politics and business
more explicitly it yields the number of standard deviations that the focused mean log interval is different from the base mean log interval
since typing errors are relatively rare in press releases or similar documents the application of case sensitive rules are a reliable and straightforward tagging means for the german language
a major drawback of previous systems was their restrictive degree of portability towards new domains and tasks which was also caused by a restricted degree of re usability of the knowledge sources
for supporting modularity the different possible kind of tokens are handled via basic edges where a basic edge can be viewed as a predicate for a specific class of tokens
the small recall is due to some lexical gap including proper names and unforeseen complex expressions like die mehrzahl der auf NUM NUM gesch itzten moslemischen fltichtlinge
resolving this kind of slot sharing requires processing of elliptic expressions of different kinds as well as the need of domain specific inference rules which we have not yet foreseen as part of the core system
tdl is used in smes for performing type driven lexical retrieval e.g. for concept driven filtering and for the evaluation of syntactic agreement tests during fragment processing and combination
a finite state grammar consists of a set of fragment extraction patterns defined as finite state transducers fst where modularity is achieved through a generic input output device
the recognition part of an fst is used for describing regular patterns over such token 2measurement has been performed on a sun NUM using an on line lexicon of NUM NUM entries
furthermore we assume that such a set of basic edges remains fix at some point in the development of the system and thus can be re used as pre specified basic building blocks to a grammar writer
the function build item then discriminates according to the specified type and constructs the desired output to some pre defined requests note that in the above case the variables det and adj might have received no token
experiments with NUM other words using collocation seeds result in an average accuracy of NUM percent
the algorithms are mcquitty s similarity analysis ward s minimum variance method and the em algorithm
their mutual information score must be NUM or greater
here the traditional mutual information score was used
table NUM presents statistical information about this dataset
the general method we adopted is as follows
table NUM provides the statistical breakdown of these terms
the result represents the change of mean log intervals
verbs adverbs or conjunctions are extremely rare
i automatic suggestion of significant terms for a predefined topic
table NUM focused and base samples
these sentences were compared for matching text
we also showed results of preliminary research on tv news text summarization and synchronization of text and speech
for example the sense of chicken under victuals would be preferred over the senses under life form when occurring as the direct object of eat
only items appearing with an asterisk in front of them are optional in the sitspec for example a sitspec underlying an open event is well formed without a causer being present
the double box in the middle is the entry point for both transformative and resultative verbs but the incoming arrows produce resultative forms
the results are shown in table NUM although the resuits indicate this is rather a limited method of disambiguating it was hoped that it would improve the
we computed the accuracy of the method by looking at whether the first sentence is ranked the first or ranked either the first or the second
the main aim of the project is to establish the technology of producing closed captions for tv programs efficiently using natural language processing and speech recognition technology
as suggested by the example in NUM in general we would want support within typed feature formalisms for at least the following kinds of lp constraints
the above constraints state that y is the f value of x and y is the atom a and z is related to x by the reflexive transitive closure of f
this approach has the advantage of simplicity and training data is only needed for the estimation of one parameter the sense frequencies
most of the closed captions are literal transcriptions of what is being said
the results are shown in fig NUM and fig NUM
the real data will be taken from both radio and tv news programs
clearly the distribution of classes is not the only factor affecting disambiguation accuracy compare the performance of these algorithms on bill and public which have roughly the same class distributions
the final result to be discussed in section NUM NUM is that the differences in the accuracy of these three algorithms are statistically significant both on average and for individual words
the translation states that y which is the f p lvalue of x precedes or is equal to every f value of x and y is a f value of x
for this to work we require that the feature symbol f p l appears only in the translation of the constraint x f p NUM y
a wide range of constraints involving precedence is provided directly in feature logic ranging from constraints expressing precedence between variables precedence between domains to guards on precedence constraints
the dissimilarity between any existing cluster ci and ckl is computed as
the baum welch forward backward algorithm has been used extensively in speech recognition e.g.
note that million and company occur frequently
content collocations features of the form cl1
figure NUM concern feature set a
at each stage of planning decisions may be influenced by the stylistic parameters and plans for the three languages may diverge in accordance with cultural as well as linguistic variations
apart from the menu bar the window has three areas the button panel on the left followed by the outline area and the content area
NUM for our purposes however free text input is unsatisfactory because users would need training in the controlled language and might still make errors
during generation a text structurer consuits the loom model in order to build a text plan NUM comprising a hierarchy of communicative goals
finally tactical generators for english italian and german compute natural language texts from the spl representations NUM
by clicking on the button the user obtains a list of more specific patterns including person requesting benefit
all attributes are presented in a controlled natural language resembling english note form italian and german versions of this language are also supported
the drafts are displayed in text editing windows one for each language from which they can be saved as text files
large scale automatic semantic tagging of texts in sufficient quantity for preference acquisition has received little attention as most research in word sense disambiguation has concentrated on quality word sense disambiguation of a handful of target words
this provides a recall of NUM and precision of NUM which can be compared to a baseline precision of NUM which is calculated as in equation NUM number sensesu
note crucially that within our approach the specification of precedence constraints such as sign1 sign and dom om dom2 is independent of the domain building constraint i.e. the constraint doml is included in dom
we shall write fi e to mean the set if e lcb e e i i e e e fi rcb we say that an interpretation NUM and a variable assignment a satisfies a constraint c written z a c if the following conditions are satisfied
this is summarised by represent x immediately precedes y by x p yay p NUM x semantics z a y p NUM x NUM y lcb x rcb the additional rules given in figure below are all that is needed to handle immediate precedence
proof sketch for the first part let cs be a constraint system containing a clash then it is clear from the definition of clash that there is no interpretation e and variable assignment a which satisfies cs
null NUM sign1 signs NUM doml dom dom dotal and dom are set valued NUM doml is included in dom the constraint sign1 sign states that sign1 precedes signs
agent note that it is not necessary to know whether the pp in der strafle is focussed to rule out NUM since the fact that the pronoun ihn is focus is enough to trigger the inconsistency
if x 3f y cs where g ranges over g 3g however for practical reasons we want to eliminate any form of backtracking since this is very likely to be expensive for implemented systems
the rest of the definition in NUM ensures that for every pair of elements x and y such that x and y are both members of mf and x precedes y at least one of the lp constraints hold
on the other hand the description in NUM waits until either one of the lp constraints is satisfied in which case it succeeds or all the lp constraints are violated in which case it fails
the consequent t is executed if the current set of constraints disentail the guard g
hence these do not introduce costly choice points
this is highlighted in figure NUM
this means that these rules converge
examples such as NUM are considered ungrammatical
that he saw a man walking in the street
this is illustrated schematically in NUM below
a threshold value was also set such that if any two word unit occurred less than NUM times in the sample or received a mutual information score lower than NUM NUM it was eliminated and would not participate in the next comparison measurement
our intuitions led us to believe that if there is a significant statistical linkage i.e. a high mutual information score between such a pair of words it is highly possible that they belong to a larger linguistic component
therefore any pair which contained closed class words such as determiners prepositions auxiliaries or single letters digit numbers or overly common verbs like give take etc were excluded
that observation leads us to propose that the example is best analyzed as involving a mere activity in the with configuration and an additional transition in the onto configuration
figure NUM provides a synopsis the boxes contain the denotation patterns that correspond to the aktionsart feature and the rules transform a configuration with one aktionsart into another
and also from the first configuration another rule derives the resultative reading which adds the information that the tank ended up empty the tank drained of the water
semspecs are constructed from sitspecs by selecting a um process and mapping sitspec dements to participant roles of that process so that all elements of the sitspec are covered
therefore its lexicon is rich in information so that lexical choices can be made on the basis of various generation parameters which are not discussed in this paper
for to dram the first configuration is the water drained from the tank and the second is either the tank drained or the tank drained of the water
alternations are derived by rewriting the partial semspec and in the case of extensions adding a new subgraph to the denotation and possibly adding nodes to the covering list
we sketch the architecture of a sentence generation module that maps a language neutral deep representation to a language specific sentence semantic specification which is given to a front end generator
transformative event pre state x post state not x resultative event activity x post state y causative activity causer x
the optimal evaluation order of the right hand side categories is found by comparing the maximal degree of nondeterminacy introduced by the evaluation of the individual categories with the degree of nondeterminacy the grammar is allowed to introduce if the degree of nondeterminacy introduced by the evaluation of one of the right hand side categories in a rule exceeds the admissible degree of nondeterminacy the ordering at hand is rejected
because both verbal categories have defining lexical entries which do not instantiate the logical form of the nonverbal arguments the dataflow analysis leads to the conclusion that the logical form of the nonverbal complements never becomes instantiated
in combination with the underspecification of the complements this allows the rule not only to be used for argument composition constructions as discussed above but also for constructions in which a finite main verb becomes saturated
this means that the logical form of the nonverbal complements if and becomes available either upon the evaluation of the complement tagged in case of argument composition or upon the evaluation of the finite verb in case the head of the rule is a ditransitive main verb
our dataflow analysis ignores the grammatical head but identifies instead the processing head and no less importantly the first processing complement the second processing complement and so on
with circumstances the situation is different a sitspec is complete and well formed without the information on for instance the location of an event
if the value is different from full it also gets verbalized such as in jill filled the tank to the second mark
the filler of the value role in the post state appears in angle brackets because it is a default value which we do not discuss further here though
the class formed by them is somewhat heterogeneous with respect to the aktionsart though it contains for example to move as well as to open
the first rule derives for example tom moved the cat from the cat moved and the second tom closed the door from the door closed
thus the arrows give the argument linking for the base form of the verb which can be quite simple as in open or move
on the left hand side are excerpts from the denotation the names of the roles whose fillers are co indexed with the respective position in the case frame
however our data does not immediately lend itself to a distance based interpretation
the m step simplifies to the calculation of new parameter estimates from these counts
this kind of service has been provided for more than NUM of the tv programs in the united states or in europe however it is available in only NUM of the tv programs in japan
for most domains such text is not available and is expensive to create
morphosyntactic features standard features needed by the surface generator to produce correct utterances
stative resultative example water filled the tank the tank filled with water
for this brief description of the system architecture see figure NUM
the water drained from the tank lcb in the garage rcb
these neighborhoods are used to increase the number of words in the ldoce sense definitions while still maintaining some measure of lexical cohesion
as an example consider the NUM most frequent content words occurring in the sentences that contain chi officer executive and president
the following three sections descnbe in detail the methods for generating each of the three categories
when automatically suggesting content two word terms we looked at the mutual information scores for adjacent words
to capture topicality we were only interested in pairs of words with high mutual information scores
one of the applications of text analysis is to identify and extract significant terminology from running text
the remaining text stdng was presented as a suggested multi word term
two steps led to our automatic suggestion of topic oriented two word terms
at the discourse level technical terms tend to be repetitive
the information in both measurements is essentially equivalent since entropy is just the log inverse of probability
newspaper text the tv news program text has the following features fewer sentences per text longer sentences if we summarize tv news program text by selecting sentences from the text it will be rough summarization
the third is to detect important or relevant words segments in the case of japanese and determine which section of the text is important and then put them together to have summarization of the text
the syntax of the constraint language is defined by the following bnf definitions
rule subset deals with subset constraints and adds a new constraint
for subject e who was extremely slow and very easily exhausted the program had only begun to have an effect but was expected to continue to improve performance even after the study had ended
it has been designed to enumerate possible valencies for predicates verbs adjectives and nouns by including separate rules for each pattern of possible complementation in english
many of the failures are due to the root s entence requirement enforced by the parser when dealing with fragments from dialogue and so forth
we utilised the anlt metagrammatical formalism to develop a feature based declarative description of part of speech pos label sequences see e.g.
the two corpora differ considerably since the former is drawn from american written text whilst the latter represents british transcribed spoken material
NUM NUM NUM NUM improving the accuracy of the system by trying to ensure that the correct analysis was in the set returned
and a more conventional one is that it incorporates some rules specifically designed to overcome limitations or idiosyncrasies of the tagging process
although we report promising results parse selection that is sufficiently accurate for many practical applications will require a more lexicalised system
further significant improvements in this area would require corpusspecific additions and tuning whose benefit would not necessarily carry over to other corpora
but would stay b she left who could blame her during the chainsaw scene and went home
thus his grammar is thoroughly integrated and it would be harder to extract an independent text grammar or build a modular semantics
it is designed as a sentence generation module that pays attention to language specific lexical idiosyncrasies and that can be incorporated into a larger scale text generator
causative to pour requires a direct object as well as a path with either a source or a destination or both pour the water from the can into the bucket
in this section we will introduce a number of extension rules for which we can give a clear definition in terms of aktionsart features as they were introduced in section NUM NUM
in this work they distinguish senses to the homograph level given the correct part of speech and report a NUM accuracy using the most frequent sense specified by ldoce ranking
for nouns m is binary indicating singular or plural
co occurrences features of the form ci are binary co occurrence features
every experiment utilizes all of the sentences available for each word
morphology the feature m represents the morphology of the ambiguous word
an extended feature logic which adds a wide range of constraints involving precedence is described
note that in this example the verb itself is not part of its own domain
thanks to herbert ruessink and craig thiersch for using and providing feedback on the implementation
the definition given in NUM extends the description given in NUM
some further work is necessary to determine the computational complexity of our constraint solving procedure
for instance when pronomial complements are involved then not all permutations are acceptable
the consequent s is executed if the current set of constraints entail the guard g
our approach is in the spirit of reape s approach but improves upon it
the binary constraint vi v enforces precedence ordering between the signs vi and v
a few stative verbs can not be resultative without being also causative
NUM NUM d NUM berlin germany
optional participants are enclosed in angle brackets
finally add nr0 to the psemspec
in our ontology these are transi tions
we illustrate our goal with an example
such a case is represented schematically in figure NUM see next page
the second optimization creates a table of the categories which have been used to make predictions from
gerdemann s generator follows a head driven strategy in order to avoid inefficient evaluation orders
the optimal evaluation order for a phrase structure rule need not necessarily be head first
extensive testing with a large hpsg grammar revealed some important constraints on the form of the grammar
this is problematic in case cl provides essential bindings for the successful evaluation of the complement c2
use the prediction step to restrict feature instantiations on the predicted phrases and thus lacks goaldirectedness
note that subpaths of a path marked as bound are considered bound too
however despite these heuristic improvements the problem of goal directedness is not solved
given the above semantics it turns out that the first daughter constraint can be defined in terms of other constraints in the logic
we proposed a set of alternation extension rules that derive such configurations from the basic configuration which is the only one stored in the lexicon
we will improve the prototype system in the second stage
conversely accuracy is still improving at NUM trees with no sign of overtraining although it appears to be approaching an upper asymptote
we also discuss the role of bias in machine learning and its importance in explaining performance differences observed on specific problems
an important question is whether some methods perform significantly better than others on particular types of problems
the current coverage the proportion of sentences for which at least one analysis was founds of this system on a general corpus e.g.
in this the mean crossing figure drops to NUM NUM and the recall and precision rise to NUM NUM as shown in table NUM
however it seems likely that say rule to rule semantic interpretation will be easier with handconstructed grammars with an explicit determinate rule set
system accuracy evaluation results are also improved since we now output trees that conform more closely to the annotation conventions employed in the test treebank
to date we have spent about one person year on grammar development with the effort spread fairly evenly over a two and a half year period
the e step finds the expected values of the sufficient statistics of the complete model using the current estimates of the model parameters
the m step is usually easy assuming it is easy for the complete data problem the e step is not necessarily so
there are a variety of options discussed for automatically selecting seeds one is to identify collocations that uniquely distinguish between senses
the set of context vectors for the word to be disambiguated are then clustered and the clusters are manually sense tagged
the features used in this work are complex and difficult to interpret and it is n t clear that this complexity is required
a narrow window of context one or two words to either side was found to perform better than wider windows
the grammatical information that was added to our system consisted of a set of NUM grammatical tags based on that of suc
we define three different feature sets for use in these experiments
the question that arises is then what happens when we add immediate prece
for plant the collocations manufacturing plant and living plant make such a distinction
this algorithm requires a small number of training examples to serve as a seed
however this text is only used to evaluate the accuracy of our methods
null frequency based features like this one contain little information about low frequency classes
to contrast this set of features with the unrestricted collocations consider concern again
the weight associated with each feature reflects all usages of the word in the sample
in order to evaluate the unsupervised learning algorithms we use sense tagged text in these experiments
suppose that we have n observations in a sample where each observation has q features
two word terms are identified through the computation of mutual information and an extension of mutual information assists in capturing multi word terms
then he or she reads the documents providing a relevance judgment i.e. a reader assigned score to each document
the general method is to compare a topic focused sample based on the predefined topic with a larger and more general base sample
once significant terms of all these three types are identified a comparison algorithm is applied to differentiate terms across the two samples
the use of a log scale for these measurements is to minimize the effect of unduly large variations in words with long mean intervals
more specifically an interval can be measured instantaneously at any point in the text between the occurrences of a particular word
p wlw2 l wlw2 ldegg p wl p w2
such an exclusion not only helped getting pairs of words with high mutual information scores but also sped up computation significantly
however one might argue that using tdl in this way could have dramatic effects on the efficiency of the whole system if the whole power of tdl would be used
searching stops either if the beginning or the end of a text has been reached or if some punctuation text tokens or other anchors defined as stop markers have been recognized
the advantage of verb group fragments is that they help to handle more complex constructions e.g. time or speech act in a more systematic but still shallow way
if in the same fst a similar edge for noun tokens follows which also makes reference to the variable agr the new value for agr is checked with its old value
np NUM generic syntactic verb subcategorization frames are defined by fragment combination patterns e g for transitive verb frame
in other words this means that each token which is not mentioned in the fcp will stop the application of the fcp on the current input part left or right
the edges ignore token and ignore fragment are used to explicitly specify what sort of tokens will not be considered by add nec or add opt
expressions which contain modal auxiliary verbs or separated verb prefixes are handled by lexical rules which are applied after fragment processing and before shallow processing
the time and date subgrammar covers a wide range of expressions including nominal prepositional and coordinated expressions as well as combined date time expressions e.g. vom NUM
second the wrong recognition of messages is often due to the lack of semantic constraints which would be applied during shallow parsing in a similar way as the subcategorization constraints
identification of single word terms is based on the notion of word intervals
the difference symbolizes how the word is distributed in e focused sample
automatically suggesting single word terms as being topically oriented has been most challenging
though all statistically based their definitions of collocations are different from one another
identifying domain specific terminology is another research effort
having one unspecified word separating them
for multi word terms the mutual information score was calculated for non adjacent words
this is because the constituent words distribute in wider range of contexts
first we identified statistically significant terms from both samples
two cdteda apply when selecting interesting word units
as we decrease the threshold the detection rate increase however the false alarm rate increases rapidly
we plan to record actual programs as real data in addition to the simulation recording in NUM
the project is divided into two stages the first three years and the rest two years
twenty one key word pairs were taken from the data which was not used in the training and selected for evaluation
in NUM and NUM the following research has been conducted and will be continued
we fed one male and one female speech to the model in the evaluation
applying such a rule to a verbalization option vo works as follows add the contents of dxt to the denotation of vo and match the new part against the sitspec
this notion is different from the standard non directional way in which alternations are seen in linguistics to label the difference we call alternations of group NUM eztensions
in particular we propose a set of rules that derive a range of verb alternations from a single base form which is one source of lexical paraphrasing in the system
among these verbs are fill flood soak encrust or saturate the kitchen flooded with water means the same as the kitchen be came flooded with water
NUM NUM examples lexical entries for verbs to illustrate our treatment of valency argument linking and alternation extension rules figure NUM shows excerpts from lexical entries of eight different verbs
subsumed by the general ontological system a domain model is defined that holds the concepts relevant for representing situations in a technical sample domain and that specifies the exact conditions for the
next the user can click either on person or on benefit to expand the pattern further this process continues until all expandable elements have been eliminated
each form part is characterized by a set of attributes
furthermore we use abbreviations of paths such as coat for syasemlloc coat and assume that the semantics principle is encoded in the phrase structure rule
a potential problem for our approach constitutes the requirement that the phrase structure rules in the grammar need to have a particular degree of specificity for the generalization operation to be used successfully to mimic its evaluation
this is best illustrated on the basis of the following more schematic phrase structure rule cat rcb l
this causes the rejection of all possible evaluation orders for this rule as the evaluation of an unrestricted nonverbal complement clearly exceeds the allowed maximal degree of nondeterminacy of the grammar
in this case the maximal degree of nondeterminacy that the evaluation of the auxiliary verb introduces is very low as the logical form of the auxiliary verb is considered fully instantiated
it is shown that combining off line techniques with an advanced earley style generator provides an elegant solution to the general problem that empty or displaced heads pose for conventional head driven generation
this mixture of top down and bottom up information flow is crucial since the top down semantic information from the goal category must be integrated with the bottom up subcategorization information from the lexicon
however evaluating the head of a construction prior to its dependent subparts still suffers from efficiency problems when the head of a construction is either missing displaced or underspecified
the degree of nondeterminacy the grammar is allowed to introduce is originally set to one and consecutively incremented until the optimal evaluation order for all rules in the grammar is found
to determine the importance of a sentence we counted the number of the key words in the sentence and then it is divided by the number of the words including function and content words
in the tf idf method first the weight of each word is computed by multiplying its frequency in the text tf and its idf in a given text collection
the aim of the research on automatic text summarization is to summarize the text fully or partially automatically to a proper size in order to assist the closed caption production
we have described a national project in which speech of tv programs is changed into captions and superimposed to the original programs for the benefit of the hearing impaired people in japan
the evaluation results are shown in table NUM
in NUM we collected the speech data by simulating news programs i.e. the tv news texts were read and recorded in a studio rather than actual tv news programs on the air were recorded
first the written tv news text is changed into the stream of phonetic transcriptions and then synchronization is done by detecting the time points of the text sections and their corresponding speech sections
the outline of each research issue is described next
then the sentence are ranked according to their importance
causative extensions example the napkin soaked tom soaked the napkin
this point will be elaborated on in section NUM NUM
figure NUM matrix of feature values
all features of this form have NUM possible values
thus before we employ either clustering algo
initially this attribute is set to the pattern form title the square brackets indicating an element to be expanded in the interface such elements are implemented as buttons
information source an indication of where to find the requested information
if selected this becomes the current pattern in place of form title
to specify an attribute value the user must create a sentence in the controlled language
uli and uri indicate the word occurring in the position i places to the left or right respectively of the ambiguous word
and cr1 indicate the content word occurring in the position NUM place to the left or right respectively of the ambiguous word
for NUM of the NUM words there was a single algorithm that was always significantly more accurate than the other two across all features
that adverb is represented by a larger number than noun is purely coincidental and implies nothing about the relationship between nouns and adverbs
microplanning rules are applied to this plan in order to obtain plans for individual sentences expressed in extended spl sentence planning language NUM
the former are durative activities that result from repeating the same occurrence
cl can not be evaluated prior to the head and once h is evaluated it is no longer possible to evaluate cl prior to c2
if a constraint solving rule transforms cs to crs then z a c iffz a c proof sketch the soundness claim can be verified by checking that every rule indeed preserves the interpretation of every variable and every relation symbol
however it is possible to add immediate precedence and extend the constraint solving rules described in this paper in such a way that it is sound and complete with respect to the current semantics described in this paper which does not insist on linearised models
the interpretation function r is defined as follows null fa x x succ a x f p x succ x p it can be shown by a case by case analysis that for every constraint k in c a k
free then one needs the following disjunctive statements x y z u x z y h y x z u y z x h z x y u z y x it is simply not possible to be agnostic about the relative ordering of sequence elements within reape s system
ordered dags the models generated by the completeness theorem interpret the map of every precedence relation p as a directed acyclic graph dag as depicted in figure NUM however sentences in natural languages are always totally ordered i.e. they are strings of words
NUM lcb er rcb lcb sah rcb lcb einen mann in der strasse rcb lcb laufen rcb our idea is to employ a specification such as the one given in NUM which is a partial specification of the lexical entry for the verb sah
figure NUM difficulty in guaranteeing linearisable models with immediate precedence
as slow writing speed is often believed to be a very important issue for individuals with motoric impairments its main purpose was to accelerate the writing process
the subjects must be linguistically competent enough to benefit from the different features of the new version of profet i.e. able to make a choice
additionally the natural object class changes from a slight preference in the atcm without wsd to a score below NUM indicating no evidence for a preference with wsd
ribas explains that this occurs because some individual nouns occur particularly frequently as complements to a given verb and so all senses of these nouns also get unusually high frequencies
this is an important requirement for practical systems
the precedence constraint such as sign1 precedes sign is intended to be captured by the constraint sign1 3p sign where p denotes the user chosen immediate precedence relation
thus in example NUM the domain of the verb is constructed by including the domains of the subcategorised arguments enforced by the constraints dora d npdomf3dom d vidom
the constraint x if p NUM y states that y is the first daughter amongst the f values of x i.e. is in the p relation with every f value of x
however if a cfg backbone is employed then we assume that the value of the subcat attribute is treated as an unordered sequence i.e. a set as defined in NUM
our constraint solving rules axe deterministic and incremental
subject areas feature logic constraint based grammars
the constraint f x p g y states that every f value of x precedes i.e. is in the p relation with every g value of y
however we believe that it is polynomial
rithm we represent our data sample in terms of a dissimilarity matrix
experiments were conducted to disambiguate NUM different words using NUM different feature sets
the essential idea is to use set valued descriptions to model word order domains
NUM daft einen mann er in der stral3e lanfen sah
NUM dab er in der strafle einen mann laufen sah
a sound complete and terminating deterministic constraint solving procedure is given
these constraint solving rules can be employed for building an efficient implementation
NUM dab in der strafle ihn er laufen sah
NUM dab in der strafle ihn er laufen sah
figure NUM linearisation of precedence ordered dags
those examples in the test set that are most confidently disambiguated are added to the training sample
the topicalized partial vp anna lichen receives its restricting semantic information from the auxiliary verb and upon its evaluation provides essential bindings not only for the direct object but also for the subject that stayed behind in the mittelfeld together with the auxiliary verb
in this approach important parts of the text which are to be kept in the summarization are determined by their locations i.e.
it is desirable however it is not a practical method at present in order to summarize actual tv news program text
for most of the tv news programs today scripts written text are available before they are read out by newscasters
the auxiliary verb category is unified with its defining lexical entries under preservation of the binding annotations
off line compilation section NUM is used to produce grammars for the earley style generator section NUM
by matching forward and backward indices the edges that must be combined for completion can be located faster
now we mark the paths of the defining lexical entries whose instantiation can be deduced from the type hierarchy
on the basis of this annotated rule we investigate the lexical entries defining its right hand side categories
the authors wish to thank paul king detmar meurers and shuly wintner for valuable comments and discussion
i at v fin NUM ff nn subcat ubcat lcont cont underspecification of the head of the rule allows it to unify with both finite auxiliaries and finite ditransitive main verbs
theorem NUM soundness let z o be any interpretation assignment pair and let cs be any set of constraints
the constraint solving rules given in figure NUM deal with constraints involving the precedes and the precedes or equal to relations and domain precedence
a simple way of treating alternations is using a separate lexical entry for every configuration but that would clearly miss the linguistic generalizations
from this viewpoint there are two groups of alternations NUM alternations that do not affect the denotation of the verb
the rules will be conveniently simple to state thanks to the upper model which provides the right level of abstraction from syntax
r0c is a list of pairs that exchange participant role names or the um type in the psemspec this replacement can also change optionality
examples to fillis stative to drain is durative to open is transformative to remove is resultative causative
consequences are higher computational cost in finding lexical options but also a higher flexibility in finding different verbalizations of the same event
alternation rules for verbs only pointers to lexical rules that represent alternations the verb can undergo see section NUM
it is not possible to specify a path expression which will be realized as a prepositional phrase as an obligatory participant
here our best performance using a larger sample with a natural distribution of senses is only an increase of NUM percentage points over the accuracy of the majority classifier
once the original news program text is summarized it should be synchronized with the actual sound or the speech of the programs
however there are strong constraints on ordering in the middle field
tem cs in normal form is consistent iff cs is clash flee
for the logic that we have described this is always possible
pri represents the pos of the word i positions to the right
a value of NUM indicates that observations i and j are identical
conceptually the neighborhood of a word is a type of equivalence class
supervised learning approaches to word sense disambiguation fall victim to the knowledge acquisition bottleneck
prior to the development of their algorithm they performed a thorough study on the linguistic properties of technical terminology
to check the quality of the suggested terms we compare them against terms manually determined by a domain expert
our first step was to compute mutual information scores for a word unit separated by a distance of two i.e.
we selected the interval between the occurrences of a word as the basis for analysis
this paper presents a preliminary experiment in automatically suggesting significant terms for a predefined topic
a set of statistical measures are used to identify significant word units in both samples
in technical terminology word constituents are limited to adjectives nouns and occasionally prepositions
they report that structurally technical terms make heavy use of noun compounds
to check the quality of the suggested terms we compare them against terms manually determined by the domain expert
i application of nlp technology to production of closed caption tv programs in japanese for the hearing impaired takahiro wakao terumasa ehara telecommunications nhk science and advancement technical organization tao research labs of japan tao eiji sawamura tao yoshiharu abe mitsubishi electric corp information technology r d center tao
to find important words in the text high frequency key word method and tf idf term frequency inverse document frequency method have been adopted and the two methods are evaluated automatically on a large scale in our preliminary research
i would also like to thank goeff towell for providing access to the line corpus
the parser throughput on these tests for sentences successfully analyzed is around NUM words per cpu second on an hp pa risc NUM NUM
the probabilistic parser was tested on the NUM sentences held out from the manually disambiguated treebank of lengths NUM NUM tokens mean NUM NUM
despite nunberg s observation that text grammar is distinct from syntax text grammatical ambiguity favors interleaved application of text grammatical and syntactic constraints
the work was also supported by uk dti salt project NUM NUM integrated language database and by serc epsrc advanced fellowships to both authors
for both corpora the majority of sentences analyzed successfully received under NUM parses although there is a long tail in the distribution
we have experimented with increasing the richness of the lexical feature set by incorporating subcategorisation information for verbs into the grammar and lexicon
to determine what this might be we ran the system on a set of NUM sentences randomly extracted from the training corpus
unification of the residue of features not incorporated into the backbone is performed at parse time in conjunction with reduce operations
from the model in figure NUM the system will generate the text shown in figure NUM along with equivalent versions in italian and german
this measures the relatedness between two words or in the class based work on selectional preferences between a class c and the predicate v
in their approach selectional preferences are represented as a set of classes or a tree cut across the hierarchy which dominates all the leaf nodes exhaustively and disjointly
as yet the description length has assumed a tree rather than a dag and it is apparent that cuts at nodes with shared daughters will be penalized in the current scheme
null in contrast where the quantity of data is sparse and the verb selects less strongly the cut obtained from fully ambiguous data experiment NUM is unhelpful for wsd
for the selectional preference acquisition experiments NUM and NUM described below it was decided to use the criteria freq NUM ratio NUM and d ignore difficult nouns
it is hoped that this will not matter where we are collecting information from many heads in a particular slot because any mistagging will be outweighed by correct taggings overall
this research was supported by the office of naval research under grant number n00014 NUM NUM NUM
there words are represented in terms of the co occurrence statistics of four letter sequences
however bear in mind that in unsupervised experiments the distribution of senses is not generally known
while the choice of feature set impacts accuracy overall it is only to a small degree
the two closest clusters are merged to form a new cluster that replaces the two merged clusters
the number of sentences available per word is shown as total count in figure NUM
bootstrapping approaches require a small amount of disambiguated text in order to initialize the unsupervised learning algorithm
the m step makes maximum likelihood estimates of the model parameters using the sufficient statistics from the e step
they use co occurrence data gathered from the machine readable version of ldoce to define neighborhoods of related words
the definition in NUM can be understood as follows
word order domains in reape s approach are totally ordered sequences
approach to motivate our approach we start with an example on scrambling in german subordinate clauses
the constraint x 3f y states that y is one of the f values of x
the telecommunications advancement organization tao of japan with the support of the ministry of posts and telecommunications has initiated a project in which electronically available text of tv news programs is summarized and synchronized with the speech and video automatically then superimposed on the original programs for the benefit of the hearing impaired people in japan
as we describe later tv news text is different from newspaper articles in that it does not have obvious structures i.e. the tv news text has fewer sentences and usually only one paragraph without titles or headlines
its annual budget is NUM million yen
this assumption is based on the suc null cess of the naive bayes model when applied to supervised word sense disambiguation e.g.
the following table lists the correspondences
figure NUM shows the overall taxonomy of situation types
ing this paper i thank two anonymous reviewers
it thus relates a causative and a non causative form
bill covered the ground with snow
sally sprayed paint onto the wall
figure NUM dependency of extension rules
here of the water is an optional constituent
