Frequently Asked Questions
|
Answers to frequently asked
questions regarding the use of the xAIgent restful
service. FAQ |
|
|
|
1. What is the meaning of the numbers associated
with each keyphrase?
The numbers indicate the score of a phrase which is an
estimate of its value as a keyphrase. Keyphrases are
ranked in order of descending score. A score can be any
positive real number. The scores with long documents as
input tend to be higher than the scores with short
documents. For some applications, it might be desirable
to normalize the score.
2.
How can I normalize the
score?
For some applications, it might be desirable to
normalize the score, so that the scores of keyphrases
from different documents can be compared.
Here are some suggestions for normalization:
a) Ignore the scores produced by xAIgent. Given a
large collection of documents (e.g., web pages), score
each keyphrase by the percentage of documents for which
the given keyphrase was suggested by the xAIgent.
(Example: "The keyphrase 'corporate merger' was
generated for 45 of the 100 documents. Thus 'corporate
merger' has a score of 45%.")
b) Take the score produced by the xAIgent and
normalize it so that it ranges from 0% to 100%, by
dividing the score of each keyphrase by the score of the
first keyphrase. (The first keyphrase always has the
highest score.) (Example: xAIgent suggests three
phrases: 'corporate merger' with a score of 50, 'stocks'
with a score of 30, and 'bonds' with a score of 10. The
normalized scores are 100%, 60%, and 20%, respectively.)
c) Longer documents often seem to have better
keyphrases than shorter documents. The problem with
suggestion (2) is that it ignores the document length.
One possibility would be to multiply the normalized
score of (2) by (say) the logarithm of the length of the
document (measured in number of words or in bytes).
Another possibility would be to sort the document
collection by length and increase the score of documents
according to the percentile in which they appear.
(Example: "The keyphrase 'corporate merger' appears in
document #345. The keyphrase has a normalized score of
60%. However, since document #345 is in the top 25
percentile of documents in the collection, according to
length, we will boost the score of 'corporate merger' by
20%, for an adjusted score of 80%.")
3. Given a sentence such as, "I am not skiing
today," why does the xAIgent select "skiing" as a
keyphrase instead of "not skiing"?
The intention of the xAIgent is to capture the main
topics that are discussed in the input document. xAIgent
does not attempt to convey exactly how these topics are
discussed. For example, if a document discusses legal
issues concerning guns, the xAIgent might suggest the
keyphrase "gun law". This keyphrase does not indicate
whether the document supports strict legal control of
guns or it is against any government involvement in gun
control. The design of the xAIgent was based on a study
of how authors use keyphrases. We have examined several
thousand documents with keyphrases supplied by their
authors. None of the keyphrases we have seen so far
include the word "not".
4. I want to use the xAIgent for automatic
document classification. Can you help me?
Automatic document classification is the use of software
to sort documents into various pre-defined categories. A
similar task is automatic document clustering, in which
there are no pre-defined categories, so the software
must create the categories by itself. If you want to
learn more about automatic document classification and
clustering, there is a hypertext Bibliography on Machine
Learning Applied to Text. xAIgent can be used to
generate features for use in feature vectors for machine
learning algorithms. (If you are not familiar with this
terminology, it should become clear to you as you read
the papers in the bibliography.) If you wish to use the
xAIgent to generate feature vectors, we suggest the
following approach:
a) Apply the xAIgent to all of the documents in
your sample collection.
b) Take the union of all of the extracted
keyphrases as the feature set.
c) For each document and each feature, let the
value of the feature be the number of times that the
given phrase occurs in the given document (regardless of
whether the xAIgent extracted it from the given
document).
d) Apply your favourite machine learning algorithm
(e.g., decision tree induction, neural network, genetic
algorithm, etc.) to the resulting feature vectors.
5. How can I combine keyphrases that were
extracted from many different documents?
For some applications, you may wish to have a list of
keyphrases that covers a whole collection of documents,
where each document has been processed individually by
the xAIgent. If you have no constraints on the size of
the list of keyphrases, you might simply take the union
of all of the phrases as your combined list. To reduce
the size of the list slightly, you might drop words that
have the same stem (e.g., "automobile" and
"automobiles"). If you want to substantially reduce the
size of the list, then you can assign a normalized score
to each keyphrase and select the keyphrases with the
highest normalized scores.
6. Can the xAIgent handle language X?
The xAIgent currently works with monolingual documents
in English, French, Japanese, German, Spanish, or
Korean.
7. Can the xAIgent handle character encoding X?
The xAIgent currently supports ISO-8859-1 for English,
French, German and Spanish. ISO-8859-1 is also known as
ISO Latin-1. The xAIgent currently supports Unicode UCS2
for Japanese and Korean.
8. How can I generate 100 keyphrases?
The xAIgent currently allows the user to specify from 3
to 30 keyphrases. For some applications, you may wish to
have more keyphrases. One solution is to break the
document into smaller sections and pass each section to
the xAIgent. Suppose we gave you a book and asked you to
give us a list of key phrases that capture the main
topics of the book. When your list approached 30 key
phrases, we think you would struggling to think of more
key phrases. It seems likely that there are less than 30
"main topics" for most books. Perhaps an average book
only has 10 or 15 "main topics", but you could cover
each topic with 2 or 3 synonymous key phrases, to yield
a total of about 30 key phrases. On the other hand, if
we took any single chapter from the same book, and asked
you to give us a list of key phrases that capture the
main topics of the chapter, we think the list would be
approximately the same size as the list you would give
us for the whole book. A key phrase that captures the
"main topic" of the chapter might only capture a "minor
topic" of the whole book. So the union of the keyphrases
for each chapter would be a superset of the keyphrases
for the whole book. This is why the xAIgent has a
maximum of 30 key phrases per "chunk". If you want more
key phrases, then you can break the document into
smaller "chunks" and take the union of the key phrases
for each individual "chunk". We believe that this
strategy will produce a superior list to the strategy of
treating the document as a single, homogenous whole.
9. When I give a document to the xAIgent and ask
for four keyphrases and then take the same document and
ask for seven keyphrases, the four keyphrases are not
always a subset of the seven keyphrases. Why?
This is explained in detail in Learning to Extract
Keyphrases from Text. If it is important for your
application that the four keyphrases that you get when
you ask for four keyphrases should be the same as the
first four keyphrases that you get when you ask for
seven keyphrases, then ask for seven keyphrases but only
take the first four. In general, if you currently want M
keyphrases but you might eventually want N keyphrases
(where N > M), then ask the xAIgent for N keyphrases,
but only take the first M keyphrases. Better yet, store
all N keyphrases, so you can later lookup the remaining
N - M keyphrases instead of running the xAIgent twice.
10. In our documents, we have phrases with four
and more words. What does the xAIgent do? Is there a
limit to the number of words in a keyphrase?
The xAIgent is designed to extract key phrases with one,
two, or three words. In a study of thousands of
documents with key phrases supplied by the authors, and
authors only create key phrases with four or more words
about 5% of the time. When we try to include phrases
with four or more words, we can cover a few more of the
authors' key phrases, but we also introduce a few more
errors. Since there is a net loss, the xAIgent does not
attempt to cover these longer phrases. In order to
capture these longer phrases you you might try
inspecting the key prhases relative to each other. If
the xAIgent outputs a phrase of the form "A B C" and a
phrase of the form "B C D", then you can conjecture that
these are parts of a longer phrase "A B C D", and join
them together. For example, "National Research Council"
and "Research Council Canada" would be joined to make
"National Research Council Canada". |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Yesterday will the
be the last day you re-read a document to enter it into
an Enterprise Content Management System. |
|
|
|
|
|
|
|
|
|
|
|
|
Feature
Highlights:
|
|
|
• Automatically Create Contextually Accurate Tags
for a Document
|
|
• Automatically Create
Contextually Accurate Tags Document Folder
|
|
• Automatically Add Document Meta Data Tags
|
|
• Subject Domain
Agnostic - additional training NOT required
|
|
• Automatic Processing - supervision is NOT
required
|
|
• Document Tags are presented by Weighted
Importance
|
|
• Process Reporting ...
|
|
• Tags Generated
|
|
• Documents Processed
|
|
• Document Location
|
|
• Examine Comparative Document Tags
|
|
• Find Documents with 'These' Tags
|
|
• Multi Language Support ...
|
|
• English
|
|
• French
|
|
• German
|
|
• Japanese
|
|
• Korean
|
|
• Spanish
|
|
• Contextual Key Terms Automatically Extracted for
Each Document
|
|
|
|
|
|
|
|
|
|
Turn long lost
content back into the valuable resource it once was |
|
|
|
|
|
|
|
|