"We can only see a short distance ahead, but we can see plenty there that needs to be done" (Alan Turing)
Software and tools
Summarization System for English Language
Imagine that you don't have time to read through 10 business reports, each 40 pages long before your meeting in 20 minutes. What you need is a summarization tool, we need to reduce the volume of reading down to a few pages and at the same time you must not lose precious pieces of information. This is called information compression. It is one of the features of an automatic summarization system.
I gave you a small example of when you need to use such a tool, but the applications, as you can see, are much more varied and enticing.
How can you build such a system? How can you make the computer understand what piece of information should give to you and which should be omitted. There are many approaches to automatic summarization of which I would mention two of them:
a) the cut-and-paste method
b) abstraction method
The software presented here as downloadable application uses the first method. Using this method, the computer evaluates the information contained in the document, sentence by sentence using some sort of ranking system. After that, it extracts the most important pieces of information (sentences/paragraphs) from the document and builds the summary.
First thing you should know is that such a system incorporates other components/tools for the natural language processing. Some of them were discussed in other sections/applications of this website (i.g. Sentence splitter, POS tagging etc.)
The components of the summarization system presented here are:
1. Sentence splitter
It is very important to cut the text into sentences and paragraphs in order to rank them afterward and decide which sentence is more important and should be included in the summary. Read the description of the Sentence Splitter application for more details. ne aspect of summarization is that sentences that are very short tend to be excluded from the document summary.
This process breaks the text into words, phrases, symbols in order to analyze it further. During this process, lemmatization is applied, meaning that each word is reduced to a basic form called lemma. For example a word like 'cars' is reduced to 'car'.
3. POS tagging
In this stage, each word is tagged according to the part of speech. For instance a sentence like:
I go to the cinema everyday will be tagged as Pronoun/Verb/Preposition/Article/Noun/Adverb
The two previous components, tokenization and POS tagging are important in the keywords extraction stage. One of the most important way to evaluate sentence importance is by the number and importance of the keywords it contains.
4. Keyword identification and extraction
Keywords are considered to be those words that have significant impact on the information presented in the document. There are several techniques for keyword extraction and most of them are based on the following rules:
a) Keywords have to be nouns or verbs
Only these two categories give meaning to the text. For instance, take a sentence like
I bought a green car yesterday.
One can notice that words like 'I', 'a', 'yesterday' are not important in the message transmission. What is important are 'car' and 'bought', the object and the action performed on the object. Thus the automatic summarization system will take only nouns and verbs as possible keywords candidates.
b) Keywords frequency
The more frequent a keyword is the more likely the sentences that contain it are going to be included in the summary.
c) Keyword co-occurrence
We start from the supposition that the words that surround a keyword can be considered important and should be included in the keywords list.
All the components presented above make up the sentence selection component of the summarization system. This sentence selection is done via some boolean features. (for instance, the length of the sentence) These features F1, F2, F3...Fn will be used to construct a Bayesian classifier, following the idea of Kupiec with modifications to features and features selection.
These sentence features can be learned by training using a corpus of document/summary pairs or estimated by hand. In this case, some of the features were learned by training, others estimated. The summarization system schema is presented bellow:
The summary is constructed by selecting the sentences which received the highest rank value and according to the summary compression value.
If you have any questions, comments or need some help, please feel free to contact me.