LinguistHelp

"We can only see a short distance ahead, but we can see plenty there that needs to be done" (Alan Turing)

Software and tools

Paraphrasing System for English Language

download
Download application
language
Language: English
Language
Supported systems: Windows XP/Vista/7

The Idea...

Some time ago, I got an offer to write a software that deals with sentence rephrasing. A nice way to build a plagiarism system. It's a nice idea to fool the teachers. Think of it! You take a document/term paper and paste each sentence into the software and after the analysis, you get a completely new sentence, with the same meaning but structurally different (without citing the sources).

Of course this is not a bullet proof anti-plagiarism system, since, in my opinion, the idea of plagiarism (one might think it's bad, but depending on the situation you are in...) implies two aspects:

1) Plagiarism at a structural level (copy sentences, or chunks of text from one paper to another (Ctrl+C method);

2) Copy the idea/ideas from a paper and rewrite them so that the structure of the new paper is different.

Of course there are many forms of plagiarism but these two probably have the highest frequency in students' papers. I don't know if this was the intended use of the software, but this comes to my mind first. At first, I thought it is pretty difficult to accomplish this task. I started breaking down the problem into smaller components/algorithms.

The mechanics of the application...

The first thing to notice is the fact that we are dealing with natural language. Natural language processing is the object of study for Computational Linguistics (CL). CL contains several subfields such as (machine translation, automatic summarization, speech recognition, question answering, automatic text classification etc.).

Why is natural language processing so difficult?

1. Let’s take the following sentences:

The race is tomorrow.

I will race tomorrow.

In the first sentence the word 'race' is a noun, while in the second one, it is a verb. The computer will have to correctly identify the POS (part of speech) for each word.

Why is the POS so important?

The POS is important if you want to build the sentence grammar or if you want to identify the correct sense of a word in a sentence, get the correct synonyms and antonyms of that word etc

2. Let’s take the following sentences:

I will go to the bank tomorrow.

On the bank of this river there is a boat.

In the first sentence the word 'bank' refers to "financial institution, financial organization", while in the second one, it refers to 'sloping land (especially the slope beside a body of water)'. One can easily identify the meanings of the word 'bank' in the 2 sentences. However, the computer must apply a WSD (word sense disambiguation algorithm) in order to find out the correct meaning. If the word 'bank' is not disambiguated, the synonyms and the related words or antonyms are not chosen correctly.

3. Language is continuously changing and new words appear everyday. It is impossible for any dictionary to contain every word that appears each day.

These are just a few of the problems the application had to surpass. I will elaborate others as well, as the presentation continues.

The application database is made up of free English dictionaries. The database we created contains approximately 100.000 words. The verbs and nouns have synonyms and related words that are organized according to word sense.

Let’s take the following sentence:

The cars are prepared for the race.

The nouns are 'cars' and 'race' and the verb 'are prepared'. If we look up the word 'cars' in a dictionary (to extract its synonyms and related words) we will see that the word is not there. Why? Because 'cars' is the plural form for 'car'. If we search 'car', we will find it. Thus, we need an algorithm to reduce a word to its base form (lemma). This process is called lemmatization. Lemmatization does not apply to nouns only, but also to verbs, adverbs etc. in fact it applies to any word. Starting with the Porter stemming we developed a lemmatization algorith

As you can see the word 'race' is ambiguous here. It can either be a noun or a verb. At this point the computer does not know how to interpret it. In order to correctly decide if 'race' is a noun or verb we had to implement a POS (part of speech) tagging algorithm.

What does part of speech tagging mean?

POS means that each word is tagged with the appropriate part of speech. For example:

The cars are prepared for the race.

After POS tagging we get:

The/A/ cars/N/are/prepared/VB/for/I/the/A/race/N

(where A=article, N=noun, VB=verb, I=preposition)

In order to solve this problem we implemented a POS tagging algorithm based on the Brill’s ideas. The accuracy of the algorithm is about 90% (from 100 noun or verbs, 10 will fail to be tagged correctly), Please, take into account the fact that we modified and simplified the tagging operation. We only tag nouns, verbs, articles, prepositions, adjectives and adverbs. Also have in mind that there is no algorithm that can correctly tag every sentence a user will input (to my knowledge at least).

Another major problem with POS tagging is that it cannot tag unknown words (words that are not in the dictionary, in our case in the database). The algorithm will try to guess the grammatical category (noun, verb, preposition etc.) of the new word by assigning the most likely tag according to the words that surround it. For example:

The baaaaaah is very beautiful.

The word 'baaaaaah' is unknown, the algorithm will try to guess its grammatical category. Since the word is preceded by the article 'the' then it must be a noun.

After the user inputs a sentence, the application will check the nouns, verbs and adverbs or conjunctions and extract the synonyms and related words from the databases according to the disambiguated sense. Then, it checks the sentence consistency and will make the some sentence transformations. For example:

The cars will participate in the race.

Plural transformation

The word 'cars' has a plural form; this means that the synonyms and its related words must have plural forms.

The autos will participate in the race.

The automobiles will take part in the race.

The machines will take part in the race.

The motorcars will participate in the race.

etc.

The same thing applies to the verbs. The application will scan the verb form and the synonyms and its related words will have the same form.

Article transformation

If you have any questions, comments or need some help, please feel free to contact me.

I bought a car last week.

The synonyms and related words of 'car' will be preceded by the correct article according to English grammar.

I bought a machine last week.

I bought an auto last week.

Tense transformation

The synonyms and related words of 'bought' will have the correct tense, according to English grammar.

I bought a machine last week.

I purchased an auto last week.

Proper name recognition

I talked to John yesterday.

Phrasal verbs/idiomatic expressions recognition

The paraphrasing algorithm also takes into account the use of phrasal verbs. For instance, a sentence like:

The oil prices were cut down yesterday.

As you can see 'cut down' has a meaning of its own. It cannot be broken into the meaning of 'cut' and 'down'. Thus, the related words and synonyms have to come under the form of phrasal verbs.

The oil prices were cut back yesterday.

Rule-based Paraphrasing

In addition to synonyms and related words, we implemented an algorithm using heuristic rules to paraphrase English sentences.

If you have any questions, comments or need some help, please feel free to contact me.

Application preview...