"We can only see a short distance ahead, but we can see plenty there that needs to be done" (Alan Turing)
Software and tools
Sentence Splitter for English...
As the other tools presented on this website, this sentence splitter is part of many natural language processing tools/systems. Whenever you want to implement, for instance, an automatic summarization system, a POS tagging tool or a question answering system, a sentence splitter is a must.
Probably the simplest sentence splitter is under the following rule: each sentence that ends in a [!?.] can be considered a sentence. For instance:
I go to the cinema everyday. This is great! Should I go to the cinema tomorrow? This was a nice example.
A nice example indeed. Everyone can easily notice that there are 4 sentence. A very easy task. However not so accurate when using live data. One can notice that when writing the above text, each sentence was delimited by either a '.', '?' or '!' but in each case followed by a whitespace and begins with a capital letter. Let's complicate things a little:
Mr. John goes to the cinema every week.
If we were to build a sentence splitter with the above sentence markers, the above example will fail to detect the correct sentence boundaries. In this case, the abbreviation 'Mr.' is very troublesome for our application. Thus, before splitting a text into sentences, we have to detect and eliminate all the abbreviations.
But sometimes an abbreviation can mark the end of a sentence. For example:
He has a PhD. He is a physicist.
Another difficulty in identifying sentence boundaries is that text in not perfectly written. Sometimes, the full stop or other markers are simply missing.
Numbers pose also a problem in correctly finding the sentence boundary. As a rule, any number not followed by whitespace are not to be considered sentence boundaries.
Another special case to treat is the presence of full stop in sequences of letters without being abbreviations. For example:
Of course, these are just a few problems when faced with building a sentence splitter algorithm.