Persian Sentence Segmenter and Tokenizer: SeTPer

SeTPer (Seraji, 2015, Chapter 4, pp. 88-90) uses the modular software platform Uplug (Tiedemann, 2003), a system designed for the integration of text processing tools. The Uplug sentence segmenter and tokenizer is a rule-based program, that can be adapted to various languages by using regular expressions for matching common word and sentence boundaries. SeTPer treats the full stop, the question mark, and the exclamation mark as sentence boundaries. Token separators in SeTPer are: apostrophe, brackets, colon, semicolon, dash, exclamation mark, question mark, at sign, slash, backslash, percent, asterisk, and tilde. The tokenizer also handles numerical expressions, web URLs, abbreviations, acronyms, and titles.

The tools

The tools are developed by Mojgan Seraji ( mojgan@stp.lingfil.uu.se ) in collaboration with Jörg Tiedemann ( jorg.tiedemann@lingfil.uu.se ) and licensed under GNU General Public License . The following scripts use similar regular expressions as in Uplug (Tiedemann, 2003) with extensions for Persian. To get the tools click the following links:

Running SeTPer

You can run SeTPer by typing the following at the command line prompt:
prompt> perl fa_sent.pl < input_file.txt  | perl fa_tok.pl > output_file.txt

References

1. Tiedemann J., 2003. Recycling Translation - Extraction of Lexical Data from Parallel Corpora and their Application in Natural Language Processing. Doctoral dissertation, Uppsala University. Studia Linguistica Upsaliensia 1.

2. Seraji, Mojgan. 2015. Morphosyntactic Corpora and Tools for Persian. Doctoral dissertation, Uppsala University. Studia Linguistica Upsaliensia 16. [pdf]





 



 
Copyright © 2004 UPPSALA UNIVERSITET, Box 256, 751 05 Uppsala | Webmaster
Uppdaterad: 2015-05-18