Persian Pre-processor: PrePer

PrePer (Seraji, 2015, Chapter 4, pp. 82-88) is a software program developed in Ruby for the task of editing and cleaning up texts in Persian. The program uses the existing Virastar module for some formating tasks (Bargi, 2011). The present PrePer handles miscellaneous cases and performs functions to normalize texts into computational standard script. PrePer via Virastar also takes care of the occurrences of mixed character encodings. By preprocessing texts all letters in Arabic style with Arabic Unicode characters are edited to Persian style with mapping to Persian Unicode encoding. In addition, Arabic and Western digits are all converted to Persian digits. PrePer also converts white space to ZWNJ between:

  • nouns and plural suffixes /-hâ/, /-ân/, /-ât/, and /-in/
  • the suffixes /-i/ or (after long vowel /u:/) when denoting indefiniteness or abstractness, as well as the indefinite suffix (after silent h) and any nouns when forming indefinite nouns or abstract nouns
  • nouns and pronominal clitics
  • past participle verbs and copula enclitics
  • nouns and verbal stems in compound words
  • verbal stems and the suffix /-âk/
  • verbal stems and the suffixes /-âr/ or /-gâr/ when forming nouns of action
  • nouns and their adjacent suffixes when forming adjective-adverbs or adjective-nouns
  • the negative prefixes /nâ-/, and /bi-/ (-im, -in, -un, -less) and its adjacent word
  • the prefixes /su-/, /adam-/, /farâ-/, and their adjacent words when forming determinative juxtaposed nouns and adjectives.

  • Download

    The program is developed by Mojgan Seraji ( ) and licensed under GNU General Public License . You need to install GEM for Ruby before running the PrePer program. PrePer can be downloaded below:

    Running PrePer

    You can run PrePer by typing the following at the command line prompt:
    prompt> ruby pre_per.rb input_file.txt > output_file.txt


    1. A. A. Bargi. 2011. Virastar.

    2. Seraji Mojgan. 2013. PrePer: A Pre-processor for Persian. Presented at the fifth International Conference on Iranian Linguistics (ICIL5). Bamberg, Germany. [pdf]

    3. Seraji, Mojgan. 2015. Morphosyntactic Corpora and Tools for Persian. Doctoral dissertation, Uppsala University. Studia Linguistica Upsaliensia 16. [pdf]


