UPPSALA UNIVERSITY : Department of Linguistics and Philology : Eva Forsbom : Resources by Eva Forsbom : Base vocabulary pool
Uppsala universitet
Skip links

Base vocabulary pool


Description

A base vocabulary pool is a ranked list of the most generally used lemmas, their wordforms, adjusted frequency, and contribution, according to a lemmatised and categorised corpus. From this pool, base vocabularies for various application needs can be extracted.

More details on the base vocabulary pool can be found here:

The BaseVocabulary package includes a Swedish base vocabulary, based on the Stockholm-Umeå Corpus, and an English base vocabulary, based on the Susanne corpus, and scripts for creating the base vocabularies and computing various frequency and dispersion measures. The scripts were written solely for the purpose of the paper, and have been tested only for Linux (2.4.22, Mandrake 9.2 and 2.6.14, Fedora Core 4). The base vocabularies, however, are raw text files, and can be viewed in any editor.

The base vocabulary pool has been used by me in the following projects:

And elsewhere:

License

The package is licensed under the GNU General Public License.

Download

Download the BaseVocabulary package (a gzipped tar archive). Unpack it with tar -xzf BaseVocabulary.tgz. Follow instructions in the BaseVocabulary/README file.

Files

Requirements

The base vocabularies are raw text files, and can be viewed by any editor.

The scripts, in Perl (5.005) and XSLT, were all developed for a Linux environment using standard modules, but they are probably portable to other environments (sorry, I have no way of testing), except for the shell scripts basevoc_suc.sh and basevoc_susanne.sh, which are used as glueing batch scripts for the other scripts. (Use them as examples rather than as turnkey scripts.)

The SUC corpus* can be obtained, subject to a license, from http://www.ling.su.se/dali/suc/suc2.0_info.html. The original corpus files can be converted from SGML format to valid XML format with parole2xml.pl.

* Stockholm-Umeå Corpus, version 2, 2002, Stockholm University, Department of Linguistics and Umeå University, Department of Linguistics.

The Susanne (R5) corpus can be downloaded from http://www.grsampson.net/RSue.html. Its annotation scheme and corpus compilation (excerpts from the Brown corpus) are described in the following book: Geoffrey Sampson. 1995. English for the Computer: The SUSANNE Corpus and analytic scheme. Clarendon Press, Oxford. ISBN 0-19-824023-6.

Version history