Ucto
A Catalogue of Free/Open Source Software for Translators
| Ucto | |
|---|---|
| Category: | Language Tools |
| Typology: | Tokenizer |
| http://ilk.uvt.nl/ucto/ | |
| Operating systems: | GNU/Linux, Mac OS X |
| Requirements: | ICU 3.6 or higher. You may also need to install fresh versions of pkgconfig and the autoconf toolkit. |
| Latest release: | 0.4.4 (2011/04/04) |
| License: | GNU General Public License |
| Affiliation: | The Netherlands Organisation for Scientific Research, Tilburg University |
| Available Resources | |
| Download page: | http://ilk.uvt.nl/ucto/download-ucto.php |
| Project Details | |
|---|---|
From the project's web-site:
Ucto tokenizes text files: it separates words from punctuation, and splits sentences. It offers several other basic preprocessing steps such as changing case that you can all use to make your text suited for further processing such as indexing, part-of-speech tagging, or machine translation.
Features
* Comes with tokenization rules for English, Dutch, French, Italian, and Swedish; easily extendible to other languages. * Recognizes dates, times, units, currencies, abbreviations. * Recognizes paired quote spans, sentences, and paragraphs. * Produces UTF8 encoding and NFC output normalization, optionally accepts other encodings as input. * Optional conversion to all lowercase or uppercase. * Optionally produces FoLiA xml.
You need JavaScript enabled for viewing comments
| powered by commenterra | Recent comments |