The TEPROLIN Web Service

Radu Ion (radu@racai.ro)

Introduction

The TEPROLIN Web Service (WS) was developed and is maintained in the ReTeRom project. The backend is the TEPROLIN text preprocessing platform that incorporates several NLP applications for which it provides a unified access interface as a Python 3 object.

TEPROLIN currently offers 15 text preprocessing operations for Romanian, 13 of which are described in (Ion, 2018). These are:

  1. text-normalization
  2. diacritics-restoration
  3. word-hyphenation
  4. word-stress-identification
  5. word-phonetic-transcription
  6. numeral-rewriting
  7. abbreviation-rewriting
  8. sentence-splitting
  9. tokenization
  10. pos-tagging
  11. lemmatization
  12. named-entity-recognition
    TEPROLIN new
  13. biomedical-named-entity-recognition
    TEPROLIN new
  14. chunking
  15. dependency-parsing

Configuration options

The GET queries will request for configuration information. Assuming that the WS is running on http://127.0.0.1:5000,

curl http://127.0.0.1:5000/operations

will return a JSON object with the list of 15 operations mentioned above:

TEPROLIN supported ops

A GET request with one of the TEPROLIN's operations, e.g.

curl http://127.0.0.1:5000/apps/pos-tagging

will return the JSON object with the list of the NLP apps that can perform it:

TEPROLIN apps for pos-tagging

The first NLP app is the default app to execute the operation. In the example above, pos-tagging is executed with nlp-cube-adobe.

Here is the complete list of NLP apps that TEPROLIN currently incorporates, for each operation:

  1. text-normalization
    1. tnorm-icia: an in-house developed Python 3 class that replaces old Romanian diacritics (ş and ţ) with their new variants (ș and ț), removes multiple spaces and normalizes the dash chars.
  2. diacritics-restoration
    1. diac-restore-icia: an in-house developed diacritic restoration algorithm based on word n-grams and Viterbi decoding. Developed by Tiberiu Boroș in Java, it has been ported to Python 3 and included in TEPROLIN.
  3. word-hyphenation
    1. tts-utcluj: developed in Python 3 by Stan et al. (2011). More information on http://romaniantts.com/.
  4. word-stress-identification
    1. tts-utcluj
  5. word-phonetic-transcription
    1. tts-utcluj
  6. numeral-rewriting
    1. expander-utcluj: see the references for tts-utcluj.
  7. abbreviation-rewriting
    1. expander-utcluj
  8. sentence-splitting
    1. ttl-icia: provided by the TTL Perl module (Ion, 2007).
    2. nlp-cube-adobe: provided by the NLP-Cube Python 3 module (Boroș et al., 2018).
  9. tokenization
    1. ttl-icia
    2. nlp-cube-adobe
  10. pos-tagging
    1. ttl-icia
    2. nlp-cube-adobe
  11. lemmatization
    1. ttl-icia
    2. nlp-cube-adobe
  12. named-entity-recognition
    1. ner-icia: provided by the web service developed by Vasile Păiș, available in this NER interface.
  13. biomedical-named-entity-recognition
    1. bioner-icia: provided by a previous version of the NLP-Cube Python 3 module (Boroș et al., 2018).
  14. chunking
    1. ttl-icia
  15. dependency-parsing
    1. nlp-cube-adobe

Annotating text

In order to annotate text, you will send POST requests to the /process URL. TEPROLIN is a REST WS, meaning that there is not any saving happening between requests. If you want to use a different NLP app for a given operation, you should send the configuration option along with the text to be processed. For a full list of what operations can be executed with what NLP apps, see the previous section.

The POST request is typed with the application/x-www-form-urlencoded MIME type. The body of the request must contain only the following key=value pairs, concatenated with the & character:

text=text to be annotated here...

<operation>=<NLP app> (e.g. pos-tagging=ttl-icia)
and
exec=<operation>,<operation>,...

If exec is present, then the requested operations are performed in the proper order (the client need not bother with the order). TEPROLIN will infer the order of function calls and the modules to run such that the requested annotations are returned to the client. If exec is not present, then the full processing chain is executed (all 15 operations).

If any configuration option is present, then the specified operation(s) will be performed with the requested NLP app(s) (e.g. pos-tagging is performed with the ttl-icia NLP app).

Finally, text is the only key that is required and which contains the text to be processed.

The returned JSON object

TEPROLIN WS will respond with a JSON object containing two keys:

For example, the output for the command

curl http://127.0.0.1:5000/process -d "text=Diabetul zaharat se remarca prin valori crescute ale concentratiei glucozei in sange." -d "exec=biomedical-named-entity-recognition"

is the following:

TEPROLIN output

Getting statistics about platform usage

The TEPROLIN platform can offer statistics about the following types of events:

In order to get frequency information of the above-mentioned events, you will send GET requests to the /stats URL prefix. To obtain the full URL, you must append a statistics type (one of the tokens, chars or requests), a time period (one of the year, month or day) and a size of the history to retrieve, an integer.

For example, to get a break-down of the number of tokens processed in the past 5 days (including the present day), you would query like this:

curl http://127.0.0.1:5000/stats/tokens/day/5

In order to get the number of requests for the current month, send this query:

curl http://127.0.0.1:5000/stats/requests/month/1

TEPROLIN will respond with a JSON object that contains the list of counts for the specified statistics type. For the first request, the response looks like this:

TEPROLIN statistics

References

Boroș, Tiberiu and Dumitrescu, Ștefan Daniel and Burtica, Ruxandra. (2018). NLP-Cube: End-to-End Raw Text Processing With Neural Networks. In Proceedings of the CoNLL 2018 Shared Task: Multilingual Parsing from Raw Text to Universal Dependencies, Association for Computational Linguistics. pp. 171--179. October 2018

Ion, Radu. (2018). TEPROLIN: An Extensible, Online Text Preprocessing Platform for Romanian. In Proceedings of the International Conference on Linguistic Resources and Tools for Processing Romanian Language (ConsILR 2018), November 22-23, 2018, Iași, România.

Stan, Adriana and Junichi YAMAGISHI and Simon KING and Matthew AYLETT. (2011). The Romanian Speech Synthesis (RSS) corpus: building a high quality HMM-based speech synthesis system using a high sampling rate. Speech Communication vol 53, pp. 442-450, 2011, doi: 10.1016/j.specom.2010.12.002