2f67f9ec8f
* change return of scan() to int |
||
---|---|---|
cmake | ||
debian | ||
python | ||
src | ||
.gitignore | ||
.gitmodules | ||
CMakeLists.txt | ||
Config.cmake.in | ||
Dockerfile | ||
LICENSE | ||
MANIFEST.in | ||
README.md | ||
setup.cfg | ||
setup.py |
README.md
kotki
High-performance language translations without using the cloud.
- C/C++ 17 implementation
- x86_64, ARM
- Runs on the CPU
- AVX intrinsics support for x86 architectures
- NEON intrinsics support for ARM architectures
- Language models from the Mozilla extension Firefox Translations
- FOSS (OpenBLAS)
- Linux only
Quick start
Requirements
For Ubuntu:
sudo apt update && sudo apt upgrade
sudo apt install -y cmake ccache build-essential git pkg-config rapidjson-dev pybind11-dev libyaml-cpp-dev python3-dev python3-virtualenv libopenblas-dev libpcre2-dev libprotobuf-dev protobuf-compiler libsqlite3-dev
Python
pip install kotki -v
- Install language translation models
Programmatically
import kotki
kotki.scan() # auto-find language translation models
# kotki.scan("/path/to/registry.json") # or supply the path
# English -> German
kotki.translate("Whenever I am at the office, I like to drink coffee.", "ende")
'Wann immer ich im büro bin, trinke ich gerne kaffee.'
# Bulgarian -> English
kotki.translate("Румънците получиха дълго чакани новини: пенсиите и минималната заплата ще бъдат увеличени от 2023 г.", "bgen")
'Romanians have received long-awaited news: pensions and minimum wages will be increased from 2023'
# Dutch -> English
>>> kotki.translate("Auto begeeft het nadat man benzine steelt in Breda, blijkt dieselauto te zijn", "nlen")
'Car breaks after man steals gas in Breda, turns out to be diesel car'
# English -> Polish
>>> kotki.translate("I am going outside to buy some Pierogi.", "enpl")
'Jadę na zewnątrz, żeby kupić Pierogi.'
CLI
$ kotki-cli --help
Usage: kotki-cli [OPTIONS]
Translate some text.
Options:
-i, --input TEXT Text to translate [required]
-m, --model TEXT Model names. Use -l to list. Leave empty to guess
the input language automatically.
-r, --registry FILENAME Path to registry.json. Leave empty for auto-
detection of translation models.
-l, --list List available models.
-d, --debug Print debug log.
--help Show this message and exit.
Self-hosted web-interface
Example: kotki.kroket.io
$ kotki-web --help
Usage: kotki-web [OPTIONS]
Exposes kotki via HTTP web interface and provide an API.
Options:
-h, --host TEXT bind host (default: 127.0.0.1) [required]
-p, --port INTEGER bind port (default: 7000) [required]
-d, --debug run Quart web-framework in debug
-r, --registry FILENAME Path to registry.json. Leave empty for auto-
detection of translation models.
--help Show this message and exit.
C++
Link against kotki-lib
(CMake target, see src/demo/
for reference).
#include <string>
#include "kotki/kotki.h"
using namespace std;
int main(int argc, char *argv[]) {
auto *kotki = new Kotki();
kotki->scan();
// auto loadedModels = kotki->listModels(); // show currently loaded language models
cout << kotki->translate("This should work, in theory.", "ende"); // English to German
return 0;
}
why
Kotki is aimed at developers who "just want to translate some text" in their C++ or Python applications without too much headache, as other translation frameworks are often big, difficult to compile, non-performant, etc.
Producing libkotki
libkotki.so
or libkotki.a
Via CMake
Install marian-lite (and its dependencies) manually
(and if you are lazy, you can let kotki download the dependencies
automatically via -DVENDORED_LIBS=ON
- though your mileage may vary).
STATIC
- Produce static binary (TODO: doesn't work yet)SHARED
- Produce shared binaryBUILD_DEMO
- Produce example demo application(s)
cmake -DBUILD_DEMO=ON -DSTATIC=OFF -DSHARED=ON -Bbuild .
make -Cbuild -j6
sudo make -Cbuild install # install into /usr/local/...
Via debian packaging
sudo apt install -y debhelper
dpkg-buildpackage -b -uc
Library usage (CMake)
cmake_minimum_required(VERSION 3.16)
find_package(kotki REQUIRED)
target_link_libraries(my_app PRIVATE kotki::kotki-lib)
Models
The translation models are borrowed from the Mozilla Firefox Translations extension. You need to manually download these models. They are conveniently packaged into a single archive that can be downloaded over at kotki/releases.
Extract to ~/.config/kotki/models/
for automatic detection:
mkdir -p ~/.config/kotki/models/
wget https://github.com/kroketio/kotki/releases/download/v0.4.5/kotki_models_0.3.3.zip
unzip kotki_models_0.3.3.zip -d ~/.config/kotki/models
Or supply your own path scan("/path/to/registry.json")
.
Performance / footprint
Translations are fast - Translating a simple sentence is generally under 10ms
(except the first time, due to model loading). Note that translation models are loaded on-demand.
This means that model loading does not happen during scan()
but during the first use
of translate()
. In addition, translations are done synchronously (and thus 'blocking').
Acknowledgements
This project was made possible through the combined effort of all researchers and partners in the Bergamot project (Jerin Philip, et al). The translation models are prepared as part of the Mozilla project. The translation engine used is bergamot-translator which is based on marian.
Bergamot-Translator
Kotki differs from Bergamot-Translator. The changes are specified below:
- Removed async/blocking worker pools
- Removed async/callback style translations
- Removed code related to parsing of HTML
- Work from a single JSON config file (
registry.json
) - Dynamically generate marian configs 'on-the-fly'
- Simplified the example C++ CLI program (
src/demo/kotki.cpp
). - Switch from marian-dev to marian-lite
- Simplified Python bindings
- Simplified the build system (cleaned up various CMakeLists.txt)
- Introduced automatic use of
ccache
for compilations - Supply CMake configs for kotki (and its dependencies)
- Supply debian packaging for kotki (and its dependencies)
- Removed support for Apple, Microsoft, WASM (rip)
- Removed usage of proprietary libraries like CUDA, Intel MKL
- Removed unit tests
- Removed CI/CD definitions
- Introduced new dependency: rapidjson
- Doxygen, and other documentation removed
License
MPL 2.0