forked from friendica/friendica-addons
509 lines
12 KiB
Markdown
509 lines
12 KiB
Markdown
<img src="https://thiagoalessio.github.io/tesseract-ocr-for-php/images/logo.png" alt="Tesseract OCR for PHP" align="right" width="320px"/>
|
|
|
|
# Tesseract OCR for PHP
|
|
|
|
A wrapper to work with Tesseract OCR inside PHP.
|
|
|
|
[![CI][ci_badge]][ci]
|
|
[![AppVeyor][appveyor_badge]][appveyor]
|
|
[![Codacy][codacy_badge]][codacy]
|
|
[![Test Coverage][test_coverage_badge]][test_coverage]
|
|
<br/>
|
|
[![Latest Stable Version][stable_version_badge]][packagist]
|
|
[![Total Downloads][total_downloads_badge]][packagist]
|
|
[![Monthly Downloads][monthly_downloads_badge]][packagist]
|
|
|
|
## Installation
|
|
|
|
Via [Composer][]:
|
|
|
|
$ composer require thiagoalessio/tesseract_ocr
|
|
|
|
:bangbang: **This library depends on [Tesseract OCR][], version _3.02_ or later.**
|
|
|
|
<br/>
|
|
|
|
### ![][windows_icon] Note for Windows users
|
|
|
|
There are [many ways][tesseract_installation_on_windows] to install
|
|
[Tesseract OCR][] on your system, but if you just want something quick to
|
|
get up and running, I recommend installing the [Capture2Text][] package with
|
|
[Chocolatey][].
|
|
|
|
choco install capture2text --version 3.9
|
|
|
|
:warning: Recent versions of [Capture2Text][] stopped shipping the `tesseract` binary.
|
|
|
|
<br/>
|
|
|
|
### ![][macos_icon] Note for macOS users
|
|
|
|
With [MacPorts][] you can install support for individual languages, like so:
|
|
|
|
$ sudo port install tesseract-<langcode>
|
|
|
|
But that is not possible with [Homebrew][]. It comes only with **English** support
|
|
by default, so if you intend to use it for other language, the quickest solution
|
|
is to install them all:
|
|
|
|
$ brew install tesseract tesseract-lang
|
|
|
|
<br/>
|
|
|
|
## Usage
|
|
|
|
### Basic usage
|
|
|
|
<img align="right" width="50%" title="The quick brown fox jumps over the lazy dog." src="./tests/EndToEnd/images/text.png"/>
|
|
|
|
```php
|
|
use thiagoalessio\TesseractOCR\TesseractOCR;
|
|
echo (new TesseractOCR('text.png'))
|
|
->run();
|
|
```
|
|
|
|
```
|
|
The quick brown fox
|
|
jumps over
|
|
the lazy dog.
|
|
```
|
|
|
|
<br/>
|
|
|
|
### Other languages
|
|
|
|
<img align="right" width="50%" title="Bülowstraße" src="./tests/EndToEnd/images/german.png"/>
|
|
|
|
```php
|
|
use thiagoalessio\TesseractOCR\TesseractOCR;
|
|
echo (new TesseractOCR('german.png'))
|
|
->lang('deu')
|
|
->run();
|
|
```
|
|
|
|
```
|
|
Bülowstraße
|
|
```
|
|
|
|
<br/>
|
|
|
|
### Multiple languages
|
|
|
|
<img align="right" width="50%" title="I eat すし y Pollo" src="./tests/EndToEnd/images/mixed-languages.png"/>
|
|
|
|
```php
|
|
use thiagoalessio\TesseractOCR\TesseractOCR;
|
|
echo (new TesseractOCR('mixed-languages.png'))
|
|
->lang('eng', 'jpn', 'spa')
|
|
->run();
|
|
```
|
|
|
|
```
|
|
I eat すし y Pollo
|
|
```
|
|
|
|
<br/>
|
|
|
|
### Inducing recognition
|
|
|
|
<img align="right" width="50%" title="8055" src="./tests/EndToEnd/images/8055.png"/>
|
|
|
|
```php
|
|
use thiagoalessio\TesseractOCR\TesseractOCR;
|
|
echo (new TesseractOCR('8055.png'))
|
|
->allowlist(range('A', 'Z'))
|
|
->run();
|
|
```
|
|
|
|
```
|
|
BOSS
|
|
```
|
|
|
|
<br/>
|
|
|
|
### Breaking CAPTCHAs
|
|
|
|
Yes, I know some of you might want to use this library for the *noble* purpose
|
|
of breaking CAPTCHAs, so please take a look at this comment:
|
|
|
|
<https://github.com/thiagoalessio/tesseract-ocr-for-php/issues/91#issuecomment-342290510>
|
|
|
|
## API
|
|
|
|
### run
|
|
|
|
Executes a `tesseract` command, optionally receiving an integer as `timeout`,
|
|
in case you experience stalled tesseract processes.
|
|
|
|
```php
|
|
$ocr = new TesseractOCR();
|
|
$ocr->run();
|
|
```
|
|
```php
|
|
$ocr = new TesseractOCR();
|
|
$timeout = 500;
|
|
$ocr->run($timeout);
|
|
```
|
|
|
|
### image
|
|
|
|
Define the path of an image to be recognized by `tesseract`.
|
|
|
|
```php
|
|
$ocr = new TesseractOCR();
|
|
$ocr->image('/path/to/image.png');
|
|
$ocr->run();
|
|
```
|
|
|
|
### imageData
|
|
|
|
Set the image to be recognized by `tesseract` from a string, with its size.
|
|
This can be useful when dealing with files that are already loaded in memory.
|
|
You can easily retrieve the image data and size of an image object :
|
|
```php
|
|
//Using Imagick
|
|
$data = $img->getImageBlob();
|
|
$size = $img->getImageLength();
|
|
//Using GD
|
|
ob_start();
|
|
// Note that you can use any format supported by tesseract
|
|
imagepng($img, null, 0);
|
|
$size = ob_get_length();
|
|
$data = ob_get_clean();
|
|
|
|
$ocr = new TesseractOCR();
|
|
$ocr->imageData($data, $size);
|
|
$ocr->run();
|
|
```
|
|
|
|
### executable
|
|
|
|
Define a custom location of the `tesseract` executable,
|
|
if by any reason it is not present in the `$PATH`.
|
|
|
|
```php
|
|
echo (new TesseractOCR('img.png'))
|
|
->executable('/path/to/tesseract')
|
|
->run();
|
|
```
|
|
|
|
### version
|
|
|
|
Returns the current version of `tesseract`.
|
|
|
|
```php
|
|
echo (new TesseractOCR())->version();
|
|
```
|
|
|
|
### availableLanguages
|
|
|
|
Returns a list of available languages/scripts.
|
|
|
|
```php
|
|
foreach((new TesseractOCR())->availableLanguages() as $lang) echo $lang;
|
|
```
|
|
|
|
__More info:__ <https://github.com/tesseract-ocr/tesseract/blob/master/doc/tesseract.1.asc#languages-and-scripts>
|
|
|
|
### tessdataDir
|
|
|
|
Specify a custom location for the tessdata directory.
|
|
|
|
```php
|
|
echo (new TesseractOCR('img.png'))
|
|
->tessdataDir('/path')
|
|
->run();
|
|
```
|
|
|
|
### userWords
|
|
|
|
Specify the location of user words file.
|
|
|
|
This is a plain text file containing a list of words that you want to be
|
|
considered as a normal dictionary words by `tesseract`.
|
|
|
|
Useful when dealing with contents that contain technical terminology, jargon,
|
|
etc.
|
|
|
|
```
|
|
$ cat /path/to/user-words.txt
|
|
foo
|
|
bar
|
|
```
|
|
|
|
```php
|
|
echo (new TesseractOCR('img.png'))
|
|
->userWords('/path/to/user-words.txt')
|
|
->run();
|
|
```
|
|
|
|
### userPatterns
|
|
|
|
Specify the location of user patterns file.
|
|
|
|
If the contents you are dealing with have known patterns, this option can help
|
|
a lot tesseract's recognition accuracy.
|
|
|
|
```
|
|
$ cat /path/to/user-patterns.txt'
|
|
1-\d\d\d-GOOG-441
|
|
www.\n\\\*.com
|
|
```
|
|
|
|
```php
|
|
echo (new TesseractOCR('img.png'))
|
|
->userPatterns('/path/to/user-patterns.txt')
|
|
->run();
|
|
```
|
|
|
|
### lang
|
|
|
|
Define one or more languages to be used during the recognition.
|
|
A complete list of available languages can be found at:
|
|
<https://github.com/tesseract-ocr/tesseract/blob/master/doc/tesseract.1.asc#languages>
|
|
|
|
__Tip from [@daijiale][]:__ Use the combination `->lang('chi_sim', 'chi_tra')`
|
|
for proper recognition of Chinese.
|
|
|
|
```php
|
|
echo (new TesseractOCR('img.png'))
|
|
->lang('lang1', 'lang2', 'lang3')
|
|
->run();
|
|
```
|
|
|
|
### psm
|
|
|
|
Specify the Page Segmentation Method, which instructs `tesseract` how to
|
|
interpret the given image.
|
|
|
|
__More info:__ <https://github.com/tesseract-ocr/tesseract/wiki/ImproveQuality#page-segmentation-method>
|
|
|
|
```php
|
|
echo (new TesseractOCR('img.png'))
|
|
->psm(6)
|
|
->run();
|
|
```
|
|
|
|
### oem
|
|
|
|
Specify the OCR Engine Mode. (see `tesseract --help-oem`)
|
|
|
|
```php
|
|
echo (new TesseractOCR('img.png'))
|
|
->oem(2)
|
|
->run();
|
|
```
|
|
|
|
### dpi
|
|
|
|
Specify the image DPI. It is useful if your image does not contain this information in its metadata.
|
|
|
|
```php
|
|
echo (new TesseractOCR('img.png'))
|
|
->dpi(300)
|
|
->run();
|
|
```
|
|
|
|
### allowlist
|
|
|
|
This is a shortcut for `->config('tessedit_char_whitelist', 'abcdef....')`.
|
|
|
|
```php
|
|
echo (new TesseractOCR('img.png'))
|
|
->allowlist(range('a', 'z'), range(0, 9), '-_@')
|
|
->run();
|
|
```
|
|
|
|
### configFile
|
|
|
|
Specify a config file to be used. It can either be the path to your own
|
|
config file or the name of one of the predefined config files:
|
|
<https://github.com/tesseract-ocr/tesseract/tree/master/tessdata/configs>
|
|
|
|
```php
|
|
echo (new TesseractOCR('img.png'))
|
|
->configFile('hocr')
|
|
->run();
|
|
```
|
|
|
|
### setOutputFile
|
|
|
|
Specify an Outputfile to be used. Be aware: If you set an outputfile then
|
|
the option `withoutTempFiles` is ignored.
|
|
Tempfiles are written (and deleted) even if `withoutTempFiles = true`.
|
|
|
|
In combination with `configFile` you are able to get the `hocr`, `tsv` or
|
|
`pdf` files.
|
|
|
|
```php
|
|
echo (new TesseractOCR('img.png'))
|
|
->configFile('pdf')
|
|
->setOutputFile('/PATH_TO_MY_OUTPUTFILE/searchable.pdf')
|
|
->run();
|
|
```
|
|
|
|
### digits
|
|
|
|
Shortcut for `->configFile('digits')`.
|
|
|
|
```php
|
|
echo (new TesseractOCR('img.png'))
|
|
->digits()
|
|
->run();
|
|
```
|
|
|
|
### hocr
|
|
|
|
Shortcut for `->configFile('hocr')`.
|
|
|
|
```php
|
|
echo (new TesseractOCR('img.png'))
|
|
->hocr()
|
|
->run();
|
|
```
|
|
|
|
### pdf
|
|
|
|
Shortcut for `->configFile('pdf')`.
|
|
|
|
```php
|
|
echo (new TesseractOCR('img.png'))
|
|
->pdf()
|
|
->run();
|
|
```
|
|
|
|
### quiet
|
|
|
|
Shortcut for `->configFile('quiet')`.
|
|
|
|
```php
|
|
echo (new TesseractOCR('img.png'))
|
|
->quiet()
|
|
->run();
|
|
```
|
|
|
|
### tsv
|
|
|
|
Shortcut for `->configFile('tsv')`.
|
|
|
|
```php
|
|
echo (new TesseractOCR('img.png'))
|
|
->tsv()
|
|
->run();
|
|
```
|
|
|
|
### txt
|
|
|
|
Shortcut for `->configFile('txt')`.
|
|
|
|
```php
|
|
echo (new TesseractOCR('img.png'))
|
|
->txt()
|
|
->run();
|
|
```
|
|
|
|
### tempDir
|
|
|
|
Define a custom directory to store temporary files generated by tesseract.
|
|
Make sure the directory actually exists and the user running `php` is allowed
|
|
to write in there.
|
|
|
|
```php
|
|
echo (new TesseractOCR('img.png'))
|
|
->tempDir('./my/custom/temp/dir')
|
|
->run();
|
|
```
|
|
|
|
### withoutTempFiles
|
|
|
|
Specify that `tesseract` should output the recognized text without writing to temporary files.
|
|
The data is gathered from the standard output of `tesseract` instead.
|
|
|
|
```php
|
|
echo (new TesseractOCR('img.png'))
|
|
->withoutTempFiles()
|
|
->run();
|
|
```
|
|
|
|
### Other options
|
|
|
|
Any configuration option offered by Tesseract can be used like that:
|
|
|
|
```php
|
|
echo (new TesseractOCR('img.png'))
|
|
->config('config_var', 'value')
|
|
->config('other_config_var', 'other value')
|
|
->run();
|
|
```
|
|
|
|
Or like that:
|
|
|
|
```php
|
|
echo (new TesseractOCR('img.png'))
|
|
->configVar('value')
|
|
->otherConfigVar('other value')
|
|
->run();
|
|
```
|
|
|
|
__More info:__ <https://github.com/tesseract-ocr/tesseract/wiki/ControlParams>
|
|
|
|
### Thread-limit
|
|
|
|
Sometimes, it may be useful to limit the number of threads that tesseract is
|
|
allowed to use (e.g. in [this case](https://github.com/tesseract-ocr/tesseract/issues/898)).
|
|
Set the maxmium number of threads as param for the `run` function:
|
|
|
|
```php
|
|
echo (new TesseractOCR('img.png'))
|
|
->threadLimit(1)
|
|
->run();
|
|
```
|
|
|
|
## How to contribute
|
|
|
|
You can contribute to this project by:
|
|
|
|
* Opening an [Issue][] if you found a bug or wish to propose a new feature;
|
|
* Placing a [Pull Request][] with code that fix a bug, missing/wrong documentation
|
|
or implement a new feature;
|
|
|
|
Just make sure you take a look at our [Code of Conduct][] and [Contributing][]
|
|
instructions.
|
|
|
|
## License
|
|
|
|
tesseract-ocr-for-php is released under the [MIT License][].
|
|
|
|
|
|
<h2></h2><p align="center"><sub>Made with <sub><a href="#"><img src="https://thiagoalessio.github.io/tesseract-ocr-for-php/images/heart.svg" alt="love" width="14px"/></a></sub> in Berlin</sub></p>
|
|
|
|
[ci_badge]: https://github.com/thiagoalessio/tesseract-ocr-for-php/workflows/CI/badge.svg?event=push&branch=main
|
|
[ci]: https://github.com/thiagoalessio/tesseract-ocr-for-php/actions?query=workflow%3ACI
|
|
[appveyor_badge]: https://ci.appveyor.com/api/projects/status/xwy5ls0798iwcim3/branch/main?svg=true
|
|
[appveyor]: https://ci.appveyor.com/project/thiagoalessio/tesseract-ocr-for-php/branch/main
|
|
[codacy_badge]: https://app.codacy.com/project/badge/Grade/a81aa10012874f23a57df5b492d835f2
|
|
[codacy]: https://www.codacy.com/gh/thiagoalessio/tesseract-ocr-for-php/dashboard
|
|
[test_coverage_badge]: https://codecov.io/gh/thiagoalessio/tesseract-ocr-for-php/branch/main/graph/badge.svg?token=Y0VnrqiSIf
|
|
[test_coverage]: https://codecov.io/gh/thiagoalessio/tesseract-ocr-for-php
|
|
[stable_version_badge]: https://img.shields.io/packagist/v/thiagoalessio/tesseract_ocr.svg
|
|
[packagist]: https://packagist.org/packages/thiagoalessio/tesseract_ocr
|
|
[total_downloads_badge]: https://img.shields.io/packagist/dt/thiagoalessio/tesseract_ocr.svg
|
|
[monthly_downloads_badge]: https://img.shields.io/packagist/dm/thiagoalessio/tesseract_ocr.svg
|
|
[Tesseract OCR]: https://github.com/tesseract-ocr/tesseract
|
|
[Composer]: http://getcomposer.org/
|
|
[windows_icon]: https://thiagoalessio.github.io/tesseract-ocr-for-php/images/windows-18.svg
|
|
[macos_icon]: https://thiagoalessio.github.io/tesseract-ocr-for-php/images/apple-18.svg
|
|
[tesseract_installation_on_windows]: https://github.com/tesseract-ocr/tesseract/wiki#windows
|
|
[Capture2Text]: https://chocolatey.org/packages/capture2text
|
|
[Chocolatey]: https://chocolatey.org
|
|
[MacPorts]: https://www.macports.org
|
|
[Homebrew]: https://brew.sh
|
|
[@daijiale]: https://github.com/daijiale
|
|
[HOCR]: https://github.com/tesseract-ocr/tesseract/wiki/Command-Line-Usage#hocr-output
|
|
[TSV]: https://github.com/tesseract-ocr/tesseract/wiki/Command-Line-Usage#tsv-output-currently-available-in-305-dev-in-master-branch-on-github
|
|
[Issue]: https://github.com/thiagoalessio/tesseract-ocr-for-php/issues
|
|
[Pull Request]: https://github.com/thiagoalessio/tesseract-ocr-for-php/pulls
|
|
[Code of Conduct]: https://github.com/thiagoalessio/tesseract-ocr-for-php/blob/main/.github/CODE_OF_CONDUCT.md
|
|
[Contributing]: https://github.com/thiagoalessio/tesseract-ocr-for-php/blob/main/.github/CONTRIBUTING.md
|
|
[MIT License]: https://github.com/thiagoalessio/tesseract-ocr-for-php/blob/main/MIT-LICENSE
|