Skip to content

Commit ca721fd

Browse files
authored
Merge pull request #73 from companieshouse/feature/readme-update
readme update
2 parents 52b0205 + 884e9dc commit ca721fd

File tree

2 files changed

+26
-5
lines changed

2 files changed

+26
-5
lines changed

.gitignore

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -23,6 +23,7 @@
2323
hs_err_pid*
2424

2525
# IDE files
26+
.idea
2627
.classpath
2728
.project
2829
.settings

README.md

Lines changed: 25 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -1,14 +1,30 @@
1-
# `ocr-api`
1+
s-# `ocr-api`
22

33
A microservice to extract text from images. This uses Tess4J which itself is a small (Java Native Access) wrapper around Tesseract. As well as returning the extracted text some metadata relating to this service is also returned [data returned](src/main/java/uk/gov/companieshouse/ocr/api/image/extracttext/ExtractTextResultDto.java).
44

55
The `ocr-api` has one thread pool (with a blocking queue) that protects the system from being overloaded (implemented by a ThreadPoolTaskExecutor). In the normal running of this microservice this queue should have very few entries on it.
66

77
Supported images types: TIFF
88

9+
## TLTR for updating for dependency changes
10+
11+
This project has not had any significant changes since it's release in 2021 but needs updates to its dependencies for security
12+
fixes. This needs testing within a Docker volume.
13+
14+
We are also now updating how we deploy it (see second confluence document below). Until this is done you should run tests for
15+
OCR conversion locally against a newly downloaded docker image or one that you have created yourself
16+
17+
## Confluence Documentation
18+
19+
- [System overview for live running](https://companieshouse.atlassian.net/wiki/spaces/IncVal/pages/2699755729/OCR+Service+Live)
20+
- [Migration from EC2 to Fargate - WIP](https://companieshouse.atlassian.net/wiki/spaces/IncVal/pages/3067346945/Automated+builds+of+the+ocr-api+to+staging+and+live) ** MUST READ UNTIL WE COMPLETE
21+
- [Testing in AWS](https://companieshouse.atlassian.net/wiki/spaces/IncVal/pages/3396206692/Environments+and+Testing) -
22+
Testing in AWS
23+
MIGRATION.
24+
925
## Call Types
1026

11-
### Asynchronous
27+
### Asynchronous (CHIPS usage)
1228

1329
Endpoint = `[server address]/ocr-api/api/ocr/image/tiff/extractTextRequest`
1430

@@ -18,7 +34,7 @@ The request to the controller is first vetted and then handed off to an asynchro
1834
- Convert the image to text,
1935
- Send the results back via a callback URL provided in the OCR Request.
2036

21-
### Synchronous
37+
### Synchronous (Automated test usage)
2238

2339
Endpoint = `[server address]/ocr-api/api/ocr/image/tiff/extractText`
2440

@@ -59,6 +75,7 @@ To activate this project in development mode, run the following command before r
5975
- Run `chs-dev development enable ocr-api`
6076

6177
The ocr-api should be assessable via http://api.chs.local/ocr-api/
78+
6279
## Tesseract Training data
6380

6481
This is used by the Tesseract engine to help in the text recognition. We store the currently used data within configuration management for consistency and speed of the docker build.
@@ -149,9 +166,12 @@ curl -w '%{http_code}' http://localhost:8080/ocr-api/statistics
149166
## Using CHS docker
150167

151168
``` bash
152-
curl --noproxy '*' http://api.chs.local/ocr-api/healthcheck
169+
curl http://api.chs.local/ocr-api/healthcheck
153170

154-
curl --noproxy '*' http://api.chs.local/ocr-api/statistics
171+
curl http://api.chs.local/ocr-api/statistics
172+
173+
# With Context ID
174+
curl -F file=@"src/test/resources/sample-articles-of-association.tif" -F responseId="curl test response id" -F contextId="SAMPLE_ARTICLES" http://api.chs.local/ocr-api/api/ocr/image/tiff/extractText
155175

156176
curl --noproxy '*' -w '%{http_code}' --header "Content-Type: application/json" \
157177
--request POST \

0 commit comments

Comments
 (0)