Skip to content

Doc determination#210

Open
rdraviam wants to merge 32 commits into
developfrom
doc-determination
Open

Doc determination#210
rdraviam wants to merge 32 commits into
developfrom
doc-determination

Conversation

@rdraviam
Copy link
Copy Markdown
Collaborator

@rdraviam rdraviam commented May 27, 2026

Integration of new models:

  • Added two new transformers types: TransformersDocumentLevelClassifier & TransformersSplittingClassifier. (Renamed TransformersDocumentClassifier -> TransformersPageLevelClassifier)
  • Added model classes: LayoutLMv3DocumentClassifier & LayoutLMv3DocumentSplitter. Converted to huggingface format (c533a53)
  • Use preprocessor_config.json in model directory if exists

Pipeline/ Executor:

  • New Executor: DocDeterminationPipelineExecutor
    • Rotation: Added as endpoint.
      • Also added as a pipeline option (not used)
    • Splitting
    • Improved page filtering. Configurable in pipeline definition and overridable in features request.
      • Can specify page sets in addition to page lists.
        • I.e {0: [1,2,3], 1: [4,6,7]} will result in two classification results for that classification group in the final metadata.
      • Filter on previous classification results. Methods: Include, Exclude, Split. Ex.
          - method: split
            type: classification
            group: split-boundary-group
            classifications: ["1"]
        
    • Collation node

Additional:

Copy link
Copy Markdown
Collaborator

@rsteele5 rsteele5 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NOTE: All comments list as MINOR are not needed changes just recommendations.

Comment thread marie/components/document_classifier/transformers.py Outdated
Comment thread marie/components/document_classifier/transformers.py
Comment thread marie/executor/pipeline/doc_determination_pipeline_executor.py Outdated
Comment thread marie/executor/pipeline/doc_determination_pipeline_executor.py Outdated
Comment thread marie/executor/pipeline/doc_determination_pipeline_executor.py Outdated
Comment thread marie/pipe/llm_pipeline.py Outdated
Comment thread marie/pipe/llm_pipeline.py Outdated
Comment thread marie/pipe/llm_pipeline.py
Comment thread marie/scheduler/repository/job_repository.py Outdated
Comment thread tests/integration/components/test_document_page_level_classifier.py
Comment thread marie/utils/image_utils.py Outdated
Comment on lines +367 to +382
# yappi.start()
enumerated_frames = tqdm(
enumerate(frames), total=len(frames), desc="page rotation", unit="frame"
)
for i, frame in enumerated_frames:
results = detect_page_rotation(frame)
if rotation := results.get("rotate", 0):
logger.info(f"Rotating frame {i} by {rotation} degrees")
frames[i] = np.rot90(frame, k=-rotation // 90)
changed, resized = ensure_max_page_size([frames[i]])
if changed:
logger.info(f"Resized frame {i} after rotation")
frames[i] = resized[0]
any_rotated = True
page_rotation[i] = results
# yappi.stop()
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove # yappi comments

Comment thread marie/executor/pipeline/doc_determination_pipeline_executor.py
Comment thread marie/executor/pipeline/doc_determination_pipeline_executor.py
@rdraviam rdraviam force-pushed the doc-determination branch 2 times, most recently from efff7ea to d6236ab Compare June 2, 2026 18:28
@rdraviam rdraviam force-pushed the doc-determination branch from 388fef5 to 9ec2fe0 Compare June 3, 2026 18:35
@rdraviam rdraviam force-pushed the doc-determination branch 2 times, most recently from 5d3a6e8 to dd8dce0 Compare June 3, 2026 22:03
@rdraviam rdraviam force-pushed the doc-determination branch from 71bf3b1 to ce39d9b Compare June 5, 2026 17:19
@rdraviam rdraviam force-pushed the doc-determination branch from ce39d9b to eaf8969 Compare June 5, 2026 17:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants