Skip to content

[REVISION] Chapter: 3 - Images and Documents as Data #105

@srearl

Description

@srearl

What guide are you proposing revisions for?

  • EML Best Practices guide
  • Dataset Design For Special Cases
  • Domain Specific Guides

Chapter and Revision Information

Chapter Number: 3
Chapter Title: Images and Documents as Data
Current Version/Commit: Version 2 prerelease
File Path: guide-special-cases/images-and-docs.qmd
Reviewer(s): D. Bahauddin, S. Earl (GC), H. Krumbholz, I. Mohandas
Review Date: 2025-09-08

Revision overview

The reviewers were overall happy with the state of this chapter. They felt that this is a reasonably straightforward topic, and that the chapter generally covers it well. That said, they did identify a few major issues that warrant attention, including if and how annotation examples are presented; nuances surrounding OCR; and how example datasets are curated. Regarding the latter, there was some uncertainty as to whether the example datasets actually reflect the topic of the chapter. Moreover, it was recognized that datasets are ephemeral, and, as such, pointing to specific datasets may not be sustainable. It was suggested that, instead, the authors consider a dedicated, dummy dataset(s) that would serve as a companion to the document and could be referenced without worry of it remaining relevant. Otherwise, the reviewers offer a few minor suggestions regarding formatting, naming, and broken links.

Detailed Content Feedback

Major Issues

Identify and describe any major content revisions needed for the chapter. Please identify by section header or line numbers, and itemize multiple proposed changes by adding to the numbered list below.

  1. Issue: Annotation example demands more context
  • Location: Documenting data packages/Ecological Metadata Language
  • Current text: …the annotation example…
  • Suggested change: This will depend on how annotations are treated in the document generally. If richly covered elsewhere then this example could link to and draw upon that resource(s). Otherwise, the example should be reconfigured as a resource (tabular?) that would provide annotation details for a richer set of data types relevant to this chapter.
  • Rationale: Incorporating annotations into EML remains a nebulous process both technically and conceptually, and is a topic that warrants ample consideration in this best practices document generally. To the point here, the example provided may be more confusing than helpful: am I expected to include an annotation for each data resource, what annotation do I use if I am not documenting an image or photograph, how do I find them, how do I locate the appropriate annotation resource, is there a relationship among annotations for the data type and subject matter?
  1. Issue: Suggestion to use OCR to clarify text may add confusion
  • Location: Resources/Optical Character Recognition (OCR)
  • Current text: Optical Character Recognition (OCR): When digitizing documents that include text, we recommend using scanning or other software with OCR capabilities (e.g., Adobe, ABBYY, Tesseract) to convert the text into machine readable characters so that the documents are searchable and thus, more usable. OCR does not work well for handwritten text, older fonts, or documents with busy backgrounds (speckled, dirty, faded, etc.).
  • Suggested change: Be sure to stress that both the input (source) and OCR-corrected products be included in the data package. That a product is OCR-corrected should also be clearly stated at least in the entity description, and elsewhere in the EML that may be relevant. Regarding workflow, where possible, it is important that the data creator is involved with the OCR process to provide context and check for errors. The reviewers were unable to offer a recommendation regarding an output format for OCR-processed products but this should be considered.
  • Rationale: While powerful, addressing text in documents with OCR adds additional considerations. First, there is now the source document and the OCR clarified document. Do you include and document just the latter or both? Second, OCR is not perfect and may (in fact, almost certainly on anything less than perfectly clean, crisp text) introduce errors. How does one minimize and accommodate OCR-generated errors?

Minor Issues

Identify and describe minor content issues that should be revised. Please identify by section header or line numbers.

  • Introduction many broken links
  • Documenting data packages/Ecological Metadata Language reference to EML Best Practices (currently v3) - not only is the link broken but references should refer only to the current BP document.
  • Documenting data packages/Data Inventory Table: clarify that the data inventory table itself would also then need to be documented in the EML as a data resource.
  • line 22 reference (see Table 4.1) - should be Table 1
  • line 30 (e.g., see Example 4.1) - should be Example 1
  • Ecological Metadata Language/Data Inventory Table/Table 1 The presentation of this table is somewhat unclear. On one hand, the text seems to suggest that these are examples of attributes that should be included in a data inventory table; on the other, that this IS the structure of a data inventory table that should be used. This needs to be clearer. If the latter, consider making it a copyable/downloadable resource.

Missing Content

Describe new content that should be added to the chapter. Include a suggested location for the addition, and if suggesting text include it in quotes.

  • File size is addressed in the Balance file size and number of files section of Considerations for data package structure but additional information, such as maximum file size would be helpful. This issue is not unique to this chapter so a clear reference to where this is addressed elsewhere in the document would be sufficient (but important!).

Structure and Formatting

Describe any necessary restructuring or formatting changes for the chapter. This may include comments about organization, section/heading hierarchies, formatting consistency, use of figures/tables, citations and references, etc.

  • The tables can be difficult to read when the text in different cells is not easily distinguished. The reviewers suggest formatting all tables with subtle column/row borders to improve readability.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Chapter revisionItemized revision proposal for a chapterCommittee review neededNeeds input from a committee or community working groupDesign for Special CasesApplies to "Data Package Design for Special Cases"

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions