Skip to content

Missing figures in the parsed data #31

@mreza-kiani

Description

@mreza-kiani

I have noticed that for some tasks, e.g., UID0056, the diagrams are essential.

Task:

question: Between the years of 1950 and 1990, in which year did U.S personal saving rates (measured as household saving as a percent of after-tax income) peak?

source_docs: https://fraser.stlouisfed.org/title/treasury-bulletin-407/september-1991-7066?page=30&deep=true

source_files: treasury_bulletin_1991_09.txt

To answer this question, the model should have access to the second diagram on page 30:

Image

However, I've noticed it's not available in the parsed data:

treasury_bulletin_1991_09.txt (content of page 30)
16

Profile of the Economy

June, the deficit totaled $235 billion, or $192 billion excluding outlays as part of the savings and loan situation. For the first 9 months of fiscal 1991, the deficit was $177 billion, compared with about $163 billion a year earlier.

FEDERAL OUTLAYS AND RECEIPTS AS A SHARE OF GROSS NATIONAL PRODUCT

FISCAL YEARS

The Federal budget outlay share of GNP averaged approximately 19 percent during the earlier postwar years, then rose to 23 percent in the 1980s. It is projected to reach a postwar high of 25 percent in fiscal 1992, including spending to deal with the savings and loan situation. The share declines to 20.2 percent by 1996, based on budget projections. Receipts were equal to 19.1 percent of GNP in fiscal 1990, and are projected to stay at 19.1 percent in the current fiscal year and to rise to 19.4 percent by 1996.

PERSONAL SAVING

Household Saving as a Percent of After-Tax Income, Through First Half 1991

The personal saving rate rose from a post-Depression low of 2.9 percent in 1987 to 4.6 percent in both 1989 and 1990, but remained well below the 6.7-percent long-term average. Saving appeared to be rising in early 1990, averaging 4.9 percent in the first half of the year. However, in the second half it dropped to only 4.2 percent as the slowing economy and increasing inflation reduced real incomes. The rate dipped to 3.7 percent in the second quarter of 1991, allowing only a 4-percent average for the first half of the year.
treasury_bulletin_1991_09.json (page 30)

{
  "document": {
    "elements": [
      ...
      {
        "bbox": [{ "coord": [41, 44, 96, 82], "page_id": 30 }],
        "content": "16",
        "description": null,
        "id": 0,
        "type": "page_number"
      },
      {
        "bbox": [{ "coord": [488, 86, 967, 144], "page_id": 30 }],
        "content": "Profile of the Economy",
        "description": null,
        "id": 1,
        "type": "title"
      },
      {
        "bbox": [{ "coord": [53, 188, 1403, 259], "page_id": 30 }],
        "content": "June, the deficit totaled $235 billion, or $192 billion excluding outlays as part of the savings and loan situation. For the first 9 months of fiscal 1991, the deficit was $177 billion, compared with about $163 billion a year earlier.",
        "description": null,
        "id": 2,
        "type": "text"
      },
      {
        "bbox": [{ "coord": [332, 294, 1181, 390], "page_id": 30 }],
        "content": "FEDERAL OUTLAYS AND RECEIPTS AS A SHARE OF GROSS NATIONAL PRODUCT",
        "description": null,
        "id": 3,
        "type": "section_header"
      },
      {
        "bbox": [{ "coord": [182, 405, 1269, 903], "page_id": 30 }],
        "content": null,
        "description": null,
        "id": 4,
        "type": "figure"
      },
      {
        "bbox": [{ "coord": [639, 921, 864, 965], "page_id": 30 }],
        "content": "FISCAL YEARS",
        "description": null,
        "id": 5,
        "type": "text"
      },
      {
        "bbox": [{ "coord": [46, 996, 1405, 1120], "page_id": 30 }],
        "content": "The Federal budget outlay share of GNP averaged approximately 19 percent during the earlier postwar years, then rose to 23 percent in the 1980s. It is projected to reach a postwar high of 25 percent in fiscal 1992, including spending to deal with the savings and loan situation. The share declines to 20.2 percent by 1996, based on budget projections. Receipts were equal to 19.1 percent of GNP in fiscal 1990, and are projected to stay at 19.1 percent in the current fiscal year and to rise to 19.4 percent by 1996.",
        "description": null,
        "id": 6,
        "type": "text"
      },
      {
        "bbox": [{ "coord": [554, 1158, 935, 1211], "page_id": 30 }],
        "content": "PERSONAL SAVING",
        "description": null,
        "id": 7,
        "type": "text"
      },
      {
        "bbox": [{ "coord": [290, 1229, 1212, 1267], "page_id": 30 }],
        "content": "Household Saving as a Percent of After-Tax Income, Through First Half 1991",
        "description": null,
        "id": 8,
        "type": "text"
      },
      {
        "bbox": [{ "coord": [179, 1269, 1259, 1692], "page_id": 30 }],
        "content": null,
        "description": null,
        "id": 9,
        "type": "figure"
      },
      {
        "bbox": [{ "coord": [41, 1725, 1406, 1847], "page_id": 30 }],
        "content": "The personal saving rate rose from a post-Depression low of 2.9 percent in 1987 to 4.6 percent in both 1989 and 1990, but remained well below the 6.7-percent long-term average. Saving appeared to be rising in early 1990, averaging 4.9 percent in the first half of the year. However, in the second half it dropped to only 4.2 percent as the slowing economy and increasing inflation reduced real incomes. The rate dipped to 3.7 percent in the second quarter of 1991, allowing only a 4-percent average for the first half of the year.",
        "description": null,
        "id": 10,
        "type": "text"
      },
      ...
],
    "pages": [
      ...
      { "id": 30, "image_uri": null },
      ...
    ]
  },
  "error_status": null,
  "metadata": {}
}

I’m wondering how the LLM with Oracle Parsed PDF Page(s) and Pre-parsed Full Corpus evaluations were done. Were they performed only using the parsed data, or did the model also have access to the no-OCR PDF version?

I’m also wondering why the figures as base64 data weren’t added to the parsed data, or why placeholders for figures (like [Stripped figure]) weren’t included in the compiled .txt format, so the model would know that figures exist on those pages.

I’d appreciate it if you could clarify these points. Thank you.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions