Skip to content

IntChron parser #115

Open
Open
@joeroe

Description

@joeroe

IntChron <https://intchron.org/> is listed in #2 as not being included because it requires web scraping. However, after spending some time playing with it, I wonder if this might be revisited.

Essentially, IntChron seems to do the same thing as c14bazAAR—systematically compile dates from existing databases—with a web-based API. An IntChron parser would be more complicated than the existing parsers because, as far as I can tell, there is no way to extract the entire database as a single file. But it should still be possible to get it without resorting to web scraping. The key is that every HTML page on IntChron can also be accessed in csv, json, or txt format. This includes the "index" pages that eventually lead you to individual date records. It think it could be worth the extra complexity for IntChron because it does seem to include a lot of dates (for example the entire ORAU database) and it's backed by the Oxford C14 Lab so it's likely to grow over time.

I can think of a few ways you could approach this, depending on how much flexibility you want to give the user. At the simplest, one could implement a multi-stage parser in c14bazAAR:

  1. Retrieve the list of "hosts" (https://intchron.org/host.csv)
  2. Retrieve the list of records-by-country for each host (e.g. https://intchron.org/oxa/record.csv)
  3. Retrieve the list of sites for each country (e.g. https://intchron.org/record/oxa/Jordan.csv)
  4. Retrieve the list of dates for each site (e.g. https://intchron.org/record/oxa/Jordan/Dhuweila.csv)
  5. Parse and collate the dates (actually quite easy because the IntChron format is similar to c14bazAAR's)

On the other end of the spectrum, one could write an R interface to IntChron as its own package, which c14bazAAR could then use as a dependency to retrieve either the entire database or a user-specific subset. That could be worthwhile if the IntChron standard does become widely used, but as things stand I'm not sure that it's worth the extra effort.

I'd be happy to put some work into this, but I thought I would first raise the issue and ask whether you think it is something that fits into c14bazAAR, and what the best approach to doing it might be.

Metadata

Metadata

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions