Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Principle: Write only one algorithm to accomplish a task. #562

Draft
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

jyasskin
Copy link
Contributor

@jyasskin jyasskin commented Mar 7, 2025

This explains why and when "polyglot" formats are a bad idea.

Fixes #239.

There's some overlap between this and the preceding section, Resolving tension between interoperability and implementability. Do y'all think it's ok, or are there bits we could refactor together?

I'd also like to give an example of parsing divergence yielding security bugs, but I didn't have any readily available. Ideas?


Preview | Diff

This explains why and when "polyglot" formats are a bad idea.
@jyasskin jyasskin requested review from hober and csarven March 7, 2025 00:52
index.bs Outdated
@@ -3488,6 +3505,52 @@ While the best path forward may be to choose not to specify the feature,
there is the risk that some implementations
may ship the feature as a nonstandard API.

<h3 id="multiple-algorithms">Write only one algorithm to accomplish a task</h3>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe "goal" instead of "task"? This immediately made me think of the event loop.


When specifying how to accomplish a task, write a single algorithm to do it,
instead of letting implementers pick between multiple algorithms.
It is very difficult to ensure that
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
It is very difficult to ensure that
It is very difficult to ensure that

index.bs Outdated
two different algorithms produce the same results in all cases,
and doing so is rarely worth the cost.

Multiple algorithms seem particularly tempting when defining
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don’t think that you need this paragraphs as long as an example mentions a file format.

index.bs Outdated
using either the [[HTML#the-xhtml-syntax|XHTML parsing]]
or [[HTML#syntax|HTML parsing]] algorithm.
Authors who tried to use this syntax tended to produce documents
that actually only worked with one of the two parsers.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
that actually only worked with one of the two parsers.
that only worked with one of the two parsers.


Note: While [[rfc6838#section-6|structured suffixes]] define that
a document can be parsed in two different ways,
they do not violate this rule because the results have different data models.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn’t the real difference here that the suffix parsing produces an intermediate result?

i suspect that this is insufficient still, because it doesn’t really get at why suffix parsers exist. That is still somewhat contested, but my view is that intermediate results are rarely able to be processed meaningfully, so they are limited to use in diagnostic tools and the like.

Copy link

@msporny msporny left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, the principle seems too blunt to be useful. Some high level thoughts to start, I'm still trying to think about what text would be useful:

  • Polyglot, as a term, is wrong -- this is about interpreting the same data serialization using different algorithms, not about a single system interpreting that serialization using different algorithms. It's more about an ecosystem interpreting that same data serialization using different algorithms (which is useful, more on that below).
  • Yes, there are cases where this resulted in bad outcomes -- XHTML/HTML is a good example.
  • The comparison between VCDM and SD-JWT-VC is totally wrong, they're two totally different data models, using two totally different serializations, using two totally different algorithms -- and there are a number of us that think that whole thing is a massive standardization failure, so using that as an example of the right way to do something is not what we want to do. The only thing they have in common is the phrase "Verifiable Credential", and even that is being objected to by some of us.
  • The multiple suffixes thing is also contested -- in the IETF MEDIAMAN WG, we couldn't find broad-scale usage of suffix-based processing, what @martinthomson is saying is important here. I'll add that the suffix-based processing is also not a clear example of why this principle is good or bad.

Fundamentally, the principle seems misguided. Yes, at some level one data format and one algorithm is a good thing. However, what a traditional web crawler gets out of a web page is different from what a browser parsing the web page works with is different from what a frontier AI model gets out of a web page. The algorithms that each uses are quite different and useful and this principle seems to be arguing against that.

I think the only solid ground here is the XHTML/HTML example. You're going to get push back on the other items being mentioned if they continue to be mentioned in the way the current PR is written up.

I'll try to think of some constructive text, but wanted to get some preliminary thoughts down in an effort to help shape the PR into something more easily defensible.

@filip26
Copy link

filip26 commented Mar 7, 2025

Algorithms + Data Structures = Programs (Niklaus Wirth).

It’s rational to avoid having two algorithms performing the same function, especially when considering costs like time and space complexity, and to recommend the one that best fits the criteria. However, if this change is based on the assumption:

use either JSON or JSON-LD to parse bytes into their data models.

then there is a misunderstanding of the algorithmization basics. JSON and JSON-LD have different data models, as noted. They involve different data structures in the equation at the top, which means different algorithms are needed because they operate on different data structures.

From this perspective, calling for one algorithm to operate on different data structures does not make sense.

My recommendation would be to use a different argument when advocating for a single algorithm - considering factors such as time complexity, space complexity, etc.

@martinthomson
Copy link
Contributor

I don't agree with Manu about this being misguided. The point here is that the same HTML document is not seeking to express multiple distinct sets of semantics depending on how it is processed, there is just one HTML with one interpretation, and one data model that both the producer of the content and the consumer of content can agree on. If they disagree, that is likely due to one or other being wrong.

This is because there is just a single specification for HTML and a single way to interpret HTML content according to that content.

Obviously, what someone does once they have received HTML might differ, but those differences do not relate to how the HTML itself was interpreted, but how the content at the next layer (that is, the words and images and stuff like that) is interpreted. Sure, a human and an AI model will seek to do different things with the information they are presented with, but the interpretation is singular.

Where CID struggles a little is that there are two paths to the same interpretation. It manages that by giving implementations a choice and promising that the outcome will be the same either way. It's bad, because now there is a third place where a bug can result in a different interpretation (producer, consumer, and now spec), but it's not fundamentally a polyglot in the sense that there are multiple divergent interpretations possible.

The core message is that having divergent paths is undesirable. And yes, that means saying that seeking to have a pure JSON vs a JSON-LD interpretation of the same content is a bad idea. Because divergence in data models means that there is no single interpretation of the content upon which all potential recipients might agree upon.

to assign properties to particular objects than JSON does,
these specifications had to add extra rules to both kinds of parsers
in order to ensure that each input document had exactly one possible interpretation.
[[vc-data-model-2.0 inline]] and [[draft-ietf-oauth-sd-jwt-vc inline]] fixed the problem
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Manu is right that these are completely different (and that they likely represent standardization failure, though the question of where the failure occurred might be contested). In a sense, it is OK that they are completely different (that they are in competition is potentially bad if they address the same use cases, but there is no risk that one might be mistaken for the other).

I think that it would serve this example better to focus only on the CID case.

@gkellogg
Copy link

This issue really gets at the heart of a basic divide at W3C: one that is browser-centric, vs. one which is data-centric. In fact, JSON-LD does parse JSON (and YAML and CBOR) into a common INFRA-based data structure (called the Internal Representation) which various algorithms operate over to perform different transformations, including to interpret as RDF. This is the core reason behind JSON-LD, which has become extremely widely used on the Web (in large part, due to schema.org).

HTML is also often processed differently, typically by interpreting the resulting DOM. This might be done to extract Microdata/RDFa, interpret the contents of script elements, or to perform extensive re-formatting through ReSpec or Bikeshed. Search engines interpret the DOM for their own uses, so a general principle would seem to settle on a data representation which different applications can use to suit their different use cases.

In the case of Verifiable Credentials, the basic failure would seem to be a lack of agreement on how to work with the data that is represented in the JSON. This is an area the TAG can help with for future specs, rather than getting into a reductionist view that Polyglot formats are fundamentally a bad idea.

these specifications had to add extra rules to both kinds of parsers
in order to ensure that each input document had exactly one possible interpretation.
[[vc-data-model-2.0 inline]] and [[draft-ietf-oauth-sd-jwt-vc inline]] fixed the problem
by defining different media types for JSON-LD vs JSON,
Copy link

@BigBlueHat BigBlueHat Mar 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This statement is incorrect. The VCDMv2 only has a single media type: application/vc.

Likewise, SD-JWT-VC only has one: application/dc+sd-jwt. However, SD-JWT is not JSON parse-able base format.

Both specifications have parsing algorithms unique to their media types--and specific to their tasks.

It remains unclear how these examples are "polyglot".

Copy link
Contributor

@hober hober left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks really good to me on a first-pass look. I'll try to find the time to give it a closer read some other time, but please don't block on waiting for me to do so. :)

@pchampin
Copy link

It's bad, because now there is a third place where a bug can result in a different interpretation (producer, consumer, and now spec).

I disagree: bugs in specs are always possible, regardless of the fact that the spec acknowledges it or not.

Also note that an algorithm is not an implementation, and no algorithm is entirely neutral: Javascript developers do not write algorithms the same way Rust (or Java, or Scala...) developers do. As a consequence, every implementer of a spec has to adapt the algorithms. The differences are not limited to programming languages: whether you are using a relational, key-value, document, or graph database, you will encode and handle a given data model differently. Ultimately, every implementation defines a different interpretation path, whether we like it or not.

In some ecosystems (browser APIs come to mind), this heterogeneity may be limited, and therefore the "only one algorithm" principle is probably a good enough way to ensure interoperability. In other ecosystems, where the heterogeneity is higher, it is probably better to acknowledge it and provide guidance to the different kinds of implementers. This is the strategy taken by the editors of CID and VCDM, and should not IMO be flagged as bad practice™.

@jyasskin jyasskin marked this pull request as draft March 15, 2025 17:26
Copy link
Member

@csarven csarven left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tl;dr: The concept of "polyglot format" lacks a clear definition. It doesn't serve as a helpful lens for this discussion. The current examples don't adequately support the argument. Either different and better examples are needed to distil a principle, or a different approach should be considered. I've opted for the latter and made a change suggestion that aims to provide a more agreeable principle, and the examples should be updated in any case.


An authoritative definition of "polyglot format" with explicit criteria would help clarify whether a format, profile, or data model qualifies as such. If such a definition exists and aligns with open standards principles, a reference would be helpful.


SVG and MathML can be written to be valid and processed in both standalone XML and inline HTML contexts.

Since draft-ietf-oauth-sd-jwt-vc and vc-data-model-1.1 are separate specifications, the example does not fully illustrate the intended point.

Other examples:

  • HTTP status codes: Servers use either 403 or 404 to prevent resource discovery.
  • Click event handling: addEventListener('click') and onclick both achieve the same goal but through different mechanisms.

I don't see the significance of noting structured suffixes any more than the relationship between the 'text' top-level type and its subtypes. text/html is specific, and interpreting it as plain text is merely a useful step in the process of parsing it as intended (as pointed out by @martinthomson here, but not an alternative interpretation path. The same applies to application/ld+json, with the goal to interpret it as JSON-LD, not merely/only JSON. Treating JSON-LD purely as JSON is analogous to treating CSV or HTML as plain text - although a useful step in the whole process, as pointed by @gkellogg here, it will ignore part of the intended structure, semantics, and functionality. Similarly, application/json is intended to represent JSON, not the raw series of strings that make it up.

CID's context injection states "[a]ny differences in semantics between documents processed in either mode are either implementation or specification bugs".

As I understand it, this is analogous to different representations of an HTTP resource being deemed equivalent in meaning, e.g., when /dog depicts a dog with content negotiated as image/jpeg, it should also depict a dog, not a cat, when content negotiated as image/png. Stating that a resource can have multiple equivalent representations is not deemed to be a bug in neither HTTP or Web Architecture, and that obviously implies different algorithms to make sense of these different representations.

Sometimes reality is a bit more nuanced than "good" or "bad" =)

So, if a principle is to be stated beyond the obvious, it should encourage simplicity, clarity, and security, while accounting for the complexity and interoperability of different implementations.

Comment on lines +3508 to +3515
<h3 id="multiple-algorithms">Write only one algorithm to accomplish a goal</h3>

When specifying how to accomplish a goal, write a single algorithm to do it,
instead of letting implementers pick between multiple algorithms.

It is very difficult to ensure that
two different algorithms produce the same results in all cases,
and doing so is rarely worth the cost.
Copy link
Member

@csarven csarven Mar 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
<h3 id="multiple-algorithms">Write only one algorithm to accomplish a goal</h3>
When specifying how to accomplish a goal, write a single algorithm to do it,
instead of letting implementers pick between multiple algorithms.
It is very difficult to ensure that
two different algorithms produce the same results in all cases,
and doing so is rarely worth the cost.
<h3 id="single-algorithm">Write only one algorithm to accomplish a goal</h3>
When defining how to achieve a feature, it's better to specify a single
approach rather than offering multiple options.
If multiple methods are allowed, they must be equivalent in conformance to
avoid unnecessary complexity, inconsistency, and security risks.

@TallTed
Copy link

TallTed commented Mar 17, 2025

This explains why and when "polyglot" formats are a bad idea.

It doesn't appear to do anything of the sort. I don't see a clear definition of what such a "polyglot" format is, nor how one might be worked with, nor how working with one might go bad (nor go well, so there's that).

I do see a number of apparent misunderstandings by the author which others have pointed out directly, so I won't go into those myself.

Principle: Write only one algorithm to accomplish a task.

This doesn't appear to be the focus of this writing. Perhaps it shouldn't be the title of the PR, either?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

New principle: Discourage polyglot formats