Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Should parser extensibility be a design goal? #113

Closed
domenic opened this issue Jun 12, 2015 · 26 comments
Closed

Should parser extensibility be a design goal? #113

domenic opened this issue Jun 12, 2015 · 26 comments

Comments

@domenic
Copy link
Collaborator

domenic commented Jun 12, 2015

Forked from #112 (comment)

@domenic
Copy link
Collaborator Author

domenic commented Jun 12, 2015

My take:

  • It would be great from an "explaining the platform" point of view.
  • Void elements are very convenient. I would like custom void elements.
  • Parsing rules that restrict content to only certain other content, and push other content out (like <table>), seem kind of nice. (But also kind of horrible?)
  • On the other hand, the parser is basically a giant pile of accidental complexity. Do we really want to make it hookable? Maybe "explaining the parser" is more like "explaining sync XHR"---a bad idea, even if we theoretically could.

@travisleithead
Copy link
Member

I can kinda see why some folks might want to have a void custom element (at least from a parsing perspective). As for the other craziness around <table>, <form>, etc. I'd steer clear of that. Just because it's part of the platform doesn't mean it's a good part of the platform.

@rniwa
Copy link
Collaborator

rniwa commented Jun 13, 2015

Self closing elements DO seem like a useful thing to have. I don't think we want to make parsing behavior polymorphic as done for table and tr elements though. It makes HTML parsing incompatible with XML parser, and all sorts of insanity ensues. One way to mitigate this problem would be let any element that has - in its name appear anywhere template element could appear.

For script and style, I don't think we necessarily want the exact same parsing rule that takes care of <!--, -->, etc... since they're mostly historical artifacts. For those use cases, we may want to allow contents to be parsed as CDATA.

Note that upgrading makes even less sense once we allow the parser behavior depend on element definition since we wouldn't know how to parse an element until the definition is available, and we certainly don't want to make the parsing rule racy.

@domenic
Copy link
Collaborator Author

domenic commented Jun 13, 2015

One way to mitigate this problem would be let any element that has - in its name appear anywhere template element could appear.

Hmm, what are the pros and cons of this vs. letting elements with - appear anywhere a div can appear? I don't really have an opinion either way, but it's probably worth enumerating.

For script and style, I don't think we necessarily want the exact same parsing rule that takes care of , etc... since they're mostly historical artifacts. For those use cases, we may want to allow contents to be parsed as CDATA.

Could you clarify a bit? From what I recall of CDATA, it involves ugly <![[CDATA[ ]]> pairs and is an XHTML-only thing. Whereas for script, the main magic is that it waits for the character sequence </script> to appear. Which of those models are you proposing? (I think this is just me not understanding the phrase "parsed as CDATA".)

Note that upgrading makes even less sense once we allow the parser behavior depend on element definition since we wouldn't know how to parse an element until the definition is available, and we certainly don't want to make the parsing rule racy.

Definitely. I think that any elements which opt-in to unusual parsing rules would either need a sentinel in their tag name (strawman: <voidelements-> vs. <need-closing></need-closing> vs. <script--esque> ... </script--esque>) or we could just error upon trying to define an element with an unusual parsing rule that already exists in the document, i.e. throw for document.registerElement("script-esque", { parsingType: "cdata" }). Upgrading and parser extensions are complementary in this sense, not at all contradictory.

@rniwa
Copy link
Collaborator

rniwa commented Jun 13, 2015

Could you clarify a bit? From what I recall of CDATA, it involves ugly <![[CDATA[ ]]> pairs and is an XHTML-only thing. Whereas for script, the main magic is that it waits for the character sequence </script> to appear. Which of those models are you proposing? (I think this is just me not understanding the phrase "parsed as CDATA".)

No, script data ignores <!-- and --> as well. See https://html.spec.whatwg.org/multipage/syntax.html#script-data-state

What I'm proposing is to let an element implicitly begin a CDATA section. e.g. treat <my-script>~</my-script> as <my-script><![[CDATA[~]]></my-script>. It won't work if the script contains the literal </my-script> but that's probably okay. But all of this seems like unwarranted complexity when the author is likely just going to use src attribute instead.

Definitely. I think that any elements which opt-in to unusual parsing rules would either need a sentinel in their tag name (straw man: vs. vs. ... )

This is the only way.

or we could just error upon trying to define an element with an unusual parsing rule that already exists in the document, i.e. throw for document.registerElement("script-esque", { parsingType: "cdata" }).

This would be racy if the script, which contains the element definition, loads asynchronously.

Upgrading and parser extensions are complementary in this sense, not at all contradictory.

I disagree. Some custom elements having to be defined prior to the use and others being upgraded later would be super confusing for consumers of custom elements.

@domenic
Copy link
Collaborator Author

domenic commented Jun 13, 2015

I disagree. Some custom elements having to be defined prior to the use and others being upgraded later would be super confusing for consumers of custom elements.

I don't really agree it would be that confusing, but the opt-in via unusual parsing rules neatly sidesteps the issue, since it means all elements (including those with unusual parsing rules) can be upgraded. Since you seem to prefer that solution anyway, I think we're in accord.

@annevk
Copy link
Collaborator

annevk commented Jun 15, 2015

I think is="" has already shown that people want some hooks for the parser. Maybe it does not need to be fully extensible, but we want to at least cover the cases of <template>, <script>, and void elements I think.

@annevk
Copy link
Collaborator

annevk commented Jun 15, 2015

(I don't think subclassing is the solution. Composition seems better since you might not want all the baggage that <script> brings, etc.)

@domenic
Copy link
Collaborator Author

domenic commented Jun 15, 2015

That seems fair. Can we try to nail down:

  • Should the default parsing (in terms of places that the element can appear) for custom elements be <div>-esque or <template>-esque? What are the pros and cons of either?
  • What specific types of unusual parsing are desired? E.g. @annevk proposes void elements, <script>-esque, and <template>-esque. Is that the correct set? @rniwa says that <script>-esque seems like unwarranted complexity when the author is likely just going to use src attribute instead.
    • Can we settle this with appeal to example usages in the wild? E.g. as @annevk points out it seems like Polymer has examples for <template>-esque. Are those better suited for composition, or inheritance? What about <script>-esque? Can we find any examples there?
    • When you're talking about <template>, there are two main changes: that it's allowed to appear anywhere, and that its contents are inert. To me the latter is fundamentally part of <template>, and you should only be able to get it via inheritance. @annevk does that seem right?

@annevk
Copy link
Collaborator

annevk commented Jun 15, 2015

I sort of feel like the default parsing should be <template>, since that can appear anywhere. Definitely not <div>, we don't want to close <p> and such.

While <script> has some complexity, it's not clear to me that introducing a new parser class is better than simply reusing the path <script> already takes.

I agree that <template> inertness is a different feature. Inheritance seems reasonable for that, though might also be composition in part since the parser then puts elements elsewhere. And if <template> ever gains data binding you might want all the parsing implications of <template>, but not data binding...

@domenic
Copy link
Collaborator Author

domenic commented Jun 15, 2015

I sort of feel like the default parsing should be <template>, since that can appear anywhere. Definitely not <div>, we don't want to close <p> and such.

Hmm I didn't realize <div> and <asdf> parsed differently. (I hope <asdf> and <span> parse the same??) I guess in all above posts when I said <div> I meant <asdf>. Which raises the question of whether anyone wants <div> behavior. Hopefully not.

While <script> has some complexity, it's not clear to me that introducing a new parser class is better than simply reusing the path <script> already takes.

Well, there'd need to be some abstraction at least, to distinguish the character sequence </script> from </custom-thingy> or whatever. Which seems like it might be enough of a cost to just go with a new mode.

@annevk
Copy link
Collaborator

annevk commented Jun 15, 2015

<span> is special inside foreign content whereas <asdf> would not be. Search for span in https://html.spec.whatwg.org/multipage/syntax.html to find that.

You don't think the scanner for </script> could be repurposed as a scanner for </{custom}>? Anyway, I guess I would be okay with a more simplified mode too. <script> does have an awful lot of states.

@tuespetre
Copy link

Anything that you would want to parse differently than 'normal' HTML can be shoved into a

<script type="my-format">

and if a custom element author wanted to make use of that, they could choose how to do that for their particular element -- whether that means instructing consumers to make it a child element, or provide it the ID of the script element, or whatever.

As for the ability to specify a valid content model, that's kind of what the shadow DOM already aims to provide. If a component author wants to move invalid children (i.e. children not matching a slot or content selector) to the outside of the element instead of just ignoring them (like what you mentioned with tables and forms) they can do that, although it seems pointless.

@zcorpan
Copy link
Contributor

zcorpan commented Feb 20, 2016

Some thoughts...

  • script parsing is insane. Really. See pub quiz, and the correct answer. Why would anyone want this wacko weirdness for any other element where it's not needed for Web compat?? If you want CDATA, be like style, not like script.
  • Voidness might not be necessary if elements with a dash in them can be self-closed with /> syntax. (HTML should support self-closing tags in general whatwg/html#721)
  • template-like parsing means you don't get thrown out of a table and you can contain stuff like tr. It also means you're scoping; compare <p><asdf>1</p>2 to <p><template>1</p>2.
  • Most "new" HTML elements parse like <asdf> (e.g. video) or like <address> (e.g. figure). Parsing like <asdf> for "block-like" elements means omitting </p> tags doesn't really work (e.g. <asdf><p>1<p>2</asdf>3). (Already the case for a/audio/video/ins/del though.)

@annevk
Copy link
Collaborator

annevk commented Apr 16, 2016

During the teleconference I think the only viable option to the parser that was mentioned was some kind of switch for the tree builder, so you can opt into a much simpler model. Anything that requires deep changes to the existing parser is unlikely to succeed.

@annevk annevk closed this as completed Apr 16, 2016
@domenic
Copy link
Collaborator Author

domenic commented Apr 16, 2016

I think some people were still supportive of the idea of using rare or currently-invalid sigils to switch to a different parsing mode, e.g. <void-element--> or <void-element!>...

@annevk
Copy link
Collaborator

annevk commented Apr 16, 2016

Okay, fair enough, if we still want to consider that.

@annevk annevk reopened this Apr 16, 2016
@zcorpan
Copy link
Contributor

zcorpan commented Apr 17, 2016

(Trailing -- is a bad idea, because --> closes a comment)

@zcorpan
Copy link
Contributor

zcorpan commented Apr 18, 2016

From ad-hoc reducing httparchive starting with all funny-looking characters on my keyboard, removing characters where the matches include things like regexps or scripts or other stuff, I was left with only ! (76 matches) and | (60 matches) as possible sigils.

Some have ! or | as the last character of an unquoted attribute value, like

<img src=/static/yeoman-character-sticker.c30c59fb9e.png class=Donation-sticker width=190 height=294 alt=Stickers!>

view-source:http://www.aujourdhui-en-france.fr/
uses <div ... !> and expects non-void.

view-source:http://ww1.jaruratmand.com/
uses <meta name=description content=This website is for sale! ... We hope you find what you are searching for!> (void)

view-source:http://gongbe.com/
uses <a ... |> and expects non-void.

view-source:http://www.philipmorrisdirect.co.uk/
uses <img |> (void)

None of the matches had element names with - in them.

For either of these, a space before it is required because otherwise the character becomes part of the element name or attribute name. The sigil could be /! or /|though I suppose that's pretty ugly.

In order to not break pages like the ones above, I suggest any sigil to change parser behavior be limited to elements that have - in the name.

@annevk
Copy link
Collaborator

annevk commented Apr 18, 2016

@zcorpan I've always assumed that @domenic's suggestion was about <void-element! attrs> not <void-element attrs !>.

@zcorpan
Copy link
Contributor

zcorpan commented Apr 18, 2016

Oh, OK, right... As I commented in whatwg/html#721 (comment) I think that is less flexible. Also if ! or | is actually part of the name it makes it difficult to use in systems that have XML name assumptions baked in (e.g. createElement but also server-side round-tripping and other tools, e.g. https://checker.html5.org/ comes to mind).

@annevk
Copy link
Collaborator

annevk commented Jul 21, 2016

@domenic do you still think that's realistic? My impression is still that nobody really wants to touch the parser.

@domenic
Copy link
Collaborator Author

domenic commented Jul 21, 2016

I still think it's realistic... I don't know if there's appetite for it.

@dominiccooney @rniwa @travisleithead @smaug---- what do you think of the idea (as a "custom elements v2" future feature) of modifying the parser so that <x-foo! attrs> parses as a void element, for any custom element name x-foo?

Does it sound doable and reasonable, or is modifying the parser in such a way a bad idea that you never endorse?

@domenic
Copy link
Collaborator Author

domenic commented Sep 19, 2016

TPAC F2F conclusion: nobody really wants to do this. The use cases are better served by is="" or similar. One of the particular problems brought up was backward compatibility with browsers that don't implement this; is="" is much better for that.

@domenic domenic closed this as completed Sep 19, 2016
@devingfx
Copy link

Hello,

Why not just plain old <icon-cog /> ? I never figure out why html parser doesn't support sefl closing tags?
Does the / char be part of an attribute name? <div /=""> ?

@rniwa
Copy link
Collaborator

rniwa commented Jan 20, 2017

We might be able to support self-closing tags for an element with - in its name.

However, what we're discussing is changing the parsing mode to avoid HTML parser's quirks or customize its behavior so the discussion here is a bit tangential to that even though the syntax we've been discussing looks like a self-closing tag so it's a bit confusing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

8 participants