A compact syntax for XProc?

Markup is great. I love markup. I particularly like XML, but I’m not dogmatic about it. HTML and JSON and Markdown are all just awkward serializations of XML as far as I’m concerned. They’ve got more (or less) expressive power than other serializations. I get a lot of milage out of Org-mode on a daily basis. I could use XML instead, but Org-mode has benefits. And, of course, it is XML.

If I have to write anything more substantial, I usually switch to DocBook. (Quelle surprise.) Writing in DocBook helps me to organize my thoughts, letting me identify what things are, instead of how they should be displayed. Then I can shake, stir, mix, remix, an…

Sometimes, markup is exactly the right tool for the job, as is the case for XSLT. Writing a long, complicated functional program in XML syntax may seem like an odd choice. Certainly, there are detractors. But, really, it’s perfect for this application. XSLT is a tree-to-tree transformation language, embedding it in an expression language that is explicitly a tree feels very natural. It also has a lot of the “code is data” feeling that makes Lisp so powerful.

But sometimes, XML markup can feel like it’s getting in the way. We can see how this plays out by looking at schema languages for XML. The historical schema language for XML, the document type definition (DTD), was not in XML (instance) syntax. It’s short⊕It’s also a little abstruse, but it’s not that hard to understand. and compact.

<!ELEMENT doc (title, p+)>
<!ATTLIST doc status (draft|final) #IMPLIED>
<!ELEMENT title (#PCDATA)*>
<!ELEMENT p (#PCDATA|em)*>
<!ELEMENT em (#PCDATA)*>

XML Schemas, on the other hand, are written in XML. This has all the advantages you’d expect,⊕It also supports a lot of features not available in DTDs, but that’s not relevant; I’m just looking at the syntax. the ones I outlined above plus code generation, semantic analysis, and easy mixing with other XML vocabularies. Short and compact, it isn’t.

<xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema"
elementFormDefault="qualified">
<xs:element name="doc">
<xs:complexType>
<xs:sequence>
<xs:element ref="title"/>
<xs:element maxOccurs="unbounded" ref="p"/>
</xs:sequence>
<xs:attribute name="status">
<xs:simpleType>
<xs:restriction base="xs:token">
<xs:enumeration value="draft"/>
<xs:enumeration value="final"/>
</xs:restriction>
</xs:simpleType>
</xs:attribute>
</xs:complexType>
</xs:element>
<xs:element name="title" type="xs:string"/>
<xs:element name="p">
<xs:complexType mixed="true">
<xs:sequence>
<xs:element minOccurs="0" maxOccurs="unbounded" ref="em"/>
</xs:sequence>
</xs:complexType>
</xs:element>
<xs:element name="em" type="xs:string"/>
</xs:schema>

Five lines has become 28. That’s a lot of trees for the amount of forest.

Another schema language is RELAX NG. Like XML Schema, it has an XML syntax:

<grammar xmlns="http://relaxng.org/ns/structure/1.0">
<start>
<ref name="doc"/>
</start>
<define name="doc">
<element name="doc">
<optional>
<attribute name="status">
<choice>
<value>draft</value>
<value>final</value>
</choice>
</attribute>
</optional>
<ref name="title"/>
<oneOrMore>
<ref name="p"/>
</oneOrMore>
</element>
</define>
<define name="title">
<element name="title">
<text/>
</element>
</define>
<define name="p">
<element name="p">
<zeroOrMore>
<choice>
<ref name="em"/>
<text/>
</choice>
</zeroOrMore>
</element>
</define>
<define name="em">
<element name="em">
<text/>
</element>
</define>
</grammar>

That’s even longer than the XML Schema version, clocking in at 41 lines! But RELAX NG also has a compact syntax. The compact syntax has exactly the same expressive power as the XML syntax. You can convert between them losslessly. The compact syntax version of our schema is just a fraction longer⊕And I could make it fit in five lines if I tried! than the DTD:

start = doc
doc = element doc {
attribute status { "draft" | "final" }?,
title, p+
}
title = element title { text }
p = element p { (em|text)* }
em = element em { text }

For lots of folks, that makes RELAX NG a lot easier⊕There are technical differences, of course, and reasons why one or the other might be better for a particular application, but the fact that you can fit a whole schema in a few hundred lines rather than thousands, can make a big difference to its comprehensibility. to use than XML Schema. It results in schemas that are a lot shorter and easier to grasp.

What does any of this have to do with XProc, I hear you ask?

Well, the question is, where does the XProc syntax sit in terms of usability as an XML vocabulary, and would a compact syntax be better? Unlike XSLT, I don’t think the XProc XML syntax is a perfect fit. Where XSLT is describing trees, XProc is describing graphs. That can’t be made to feel quite as natural because the underlying abstractions are different.

In XProc 3.0, I actually think it’s a pretty good fit. A lot of effort went into finding better, easier ways to express steps, the connections between them, and other language features in ways that are compact and natural.

XProc 1.0 is more verbose and much less flexible. The XProc 1.0 XML syntax feels a lot more like it’s getting in the way instead of helping. Shortly after 1.0 came out, a fair bit of effort was poured into finding an equivalent “compact syntax” for XProc. Without success.

Later, the XProc Next Community Group that spun up to work on what became XProc 3.0 wasn’t especially interested in a non-XML syntax. But I’ve never really been able to completely let it go.

What’s new since the last time I poked at it is Invisible XML. That makes prototyping a lot easier. And it’s fun, of course.

This is not a serious proposal. I just had an idea and spent an evening pursuing it. Here’s a short, useful XProc pipeline:

<p:declare-step xmlns:p="http://www.w3.org/ns/xproc"
xmlns:h="http://www.w3.org/1999/xhtml"
xmlns:nw="http://nwalsh.com/ns/xproc-extensions"
type="nw:standalone"
name="main" version="3.1">
<p:input port="source" content-types="xml html"/>
<p:output port="result" content-types="xml html"/>

<p:viewport name="vphref" match="h:link[@href and @rel='stylesheet']">
<p:load href="{resolve-uri(h:link/@href, base-uri(.))}"/>
<p:encode name="encoded"/>
<p:set-attributes>
<p:with-input pipe="@vphref"/>
<p:with-option name="attributes"
select="map{'href': 'data:text/css;base64,' || .}">
<p:pipe step="encoded"/>
</p:with-option>
</p:set-attributes>
</p:viewport>

<p:viewport name="vpsrc" match="h:img[@src] | h:script[@src]">
<p:load href="{resolve-uri(*/@src, base-uri(.))}"/>
<p:variable name="ctype" select="p:document-property(., 'content-type')"/>
<p:encode name="encoded"/>

<p:set-attributes>
<p:with-input pipe="@vpsrc"/>
<p:with-option name="attributes"
select="map{'src': 'data:' || $ctype || ';base64,' || .}">
<p:pipe step="encoded"/>
</p:with-option>
</p:set-attributes>
</p:viewport>

</p:declare-step>

It replaces all references to external CSS stylesheets, script sources, and images in an HTML document with their content, encoded as data: URIs: it makes an HTML document standalone. Here’s what it might look like⊕I had an idea for a YAML syntax too. Don’t ask. I don’t think so. in a non-XML syntax:

namespace h="http://www.w3.org/1999/xhtml"
namespace nw="http://nwalsh.com/ns/xproc-extensions"

declare-step@main type="nw:standalone" version="3.1" {
input source { content-types="xml html" }
output result { content-types="xml html" }

viewport@vphref(match="h:link[@href and @rel='stylesheet']") {
load(href="{resolve-uri(h:link/@href, base-uri(.))}")

encode@encoded()

set-attributes(
attributes = "map{'href': 'data:text/css;base64,' || .}"
with context encoded) {
source from @vphref
}
}

viewport@vpsrc(match = "h:img[@src] | h:script[@src]") {
load(href = "{resolve-uri(*/@src, base-uri(.))}")

let ctype = "p:document-property(., 'content-type')"

encode@encoded()

set-attributes(
attributes = "map{'src': 'data:' || $ctype || ';base64,' || .}"
context encoded) {
source from @vpsrc
}
}
}

I’m not going to try to explain it. If you squint, you can see that all of the pieces are there. It’s only two lines shorter than the XML version and I’m not absolutely convinced that it’s easier⊕It does parse, I can turn it back into XML with an iXML grammar and I could turn that back into XProc with a bit of XSLT. to read. XProc 3.0 isn’t markedly improved by taking away the XML, as far as I can tell. Maybe I am ready to let it go.

Looking down the road at future versions, though, based on the XPath 4.0 family of specifications, it’s easy to see that these are going to be powerful tools for processing JSON and other data formats not expressed in angle brackets. Would a non-XML syntax make XProc more appealing for users who never use XML? Should I care?

I don’t know, but I still do, kinda.

Similar Posts