[OpenISO] "alpha" release: OOXML spec as HTML
Norbert Bollow
nb at bollow.ch
Sun Oct 7 02:00:32 CEST 2007
Claude Almansi <claude.almansi at gmail.com> wrote (Wed, 26 Sep 2007):
> On 9/26/07, Norbert Bollow <nb at bollow.ch> wrote:
> > Henrik Sundberg <storangen at gmail.com> wrote on 2007-09-16:
> >
> > > Does anyone know about OOXML saved in other formats than the pdf
> > > documents found at
> > > http://www.ecma-international.org/publications/standards/Ecma-376.htm
> > > ?
> >
> > Well there's a OOXML formatted version of the spec also.
> >
> > Maybe it would be a worthwhile test of a document format spec
> > whether it is possible to write, with reasonable effort, a
> > script which converts the spec into HTML format?
> >
> > Since according to Ecma, the OOXML spec is in the public domain,
> > we can publish the resulting HTML files on the OpenISO.org website.
>
> That would be great
Done: I've just put a rough first version of the conversion script
up at http://OpenISO.org/tools/ooxmlspec2html
The output is here:
http://OpenISO.org/Ecma/376/
> Wondering though: if you make such a version but have to use "geek"
> competencies, won't publishing this version be self-defeating? I mean
> ECMA and MS could say: look, OOXML does work for the production of
> files in other formats.
Well, if they try to claim that my answer will be that it really
doesn't say much about a markup format if it's possible to generate
HTML from it.
If you could reasonably convert it to a genuine document format
standard, i.e. ODF, while preserving all the detailed markup
information, that'd genuinely be good news. But converting any
XML-based document markup format to straightforward HTML (without
trying to preserve the "look and feel" of the document, because that
would require abusing HTML for purposes that it was not designed for)
should be essentially trivial.
Nevertheless I ran into serious problems, much worse than just the
effects of the ugliness of Microsoft's abuse of XML where e.g. you
have to extract a bit from a hexcoded "table properties" bitmap in
order to determine whether the cells in the first row of a table
should be <th> or <td> cells.
The biggest problem is the wide variety of nonstandard graphics
formats embedded in the specification text. For example, some of the
pictures are in Microsoft's "Windows Metafile" (WMF) format. I
tried to convert one by means of various conversion utilities, but
they all failed to produce the right result. At least in this case I
was able to extract a version of the picture from the pdf which looks
like it might be essentially correct.
Maybe we should write a program which extracts all the pictures from
the pdf.
However I have seen several examples of pictures which are rendered
obviously incorrectly in the official Ecma-376 pdf files. Therefore,
such an image-extraction program certainly cannot possibly be expected
to produce correct results in every case.
Greetings,
Norbert.
--
Norbert Bollow <nb at bollow.ch> http://Norbert.ch
President of the Swiss Internet User Group SIUG http://SIUG.ch
Working on establishing a non-corrupt and
truly /open/ international standards organization http://OpenISO.org
More information about the Discuss
mailing list