HTML, NPF and other markup formats

I've recently published html2tumblr, a website and web API that converts HTML into posts on the blogging service tumblr. While tumblr used to allow its users to submit posts as HTML, it has since switched to an alternative format called NPF.

HTML is a tree-based markup language: HTML elements can contain other HTML elements, which can contain more elements, with no limit. This allows websites to have complex layouts, such as sidebars, headers and footers, side-by-side content, and so on. However, for simple blog posts such as this one, often the tree only has two layers: One for paragraphs and elements such as images and videos, and a second layer for inline formatting and inline links.

In fact, the HTML for this post looks roughly like this:

<article>
    <p>
    One paragraph of rambling, 
    perhaps with an <a href="https://example.com">inline link</a>.
    </p>
    <p>
    Another paragraph of rambling.
    </p>
</article>

You'll notice that inside the <article> element, there are only two more layers of HTML tags.

The tumblr NPF format follows this two-layer approach, but instead of HTML, it uses JSON as its representation language. The <article> from above would look like this in NPF:

{
  "content": [
    {
      "type": "text",
      "text": "One paragraph of rambling, perhaps with an inline link.",
      "formatting": [
        {
          "start": 43,
          "end": 54,
          "type": "link",
          "url": "https://example.com"
        }
      ]
    },
    {
      "type": "text",
      "text": "Another paragraph of rambling."
    }
  ]
}

While HTML places inline formatting directly in the text itself, NPF places inline formatting separately from the text and uses the "start" and "end" properties to specify what part of the text should be formatted.

Other markup formats

The site html2tumblr uses the deltaconvert library to convert HTML to NPF. The library also allows conversions to and from a few other formats, most notably the quill delta format. Quill is an open-source richt-text editor and uses its own format for representing richt text. The format is called delta because it can represent differences between documents with keywords such as insert, delete and retain. Representing a complete document comes down to representing the difference from an empty document. The code above would look like this as a quill delta:

{
  "ops": [
    {
      "insert": "One paragraph of rambling, perhaps with an "
    },
    {
      "insert": "inline link",
      "attributes": {
        "link": "https://example.com"
      }
    },
    {
      "insert": ".\n\nAnother paragraph of rambling.\n\n"
    }
  ]
}

A big difference from NPF is that quill delta doesn't represent paragraphs as separate blocks. Instead, each piece of text with different formatting is its own block. Multiple paragraphs with the same formatting become one block, and a paragraph break is represented as two newlines.

The deltaconvert library is open source under the Apache-2.0 license. You can find the code at https://jfhr.de/source/jfhr/deltaconvert or https://github.com/jfhr/deltaconvert.

The html2tumblr website is under the same license, you can find the code at https://jfhr.de/source/jfhr/html2tumblr or https://github.com/jfhr/html2tumblr.

I developed both on my own, and I'm far from perfect, so there are probably bugs. If you happen to find any, do to let me know, by writing an email or opening an issue in the github repository.