« Notes on the book "Why Not?" (How to Use Everyday Ingenuity to Solve Problems Big and Small) | Main | Using Greasemonkey to Re-balance (and Re-write) Journalism »

An XML Clipboard for Semi-Structured Data

The problem: Data is still frequently transmitted as semi-structured ASCII text, but applications require structured information. For example, I often want to input an address into an electronic address book based on text from an email or web page, esp. a signature. (Yes, there's vCard, but I think consistent usage is still rare, and using it requires extra steps.) Another example is inputting financial transactions into a finance program from an on-line banking page - I now do much of my bill paying on-line using a web interface from my bank, and I hate having to laboriously (and erroneously) copy each section of the transaction's output (HTML) into GNUCash.

The solution I'm playing with has two parts: a) a general-purpose data recognizer clipboard that converts semi-structured text to XML, and b) support by applications for recognizing XML on the clipboard. The thought is that by converting to XML (i.e., semi-structured data that has been marked up), we've done the hard low-level data recognition work, leaving it to the applications to 'take what they can use'.

Usage scenario: consider copying the following email signature (from this example) to the clipboard:

Jim Frazier, President
The Gadwall Group, Ltd. - IT and Ebusiness Strategies
Batavia, Illinois 630-406-5861 jfrazier@gadwall.com
http://www.gadwall.com http://www.cynicalcio.com
Seminars and Training - Consulting - Publications

(I don't know him - it's just the first public signature I found.) It's easy to envision a straight-forward regular expression-based tool that pulls out the following:


Here's what the text copied from a bank's on-line statement might look like:

04/27/05 | Checking | Check 2067 | 2067 | $-45.00

Where the columns are: Date, Account, Description, Check #, and Amount. In this case the date might be:

<text>Check 2067</text>
<number type="integer">2067</number>
<currency unit="dollars">-45.00</currency>

You get the point - basically it's just a set of lower-level recognized data. It's up to the application to put the pieces together in a more specific and meaningful way.

  • It would be great to program custom rules using a nice scripting language, such as in JavaScript (used to program Konfabulator widgets).
  • As a work-around to requiring applications to recognize XML, we might try the XML clipboard plus a mouse/keyboard macro playing program (e.g., QuicKeys).
  • There would need to be a DTD standard for this. (I'm sure one exists somewhere.)
  • Hasn't this been done (partly) by Apple's old Data Recognizers idea? A few references here and here.

Reader Comments (2)

This should be part of the OS. Although the idea of having a program parse text into XML is useful, it will not be universal and it will never be possible to make it work even most of the time since too often there are errors in text.

However, All forms should be copyable into XML whether from applications or web pages. They should then be pastable anywhere. It needs to be part of the OS and it then needs to be part of web browsers.

July 19, 2010 | Unregistered CommenterAnonymous

The other thing that should be there is support for basic way we think about and represent the world - as connections between things. I've wondered for a while now whether most of our world can be broken in to People, Places, Events, and Things. Replacing the standard file system with a semi-structured link-based one would be very cool. Yes I care about files, but it's the context that provides them with value. For example, an email message is fine, but it needs to be connected to my world. All things people are probably working on... Anyway, thanks for your comment!

July 19, 2010 | Unregistered Commentermatthewcornell

PostPost a New Comment

Enter your information below to add a new comment.

My response is on my own website »
Author Email (optional):
Author URL (optional):
All HTML will be escaped. Hyperlinks will be created for URLs automatically.