Yesterday Don voiced his opinion that XML tagging is a broken proposition. One of his basic premises is that because tags, a.k.a. meta-data, are generated by people and people aren't always as, shall we say, obsessive about doing so, that the entire system is broken. Obviously I disagree or I wouldn't be penning this post. :-)


Web 3.0, a.k.a. the Semantic Web, is going to require tagging, or meta-data if you prefer. Without it there's not a good way to establish the relationships between content that forms the basis for connections. But people are only part of the equation, and while they certainly form the basis for proper tagging, there are other issues that need to be addressed in order for this semantic web to work properly. While some proponents are doggedly attempted to create acceptable taxonomies from which the proper tags can be plucked, a semantic web is going to require more than just appropriate, consistent tagging.

The People Problem

Don's partially right (but only partially). Even with an extensive, consistent taxonomy the tags still need to be applied. People haven't done a great job with this thus far, therefore it's probably time to consider an automated, technological solution.

This isn't as far-fetched as it might sound. Most categories/tags are based on single, simple words like "performance" or "SOA" or "XML". Consider a transparent device between you and the server receiving your content that automatically inspects incoming content and compares it against a known list of tags. The device inspecting the content automatically builds a list of relevant tags and inserts the result set into the content before submitting it.

BIG-IP could do this today with an iRule, and it can do so transparently. In fact, this type of content enrichment is something that iRules is very well suited to accomplish, whether it's tagging or URI replacement or content filtering.

Such a solution only addresses web-submitted content, like blogs and wiki pages. We store and ultimately search and deliver a lot more than just Web 2.0 content, sometimes we need to do the same thing for documents in multiple formats, such as PDF or EXCEL or PPT. Well, consider another device that transparently sits between you and the server on which you store that data, and the possibility that it could accomplish the same type of task as BIG-IP and iRules. Perhaps a device like Acopia's ARX?

But even if we use such a solution to automatically tag content there are still other issues that need to be addressed in order for the Semantic Web to be realized.

The Tomato Problem

One of the biggest hurdles that needs to be overcome is the fact that while I say tuh-mey-toh, you might say tuh-mah-toh. Less pedantically, I say "tag", Bob says "class" and Alice says "category". Synonyms are something that aren't always addressed by tagging and they are not handled at all by search engines. The search technology that will drive Web 3.0 is semantically minded. It will understand that the meaning is more important than the actual word and search for the former rather than simply pattern matching on the latter.

Sure, everyone who practices SEO (Search Engine Optimization) makes certain to include as many synonyms as they can, but they can't (and don't) get them all. And it shouldn't be up to the author to worry about catching them all, especially as we're all essentially working off the same thesaurus anyway.

This is important. Even if we can automate or get everyone in the world into the habit of properly tagging their content, there will still be the need for "smarter" search capabilities that take the intrinsic properties of langauge into consideration.

Forcing a single taxonomy isn't the solution, primarily because of the next issue that needs to be addressed, language.

The Language Problem

When discussing the problem of language some might immediately think of the differences between say, Spanish and French and English. But the language problem is even more subtle than that. UK English and American English are not exactly the same, particularly in the area of spelling.

Key word and pattern matching don't treat "localization" and "localisation" the same. But both are the same word simply spelled a bit differently due to changes and localization in the language over centuries. There are a large number of these words that span the two versions of English; it isn't just a smattering, it's a large enough subset that people could not be relied upon to consistently recall all the exceptions in their own language.

So not only do we have inconsistency between French and English, but we have internal language inconsistencies to deal with as well. We can hardly force the UK or the US to change their entire language to solve this problem, so some sort of mechanism to deal with these subtle issues as well as broader translation capabilities is necessary for Web 3.0 to actually come to fruition.

Tagging Isn't the Problem, It's Search That Needs Some Fixin'

Tagging itself isn't necessarily the issue. We can, and likely will, automate the process of applying meta-data for the purposes of categorization and search rather than continue to rely on the inconsistent application by users. But simply applying the meta-data consistently will not address the deeper issues prevalent in search technology today that prevents a truly semantic web from being a success.

There is a Search 3.0 on the horizon; but it remains to be seen who will figure it out first. The language barrier issue is one that's already been somewhat addressed, but only for languages that require translation. The subtle differences in English around the world have yet to be addressed, and no one appears to have presented a semantic search that addresses the synonym issue. Try it yourself. Search on any search engine first for "dog", and then "canine". Or "cat" and "feline". Note the differences in the results. Not similar at all, even though they probably should be.

It seems obvious (to me at least) that the language and synonym issue lie squarely in the realm of search. The problem is, of course, how to implement such technology without sacrificing the performance of current search technology.


Imbibing: Coffee