Thursday, October 18, 2007

Can the semantic web help with the ten challenges facing enterprise mashups?

I'm going to delay my review of QEDWiki yet again to comment on Dion Hinchcliffe's post, The 10 top challenges facing enterprise mashups. Hinchcliffe's blogs about Web 2.0 have been very influential over the past few years, and this excellent posting is no exception.

Fair warning: I’m going to use his post as an excuse to go off on a futurist binge and talk about the semantic web. Don't worry, though. I'm going to talk about the 'real' semantic web, not the ivory tower version.

I won't reiterate Hinchcliffe’s points, you can, and should, read them for yourselves. However, I do want to talk further about two of his challenges that I think are related, and relate directly to the power of the emerging semantic web. His #2 challenge is an immature services landscape. There just aren't enough services out there to provide mashable content. His #6 challenge relates to data quality and accuracy. How do mashers know whether the data are accurate and up-to-date?

I see these issues as interrelated. The lack of 'supported' services is driving people to create services for themselves using various tools, HTML screen scraping being the one I've been working with lately. Before you dismiss screen scraping as a viable content creation strategy, note that the number of robots available from OpenKapow outstrips the number of services from StrikeIron and the number of APIs available from Programmable Web combined. Lack of services is causing people to turn to self-help methods to get mashable content directly from web pages. Yet we all know that web pages often have out-of-date data or even absolutely trash data.

Do you know about The Greys? The Greys are a crossbreed between human and an extraterrestrial reptilian species. By visiting this site I learned that there are over 70 distinct species of Greys. Wow! Good thing I have this website around to help me find such valuable information.

‘The Greys’ is an extreme example, but there are others that are less silly. If you were scraping content from the US Open site about who was in the women's final you would get one set of names for 2006 and another for 2007. Yet once the data is abstracted through a service call and incorporated into a mashup, it won't be obvious the 2006 data is out of date. Mashup user won't, and shouldn't, be able to tell from where the data came. Mashups are first and foremost about presenting a unified experience to the mashup user. Noting where data comes from makes the mashup less of a mashup and more like a plain old integration.

How can mashers solve this problem? One way is to create more supported services so mashers will depend less on tactics such as screen scraping to get their mashup content. I doubt this will work. By some estimates there are between 19 and 30 billion web pages today, and that doesn't even count dynamic pages such as search results from the Snap-on Tools site. We aren't going to create web services to expose reliable data for all of those pages. People who need mashable content are going to get it where they can, and that means web pages themselves.

Another way to help with the data reliability problem, and this is where I think the web is going, is to start leveraging the capabilities of the semantic web. I’m talking about the practical semantic web that is emerging from the likes of, Facebook and Amazon. I’m not talking about the ivory tower semantic web with volumes of ontologies, deductive rules and AI searches. Some call this emerging web “Web 3.0” and some say the ivory tower version of the semantic web is “Web 3.0.” Personally, I don’t care what we call it, but I’m excited about what it is, or rather, what it can become.

To backtrack, the semantic web is a way of structuring web content so that it can be consumed both by humans and by machines. Most web content today is only consumable by humans. (Irony. It's everywhere.) That’s why we get so many trash results even from the greatest search engines. In the ivory tower version, every web page has both semantic information (what the information on the page means) as well as content. The semantic information makes the page machine consumable. A phone number is a phone number is a phone number. Once a program knows the content is a phone number, it knows how to handle said content.

In theory, but not in reality, since there are many ways to tag and format a phone number.

In practical terms today, web content is being slowly categorized by various tag clouds such as, social network sites and blogging sites such as the one you’re visiting now.

Today these clouds are disaggregated without any sort of consistency. However, while it is unlikely we will get universal acceptance on what amounts to a tag dictionary, it is highly likely we can get universal acceptance of a small number of tags. This has already happened in specialty areas such as research libraries. Imagine a rating tag being adopted by all tag clouds so site visitors can rate the quality of a web page ala Digg, or an expiration date so mashers know when content is out of date, or even a copyright tag telling mashers the page is off limits for scraping. Not that mashers would pay attention.

Imagine a world where a masher pulling content from a web page through HTML harvesting of some sort could be given a rating of how good the data is likely to be. And even with disparate tag clouds, it would be possible for mashing tools to suggest alternative content pages. Imagine a world where the mashup itself could warn users if content quality degrades below some acceptable level.

Finally mashers and mashup users would be able to have some indication whether they are getting the latest scores, the most reliable news or the best information on extraterrestrial species.

OK, this is all for the future, but perhaps the not-to-distant future.

I’ll see what I can do to convince Serena Software to start thinking about the semantic web and how we can use it to help business mashers. Meanwhile, go give Hinchcliffe's blog post a thumb's up vote.


Mike said...

I agree, I think semantic web technologies will become an important part of the enterprise mashup solution, but as you note, we have to get away from the ivory tower version, and get back to pragmatic applications of the technology.

I'm not sure I fully agree that the mashup user shouldn't know where the data came from, since mashups are very much an integration, but with a focus on usability. This ties back to the data quality and reliability issues Dion mentions. In order to trust and verify the data, someone has to be able to audit the flow, although not all end users will care. (I've written about this problem
in detail at

I like the rating idea ! As we see more data services appear inside and outside the firewall, end user feedback will probably become as important as IT endorsement of various sources, and there is potential synergy between them, since IT can also rate them, and might have greater authority.

Shaw said...

Thanks for your comment, Mike, and for your blog link. I enjoy your posts.

The data quality problem is a tough nut to crack. Should mashup users care about the quality of their data? Well, yes, in theory. Will they care? Likely not until they get bitten by bad data. Tony Baer wrote about this a few weeks back, and I commented on it here.

The traceback idea is interesting, but I suspect that in some cases at least, data will be mixed, remixed and mixed yet again until the original source is completely obscured. Even enterprise data is suspect, which is why enterprise data governance has become a hot issue lately. When we reach out to the whole world, I'm just not sure the traceback idea is practical.

Your central point is well taken, however. Mashup users should care about the source of their data, even if they want it to be seamlessly integrated across multiple sources. I retract that part of my post.