Blogging: how we watch our navels
Filed in: WWW2004, Web, Fri, May 21 2004 22:15 PT
This is a conference blog post about a presenter talking about blogging. Love my navel. Looooooove my navel. Who’s a pretty navel? You are!
Daniel Gruhl of IBM Research presented on “Information Diffusion Through Blogspace”. He described blogging as “just on the cusp of being absolutely universally well-known” and “the greatest thing for exhibitionism… since windows.” Some that die as soon as they’re created, and some that are updated with “pathological intensity.”
The researchers gathered up 11,804 RSS feeds last September and October, comprising 401,021 entries. There were 14 news syndication sites included for comparison.
There’s related work in topic tracking and characterization, epidemiology (blogging is a virus!), network models and graphs, and how information is diffused, including threshold models (after x connections of mine mention a topic, at which point do I start talking about it), and other things that are used in viral marketing. They tracked URL references, noting that while 100,000 distinct links were in the corpus, yet only 700 appeared 10 times or more. (Power laws.) They looked at recurring text sequences, but didn’t get much value out of that. Likewise names of known personalities. Proper nouns were more numerous. As were individual words, but they had to find value in that by comparing the daily frequency of those words against historical frequencies.
They came up with a class of 340 “classic” topics. They broke down into spikes (one-shot deals, things that don’t get discussion over a longer period of time), chatter (low-volume but consistent topics, like “Alzheimer’s”), and “spiky chatter” (frequently mentioned, and likewise prone to spikes, such as “Microsoft”). Spiky chatter was most interesting to them. Once they’ve isolated one of these terms, they try to split out subtopics based on other terms, so they can try to normalize the overall graph. Spikes usually last no more than one or two weeks.
So, how do we bloggers infect one another? (Insert MeetUp joke here.) He described the echo chamber, and used a formal transmission graph to show how ideas propagate. IBM built a propagation model: each connection has two parameters, including probability that an idea-infectable person reads a given blog on a given day. If that doesn’t happen, it doesn’t get spread.
Is it possible to track the echo chamber? That is, can you determine one’s sphere of influence from the people who spread their memes? They presented an algorithm (and again, I napped. Sorry, I’m not the math guy.) But he did show a picture. Using a transmission graph, he shows how he’s trying to determine whick of a number of bloggers infected a given person later on the timeline. The algorithm iterates over the data to arrive at the variables in the graph.
Then they applied it to the blog data. Items were restricted to things not found in the media to reduce noise. The top 100 blogs had a high correlation with blogstreet.com. Then the researchers went back and manually looked at certain bloggers to try to conclude whether indeterminate cases could be resolved (that is, whether two bloggers knew each other), and found that in 90% of cases this was true. The chart that resulted looks — shock! — like the power law curve. There was a measure of “fanout” in which the vast majority were lower than 1 (energy of a story dies before spreading), and very few with powerful capability to spread memes.
Question about out-of-band communication, like mailing lists and so on. He said he’d like to add an item of “the outside world” to the study (ouch. Sounds like he knows us.) Did he consider using Trackback? No. It wasn’t available broadly enough.
(This conference is going to have a serious blogging tilt in 2005. Everybody is seeing blogging as a usage pattern that they really, really want to study. And in the next year, they’ll be seeing just how huge it’s going to get. Start preparing your papers.)
Semantic Web applications
Filed in: WWW2004, Web, Thu, May 20 2004 16:58 PT
How to make a Semantic Web browser
Dennis Quan of IBM Research presented Haystack, which was featured in Tim Berners-Lee’s keynote.
The baby step before automated agents is a user agent that focuses on personalizing information, to use the Semantic Web to deal with info overload. (Yes, please.) A Semantic Web user agent would bring related information together and put it into perspective for the user by creating relevant visualizations. The lessons from the user agents of today is that you have to provide decentralized access to information, and make it easy enough for Grandma to use it.
From this, Haystack. It allows users to go and suck down lots of data and rearrange it into views. He demonstrated bioinformatics, which is a rich area for data (in fact, W3C just got a fellow from a group called I3C which is working on all kinds of bio-science stuff). He pulled up one chunk of data, being ordered visually so people can try to make sense of it. Then he graphed it. As it goes on, it pulls in more data as it builds.
There are public databases of knowledge already out there, like the TAP project at Stanford, and IRC bots recording data. You can go and play with things like this, if you have a gig of free disk space and half a gig of RAM. So, not quite ready for prime-time, and not quite user-friendly enough yet, but at least there is a vision of how to sort data in the open world.
Semantic Email
Luke McDowell of the University of Washington presented a paper on semantic email based on a project conducted at UW.
The group set a baseline for what it would take in order to make semantic-based email transactions work broadly. It needs to be instantly gratifying to users. It needs to accommodate gradual adoption. And it needs to be easy to use. Semantic email processes provide at least the first of these.
Challenge #1: Process creation. They use templates, based in RDF, that are created by a simple Web form.
Challenge #2: Facilitating responses. They use text-based email forms that are then managed by a central server.
Challenge #3: Human/Machine Interoperability. Enabling either humans or software agents to handle semantic email messages. They have both human-readable and coded messages to allow either to respond.
This has been used for RSVPs, first-come-first-served situations, potlucks, consensus finding, and other areas like voting.
Why semantic email? Well, it may be able to determine certain things, such as that so-and-so probably can’t attend at all, based on her calendar. That is, to predict responses. It can interpret responses, and it can recommend points where intervention may be necessary. Say, I have a potluck, and there are too many desserts. This will need to be resolved.
Responses that violate the constraints of a situation have to be rejected. But then there are issues to resolve in determining whether constraints have been violated. The paper explains how to fix that.
What’s missing is that it doesn’t really define success, only various situations for failure, it ignores the costs of rejection (if the boss isn’t in, maybe it’s worth meeting anyway, or not), and assumes good actors.
They’ve tried to fix that by trying to specify a reward based on outcome, adding suggestions to those who have rejected, and come up with a probabilistic approach to get a positive outcome.
Topic-oriented blogging
Filed in: WWW2004, Web, 16:48 PT
Judit Bar-Ilan of Bar-Ilan University (what were the odds?) presented “An outsider’s view on ‘topic oriented’ blogging”.
(We’ll skip the “what is a blog” slide, k?)
Bar-Ilan credits Tim Berners-Lee with being the “father of the blogs” based on his status report from 1992. (I’m pretty sure Tim’s not concerned with the title.) “The blogging community is very much aware of itself.” (Amen, sister.) She was actually talking about things like Daypop, Technorati and Trackback. The blogosphere is good at disseminating information through the creation of RSS feeds and services like Blogdex and Feedster. And we permalink, which is good. (It’s even better when your permalinks are actually permanent, by the way.)
The Pew Internet Project says between 2% and 7% of American Internet users have blogs, and 11% read them. Meanwhile, Cyberatlas reports 2.4 to 2.9 million blogs as of June of 2003, but only half are active. Perseus, meanwhile, reports 4.12 million blogs, but only a third of them are active.
Bar-Ilan browsed “topic-oriented” blogs over the course of two months late last year. They included three library and information science blogs, three on networking and social science, three Web-oriented blogs, three computer-science blogs, and three usability blogs. She tracked statistics on updates and links. The average number of posts per day ranged from 0.11 to 2.74. Maximum number of posts in one day was 11. All of those sampled took at least a day off. (So we do have lives. huh.) Range of links per post went from 0.54 to 3.65, ranging up to 31. Comments were disabled in 7 blogs. Where in academia the saying goes “Publish or perish,” she suggests for the blogosphere: “Link and be linked or perish.” Only one single blog commonly didn’t link to anything or anyone else.
Sebastian Paquet’s blog (which was at the top of almost all of these ranges) had 335 inbound links, closely followed by Aaron Swartz’s Google blog with 325. For all but one, the majority of the posts were topic-oriented. Four of the fifteen had extensive postings, while the others were shorter.
Of the links, 13.4% linked to news. Another 12.9% linked to a content site or page, 12% to blogs, and 6.4% to specific blog entries. Some 27.6% of linking blog entries had a quote, and 35.8% had some kind of discussion or comment on the target.
In her conclusion, she says “We still do not know why people read blogs, but we can recommend reading professional, topic-oriented blogs.” (Man, I need to submit for next year.)
Conference etiquette
Filed in: WWW2004, vent, 16:44 PT
I am frequently irritated by people who do not know how to conduct themselves at a conference. When I get irritated, I have trouble paying attention. So, in an effort to cleanse myself of bad feelings so that I can listen to the speakers again, these are the people I’m targeting, and what I desire to happen to them.
People broadcasting computer-to-computer wireless networks should be punched in the head. Especially when they blow up everybody else’s network access.
People who eat with their mouths open while in a session, or who sniffle every ten seconds rather than simply blowing their damn nose, likewise deserve bad things.
And people who after 20 years of cell phone technology have not yet learned to at least put their phones on mute during presentations — or worse, who answer the phone and have a conversation inside the session — should have their talking privileges revoked by any means necessary.
HearSay Audio Browser
Filed in: WWW2004, Web, Wed, May 19 2004 20:00 PT
Amanda Stent reported on her work on a second-generation audio Web browser. HearSay is a prototype of an audio browser for users with visual disabilities.
Existing audio browsers have “several shortcomings”: they’re read out with little or no user-driven selection or ordering of content. Or they’re targeted to low-vision users, or use a walled-garden approach to get to a small number of sites.
HearSay’s approach is to determine function based on the structure of a document. It works well on template-based sites. It does a structural analysis of the document to discover semantic structures. (Cool.) It has been determined that semantically-related items exhibit consistency in presentation style and spatial locality. (And then the speaker showed mathematical equations, and I napped.)
If HearSay can find stuff in the DOM tree that it can use for semantics, it does that. If not, it looks for heuristics, and then annotates that portion of the DOM tree as a place to look. It then separates things into “partitions,” something which has been done successfully on a number of news and e-commerce sites.
When this was tested, it was found that task completion was even with a traditional browser. However, it took 4 times longer than a visual interface, because of rate of speed and mode errors. Users preferred it over JAWS, but would like non-speech input methods.
There are additional features being worked on like bookmarking items within a document (apparently including those without IDs) with a voice command. (Annotation is something the W3C has been working on for years. It’s good to see a decent application of this being requested by users.)
Web Accessibility: A “Broader” View
Filed in: WWW2004, Web, accessibility, 19:16 PT
John Richards of IBM Research presented under the title “Web Accessibility: A Broader View”. (This always makes an accessibility geek like me nervous, since “broader” views like this usually translate in the minds of listeners to, “great, these eggheads are solving the problem, so I’ll just stonewall when it comes to accessibility, and hope it gets fixed before the shit hits the fan.” Something similar happened earlier this year at the W3C Technical Plenary. So, to clarify: Yes, you as an author still have a responsibility to make your content accessible. Automated techniques are going to help, but they will not do your dirty work for you. Trust me. They’ve been trying for thirty years.)
Web accessibility is not just about recreational surfing. In some cases, it’s a lifeline (e.g., buying groceries when you can’t leave the house). In others, it’s about Web mail, messaging, and e-commerce. And we’re seeing the desktop and the browser morphing together.
The cost to comply with Web accessibility standards is high. (Well, it is and it isn’t. If you did it right to begin with, your cost is going to be low, and stay low.) Ownership of content has been haphazard, and the requirements have been on legal grounds, since there are “not that many” users with disabilities. (Oh, whatever. This is just an attempt to redefine disability to suit the approach cited in the paper.)
The goal, then, is to reduce the cost of remediation through technology. Moving from checking tools to automated repair tools, from source modification to on-the-fly transformation, and from hardware to software input adaptations. This would increase market adoption by addressing a larger body of users. (I interpret this to mean, companies don’t care about accessibility until they can buy it in a box, and if we can just abstract all these hairy, ugly problems called “users,” that would help us sell it to them. It’s safe to say I have issues with this.)
Let’s take older adults, because they have more to spend. There are benefits for limited dexterity, non-native-language speakers, and those with cognitive disabilities. And there’s situational disability, like poor visibility or constrained ranges of motion, that we all experience. Older adults are a wild card. They have multiple and fluctuating levels of disability, but don’t often consider themselves as being “disabled”. They want a standard browser, want to use the entire Web, but can’t specify their needs concretely.
Older adult users have the same problems, and they addressed them in the same way as with any other user with disabilities, but with an interface that has a less disability-oriented slant than usual. (While I disagree with the motives I can discern here, I have to say this is a really good idea. People should feel comfortable managing their Web sites, rather than treating them like they are printed in indelible ink and Not To Be Modified.)
The researcher’s project is in the form of an IE plugin that pops up menus for setting text size on the fly, color schemes, and so forth. (This is a good idea. Fonts on sites change, and the side effects of such do as well. It’s good to have a way to change things like this more interactively.) A lot of these changes can be done through the browser, but “how many people know how?” (Yup. Not a lot of people know how to set a user style sheet, unfortunately.)
The problem was, nobody could discover the settings icon they created. (Whee! Icon blindness!) So they had to create documentation to show how it’s done. (Which, unfortunately, has the same problem as setting the original preferences: nobody actually studies them. Much better to set a preference right at the start that says, “I don’t want any fonts smaller than x.”)
So they started working on a proxy. It was slow, error-prone, insecure, and had problems with copyright. And it didn’t work for a lot of people. 40% of users used the “speak text” feature — in fact, it was the most-used feature. The system leverages good, accessible content and user agents, rather than trying to boil the ocean of junky, broken content that’s out there. (Which is to say, make good accessible content and a system like this, once they make something that’s usable and effective, and fewer people will have problems. I always hate to say it, but accessibility is not going to go away until everybody picks up a shovel and starts digging. If you were hoping IBM fixed it for you, its really you who needs to take a “broader view.”)
Blogging WWW 2004
Filed in: WWW2004, Web, 16:00 PT
So, since people have been asking me if I’m going to blog the WWW 2004 conference, I guess that means I’m blogging the WWW 2004 conference.
The ground rules for reading these entries: Tangents may occur. Quotes are as accurate as I can remember or read from the slide. This is not an official transcript. But those sessions I blog will attempt to be as accurate as I can. My bias is apt to show through, but explicit statements of mine will be in parenthesis. People whose points I’ve missed are welcome to smack me down on the conference grounds. Other than that, sit back and enjoy the ride.
WWW 2004 keynote: Tim Berners-Lee
Filed in: WWW2004, Web, 16:00 PT
The keynote, as usual, is my boss’s boss, Tim Berners-Lee. Feel free to browse the keynote slides.
Tim is talking about domain names of late. Now that the .mobi and .xxx TLDs, among others, has been proposed, though, he has a problem with their intent. Lots of new TLDs have been added, such as the commercialization of national TLDs, and the new .info and .biz. It’s resulted in a whole lot of domains getting bought just to redirect to one central domain. With the new proposed TLDs, now they just have to buy all these other ones just to maintain their brand (example: amazon.xxx).
Good domain management is governed fairly, technically sound, and has value for the overall community. It isn’t just to score easy cash.
One proposed TLD was .xxx. “We’ve been here before” with PICS. And they did it in a way such that one central authority (see: Washington DC) couldn’t determine what is and is not porn. There are other ways to skin this proverbial cat.
.mobi is another issue: they want another TLD to encourage the creation of content for mobile phones and PDAs. But there are also systems that do this without a TLD. “The important thing about the Web is its universality.” It’s hardware-independent, OS-independent, network-independent, independent of language, culture, disability, “quality” of information, and economics of the country of origin.
.mobi “breaks the Web” because it splits the Web into two parts. Say you find a resource on your phone. Can you reuse it on your laptop? Nope. Another question. What’s “mobile”? How small is a small screen? What happens when they get a keyboard? How short is a short attention span? How low is low-bandwidth? We’ve got medium-specific identifiers in CSS to let you design this stuff in one glorious unified format. There’s content negotiation to point different devices to different views, and CC/PP to set profiles of device capabilities. And there’s the separation of form and content we accessibility wonks complain about all the time. (See? There is a bonus for doing things accessibly!) Selling mobile sites is a marketing issue to be solved, not a call for a new TLD.
Separating form and content taken to the extreme: ship the raw data alone, and map it to the user interface. This is called the “Semantic Web”. Lots of papers are floating around this conference on the topic of Semantic Web browsers, where they can interpret ontologies, and traverse different relationships. For example, using FOAF, my address book, and a meeting ontology, you can set agendas without everybody having, say, the same version of Outlook for it all to work.
Tim’s challenge to the audience is an “extensible open framework for the Semantic Web browser” (emphasis in original). That’s not just an idle challenge: “I want them done!”
So far, in Phase 1 of the Semantic Web, “it was a time of constraint”: everything was represented in triples. The RDF and OWL languages have resulted in this phase, both becoming W3C Recommendations.
Phase 2 is different: it’s a time of less constraint. Lots of RDF-based tools are stretching upward fromm the foundations, being used for various different applications, tied together with RDF and OWL.
A Semantic Web agent will handle only certain types of inference. It will have access to only certain sources of data. It will be aware of the provenance of the info it receives. It will be able to exchange data with anything, exchange rules of inference with similar agents, and exchange proofs of anything it learns.
What does it connect? Lots of things. He’s got a couple dozen different individual details which are daisy-chained together to show how things actually interoperate (or could, anyway).
Tim suggests a few ideas for bootstrapping a Semantic Web project. Put things together from data, not by marking things up in RDF by hand. Don’t change existing systems over to RDF immediately, but try some RDF adapters. Try combining data from existing systems that haven’t been connected. Then try running rules on them, or explore the data using OWL to find relationships you may not have known before. The challenge here is to show the “first genuinely serendipitous associations” of data.
He challenges academics and industry people to learn how to speak each other’s language to bridge the Semantic Web’s culture gap.
Tim puts forward the idea of an RDF clipboard that can determine the rules in disparate domains (e.g., dragging your bank statement onto the calendar) and having it figure out how to reconcile that in the interface (showing your purchases in a given time frame).
There’s a chart Tim shows called the Semantic Web bus, but this time, in addition, he showed an actual bus that will be driving around Spain in October, wrapped in the W3C logo, complete with workstations and WiFi, to show the Web’s full potential.
Questions (I love it when a keynote speaker actually takes questions!): One was about the formats of these data structures (like bank statements): Tim says people should be encouraged to work on the format the way they like it. If something else creates a different one, fine. You can decide that you’ll use that one later. No harm, no foul. The message it seems he’s putting forth is that it’s not the same kind of issue as the near-catholic (small c) adherence to today’s data file formats, since they’re all fluid, and putting them together patchwork-style should work, given the underpinnings of the Semantic Web.