Live Web, Real Time . . . Call It What You Will, It's Gonna Take A While To Get It

The article "Live Web, Real Time . . . Call It What You Will, It's Gonna Take A While To Get It" by Mary Hodder, published on June 30, 2009, delves into the persistent challenges and evolution of "live web search" or "real-time search." Hodder, drawing from her extensive experience in the field since 1999, argues that despite the rebranding and new startups, the fundamental problems of real-time search remain largely unsolved.

The Problem of Real-Time Search: Hodder defines "live web search" as search that incorporates time as a crucial element, a concept she notes was coined by Allen Searls. She contrasts this with "recent search," which she believes many new startups are offering, rather than true real-time search. The core issue is effectively filtering and presenting the vast, dynamic flow of information from the web in a meaningful way.

Historical Context and Evolution: The author traces the history of this problem back to the late 1990s, mentioning early publishing systems like Blogger and Movable Type, and discovery systems like Technorati, Sphere, and Pubsub. She highlights her own work at UC Berkeley, culminating in her master's thesis on live web data search and topic discovery. Hodder points out that even in the early 2000s, the value of time as a missing element in traditional search was recognized, but the practical implementation remained elusive.

Critique of Current Approaches (as of 2009): Hodder criticizes the prevailing approaches of the time, noting that many systems offer a reverse chronological view, often mixing various data sources like Twitter, blogs, and photos. While some provide context (e.g., activity counts, trend duration, histograms), they often suffer from susceptibility to spam and a lack of quality filters. She observes that new companies seem to be repeating the mistakes of earlier ones, failing to build upon past learnings.

The Role of Time and Filtering: A central theme is the importance of time as a filter. Hodder explains that while users are overwhelmed by the sheer volume of information and experience anxiety from the lack of editorial curation found in traditional media, better filters could alleviate this. She uses the example of searching for a "giant sea creature" versus a "massive squid" to illustrate the semantic leaps and difficulties users face in discovery.

Data Structure and APIs: Hodder emphasizes the advantage of structured data, citing Twitter's API as an example. Structured data, where the time of each entry is known, facilitates more efficient search and discovery compared to traditional web crawling, which only knows the time a page was spidered. She stresses the need for the right data models in backend databases to derive meaning and link metrics.

Power Law Effects and Metrics: The article discusses the problem of "power law effects" in search metrics. If a system prioritizes metrics like click-through rates, popular items can unnaturally stay at the top, regardless of their actual relevance or quality. Hodder argues that metrics should not be applied uniformly across all topics or circumstances without context. She uses the example of Om Malik's authority on broadband issues versus other topics to illustrate how generic metrics can randomize information quality. She advocates for topic-specific communities and editorial filters driven by human judgment as time and energy savers.

Challenges and Future Outlook: Hodder acknowledges that solving live web search, real-time search, and discovery is a complex, ongoing challenge. She expresses a personal need for these solutions and sees a huge opportunity in algorithmically replicating human editorial filters while managing user-generated content. The core challenge lies in balancing the "mobs' activities" with effective filtering.

Key Takeaways:

Real-time search is a complex problem that has persisted for over a decade.
The distinction between "live web search" (time-centric) and "recent search" is important.
Effective filtering and context are crucial for managing information overload.
Structured data and APIs (like Twitter's) offer advantages over traditional web crawling for time-sensitive information.
Power law effects in metrics can skew search results; context-aware filtering is needed.
The ultimate goal is to algorithmically replicate human editorial judgment while managing user-generated content.