Thursday, September 24, 2009

Why Big Data & Real-Time Web Are Made For Each Other

There's been a lot of discussion lately about the real-time web and theproblems it poses for incumbent search companies and technologies. Fast-moving trends and the availability of up-to-the minute updates mean that purely historical answers are missing crucial information. Dealing with constantly growing information streams causes performance and scalability problems for existing systems and calls into question the mechanisms for compiling, vetting and presenting results to users. While these challenges may sound new, game-changing performance and scalability problems are also being faced in the more traditional realm of data analytics and large-scale data management. Driven by network-centric businesses that track user behavior to a fine degree, there has been an explosion in the speed and amount of information that companies need to make sense of, and an increasing pressure on them to do so faster than ever before. What needs to be recognized is that the inadequacies of existing systems in these two seemingly different environments stem from the same source — infrastructure built to handle static data simply doesn't scale to data that is continuously on the move. The information stream driving the data analytics challenge is orders of magnitude larger than the streams of tweets, blog posts, etc. that are driving interest in searching the real-time web. Most tweets, for example, are created manually by people at keyboards or touchscreens, 140 characters at a time. Multiply that by the millions of active users and the result is indeed an impressive amount of information. The data driving the data analytics tsunami, on the other hand, is automatically generated. Every page view, ad impression, ad click, video view, etc. done by every user on the web generates thousands of bytes of log information. Add in the data automatically generated by the underlying infrastructure (CDNs, servers, gateways, etc.) and you can quickly find yourself dealing with petabytes of data. Batch processing The commonality between real-time web search and big data analytics problems is rooted in the need to continuously and efficiently process huge streams of data. It turns out that traditional data analytics systems (such as database systems and data warehouses) and search engines are a poor match for this type of processing. These systems are built using batch processing, which involves information being collected, processed and indexed, then made available for querying and analysis, often with a cycle time of a day or more. This is not unlike the way programming used to be done in the days of punch cards -– create a card deck, wait for your turn, and come back the next day to see if it worked. Batch processing, however, leads to two problems: First, and most obvious, is the time lag ("latency") inherent in such processing. Batch processing systems typically have high startup costs and overheads, so efficiency improves as you increase the batch size. Larger batch sizes also make it easier to exploit the resources of ever-larger clusters of servers. In a batch world, throughput is improved by delaying the processing of information — exactly the opposite of what's needed for real-time anything. The second problem with the batch approach is that it wastes resources. For example, data warehouses typically ingest data through an ETL (Extract, Transform, Load) process that writes data into disk-based tables. Subsequent queries then hunt for that recently stored data and pull it back into memory to process it. All of this data movement is hugely expensive in terms of I/O, memory and networking bandwidth. The batch approach stems from viewing information as something that is stored rather than something that flows. The real-time web is a perfect example of where this way of thinking fails; the much larger information stream generated by all web activities is a less visible but even more extreme case. A mindset shift The big data problem has fed a surge of activity in data analytics systems. The flurry of new data warehousing and database vendors and the increasing adoption of the Google-inspired Hadoop stack are driven by these new data management challenges. While there have been some innovations in terms of efficiency in these systems (such as highly compressed columnar storage and smart caching schemes), the basic approach has been to rely on increasing amounts of hardware to solve ever-bigger problems. Such systems have not addressed the fundamental mismatch between batch-oriented processing and the streaming nature of network data. The excitement around real-time web provides a great opportunity to reassess the way we think about information and how to make sense of it. While there will always be a need to store information and to search through historical data, many of the analysis and search tasks that users need to perform can be done in-stream. This type of processing has both efficiency and timeliness benefits. For example, real-time search and trend analysis of the tweetstream can be done continuously as tweets are being created. This doesn't mean that the need for managing stored data is going away. In fact, most useful applications will need to combine streaming data with stored historical data, and in-stream processing is an extremely efficient way to compute metrics to be stored for later use. The point is that all processing that can be done in-stream should be. And such processing should not be limited just to the emerging "real-time" web. Applications that can map activity on the real-time web with information about past and present user activity on the traditional web will be perhaps the most useful of all. For example, a spike in tweets about a particular band could be used as a predictor of demand at an online music store. Conversely, the real-time web could be monitored for explanations for an observed spike in user activity patterns, video popularity or music downloads. The key to enabling such applications is to move from the "data as history" mindset to one of "data as streams." Fortunately, the real-time web is providing a great opportunity for all of us to rethink our approach to making sense of the ever-increasing amount of information available, no matter where it comes from. Michael Franklin is the founder and CTO of Truviso, and a Professor of Computer Science at UC Berkeley.

No comments: