BigTable saves the Semantic Web

Last time Adam Bosworth mentioned databases , he made a few statements implying that the Semantic Web was doomed because of its complexity and Atom and RSS were going to be the way to structured data. For a moment, I thought Google had given up on the SW and RDF, but we all knew that through a couple of quotes in his talk, what he really wanted was an RDF store.

Adam Bosworth: If you build an open source stack that delivers globally available information, how do you massively distribute it and cause it to scale? Bosworth said you need to limit your queries to those that can be easily implemented by everybody and those that can be handled by a single machine. This requires that your queries run at the item level. This might feel odd to those used to dealing with databases, as this means you are not likely to perform joins, aggregations, or subqueries. There is plenty of SQL that cannot be supported….Bosworth concluded his keynote by saying the potential is that “you guys can handle hundreds of millions of queries per day and scale up and out in ways that Oracle can only dream of. You will be able to effortlessly support hard questions.”

Now in retrospect, we are starting to see that he wasn’t bluffing and the demands he was making to the database community were already a reality at Google. I stumbled upon a post by Greg Linden on talk given by Jeff Dean on BigTable.

BigTable is a system for storing and managing very large amounts of structured data. Data is organized into tables with rows and columns, but unlike a traditional database system, the row/column space can be sparse. Row keys and values are arbitrary strings, and the system allows each row/column cell to store not just a single value but a set of values with associated timestamps, simplifying analyses that examine how values have changed over time. Data in a single table is internally broken at arbitrary row boundaries to form contiguous regions of data called tablets.

This is excellent news for the Semantic Web. Google is building the RDF database we’ve been trying to build and to this date even though conceptually we are on the right track, our implementations do not scale in ways that would even match standard relational models today. Thus, making it very hard for real systems to adopt RDF as their platform today. However, all of this is going to change with BigTable, but let’s pay attention to the details in the description and a summary from Andrew Hitchcock.
  • Storing and managing very large amounts of structured data
  • Row/column space can be sparse
  • Columns are in the form of “family:optional_qualifier”. RDF Properties, Yeah!
  • Columns have type information
  • Because of the design of the system, columns are easy to create (and are created implicitly)
  • Column families can be split into locality groups (Ontologies!)
Why do I think this is an RDF database? Well, in case you might not know one of the problems with existing relational database models is that they are not flexible enough. If a company like Amazon starts carrying a new type of product with attributes not currently built into their systems, they have to jump through hoops to recreate the tables that store and manage product information. RDF, as an extensible description framework answers this problem, because it allows a resource to have unlimited number of properties associated with it. However, when we implement RDF stores atop existing RDBMS, we begin to use a row for each new property/attribute that we would like to store about the resource, thus making it sub-optimal for joins and other operations. Here is where BigTable comes in, because it’s row/column space can be sparse (not all rows/resources contain all the same properties) and columns can be easily created with very little cost. Additionally, you can maintain a locality for families of properties, which we called Ontologies, so if we wanted all properties about a blog entry, we could get them fast enough (i.e. a locality for all Atom metadata columns). Anyways, I have to get back to my school work, but I hope that everyone sees what I’m seeing and further analyze this talk with more attention to the technical details. I think that better times are coming for the SW and we’ll be soon enjoying a whole new class of semantic services on the Internet. One final note or maybe a whole separate post will be Bosworth’s comments on how we should be limiting our SQL queries in order to gain the performance we need in RDF databases.
  • co.mments
  • connotea
  • del.icio.us
  • digg
  • Ma.gnolia
  • Reddit
  • scuttle
  • Technorati
  • YahooMyWeb

About this entry