A History of Silicon ValleyTable of Contents | Timeline of Silicon Valley | A photographic tour
History pages | Editor | Correspondence
Purchase the book
These are excerpts from Piero Scaruffi's book
The Selfies (2011-16)click here for the other sections of this chapter
The first companies to deal successfully with "big data" were probably the big two of the 2000s: Google and Facebook. It was becoming more and more apparent that their key contributions to technology were not so much the little features added here and there but the capability to manage in real time an explosive amount of data.
A Facebook team led by Avinash Lakshman and Prashant Malik developed Cassandra, leveraging technology from Amazon and Google, to solve Facebook's data management problems. Facebook gifted it to the open-source Apache community in 2008. DataStax, founded in 2010 in Santa Clara by Jonathan Ellis and Matt Pfeil, took Cassandra and turned it into a mission-critical database management system capable of competing with Oracle, the field's superpower.
A Google team led by Jeff Dean and Sanjay Ghemawat (in about 2004) developed the parallel, distributed algorithm MapReduce to provide massive scalability across a multitude of servers, a real-life problem for a company managing billions of search queries and other user interactions. In 2005 Doug Cutting, a Yahoo! engineer, and Mike Cafarella implemented a MapReduce service and a distributed file system (HDFS), collectively known since 2006 as Hadoop, for storage and processing of large datasets on clusters of servers. Hadoop was used internally by Yahoo! and eventually became another Apache open-source framework. The first startups to graft SQL onto Hadoop were Cloudera, formed in 2008 in Palo Alto by three engineers from Google, Yahoo! and Facebook (Christophe Bisciglia, Amr Awadallah and Jeff Hammerbacher) and later joined by Doug Cutting himself (Cloudera was acquired by Intel in 2014); and Hadapt, founded in 2011 in Boston by Yale students Daniel Abadi, Kamil Bajda-Pawlikowski and Justin Borgman. Other Hadoop-based startups included Qubole, founded in 2011 in Mountain View, by two Facebook engineers, Ashish Thusoo and Joydeep Sen Sarma; and Platfora, founded in 2011 in San Mateo by Ben Werther. Qubole offered a cloud-based version of Apache Hive, the project that the founders ran at Facebook (since 2007) and that was made open-source in 2008. Hive sat on top of Hadoop for providing data analysis and SQL-like query.
Meanwhile, Google developed its own "big data" service, Dremel, announced in 2010 (but used internally since 2006). The difference between Hadoop and Dremel was simple: Hadoop processed data in batch mode, Dremel did it in real time. Dremel was designed to query extremely large datasets on the fly. Following what Amazon had done with its cloud service, Google opened its BigQuery service, a commercial version of Dremel, to the public in 2012, selling storage and analytics at a price per gigabyte. Users of the service could analyze datasets using SQL-like queries. Dremel's project leader Theo Vassilakis went on to found Metanautix with a Facebook engineer, Apostolos Lerios, in 2012 in Palo Alto.
At the same time that it disclosed Dremel, Google published two more papers that shed some light on its internal technologies for handling big data. Caffeine (2009) was about building the index for the search engine. The other one (2010) was about Pregel, a "graph database" capable of fault-tolerant parallel processing of graphs; the idea being that graphs were becoming more and more pervasive and important (the Web itself is a graph and, of course, so are the relationships created by social media). MapReduce not being good enough for graph algorithms, and the existing parallel graph software not being fault tolerance, Google proceeded to create its own. Google's Pregel, largely the creature of Grzegorz Czajkowski, used the Bulk Synchronous Parallel model of distributed computation introduced by Leslie Valiant at Harvard (eventually codified in 1990). The Apache open-source community came up with their own variation on the same model, the Giraph project.
The open-source project Apache Mesos, inspired by the Borg system developed at Google by John Wilkes since 2004 to manage Google's own data centers, was conceived at UC Berkeley to manage large distributed pools of computers and was used and refined at Twitter. In 2014 in San Francisco a veteran of Twitter and Airbnb, Florian Leibert, founded Mesosphere to commercialize Mesos. Meanwhile at Google the old project Borg evolved into Omega, Apache Spark, a project started in 2009 by Matei Zaharia at UC Berkeley, was a platform for large-scale data processing. Zaharia later founded his own company, Databricks, but the project survived and in fact grew. In 2015 IBM pledged 3,500 researchers to Apache Spark while open-sourcing its own SystemML machine-learning technology.
The old field of "business intelligence" kept mutating, or at least changing name. As "data mining" and "data analytics" became obsolete terms, a new one was coined: "data science". For example, Looker Data Sciences, founded in 2012 in Santa Cruz by Lloyd Tabb and Ben Porterfield, provided business-intelligence tool to dig into big data and make sense of them. At that point "big data" were mostly stored on high-performance data warehouses such as Amazon Redshift (2013, powered by technology acquired from ParAccel), Google BigQuery (2012), HP Vertica (built on top of Hadoop), IBM Netezza, and Teradata.
The world actually didn't have enough data, particularly from the developing world, a fact that skewed research and hampered remedies to problems. Premise, founded in 2012 in San Francisco by MetaMarkets' co-founder David Soloff and MetaMarkets' chief scientist Joe Reisinger, harnessed the power of the crowd to collect economic data around the world, provided in real-time from ordinary individuals armed with smartphones.
click here for the other sections of the chapter "The Selfies (2011-16)"