Research
This category contains articles about my main research topic.

OLTPBenchmark PDF Print E-mail
Written by Carlo A. Curino   
Saturday, 03 March 2012 06:03

With Andy Pavlo, Djellel Difallah, and Phil Cudre-Maroux we just completed an interesting project: an extensible testbed for benchmarking relational databases.

This was motivated by the huge amount of time each one of us wasted in our research in building workloads (gathering real/realistic data and query loads, implementing driving infrastructures, collect statistics, plot statistics).

To reduce the wasted effort for others and to promote repeatability and ease of comparison of scientific results we pull together our independent efforts and created an extensible driving infrastructure for OLTP/Web workload, capable of:

  • Targeting all major relational DBMSs via JDBC (tested on MySQL, Postgres, Oracle, SQLServer, DB2, HSQLDB)
  • Precise rate control (allows to define and change over time the rate at which requests are submitted)
  • Precise transactional mixture control (allow to define and change over time % of each transaction type)
  • Access Distribution control (allows to emulate evolving hot-spots, temporal skew, etc..)
  • Support trace-based execution (ideal to handle real data)
  • Extensible design
  • Elegant management of SQL Dialect translations (to target various DBMSs)
  • Store-Procedure friendly architecture
  • Include 10 workloads
    • AuctionMark
    • Epinions
    • JPAB
    • Resource Stresser
    • SEATS
    • TATP
    • TPCC
    • Twitter
    • Wikipedia
    • YCSB
If you are interested in using our testbed, or to contribute to it visit:  http://oltpbenchmark.com/

 
A Relational rant about Hbase/Pig/Hadoop PDF Print E-mail
Written by Carlo A. Curino   
Monday, 09 January 2012 01:37

Disclaimer: my background is heavily skewed in favor of Relational Databases, and I am fairly new to the entire Hadoop/Pig/Hbase world.

For a project I am involved in, I am in the process of setting up a 40 nodes Hadoop/Pig/Hbase cluster. I am learning the tools as I go, and I feel like sharing the pain and excitements of this phase.

While I am 100% sure that someone with more experience in this field would have avoided many dumb errors I committed, I represent an ideal example of how easy/hard it is to pick up Hadoop/Pig/Hbase, i.e., zero backround in hadoop/hbase/pig, strong background in related fields (DB and Java programming).

I am running hbase 0.90.3 on hadoop 0.20.3 and pig 0.9.1.

RANT 1: No documentation

My first rant is about the painful lack of documentation, examples, tutorials. I was expecting to join a "modern" project run by a hip community, so I was expecting amazing tools and great documentation. The code is mostly barebone, there is very few and only partial examples around, most of the knowledge is trapped in mailinglists and bug-tracking software. Not very pleasant experience, especially because the rate of evolution of the project is high enough that most of what you find is already obsolete.

RANT 2: Classpath what a nightmare

One of the most horrible things to deal with is surprise surprise: the classpath! Using Pig and Hadoop the actual active classpath is born as a funky combination of many variables, registering from the pig scripts and scanning of directories. This would not be a problem in itself, what made my life miserable was the following: you load the wrong .jar file (e.g., a slightly different version of hadoop or hbase or pig or guava or commons etc..) and the error that comes out is not something reasonable like "Wrong class version" but rather varios bizarre things like "IndexOutOfBoundException", "EOFException", and many others... now this is annoying and rather hard to debug (if you don't expect it). Especially when it is some obscure Maven script that is sneaking the wrong jar into a directory you don't even know it exists, much less you suspect is somehow every jar in that dir is part of the classpath. Another interesting thing I observed is that pig 0.9.1 "register" does not always work, sometimes you have to put the jar both in the register and in the -cp you pass to pig in input for it to work. Oh joy....

At least from my experience having a more systematic way to load and check the jars in the classpath would save lots of time, especially when you first start (I am now very careful to alway check the classpath anytime there is an error).

RANT 2: Poor error propagation

I run in a very privileged setting, I am root on every box in the cluster, and I debugged many of my errors by looking at the logs of Zookeeper, Job Tracker, Pig, Master Hbase nodes and what so ever. But the propagation of errors seems rather bad, in a normal "shared grid" environment, where you don't necessarily have access to all these logs, I would have had a even harder time to debug my code (e.g., the pig script complete successfully, and the error is lost in the job tracker log).

RANT 3: Hbase is (not) easy to configure/run

I am doing the installation with the precious help of a skilled sys-admin, therefore I was thus expecting everything to run rather smoothly. Afterall, Hbase is giving up almost everything I love in life (transactions, secondary indexes, joins), in order to be super-scalable, robust, zero-admin. Well, not really... Having Hbase to work decently at scale seems a rather manual work of pampering and convincing the thing that "it's ok" to go fast.

I am loading data in Hbase via Pig HbaseStorage and/or importtsv (after spending enough time to find the combination of versions that works properly together), and I was expecting the thing to scale linearly and trivially (and to load at some 10k row per box per second). At my first attempt the pig script was failing because of too much waits on a region or something. And even after I pre-split my table across many regions and I generate keys as an MD5(id)+id, there is a huge amount of skew (50% of request hit a single node), but after some more parameter tuning the thing at least complete the loading. I will do more performance debugging in the future. (I plan to generate hfiles directly and bulk load).

Bottomline, I am ok to give up all the fancy relational transactional guarantees for a worry-free, super-scalable, zero-admin, self-balancing something... not sure Hbase is quite there yet. I feel there is lots of hype associated to it.

Altogether I am a bit underwhelmed by Hbase, I haven't seen anything I could not do about 5X faster with a set of MySQL boxes... Ok I know MySQL performance tuning much better than Hbase anything...  but I will work on tuning the performance more and at some point I will run some comparisons, I am still skeptical.

PROS:

Now that I released some steam with my rants, let me say few words about what's nice... What it is very nice for me is pig because it allows me to write very few lines of code that automagically get distributed and run like crazy. Let me reformulate, it is pig+hadoop... fine it is pig+hadoop+hbase. I have to admit that whenever you get the entire software-stack to play nice it is very exciting to be able to hack up some code in few minutes and have it parallelized across the entire cluster (and even more exciting on our main grid where you can spawn thousands of mappers in parallel).

Altogether, I am probably spoiled by having worked in the relational world for long time, where the systems are 30+ years old and thus many of these small issues have been solved (while plenty worst problems are lurking in the darkness of an unpredictable optimizer or hard to scale data model etc..). But this brings me to the moral of this post:

MORAL

My word of advice to people that plan to start using Hbase/Pig/Hadoop... by any mean get into it, it is an exciting world, but be ready to deal with a hackish, unpolished, untamed, kicking and screaming, system.. be ready to open java source code of the system to figure out why things are not working, or how to use and API... be ready to have interfaces that are as stable as butterflies... If you are ready for that you will not be disappointed.

 

Last Updated on Monday, 09 January 2012 01:38
 
Schema Evolution: Datasets... PDF Print E-mail
Written by Carlo A. Curino   
Monday, 18 April 2011 11:07

With the help of various students of Prof. Carlo Zaniolo at UCLA, I have gathered a long list of datasets to test schema evolution systems. You can find a brief summary of the system and links to their schemas at the following link: Schema Evolution Benchmark Datasets

Enjoy! And if you use this in a paper, don't forget to acknowledge/cite us...

 

 

 
Database in the Cloud PDF Print E-mail
Written by Carlo A. Curino   
Tuesday, 17 November 2009 14:25
The advent of cloud computing and hosted software as a service is creating a novel market for data management. Cloud-based DB services are starting to appear, and have the potential to attract customers from very diverse sectors of the market, from small businesses aiming at reducing the total cost of ownership (perfectly suited for multi-tenancy solutions), to very large enterprises (seeking high-profile solutions spanning on potentially thousands of machines and capable of absorbing unexpected burst of traffic). At the same time, the database community is producing new architectures and systems tackling in novel ways various classes of typical DB workloads. These novel, dedicated approaches often outshine the traditional DBMS trying to provide an unreachable "one-size-fit-all" dream.  Such increase in the number of available data management products is exacerbating the problem of selecting, deploying and tuning in-house solutions for data management.  

In this context we are building a cloud-based database architecture, enabling the implementation of DB-as-a-service.  We envision a database provider that (i) provides the illusion of infinite resources, by continuously meeting user expectations under evolving workloads, and (ii) minimizes the operational costs associated to each user by amortizing administrative costs across users and developing software techniques to automate the management of many databases running inside of a single data center.  Thus, the traditional provisioning problem (what resources to allocate for a single database) becomes an optimization issue, where a large user base, multiple DBMS engines, and a very large data center provide an unprecedented opportunity to exploit economy of scale, smart load balancing, higher power efficiency, and principled overselling.

The "relational cloud" infrastructure has several strong points for database users:  (i) predictable and lower costs, proportional to the quality of service and actual workloads, (ii) reduced technical complexity, thanks to a unified access interface and the delegation of DB tuning and administration, and (iii) elasticity and scalability, providing the perception of virtually infinite resources ready at hand. In order to achieve this, we are working on harnessing many recent technological advances in data management by efficiently exploiting multiple DBMS engines (targeting different types of workloads) in a self-balancing solution that optimizes the assignment of resources of very large data centers to serve potentially thousands of users with very diverse needs.  Among the critical research challenges to be faced we have identified: (i) automatic database partitioning, (ii) multi-db, multi-engine workload balancing, (iii) scalable multi-tenancy strategies, (iv) high profile distributed transaction support, (v) automated replication and backup.   These and many others are the key ingredients of the RelationalCloud (http://relationalcloud.com) prototype currently under development.
 
Data Integration PDF Print E-mail
Written by Carlo A. Curino   
Thursday, 11 September 2008 09:39

The current business reality induces frequent acquisitions and merges of companies and organizations, and more and more tight interaction between information systems of cooperating companies. We present our contributions to Context-ADDICT, a framework supporting on-the-fly (semi-) automatic data integration, context modeling and context-aware data filtering. Data Integration is achieved in Context-ADDICT by means of automatic ontology-based data source wrapping and integration. While this approach lays its foundations in a solid theoretical background, it also provides heuristics to solve the practical aspects of data integration in a dynamic context. 

More details on this research effort can be found at: http://kid.dei.polimi.it/

 

 
«StartPrev123NextEnd»

Page 1 of 3
Copyright © 2014 . All Rights Reserved.
Joomla! is Free Software released under the GNU/GPL License.