There’s a lot of buzz right now about Map/Reduce, what it is supposed to be, why you should use it. Some see it as a sort of a panacea in solving all through for scaling, a general database, a general clustering or cloud computing system. In short, before you even begin thinking of Hadoop and Map/Reduce, ask yourself:
- Is the data inherently impossible to treat as relational data without any drawbacks?
- Can the data be handled by just a Perl script?
- Is disk seek time the bottleneck (if it isn’t, Map/Reduce may be an solution but not in form of Hadoop, which through the use of Java, is not most efficient as a general “process-spawning” engine).
Map/Reduce with a distributed file system (and Hadoop as implementation of Map/Reduce layered on top of HDFS) is unique in that it is a way to distribute disk seek times, across a cluster of commodity hardware. Disk seek time is a resource for which parallelization on a single node (multiple cores, virtualization, threading) doesn’t work. What sort of tasks require disk seek times? Data processing and data access.
Map/Reduce, however, isn’t the only approach to data processing or distributing seek times. If the data is relational to start with, a much better approach is simply using a partitioned and replicated database cluster (MySQL or Postgres): write partitioning distributes seek times on writes, replication distributes seek times on reads. However, if your data is not relational to start with and can’t be inserted, rapidly, in a relational fashion (e.g. if you would need to log every single page view, or even emit multiple events per page view - thus you can’t afford to wait for a database statement which would insert a row (with a primary key value) into a table), you have to look elsewhere.
The simplest solution, however, if you need data logged rapidly and processed offline is to perform an asynchronous write (e.g. to syslog’s/syslog-ng’s UNIX domain socket, or implement UDP client/server) and then process the data from a Perl script executed through a cronjob. This solution can scale by storing the data on a file server or simply writing it to multiple log servers (built in feature of syslog, could also be done using UDP broadcast or multicast). The time when a Perl script won’t cut it, however, is when reading the data into from file and performing sorting/data distribution (to child processes for computation) takes so long, that by the time the script is finished the data is no longer useful.
If your data is stored in a relational database, using Map/Reduce would add redundant steps. If the database solution can’t scale (by slowing down response times and adding database contention), replace it by using hadoop (or another non-relational process) first and then bulk loading the processed data into the database (e.g. MySQL LOAD DATA INFILE).
If disk time itself is not a source of contention, but computation and/or network I/O is, it is better to distribute the task using threading or regular UNIX forking (if there’s intensive computation) or using event-loops (select() or epoll()) if there’s lot of network I/O (for a general discussion of concurrent network I/O, see The c10k problem. Java processes (which is what hadoop would use for every running map task) are fairly expensive to spawn and context switch through (compared to threads or traditional UNIX processes).