Making XML readable with Emacs and Perl

March 2nd, 2009

I’ve found I often have parse through and examine XML output from web services, configuration files and other sort of resources where XML is optimized for compactness as opposed to readability. Since I almost always have an emacs window already open and ready to accept the cut/paste buffer, I wrote these snippets of Perl and emacs lisp to help me deal with the situation.

Add this to your .emacs:

1
2
3
4
5
6
(defun xmltidy-region ()
  "Tidy up the current XML region."
  (interactive)
  (save-excursion
    (shell-command-on-region (point) (mark)
			     "~/t/xml_indent.pl" nil t)))

Where ~/t/xml_indent.pl contains this:

1
2
3
4
5
6
7
8
9
10
11
12
13
#!/usr/bin/perl -w
 
use strict;
 
use XML::Simple;
 
my $xs = XML::Simple->new(KeepRoot=>1,ForceArray=>1);
my $data = '';
while (my $line = <STDIN>) {
    $data .= $line;
}
my $x = $xs->XMLin($data);
print $xs->XMLout($x) . "\n";

Now, simply paste the XML into a buffer (using nxml-mode helps), select the XML mark-up and enter M-x xmltidy-region.

Should you use Hadoop?

August 11th, 2008

There’s a lot of buzz right now about Map/Reduce, what it is supposed to be, why you should use it. Some see it as a sort of a panacea in solving all through for scaling, a general database, a general clustering or cloud computing system. In short, before you even begin thinking of Hadoop and Map/Reduce, ask yourself:

  • Is the data inherently impossible to treat as relational data without any drawbacks?
  • Can the data be handled by just a Perl script?
  • Is disk seek time the bottleneck (if it isn’t, Map/Reduce may be a solution but not in form of Hadoop is not most efficient when used just as a general “process-spawning” engine).

Map/Reduce with a distributed file system (and Hadoop as implementation of Map/Reduce layered on top of HDFS) is unique in that it is a way to distribute disk seek times, across a cluster of commodity hardware. Disk seek time is a resource for which parallelization on a single node (multiple cores, virtualization, threading) doesn’t work. What sort of tasks require disk seek times? Data processing and data access.

Map/Reduce, isn’t the only approach to data processing or distributing seek times. If the data is relational to start with, a much better approach is simply using a partitioned and replicated database cluster (MySQL or Postgres): write partitioning distributes seek times on writes, replication distributes seek times on reads. However, if your data is not relational to start with and can’t be inserted, rapidly, in a relational fashion (e.g. if you would need to log every single page view, or even emit multiple events per page view - thus you can’t afford to wait for a database statement which would insert a row (with a primary key value) into a table), you have to look elsewhere.

The simplest solution, if you need data logged rapidly and processed offline is to perform an asynchronous write (e.g. to syslog’s/syslog-ng’s UNIX domain socket, or implement UDP client/server) and then process the data from a Perl script executed through a cronjob. This solution can scale by storing the data on a file server or simply writing it to multiple log servers (built in feature of syslog, could also be done using UDP broadcast or multicast). The time when a Perl script won’t cut it, is when reading the data into from file and performing sorting/data distribution (to child processes for computation) takes so long, that by the time the script is finished the data is no longer useful.

If your data is stored in a relational database, using Map/Reduce would add redundant steps. If the database solution can’t scale (by slowing down response times and creating contention), replace it by using Hadoop (or another non-relational process) first and then bulk loading the processed data into the database (e.g. MySQL’s LOAD DATA INFILE).

If disk time itself is not a source of contention, but computation and/or network I/O is, it is better to distribute the task using threading or regular UNIX forking (if there’s intensive computation) or using event-loops (select() or epoll()) if there’s lot of network I/O (for a general discussion of concurrent network I/O, see The c10k problem. Java processes (which is what hadoop would use for every running map task) are fairly expensive to spawn and context switch through (compared to threads or traditional UNIX processes).

How to lose weight: iterative approach

August 5th, 2008

(Warning and a disclaimer: I am not a doctor or a personal trainer;this is a personal experience, do your own research and don’t just take my word for it. I am not responsible for any damage to your health you may incur as a result of my advice, nor do I guarantee my results will be duplicated).

There was an “Ask YC” post on Hacker News asking how to lose weight. I posted a response, describing my own experience. Since then I’ve decided to write more on the subject, perhaps tailoring this as a “weight loss for hackers” article.

From ~January 2006 to ~November 2007 I’ve lost over fifty pounds (going from 200+ lbs. to <150 lbs) The appearance difference had been noticeable and people (including waiters at restaurants I frequent) had been asking me what my secret is.

To put it simply, there is no secret other than exercise. Is weight loss possible without exercise? May be, but for most, that tactic is actually more difficult than exercising. Dieting is tricky and one wrong move could put your body into a self-preservation mode where anything you eat turns into fat. Further more, a “hard-core” diet isn’t fun, you’ll constantly think about it. Should some changes made to what you eat? Since you’re already overweight, then likely yes (I will discuss them later on), but the most important change is to begin exercise*

From a hacker’s point of view weight gained or lost is a function of caloric intake and the energy used. There’s the energy we expend by going about our daily chores and there’s additional energy we can expend through exercise. In the last fifty years, in the United States, we’ve undertaken a massive move: from walkable city to suburbs. For one to get to work (or school) requires less walking than it did fifty years ago. In addition, the work most of us do is less menial; previously menial household tasks are now more automated.

As a result, we have to either diet or exercise to maintain a constant weight (“stay in shape”). Both of these seem like daunting tasks, but the trick here is to break them down into smaller deliverable chunks, each one of them having its own visible results rather than hope for an overnight change (and burn out due to frustration).

Read the rest of this entry »

Odd Hadoop problem

June 29th, 2008

When running Hadoop against a relatively large (~100,000 file) dataset, I found that the userlogs/ directory, I found Hadoop would permanently enter a state where new tasks would be impossible to execute, issuing the following error and failing at all tasks (whether native Java tasks or streaming tasks):

ERROR org.apache.hadoop.mapred.TaskTracker:
Caught exception: java.net.SocketTimeoutException: timed out waiting for rpc response

After long searching, I’ve found that the issue is due to the “userlogs/” directory in $HADOOP_HOME/logs/ filling up. My guess is this is due to running out of available file descriptors (”df -i” didn’t yield any indication of running out of inodes). Simply removing the directory with “rm -rf” and restarting hadoop worked to fix it.

Emacs compile command

June 18th, 2008

I’ve found a snipet online that lets one use the M-x
compile
emacs command with custom language sensitive settings.
Unfortunately I don’t have the original URL that I found this on, but
here is the relevant snippet, with example settings for Perl and PHP.

(require 'compile)

(defvar compile-guess-command-table
  '((cperl-mode . "perl -c %s")
    (php-mode . "php -l %s")))

(defun compile-guess-command ()
  (let ((command-for-mode (cdr (assq major-mode
                                     compile-guess-command-table))))
    (if (and command-for-mode
             (stringp buffer-file-name))
        (let* ((file-name (file-name-nondirectory buffer-file-name))
               (file-name-sans-suffix (if (and (string-match "\\.[^.]*\\'"
                                                             file-name)
                                               (> (match-beginning 0) 0))
                                          (substring file-name
                                                     0 (match-beginning 0))
                                        nil)))
          (if file-name-sans-suffix
              (progn
                (make-local-variable 'compile-command)
                (setq compile-command
                      (if (stringp command-for-mode)
                          ;; Optimize the common case.
                          (format command-for-mode
                                  file-name file-name-sans-suffix)
                        (funcall command-for-mode
                                 file-name file-name-sans-suffix)))
                compile-command)
            nil))
      nil)))

(add-hook 'cperl-mode-hook (function compile-guess-command))
(add-hook 'php-mode-hook (function compile-guess-command))

Back-propagation Neural Networks, Perl example

June 15th, 2008

I’ve been looking for a simple example of a Back Propagation Neural Network to base a class project on. The example I found was bpnn.py by Neil Schemenauer. A more thorough, Python-based introduction can be found here. For class (and personal) purposes, I’ve transliterated bpnn.py into Perl (see Bpnn.pm). Here is a quick example on how a five node Neural Network can be trained to recognize the Xor of two inputs:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
#!/usr/bin/perl -w
 
package XorBpnn;
 
use strict;
use warnings;
 
use lib qw(.);
 
use Bpnn;
 
use base qw(Bpnn);
 
sub new {
    my $class = shift;
    my $self = $class->SUPER::new(2,2,1);
 
    my $patterns = [
            [[0,0], [0]],
            [[0,1], [1]],
            [[1,0], [1]],
            [[1,1], [0]]
           ];
    $self->train($patterns);
 
    return $self;
}
 
sub round {
    my $number = shift;
 
    return int($number + .5 * ($number <=> 0));
}
 
sub calculate {
    my $self = shift;
    my ($x,$y) = @_;
 
    return round($self->update([$x,$y])->[0]);
}
 
1;
 
package main;
 
my $ann = XorBpnn->new();
 
print $ann->calculate(1,1) . "\n";
print $ann->calculate(0,0) . "\n";
print $ann->calculate(1,0) . "\n";
print $ann->calculate(0,1) . "\n";

New colocation

May 16th, 2008

Pretty much transitioned most hosted sites to a new machine collocated at Applied Operations - Layer42 datacenter. Big thanks go to their team in getting all of the gear setup properly. Transitioned from 32-bit FreeBSD 5.4-RELEASE to 64-bit Ubuntu Feisty. The general strlen.net site still goes to the FreeBSD machine, pending a transition (and possibly a re-write).