Journal of a Programmer

Saturday, January 28, 2012

DERBY-5584

This week I've been spending a bit of time working on DERBY-5584.

This is a bug that I introduced into Derby in October, 2009, when I re-wrote Derby's GROUP BY engine to add multi-level aggregation support (DERBY-3002).

The new bug is rather complicated, because it involves (a) multiply-scanned result sets and (b) DISTINCT aggregates.

Let's see if I can explain those two concepts, starting first with DISTINCT aggregates.

Aggregates are a feature of the SQL language. The SQL statement normally reports on individual rows in your table(s): each row in your result can be traced back to a particular row in a particular table (possibly multiple tables, joined together). Aggregates, however, are functions which collapse the detail of individual rows of data into coarser-grained summarizations.

The classic aggregates in the core SQL language are: MIN, MAX, COUNT, SUM, and AVG, and these have been part of the language for decades. Most DBMS implementations add additional aggregate functions, but it's sufficient for our purposes to consider the base 5.

An example of an aggregate query would be to figure out the average mileage of flights between San Francisco and San Antonio:

SELECT AVG(flight_miles) 
   FROM flights
   WHERE (origin = 'SFO' AND destination = 'SAT')
      OR (destination = 'SFO' AND origin = 'SAT')

When performing these sorts of aggregate queries, it's often quite useful to break things down by grouping the results based on some other fields. For example, suppose we wanted to know whether the distance flown tends to vary depending on the day of the week (do weekend flights get to take a shorter flight path?):

SELECT departure_day_of_week, AVG(flight_miles) 
   FROM flights
   WHERE (origin = 'SFO' AND destination = 'SAT')
      OR (destination = 'SFO' AND origin = 'SAT')
   GROUP BY departure_day_of_week

Executing a query such as this is performed (in Derby, as in many other databases), by sorting the data on the grouping column(s) and then processing it in order. So we'd: (a) find all the rows matching the WHERE clause, (b) pull the departure_day_of_week and flight_miles columns out of each row, (c) sort the rows by departure_day_of_week, and then pump all that data through the aggregation engine.

The aggregation engine would then be able to see the rows in a single pass: first all the "FRI" rows, then all the "MON" rows, then "SAT" rows, "SUN" rows, "THU" rows, and "TUE", and lastly the "WED" rows. For each group of rows, the engine computes the AVG(flight_miles) in the obvious straightforward fashion (computing the total sum and the number of instances and dividing). The result of the query is 7 rows, with the average for each day.

Note that the author of the query generally includes the grouping columns in the result, to make the result easy to read.

You can GROUP BY multiple columns, but the overall principle is the same: the columns used to sort the records are exactly the columns used to group the results.

DISTINCT aggregates add some additional complexity. Suppose that, if two flights happen to have exactly the same flight_miles, we only want to consider the flight in the average once; that is, we only want to compute the average over all the unique values of flight_miles for that particular departure_day_of_week.

(I admit, this is very contrived, but hopefully it's still obvious what we're talking about here.)

In this case, the programmer can use the word DISTINCT to specify this:

SELECT departure_day_of_week, AVG(DISTINCT flight_miles) 
   FROM flights
   WHERE (origin = 'SFO' AND destination = 'SAT')
      OR (destination = 'SFO' AND origin = 'SAT')
   GROUP BY departure_day_of_week

This adds a small additional wrinkle into the query execution, as we need to ensure that we can refine the results to consider each unique combination of destination_day_of_week and flight_miles only once.

Derby accomplishes this by including the DISTINCT aggregate column into the sorting process, so that the rows are sorted by the compound key (departure_day_of_week, flight_miles). This way, all the rows which continue duplicate values for this pair of columns arrive together, and we can consider only the first such row and discard the others.

When I re-wrote the GROUP BY engine as part of DERBY-3002, I considered this problem, and implemented it, but I made a mistake. Note that, above, I observed that "the columns used to sort the records are exactly the columns used to group the results". However, with DISTINCT aggregates, this isn't precisely true, as there is one extra column in the sort key (the DISTINCT aggregate column) which isn't used to group the results, just to sort them.

In DERBY-3002, I handled that special case by: after the sort, but before the grouping pass, I removed the DISTINCT aggregate column from the sort key.

This worked, but ...

For most queries, the natural flow of query execution processes each intermediate result once. This is efficient and so the database works hard to do this whenever possible.

However, there are some cases in which an intermediate result must be processed multiple times. One example is the so-called "Cartesian Product" query:

SELECT a.*, b.* FROM a, b

In this query, the database produces all possible combinations of rows from tables a and b: for each row in table a, and each row in table b, there is a row in the result containing those values from a's row and b's row.

In such a query, Derby uses a brute force technique and simply implements the query as: for each row in a, read each row of b, and emit the values.

This means that we read the inner table (table b), multiple times, once per row in table a.

This was where my bug came in: it is possible that the inner table is a GROUP BY query of its own, which includes a DISTINCT aggreate.

When this happens, my code that removed the DISTINCT aggregate column from the sort key was causing problems. The first time we read the inner table, everything was fine, but then the next time, the ordering/grouping/sorting columns were all messed up (I removed the column from the key, but didn't add it back for the next go-round).

In DERBY-5584, you can see a clear and simple example of a query which demonstrates the bug.

The fix, I believe, is that instead of physically removing the column from the sort key, we need to instead teach the grouping logic that it may need to consider only a leading subset of the columns in the sort key as the grouping column.

I've implemented that change and am hoping it solves the bug.

Friday, January 27, 2012

It's not just a game, it's an obsession

The search

https://www.google.com/#q=skyrim+site:youtube.com

now returns more than 24 million videos, nearly all of them fan-contributed.

For example, here's a particularly nice one: http://www.youtube.com/watch?v=IL5K09mqwZc.

Now, back to that bandit camp...

Advances in weather instrumentation

I dug this great post from Cliff Mass about the state of modern weather observation instrumentation.

As I read it, I was remembering a discussion I'd had with my sailing friend Nick about whether the real-time wind reports at http://www.sailflow.com could/should include information from the boats themselves (to start with, from large commercial vessels or government vessels such as the USCG small boats, but eventually might each sailboat itself report on the wind conditions it's observing)?

So it was a kick to read about how large-scale weather forecasting in fact does include such information, and to come across this subtle observation about the feedback loop that occurs with that process:

As part of the Volunteer Observing System (VOS), mariners take observations every six hours. The light blue dots on this chart show where ships were reporting at one particular time:
...
[But] as forecasts get better, the ships avoid the areas we really need data---in the middle and near major storms. When forecasts were bad, ships would get trapped in dangerous conditions--now they can get out of the way.

Technology improves, and the way that humans use that technology improves as well, and the whole inter-connected world evolves.

Thursday, January 26, 2012

"Then the screams began."

All right, I know, I said I was not writing about these topics, but, really, have you read this report in today's New York Times about the conditions in the Chinese factories that build the world's high-tech gadgetry?

Those accommodations were better than many of the company’s dorms, where 70,000 Foxconn workers lived, at times stuffed 20 people to a three-room apartment, employees said. Last year, a dispute over paychecks set off a riot in one of the dormitories, and workers started throwing bottles, trash cans and flaming paper from their windows, according to witnesses. Two hundred police officers wrestled with workers, arresting eight. Afterward, trash cans were removed, and piles of rubbish — and rodents — became a problem.

It's a long and detailed article; you need to read it all. But it concludes with this observation:

“You can either manufacture in comfortable, worker-friendly factories, or you can reinvent the product every year, and make it better and faster and cheaper, which requires factories that seem harsh by American standards,” said a current Apple executive.
“And right now, customers care more about a new iPhone than working conditions in China.”

Go. Read. The. Article.

Also worth reading: Making It in America, by Adam Davidson of NPR's Planet Money team.

These meetings can lead the company to move dozens of jobs to another country or, in some cases, to create new jobs in the U.S. When Standard decided to increase its fuel-injector production, it chose to do that in the U.S., and staffed up accordingly (that’s how Maddie got her job). Standard will not drop a line in the U.S. and begin outsourcing it to China for a few pennies in savings. “I need to save a lot to go to China,” says Ed Harris, who is in charge of identifying new manufacturing sources in Asia. “There’s a lot of hassle: shipping costs, time, Chinese companies aren’t as reliable. We need to save at least 40 percent off the U.S. price. I’m not going to China to save 10 percent.” Yet often, the savings are more than enough to offset the hassles and expense of working with Chinese factories. Some parts—especially relatively simple ones that Standard needs in bulk—can cost 80 percent less to make in China.

These are complex issues. I'm pleased that journalists are taking the time to really dig into them, and to help educate us all, for as Dan Lyons said: "The problem I’m having isn’t with Apple, but with me."

Wednesday, January 25, 2012

Location, location, location

In computing, as in everything else, we care about physical location:

First, do you know where, I mean exactly where, the Web was invented? In case you don't, David Galbraith helps us track down the precise location, complete with a Google-mapped satellite view and a short interview with TBL himself:
I wrote the proposal, and developed the code in Building 31. I was on the second (in the European sense) floor, if you come out of the elevator (a very slow freight elevator at the time anyway) and turn immediately right you would then walk into one of the two offices I inhabited. The two offices (which of course may have been rearranged since then) were different sizes: the one to the left (a gentle R turn out of the elevator) benefited from extra length as it was by neither staircase nor elevator. The one to the right (or a sharp R turn out of the elevator) was shorter and the one I started in. I shared it for a long time with Claude Bizeau. I think I wrote the memo there.
I love the photo of the nice young fellow who currently occupies the office...!
Secondly, this is all-too-faddish, but you've got to check out this delightful video full of the latest Silicon Valley cliches.
"Who has a party in Palo Alto?"
...
"It's like Pandora ... for cats!"
Great stuff, and it really is quite accurate and well-written.
Lastly, this brought back great memories of my time in New England just after we got married. We lived in Boston (Brighton, to be precise); I worked in Kendall Square in Cambridge; many of my friends switched jobs and worked at "the old Lotus building", which at the time was the new Lotus building, back when Lotus was its own company and was the hot new place to be; I rode my bike to and from the office along Western Avenue and through Central Square; and I went to grad school 5 stops down the Red Line at UMass Boston (while my wife was taking a film studies class at BU).

I miss those days in Boston.

But the point of Feld's essay, and the point, really, of all three essays/films, is that location matters. It really does. I loved my time in Boston, and even though Brad Feld is right that East Cambridge is one of the most remarkable and wonderful spots in the country, the Bay Area is even better, which is why we (reluctantly) switched coasts in 1988.

Tuesday, January 24, 2012

Back to the hard-core tech stuff ...

Here are three pointers that are (hopefully) as interesting to you as they were to me:

Over at Real World Technologies, David Kanter has posted this great writeup of Intel's latest mobile-oriented microprocessor platform: Medfield, Intel's x86 Phone Chip
Medfield is a credible SoC for smartphones and is good enough to begin the process of building vendor and carrier relationships for Intel. This is particularly true, given Intel’s attractive roadmap. In 2013, Intel will ship a 22nm FinFET SoC with the new, power-optimized Silvermont CPU and the recently announced PowerVR Series 6 graphics. The rest of the world will ramp 20/22nm in 2014 at the earliest, a gap of 6-12 months. Judging by Intel’s plans for 14nm SoCs based on the Airmont CPU core in 2014, this process technology advantage is only likely to grow over time. Whether that advantage will yield a significant smart phone market share for Intel is uncertain, but Medfield clearly demonstrates that it is possible.
Even if Intel doesn't succeed, their continued presence in the marketplace and competition for market share will spur the big players (Qualcomm, TI, ARM, etc.) to improve their own systems, not just rest on their laurels.
The HTTP Working Group, which is (I think) a joint effort of the W3C and the IETF, is considering a new effort to try to define a HTTP 2.0 specification.
Why here? This mailing list is the best approximation of the HTTP community; it has participation (or at least presence) from most implementations, including browsers, servers, intermediaries, CDNs, libraries, tools, etc. I firmly believe that as HTTP evolves, it needs to accommodate the entire community, not just the selected needs of a subset, so rather than creating a new WG or having a private collaboration, it should happen here.
This won't be easy, but it's great to see them trying. I suspect that these topics are some of what they'll be talking about.
And if you don't have enough to read yet, here's a very nice, compact, and well-rounded Distributed Systems Reader. As it turns out, I was already familiar with all but two of these papers, but hey! Two new papers on the foundations of Distributed Systems; what's not to like!

CAPTCHA factories

It seems to be the week for me to write about social issues in computing, so I'll toss one more in. Then, I promise, I'm done with that slant for a while, because it's really not my sort of thing.

Anyway, I can't help but note the fascinating (in a can't-look-away-from-the-abject-horror sort of way) recent research done by a team at the University of California San Diego: Re: CAPTCHAs -- Understanding CAPTCHA-Solving Services in an Economic Context.

CAPTCHAs, of course, are those strange little pictures full of numbers and letters, generally semi-garbled or semi-distorted, that various web pages ask you to type in in order to prove you're a human being, not a robot. The idea is to dissuade people who are writing programs to manipulate web pages that are only intended to be manipulated by human beings. The acronym stands for "Completely Automated Public Turing test to tell Computers and Humans Apart".

Do they work? Well, according to the UCSD study, they actually do work fairly well, according to the strict interpretation of their stated goal. After testing several specialized programs designed to be able to "break" CAPTCHAs, the researchers found that the automated solvers generally were not successful:

We observed an accuracy of 30% for the 2008-era test set and 18% for the 2009-era test set using the default setting of 613 iterations, far lower than the average human accuracy for the same challenges (75–90% in our experiments).

So CAPTCHAs are working, right?

Well, not so fast.

It turns out that there is an immense industry of CAPTCHA-solving, and the solvers are actual human beings, not computer programs:

there exists a pool of workers who are willing to interactively solve CAPTCHAs in exchange for less money than the solutions are worth to the client paying for their services.

These people apparently sit in front of computers for hours at a time, doing nothing but solving CAPTCHAs that are displayed in front of them by Internet-based solving services that then turn around and sell these solutions to clients willing to pay for CAPTCHA solutions:

Since solving is an unskilled activity, it can easily be sourced, via the Internet, from the most advantageous labor market—namely the one with the lowest labor cost. We see anecdotal evidence of precisely this pattern as advertisers switched from pursuing laborers in Eastern Europe to those in Bangladesh, China, India and Vietnam

How much do these people end up getting paid? Almost nothing, but still enough to attract workers:

on Jan. 1st, 2010, the average monthly payout to the top 100 earners decreased to $47.32. In general, these earnings are roughly consistent with wages paid to low-income textile workers in Asia [12], suggesting that CAPTCHA-solving is being outsourced to similar labor pools

What do the authors conclude from all of this? The answer is that you can view the whole arrangement in an economics framework:

Put simply, a CAPTCHA reduces an attacker’s expected profit by the cost of solving the CAPTCHA. If the attacker’s revenue cannot cover this cost, CAPTCHAs as a defense mechanism have succeeded. Indeed, for many sites (e.g., low PageRank blogs), CAPTCHAs alone may be sufficient to dissuade abuse. For higher-value sites, CAPTCHAs place a utilization constraint on otherwise “free” resources, below which it makes no sense to target them. Taking e-mail spam as an example, let us suppose that each newly registered Web mail account can send some number of spam messages before being shut down. The marginal revenue per message is given by the average revenue per sale divided by the expected number of messages needed to generate a single sale. For pharmaceutical spam, Kanich et al. [14] estimate the marginal revenue per message to be roughly $0.00001; at $1 per 1,000 CAPTCHAs, a new Web mail account starts to break even only after about 100 messages sent.

It's a cold, calculating, unfeeling analysis, but it's an absolutely fascinating paper, easy to read and full of lots of examples and descriptions of the details behind this corner of the Internet. I never knew this existed, and I'm wiser now that I do.

However, I still felt the irresistible urge to go and wash my hands after learning all this. :(