Journal of a Programmer

Friday, August 13, 2010

Trinity Alps backpacking

Herewith, a short report on a delightful 5 day backpacking trip to the Trinity Alps.

The Trinity Alps Wilderness Area is a large protected wilderness located inside the Shasta Trinity National Forest in far northern California. It is one of the largest wilderness areas in California, and also one of the least visited.

We drove up from the Bay Area on Saturday afternoon, and spent the night in Weaverville at the Red Hill Cabins. Sunday morning we arose early, picked up sandwiches and coffee at the local Tops market, and headed up California 3 toward the Trinity Alps. At Trinity Center, where Swift Creek enters Trinity Lake (formerly Clair Engle Lake, which is a story all to itself), we left the paved road and drove 8 miles on Forest Service dirt roads up to the Poison Canyon trailhead. Trinity Center is at about 2,500 feet elevation, while the trailhead is at 4,100 feet elevation; the Forest Service road took almost 45 minutes to drive.

When we got to the trailhead, we were a trifle concerned when we noticed that the brand new Toyota 4Runner parked there had been apparently attacked by a bear: there were paw marks on the windows; two fiberglass side panels were torn off the car and lying on the road; and when we looked into the car window, a cooler was sitting open on the back seat. We had no food to leave in the truck, so we carefully moved our overnight bags of spare clothes into the center-front of the truck bed, covered them with a grey blanket, closed the truck up tight, and hoped for the best.

From BryanTrinityAlps2010

The Poison Canyon trail leads upward along the North Fork of Swift Creek, staying close enough to the creek to hear the water, but not close enough to see the creek. The trail is steep: it climbs 1,800 feet in just 2 miles, which is about a 17% average grade. Happily the trail leads mostly through shaded woods and pleasant meadows, so even though it is steep and unrelenting, we took our time and enjoyed the walk. Poison Canyon is apparently named due to a type of grass which grows there, which is indigestible by the livestock. Of course, we backpackers refrained from chewing the plants...

As the trail rises through the canyon, we start to see views of towering Ycatapom Peak, the 7,600 primary landmark of the area.

From BryanTrinityAlps2010

After a long break for lunch, we arrived at Lilypad Lake in mid-afternoon, and spent a few hours selecting the best campsite and setting up camp. As the guide books had predicted, we had the canyon to ourselves, and so we picked a great campsite high on the boulders above the lake, near the feeder stream with its pleasant running water. Firewood is abundant in this area of the Alps, and we had an enjoyable campfire before an early night, as we were all quite tired.

From BryanTrinityAlps2010

Monday morning we headed up the Thumb Rock trail above Lilypad Lake. Again, this trail is very steep, but since we were only carrying light day packs it was much more tolerable, and we made good progress. The trail takes you quickly up 600 feet above the lake, where the views of Mount Shasta are tremendous! We climbed a bit further, past the junction with the Deer Flat trail, to a knoll above Poison Canyon with a spectacular view, at about 7,000 feet of altitude. After lunch, we hiked a bit further, to the crest of the Parker Divide, where we got a 360-degree view of the eastern Trinity Alps, including the Parker Creek canyon, Cub Wallow, and Deer Flat. We split up, with Roger and Rich heading off to climb Thumb Rock's neighbor peak (elev. 7735), while Mike, Deb, and I returned back down the switchbacks to Lilypad Lake for some rest, book reading, and wildlife study.

From BryanTrinityAlps2010

The shooting stars were great on Monday night! I think we were 3 days early for the Perseids, but we still saw lots of meteors.

Tuesday morning dawned cold and grey, with heavy clouds overhead. Tired and a bit discouraged, we cancelled our plans to take the 8 mile (round trip) hike over to Boulder Lake, and instead decided to play campground games: horseshoe golf, poker, and rock skipping were all favorites. Unfortunately Rich didn't get to use his fishing pole; maybe next year!

Wednesday we broke camp and headed back down to the trailhead. Happily, Mr. Bear had decided that there was nothing worth fighting for in the truck, and we sailed back down the dirt road, through Weaverville, back to I-5, and (eventually) made it home to the Bay Area for the conclusion of another great trip.

For more pictures, see my online web album!

Friday, August 6, 2010

Bug-fixing and broken windows

People often ask me why I get so passionate about fixing bugs, and why, in particular, I routinely take the time to fix "minor", "unimportant", or "harmless" bugs. Sometimes the discussions get somewhat heated, since many programmers, quite justifiably, are very concerned about the risk of bug fixing, and about how to manage that risk effectively. Each bug fix introduces volatility into the code base, and there is no denying that what appear to be simple bug fixes can often be more complicated than they appear, and may break other behaviors unexpectedly, or have unintended side effects. And the volatility itself can be frustrating, because when you are working in a team environment volatility of the code has an expense in and of itself.

But I'm a very strong believer in fixing all bugs, even minor bugs, if at all possible, and, for me at least, this belief is deeply rooted in a theory of social behavior known as the Broken Windows Theory. This theory dates from a famous 30-year-old article published in the Atlantic magazine, by George Kelling and James Wilson, in which they report on several long and detailed studies of neighborhood policing. In the article, Kelling and Wilson observe:

Social psychologists and police officers tend to agree that if a window in a building is broken and is left unrepaired, all the rest of the windows will soon be broken. This is as true in nice neighborhoods as in rundown ones. Window-breaking does not necessarily occur on a large scale because some areas are inhabited by determined window-breakers whereas others are populated by window-lovers; rather, one unrepaired broken window is a signal that no one cares, and so breaking more windows costs nothing.

This is an incredibly important observation, and I believe that it's deeply and fundamentally true.

Moreover, I believe that it's deeply and fundamentally true about software, just as it is about neighborhoods:

one unrepaired broken window is a signal that no one cares

Replace "broken window" with "software defect", in that conclusion, and you'll see what I mean.

So, that's why I get passionate about fixing bugs. But, how do we manage the risk?

Well, here is where I get up on my soapbox with my second passionate opinion about software: test Test TEST! As Kent Beck says,

Never write a line of code without a failing test.

It's very hard to follow this discipline, but it gets easier the more you do it. And, in the end, you'll find that it makes your programming more fun, as well as more successful and productive, if you just follow the basic Martin Fowler-based rules:

Don't be afraid to change the software

Integrate early, integrate often

Always write a test

Always run your tests

Always pay attention when your tests fail

Repair your broken windows

Climbing down off my soapbox now, and heading for the mountains to think about things other than software for a week...

May 6th Flash Crash is still not well understood

Today is the 3 month anniversary of the "Flash Crash", the peculiar trading behavior that occurred during the afternoon of May 6th, wildly disrupting the markets and resulting in thousands of cancelled trades.

Today, the Wall Street Journal has a fascinating, long, and detailed article about what is known, and what still isn't known, about the Flash Crash. If you aren't a subscriber, you may be able to get to the article by searching for it via Google, since something about the redirect-via-Google-search triggers the WSJ to allow you to read the full article rather than stopping at the paywall.

Some of the details in the article were quite new to me, for example:

At one point, Apple traded for nearly $100,000 a share on Arca, according to NYSE officials, after a buy order for 5,000 shares entered the market and only 4,105 shares were available. When Arca's computers saw that no more shares were available to sell, the system automatically assigned a default price of $99,999 to the remaining 895 shares. Those trades later were cancelled.

A default price of one hundred thousand dollars per share? That's ridiculous, and is just as ridiculous as the one penny per share "stub quotes" that also occurred during the crash.

According to the WSJ, the investigation continues, and "the SEC and Commodities Futures Trading Commission expect to issue a final report on the flash crash within a few months." So hopefully we will soon know more about the details and can study them and discuss what they mean.

In the meantime, the WSJ article has some fascinating tidbits about the interactions between humans and their computers during the crash:

Rumors swirled about of an erroneous "fat-finger" order by a trader at Citigroup Inc.—that the trader mistakenly entered extra zeros, turning millions into billions. Citigroup and regulators later said such an errant trade did not appear to have taken place.

But the rumor helped stabilize the market. If the massive decline was the result of a mistake and not some terrible news, that meant there were bargains to be had.

I'm intrigued by the notion of the rumor stabilizing the market. It is very interesting to reflect on what that tells us about how we reason about our computers, and our algorithms, and our observations of systematic behavior.

The WSJ also confirms what others said about the disturbing behavior of so-called "stop-loss" orders during the Flash Crash:

Some of the biggest ETF traders are firms that try to profit from discrepancies between prices of ETFs and the stocks that they track. But as questions mounted about pricing of individual stocks, these firms pulled back from trading. This hurt small investors who had placed "stop-loss orders," aimed at protecting against big losses by automatically selling once prices fell below a certain level. Those orders hit the market when there were virtually no buyers to be found.

Unfortunately, as the article notes,

Exchanges are unlikely to be able to prevent high-frequency trading firms or statistical-arbitrage firms from bailing out of the market en masse.

So it's still quite confusing about exactly what went wrong, and what should be done to improve the market behaviors, other than continuing to educate the public about how modern markets operate, and requiring consumer-facing financial firms to provide appropriate tools, and appropriate training, for the individual consumers who are trying to participate in the market themselves (e.g., me, with my Rollover IRA).

Also, if you're still listening, the gang over at Nanex are continuing to explore visualizations of some of the algorithmic trading behaviors that they observed during the Flash Crash. I'm not quite sure what to make of this, but the displays are quite intriguing to scroll through. They also maintain a short links page with some further reading.

Tuesday, August 3, 2010

Jeremy Manson on the complexities of writing a truly accurate Java profiler

If you're even slightly interested in Java performance tuning, don't miss Jeremy Manson's great post on why it is incredibly hard to make a Java performance profiling tool which gives the right answers.

For years I was frustrated by the behavior that Manson describes:

there are many profilers that report a great deal of your time spent in blocking mechanisms, simply because threads frequently become scheduled as they exit blocking functions. However, those sorts of threads are all "ready to be scheduled" rather than "actually running". As a result, CPU time is charged to those threads inaccurately.

Manson describes the fundamental mechanism that his team at Google are investigating to build a better Java profiler, and also points to a new paper from a team at CU-Boulder describing their studies of a collection of current state-of-the-art Java profilers:

If two profilers disagree, at least one must be incorrect.

Definitely words to live by.

Hopefully this will lead to some improvement in the Java profiling tools world. Building a better Java profiler is a very leveraged activity, given the amount of Java code in the world and the number of people who are sweating away trying to figure out what their code is doing and how to improve it.

A little bit of Internet trivia that I hadn't known

The Mike Godwin who is the General Counsel for Wikipedia is the same Mike Godwin who coined Godwin's Law.

Monday, August 2, 2010

Keeping precious data safe

Does your software trust the other software it uses? Does your software trust the hardware it runs on? Does your software trust the users who use it?

At my day job, we've been engaged in a fascinating discussion recently regarding the limits of software trust. Our product is used to store objects of incredible value, and our users depend completely on us to preserve and retain that value. And when you try to build software that approaches the level of reliability required to preserve tens of millions of files holding dozens of petabytes of content, you start encountering situations where simple measures of trust break down.

If your program writes a block of data to the disk, how confident are you that the block was actually written, with exactly the contents you provided, to exactly the location you specified?

More intriguingly, if your program reads a block of data from the disk, how confident are you that it is the same block of data that your program wrote there, some time ago?

Unfortunately, hardware fails, software fails, and users make mistakes. Bits rot, gamma rays and voltage spikes occur, cables fray and wiggle, subtle bugs afflict drivers, and well-meaning administrators mis-type commands while signed on with high privileges. So once you have fixed all the easy bugs, and closed all the simple design holes, and built all the obvious tests, and created all the routine infrastructure, a program which really, really cares about these issues finds itself facing these core issues of trust.

There's only one technique for addressing these problems, and it's two-part:

Provide recoverability of your data via backups and/or replication,

and have a way to detect damaged data.

The first part ensures that you are able to repair damaged data; the second part ensures that you know which data need to be repaired.

Most people know a lot about backups and replication, but by themselves that's not enough. If you can't detect damaged data, then your backups have no value.

Most organizations that I've dealt with depend, without really realizing it, on detecting damaged data at the time that it's next used. Unfortunately, that is far, far too late:

By then, the damage is impacting your business; you've got downtime due to the damage.

Worse, by then, the damaged data may have already made it into your backups; your backups themselves may be damaged. And then you're done for.

The sooner you detect damaged data, and the closer to the source of the damage that you detect it, the faster you can repair it, and the more likely you are to be able to track down the source of the damage (buggy software, user error, intermittent power fluctuation, overheating DIMM board, etc.).

Computer science has one basic technique for detecting damaged data, though it goes by several names: checksums, CRCs, hashes, digests, fingerprints, and signatures among others. Here's a nice summary from Wikipedia:

A cryptographic hash function is a deterministic procedure that takes an arbitrary block of data and returns a fixed-size bit string, the (cryptographic) hash value, such that an accidental or intentional change to the data will change the hash value. The data to be encoded is often called the "message", and the hash value is sometimes called the message digest or simply digest.

The basic idea is that:

At some point, when you are confident that your data is valid, you compute the checksum and save it away.

Later, when you want to know if your data has been damaged, you re-compute the checksum and compare it against the checksum that you saved before. If they don't match, your data has been damaged (or, perhaps, your checksum has been damaged)

Note that there is a natural symmetry to the use of checksums:

Checksum the data when you write it out, then verify the checksum when you read it back in.

Checksum the data when you send it over the net, then verify the checksum when you receive the data over the net.

Given that there are reliable, free, widely-available libraries of checksum code available out there, this should be the end of the discussion. But of course life is never so simple. Programs fail to checksum data at all, or they checksum it but don't later verify the checksum, or they checksum the data after the damage has already happened. Or they use a weak checksum, which doesn't offer very extensive protection. Or they do all the checksumming properly, but don't tie the error checking into the backup-recovery system, so they don't have a way to prevent damaged data from "leaking" into the backups, leading to damaged and useless backups and permanent data loss. There are a world of details, and it takes years to work through them and get them all right.

If you've got this far, and you're still interested in learning more about this subject, here's two next resources to explore:

Stone and Partridge's detailed analysis of error patterns in TCP traffic on the Internet: When The CRC and TCP Checksum Disagree. Here's an excerpt from the conclusion:

In the Internet, that means that we are sending large volumes of incorrect data without anyone noticing. ... The data suggests that, on average, between one packet in 10 billion and one packet in a few millions will have an error that goes undetected. ... Our conclusion is that vital applications should strongly consider augmenting the TCP checksum with an application sum.

Valerie Aurora's position paper, An Analysis of Compare-by-Hash, on the use of cryptographic hashes as surrogates for data, particularly in places such as disk de-dup, file replication, and backup. Here's an excerpt from her paper:

On a more philosophical note, should software improve on hardware reliability or should programmers accept hardware reliability as an upper bound on total system reliability? What would we think of a file system that had a race condition that was triggered less often than disk I/O errors? What if it lost files only slightly less often than users accidentally deleted them? Once we start playing the game of error relativity, where does we stop? Current software practices suggest that most programmers believe software should improve reliability -- hence we have TCP checksums, asserts for impossible hardware conditions, and handling of I/O errors. For example, the empirically observed rate of undetected errors in TCP packets is about 0.0000005%. We could dramatically improve that rate by sending both the block and its SHA-1 hash, or we could slightly worsen that rate by sending only the hash.

Have a good time out there, and keep that precious data safe!

Sunday, August 1, 2010

Raiders of Unusual Size

You know it's almost time for Tackle Football here in America when you start reading stories like this one:

Tackles Mario Henderson and Langston Walker: These are your starters on Day One in Napa for an offensive line that Yahoo! Sports recently called the worst in the league.

...

Walker is back for another stint, and the Cal product has dropped some weight and is down to 360 pounds. He looked fresh at minicamp and quickly moved into the lineup, dispatching Khalif Barnes to get some work at guard. Walker's emergence could also signal a lot more power blocking under new offensive coordinator Hue Jackson, in addition to the zone blocking that Cable favored.

Power Blocking! Yee Haw! Let's play some football!

Ah, yes, it's the Raiders pre-season. The best time of all!

Journal of a Programmer

Friday, August 13, 2010

Trinity Alps backpacking

Friday, August 6, 2010

Bug-fixing and broken windows

May 6th Flash Crash is still not well understood

Tuesday, August 3, 2010

Jeremy Manson on the complexities of writing a truly accurate Java profiler

A little bit of Internet trivia that I hadn't known

Monday, August 2, 2010

Keeping precious data safe

Sunday, August 1, 2010

Raiders of Unusual Size

About Me

Blog Archive

Pages