By Jim, on May 23rd, 2013 Whatever else it is, randomness is peculiar. More peculiar still is our perception of what is or is not random. Consider, for example, the case of flipping a coin five times. Suppose you do that and get the sequence T,H,H,T,H (T is tails, H is heads). That seems perfectly reasonable, right? Random? Imagine instead that you got H,T,H,T,H. You’d probably think the alternating heads and tails is pretty weird, but you still got almost the same number of tails as heads which is pretty much the definition of what people think of as “random.”
If you got H,H,H,H,H, you’d probably think that the coin was rigged. It could be, but it could also be a perfectly fair coin.
Given five flips of a fair coin, there are 25 (32) possible sequences that can occur. You’re equally likely to get all heads as you are to get something like the first sequence I showed, or all tails, or any other sequence. The all heads result just seems “not random” because it’s so distinctive. But there is a one in 32 chance of that or any other result.
Pick up your coin again and flip it five times, writing down each result. Was the sequence T,T,H,T,H? For approximately one out of 32 people who do this exercise, it will be.
Think of it this way. Write down a number from 1 to 32, go to random.org and have it generate a number in that range. Would you be very surprised if it picked your number on the first try? More or less surprised than if you flipped a coin five times and got all tails?
Instead of getting random.org to generate a number for you, you could ask somebody to guess what it is. But that wouldn’t be a fair test because people don’t pick numbers randomly. Asked to select a number in some range, people are more likely to pick an odd number than an even number, more likely to pick primes, and the chosen number is usually in the bottom third or top third of the range. See for example, Is 17 the “most random” number? and Probability, algorithmic complexity, and subjective randomness. Or just do a Google search for “psychology of randomness.”
Because you probably picked a number using the same criteria, other people have a better chance of guessing your number than does a computer that gives each number equal weight. Of course, now that you know what most people do, you’ll probably pick a number that others are less likely to guess.
What most people think of as “randomness” is really complexity. The more complex a sequence looks, the more “random” we perceive it. THHTH appears more complex than HHHHH, and therefore more “random” in our eyes. It’s what researchers call subjective randomness.
Not only are people terrible at perceiving randomness, they’re also terrible at generating randomness. Asked to flip a coin 100 times and write down the results, many college students will “cheat” and forego flipping the coin. They’ll just write down what they think is a random sequence. It’s usually easy to catch them because the idea of run of four or five tails just seems “not random.” But getting five heads or five tails in a row is very common given 100 flips of a fair coin.
Let me demonstrate with a smaller set.
There are 32 possible sequences of five flips, which I’ve reproduced below. I’ve substituted ’0′ for ‘T’ and ’1′ for ‘H’ because 0 and 1 are more visually distinct than T and H, and because I’m accustomed to using 0 and 1 for binary results.
00000 01000 10000 11000
00001 01001 10001 11001
00010 01010 10010 11010
00011 01011 10011 11011
00100 01100 10100 11100
00101 01101 10101 11101
00110 01110 10110 11110
00111 01111 10111 11111
The sequence 00 occurs in 19 of those 32 results. So the likelihood of getting two tails in a row in a sequence of five flips is 19/32, or about 59%. The odds of not getting two tails in a row, then, is 13/32, or 41%. Getting two tails in a row wouldn’t be strange. Neither would getting two heads. What would be strange is not getting either two heads or two tails in a row. There are only two sequences that don’t have a matching pair (01010 and 10101), so in 30 out of 32 cases (94% of the time), there will be either two heads in a row or two tails in a row.
It’s not uncommon, in fact, to see a run of three tails. In the results above, there are eight sequences with 000, so 25% of the time you’ll see a run of three tails. In just five flips of the coin!
You can actually compute the probability of not getting a run of k tails in n tosses using something called a Fibonacci k-step number. I used that formula to verify the results I derived above.
There are 2100 (a really big number) possible sequences of 100 flips. I can’t list them all, but I can use the formula to determine the likelihood of not getting a run of five tails in a sequence of 100 flips. I just had to compute the 102nd item in the 5-step sequence. Here’s the method I used.
private BigInteger KStep(int k, int n)
{
// Calculate the nth value in the Fibonacci k-step number number sequence
BigInteger[] all = new BigInteger[n];
all[0] = 1;
all[1] = 1;
for (int i = 2; i < n; ++i)
{
BigInteger rslt = all[i - 1];
for (int j = i - 2; j >= 0 && j >= i - k; --j )
{
rslt = rslt + all[j];
}
all[i] = rslt;
}
return all[n - 1];
}
That’s not the most efficient implementation possible, but it did the job. The result I got for KStep(5, 102) is 240,714,680,556,315,819,945,145,376,976, or about 2.4129. I have to divide that by 2100 (1.2730), the result of which is 0.1899. So given 100 flips, 19% of the time I won’t see a run of five tails.
Because I don’t have a trillion lifetimes to generate and examine all the possible 100-flip sequences, I wrote a little function that will generate a million different n-flip sequences, each time checking to see if it got a run of k tails. It keeps track of how many times there isn’t a run, and reports the results. Called with Flipper(5, 100), it reports a value pretty close to 19% every time.
private const int NTries = 1000000;
private void Flipper(int k, int n)
{
var rnd = new Random();
string testString = new string('T', k);
int runCount = 0;
// how often do we NOT get k tails in n flips?
for (int i = 0; i < NTries; ++i)
{
var sb = new StringBuilder(n);
for (int flip = 0; flip < n; ++flip)
{
var x = rnd.Next(2) == 0;
sb.Append(x ? 'H' : 'T');
}
if (!sb.ToString().Contains(testString))
{
++runCount;
}
}
Console.WriteLine("{0:N0} out of {1:N0}: {2:P2}",
runCount, NTries, (double)runCount/NTries);
}
If you play with the program a bit, you’ll find that runs of six tails happen more than half the time, runs of seven happen about a third of the time, and you’re twice as likely to get a run of 10 than not get a run of four. File that one away in the “Math that just don’t feel right” drawer.
By Jim, on May 17th, 2013 I’ve recently had occasion to write LINQ versions of two functions I’d written some time back. It proved instructive.
Random numbers that sum to N
Doing some testing, I needed an array of 100 random floating point numbers that sum to 1.0. That is, given an array a, (a[0] + a[1] + a[2] ... + a[99]) = 1. The easiest way I know of to do that is to generate an array of 100 random values, sum them, and then divide each number by the sum.
const int ArraySize = 100;
Random rnd = new Random();
double[] a = new double[ArraySize];
double sum = 0;
// generate the values
for (int i = 0; i < a.Length; ++i)
{
a[i] = rnd.NextDouble();
sum += a[i];
}
// now divide each item by the sum
for (int i = 0; i < a.Length; ++i)
{
a[i] /= sum;
}
The math behind this is pretty simple. What you end up with is (a[0]/sum + a[1]/sum + a[2]/sum ... + a[99]/sum), which works out to sum/sum, which is 1.
Somebody asked me for a LINQ version.
var temp = (from d in Enumerable.Range(0, ArraySize)
select rnd.NextDouble()).ToArray();
var sum = temp.Sum();
var a = temp.Select(d => d/sum).ToArray();
An obvious extension to this is generating an array of N values that sum to some arbitrary value. The first part of the algorithm is the same: generate N numbers, and sum them. Then, rather than dividing each number by the sum, divide by (sum/desired_value). Or, you can multiply by (desired_value/sum). So the final LINQ expression becomes:
var a = temp.Select(d => d*(desiredValue/sum)).ToArray();
And don’t worry about the division in the inner loop there. The compiler (either the C# compiler, or the JIT compiler) will move it out of the loop.
Finding palindromes
At a previous job, one part of the interview process for entry-level programmers was having them solve some simple programming puzzles on the white board. I called it The White Board Inquisition. A common exercise was to write a function that determines if an input string is a palindrome. We’d keep it simple and have them look for an exact match. So “amanaplanacanalpanama” would be a palindrome, but “A man, a plan, a canal, Panama!” would not be. Stripping out the non-alphabetic characters and doing case-insensitive matching is a simple refinement, but we wanted to see how the candidate approached the primary problem.
The simplest way, of course, is to reverse the string and do a comparison, but we wouldn’t allow them to call a reverse method. If they wanted to reverse the string, they’d have to do it themselves. Many did. Another common approach is to push the characters onto a stack and then pop them, comparing the popped values with the string characters, in order.
Few candidates tripped to the in-place method of comparing the characters starting from the outside and moving in, like this.
private bool IsPalindrome(string input)
{
int i = 0;
int j = input.Length - 1;
while (i < j)
{
if (input[i++] != input[j--])
return false;
}
return true;
}
Somebody recently asked me for a LINQ version of a palindrome finder. The obvious solution is to employ the Reverse extension method and then compare the sequences. But it’s easy enough to duplicate the above algorithm.
private bool IsPalindrome(string input)
{
return
Enumerable.Range(0, input.Length / 2)
.Select(i => input[i] == input[input.Length - i - 1])
.All(b => b);
}
A common misconception is that the LINQ code will do those three operations in sequence: generate a list of indexes, then generate a list of results, and then compare the results. That most definitely is not how the generated code works. If you enable runtime library source stepping, you’ll see that the sequence of operations follows the looping construct very closely. The LINQ solution is slower, of course, because it’s working with iterators and such, but it isn’t stupid.
The more I work with LINQ, the more impressed I am by what I can do with it. It’s not always the fastest executing solution, but for most of what I do that doesn’t matter. I find that the expressiveness leads to much shorter, more easily understood, and more reliable code than writing large looping constructs. I imagine that interviewers these days expect candidates to demonstrate a thorough understanding of LINQ. I know that I certainly would.
By Jim, on May 16th, 2013 There is the question of how to handle null parameters in extension methods. Consider this extension method that returns the number of digits in a string.
public static class MyExtensions
{
public static int CountDigits(this string input)
{
var count = 0;
foreach (var c in input)
{
if (char.IsDigit(c))
{
++count;
}
}
return count;
}
}
Understand, this is just an example. I realize that I can replace all that code with a single function call, and in fact I wouldn’t actually write such a simple extension method to replace that one line of code in my program. Bear with me.
After running the code below, result will be equal to 10.
string foo = "(555)123-4567";
var result = foo.CountDigits(); // returns 10
So what should happen if I call that method when foo == null? Let’s see what the Framework does when I call a string method on a null reference.
string foo = null;
string lowerFoo = foo.ToLower();
Not surprisingly, that throws NullReferenceException. And, not surprisingly, the CountDigits method also throws NullReferenceException when foo == null. The difference, though, is important. If you look at the exception detail you’ll find that the call to foo.ToLower reports the exception as being thrown on that line. The stack trace for the call to CountDigits looks like this.
System.NullReferenceException was unhandled
HResult=-2147467261
Message=Object reference not set to an instance of an object.
Source=sotesto
StackTrace:
at Test.MyExtensions.CountDigits(String input) in c:\Dev\Jim\Testo2012\sotesto\Program.cs:line 37
at Test.Program.DoStuff() in c:\Dev\Jim\Testo2012\sotesto\Program.cs:line 28
The exception is thrown on the line containing the foreach statement.
If you understand how extension methods work, this shouldn’t be surprising. After all, the code foo.CountDigits(); is really just shorthand for MyExtensions.CountDigits(foo);.
The exception is saying that the error exists in the extension method when in reality the error exists at the call site. This is an important distinction. Imagine that you’re calling an extension method that exists in a library for which you have no source. If that extension method throws NullReferenceException, you will likely assume that the error is in the extension method. And nothing in the stack trace will tell you otherwise. There is nothing to indicate that it threw that exception because you passed a null parameter.
That is incorrect behavior.
Don’t believe me? Let’s see what happens if I rewrite that extension method.
public static class MyExtensions
{
public static int CountDigits(this string input)
{
return input.Count(char.IsDigit);
}
}
Yes, that duplicates the functionality of the previous version. But if you call it with a null parameter, the result is quite different.
System.ArgumentNullException was unhandled
HResult=-2147467261
Message=Value cannot be null.
Parameter name: source
Source=System.Core
ParamName=source
StackTrace:
at System.Linq.Enumerable.Count[TSource](IEnumerable`1 source, Func`2 predicate)
at Test.MyExtensions.CountDigits(String input) in c:\Dev\Jim\Testo2012\sotesto\Program.cs:line 36
at Test.Program.DoStuff() in c:\Dev\Jim\Testo2012\sotesto\Program.cs:line 28
Here we see that the Enumerable.Count extension method threw ArgumentNullException, and actually told me why: the source parameter cannot be null. But, again, it looks like the error is due to the CountDigits extension method passing a null argument.
The answer to the question of what to do with null parameters to extension methods is simple: check them and throw ArgumentNullException.
public static int CountDigits(this string input)
{
if (input == null)
{
throw new ArgumentNullException("input", "Value cannot be null.");
}
return input.Count(char.IsDigit);
}
Now you know exactly why the error occured: the code that called CountDigits passed a null parameter.
The above is how the Framework methods handle null parameters. It’s considered a best practice. More importantly, this is the only way to know for sure where the real error lies.
By Jim, on May 15th, 2013 The two intersections near my office have traffic lights that are unlike any others I’ve seen, and I fear that they are confusing to drivers and therefore dangerous.
On the north and south sides of the intersection there are three lanes: a left turn lane, a through lane, and a right turn lane. On the east and west sides, there are four lanes (two through lanes). All four sides of the intersection have the normal red/yellow/green through lights and a left turn light.
A traditional traffic light is green, turns yellow for a short period, and then turns red. They have been this way for decades–certainly for the 40+ years I’ve been aware of the way that traffic lights work. Left turn lights work much the same way. The left turn arrow might work on a different schedule than the through lights, but it traditionally follows the green to yellow to red pattern. Or, it might have only green and off, meaning that you have to yield to oncoming traffic when the light is off, but the oncoming traffic has a red light when the left turn light is green.
The left turn lights at this intersection are different. And confusing. They blink yellow to mean, “You may turn after yielding to oncoming traffic.” After the blinking yellow period, they turn green. So drivers have to know that solid yellow means “red is soon to come,” but blinking yellow means, “green is soon to come.”
It doesn’t help, of course, that years of experience have taught drivers that in practice a yellow left turn signal means, “Hurry up through the intersection before the light turns red.”
Every day I see drivers not fully understanding what that blinking yellow means. Some, when they see the blinking yellow, dash through the intersection heedless of oncoming traffic. I can only conclude that they don’t understand that the rules have changed and a light might start blinking yellow. Other drivers see the blinking yellow and hesitate when it’s clear that there is no oncoming traffic and they’re fully within their rights to proceed.
I realize that blinking yellow means “proceed with caution.” But traditionally that’s done with all of the lanes at an intersection (going the same way, of course). But drivers aren’t accustomed to seeing blinking yellow on one light, and solid red or green on the other lights. And they’re certainly not prepared for a blinking yellow light to suddenly go green.
I’ll grant that I’ve only seen two crashes (after the fact–I wasn’t present for the actual impact) at these intersections in the three months I’ve been here. But I’ve seen too many close calls to believe that this unorthodox use of the signal lights is a good idea. I can only hope that those in charge of the traffic lights will see this dangerous situation for what it is and reprogram the lights back to something more reasonable.
By Jim, on May 13th, 2013 Suppose you wanted a function that would replace multiple instances of a character with a single instance. So the string “+++This+is+++++a++test++” would become “+This+is+a+test+”. The easiest way to do it in C# is to employ regular expressions.
string testo = "+++This+is+++++a++test++";
string result = Regex.Replace(testo, @"\++", "+");
The leading slash is there because '+' is a special character in regular expressions. The slash “escapes” the character so that it will be treated as a literal.
By the way, you can’t do this reliably with String.Replace. If you wrote:
string result = testo.Replace("++", "+");
You would end up with “++This+is+++a+test+”. String.Replace makes a replacement and moves on, so it will miss when the replacement string plus the following characters is equal to the search string.
Curious, I thought I’d write my own function to remove duplicate characters and see how it fares against the regular expression. The result is this RemoveDupes method.
public static string RemoveDupes(string text, char dupeChar)
{
if (string.IsNullOrEmpty(text))
{
return text;
}
var result = new StringBuilder(text.Length);
bool dupe = false;
for (int x = 0; x < text.Length; ++x)
{
char c = text[x];
if (c == dupeChar)
{
if (!dupe)
{
result.Append(c);
dupe = true;
}
}
else
{
result.Append(c);
dupe = false;
}
}
return result.ToString();
}
That's the most straightforward way I could think of to write it, so I wouldn't be surprised if there are some optimizations that would make it faster.
I created the regular expression like this:
var re = new Regex(@"\++", RegexOptions.Compiled);
And in the timed loop, called it with:
result = re.Replace(testString, "+");
That should make the regular expression as fast as it can be.
Performance testing was a little bit surprising. The simple RemoveDupes method was 5 to 8 times as fast as the regular expression version, depending on the length of the strings and the percentage of repeated '+' characters in the string.
That the custom code was faster didn't particularly surprise me; I've proven many times that a custom parser can outperform regular expressions. Two or three times as fast is what I expected. Up to eight times as fast came as a surprise.
One million iterations with a string 100 bytes long took the regular expression version 4,767 milliseconds, or 0.004768 ms per call. The same string run through the RemoveDupes method took 793 ms, or 0.000793 ms per call. The custom method is six times as fast, but the absolute difference for those million iterations is only four seconds.
Granted, the absolute differences get larger as the number of iterations or the length of the string increases. The timings (and the differences) increase linearly. If it takes RemoveDupes some N seconds to complete, it will take the regular expression approximately 6*N seconds.
But does it really matter? Most likely not. In most cases, sanitizing the input data takes up a very small percentage of the total CPU time. Optimizing it just doesn't yield meaningful performance gains. Unless removing duplicated characters from strings is the most time consuming part of the program, this huge boost in performance would mean nothing.
Would you really notice a forty second difference in the running time of a program that takes an hour?
The other practical matter has to do with time to implement. The regular expression code already works. It took some (small) amount of time to write and test my RemoveDupes method, and the result is a very narrowly-targeted solution: it can remove duplicated characters from a string. Even the seemingly simple modification of letting it de-dupe 2-character strings would be a much more involved job, especially when you take into account the weird edge cases inherent in Unicode. Writing and debugging that would be difficult.
I think I'll stick with regular expressions for this kind of job, even though they're slower than the custom parser. Unless I really need the speed.
By Jim, on May 8th, 2013 In reading about C# performance optimization, I often encounter the conventional wisdom that jagged arrays are faster than rectangular arrays. That is, given these two arrays,
byte[,] rectArray = new byte[NumRows, NumCols];
byte[][] jaggedArray = new byte[NumRows][];
The assertion is that accessing jaggedArray will be faster.
I’ve rarely seen any kind of benchmark to back up that assertion, and of the benchmarks I have seen many are fundamentally flawed. I’m exploring a project that will make heavy use of arrays, so I thought I’d look into the relative performance of both array types.
Before I do that, though, I should mention that often you don’t have a choice of what type of array to use. If your job is to process a two-dimensional array of bytes that somebody passes to your program, you’re most likely better off processing it in the format that it was given to you. Converting the data to your preferred “faster” format and converting it back to the native format is going to take a lot of time: probably more time than you’ll save with your “faster” processing code.
Even if you’re writing new code, the best array representation is almost always the one that best fits your model. A rectangular array (i.e. byte[,]) usually leads to simpler code, but jagged arrays (i.e. byte[][]) can be better in many situations. Jagged arrays are almost always better when you need an “array of arrays.” That is, when the second dimension is variable.
That said, let’s benchmark some simple operations with the two different types of arrays. The platform I’m using is:
- Windows Server 2008 R2
- Visual Studio 2012, Update 1
- Intel Xeon X5350 @ 2.67 GHz
- Program compiled in Release mode with “Any CPU”, targeting .NET 4.5
- Run in the 64-bit runtime with debugger detached
I have a simple test harness that runs the test method once to eliminate JIT effects (the time taken for the JIT compiler to compile the method), and then starts a Stopwatch and executes the test method the number of times requested. The reported time is the average of all those runs. Here’s the test harness:
private void Test(Action a, string text, int numIterations)
{
Console.Write("{0} ... ", text);
// do it once to eliminate JIT effects
a();
// Now time the iterations.
var sw = Stopwatch.StartNew();
for (var i = 0; i < numIterations; ++i)
{
a();
}
sw.Stop();
Console.WriteLine("{0:N0} ms", sw.ElapsedMilliseconds/numIterations);
}
In these tests, I’m using arrays that have 2,736 rows and 10,944 columns. This corresponds to the number of pixels in a 10 megapixel image, with each pixel taking up 3 bytes. The timings I report below are the average of running each method 1,000 times.
The first test is just a simple traversal of the array that sums all the values. I first wrote it like this:
private int SimpleSum;
private void SimpleSumRect(byte[,] array, int rows, int cols)
{
SimpleSum = 0;
for (var i = 0; i < rows; ++i)
{
for (var j = 0; j < cols; ++j)
{
SimpleSum += array[i, j];
}
}
}
private void SimpleSumJagged(byte[][] array, int rows, int cols)
{
SimpleSum = 0;
for (var i = 0; i < rows; ++i)
{
for (var j = 0; j < cols; ++j)
{
SimpleSum += array[i][j];
}
}
}
After initializing the arrays, I run the tests like this:
Test(() => SimpleSumRect(rectArray, ArrayRows, ArrayCols), "Simple Sum rect", NumIterations);
Test(() => SimpleSumJagged(jaggedArray, ArrayRows, ArrayCols), "Simple Sum jagged", NumIterations);
The result was rather interesting:
Simple Sum rect ... 70 ms
Simple sum jagged ... 70 ms
No difference? I wasn’t really surprised by that, and almost dismissed the conventional wisdom as so much myth. But I was also curious because I have great respect for some of the people making the “jagged is faster” claim. So I modified the test slightly. Rather than summing to a variable at class scope, I made it a local variable, like this:
private int SimpleSumRect(byte[,] array, int rows, int cols)
{
int SimpleSum = 0;
for (var i = 0; i < rows; ++i)
{
for (var j = 0; j < cols; ++j)
{
SimpleSum += array[i, j];
}
}
return SimpleSum;
}
That result was surprising:
Simple Sum rect ... 39 ms
Simple sum jagged ... 24 ms
I had no idea that accessing a class-scope variable was so much more expensive than accessing a local variable. I’d have to look at the generated code (both IL and the JIT-generated machine code) to say for sure what’s going on here, but apparently accessing that class-scope variable prevents the compiler from making some optimizations. Making the sum a local variable (that the compiler probably keeps in a register) shows the real difference in the arrays’ performance. I thought the jagged array would be a little faster, but I certainly didn’t expect that much of a difference.
It’s pretty clear here that traversing a jagged array sequentially is hugely faster than traversing an equivalent rectangular array. The primary reason, it turns out, is array bounds checking. When accessing array[i, j], the runtime has to check the bounds of both indexes. You would think that it would have to do the same thing for the jagged array, but the compiler optimizes the code to something akin to this:
private int SimpleSumJagged2(byte[][] array, int rows, int cols)
{
var SimpleSum = 0;
for (var i = 0; i < rows; ++i)
{
var theRow = array[i];
for (var j = 0; j < cols; ++j)
{
SimpleSum += theRow[j];
}
}
return SimpleSum;
}
So the inner loop only has to do one bounds check per iteration.
In addition, a simple array (i.e. byte[]) is treated differently. It is assumed to be 0-based, so that indexing is a matter of arrayBase + (index * size). The jagged array is a simple array of simple arrays, making indexing faster. See this Stack Overflow answer for a little more detail.
Rectangular arrays, on the other hand, can have non-zero bases. Not in C#, but the C# compiler generates code to be used by the runtime Array class implementation. Indexing into one of these arrays requires a little more math: arrayBase + ((i - dim1Base) * (size * cols)) + ((j - dim2Base) * size) (where dim1Base and dim2Base are the base indexes for each dimension). The compiler can optimize some of that, but it should be obvious that it can’t eliminate all of the extra work.
It’s important not to dismiss the first test as irrelevant. Whereas the second test shows that jagged arrays can be faster than rectangular arrays, the first test shows that other factors can easily defeat any potential gain realized by using the “more efficient” data structure. In this case, simply accessing a class-scope variable in the inner loop made the compiler optimizations meaningless.
I ran another test that modified each byte in the array. The idea here was to eliminate as much as possible all external factors. Here are the test methods.
private void SimpleTraversalRect(byte[,] array, int rows, int cols)
{
for (var i = 0; i < rows; ++i)
{
for (var j = 0; j < cols; ++j)
{
array[i, j] = (byte)(255 - array[i, j]);
}
}
}
private void SimpleTraversalJagged(byte[][] array, int rows, int cols)
{
for (var i = 0; i < rows; ++i)
{
for (var j = 0; j < cols; ++j)
{
array[i][j] = (byte)(255 - array[i][j]);
}
}
}
The timings are pretty much what you expect: 40 ms for the rectangular array and 24 ms for the jagged array.
Then I thought I’d do a little optimization myself by venturing into the world of unsafe code and pointers. The result looks more like C than C#, but it does show that you can optimize access to a rectangular array.
private void OptimizedTraversalRect(byte[,] array, int rows, int cols)
{
long count = (((long)rows) * cols)/3;
unsafe
{
fixed (byte* pArray = array)
{
byte* p = pArray;
while (count-- > 0)
{
*p++ = (byte)(255 - *p);
*p++ = (byte)(255 - *p);
*p++ = (byte)(255 - *p);
}
}
}
}
private void OptimizedTraversalJagged(byte[][] array, int rows, int cols)
{
for (var i = 0; i < rows; ++i)
{
unsafe
{
int count = cols/3;
fixed (byte* pArray = array[i])
{
byte* p = pArray;
while (count-- > 0)
{
*p++ = (byte)(255 - *p);
*p++ = (byte)(255 - *p);
*p++ = (byte)(255 - *p);
}
}
}
}
}
The timings here are exactly what I expected: 24 ms for the rectangular array and 24 ms for the jagged array. Actually, the rectangular array version is slightly faster, on average, because it doesn’t have to do as much work. Interestingly, if I don’t unroll the inner loop (i.e. only do one operation per iteration), both versions take longer than the original, un-optimized rectangular array version. The compiler is smarter than you think. I was able to outperform the compiler here only because I know that the array’s width is a multiple of three, and it was safe to unroll the loop.
Note that there’s a reason that code is considered unsafe: using the pointer, I circumvent any array bounds checking. Between that and eliminating the indexing calculations, I made traversing the rectangular array a simple sequential walk through memory.
So jagged arrays are faster, right? Well … maybe. What happens if you access them (semi-) randomly?
private const int Increment = 1009;
private void RandomRect(byte[,] array, int rows, int cols)
{
int count = rows*cols;
int row = 0;
int col = 0;
while (count-- > 0)
{
row = (row + Increment) % rows;
col = (col + Increment) % cols;
array[row, col] = (byte)(255 - array[row, col]);
}
}
private void RandomJagged(byte[][] array, int rows, int cols)
{
int count = rows * cols;
int row = 0;
int col = 0;
while (count-- > 0)
{
row = (row + Increment) % rows;
col = (col + Increment) % cols;
array[row][col] = (byte)(255 - array[row][col]);
}
}
The timings here are telling: 310 ms for the rectangular array and 412 ms for the jagged array. The reason for the huge difference has to do with how the two array types are represented in memory. The rectangular array is one big block of memory. The jagged array is one block of memory for an array of row references, and a separate block of memory for each row. Accessing an arbitrary element in a jagged array, then, requires two separate lookups in memory that can be scattered all over. The result is approximately twice as many cache misses for the jagged array when compared to the rectangular array.
I understand, of course, that my “random” traversal is hardly random. It does, however, provide scattered access to the array elements in a manner that simulates what truly random access would do. I actually wrote a version that used a random number generator, but it took much longer to run and gave almost the same behavior as this simpler version.
So the answer to the question of which type of array is faster is, as you’ve probably figured out by now, “It depends.”
If you can guarantee that external factors (like accessing a class-scope variable) won’t negate the compiler’s optimizations, then using a jagged array can be 40% faster than using an equivalent rectangular array. But if you’re accessing the array randomly, a jagged array can be 33% slower than a rectangular array. The question of which is faster is meaningless without context.
The machine I ran these tests on is pretty old. I’ve also run the tests on a Core i7 processor and obtained similar results. The absolute times were faster (the machine is running at 3.4 GHz), but the percentage difference in the speed of the two array types are similar.
Non-speed considerations
“Efficiency” means more than just raw speed. When discussing performance, we often find that speed and memory efficiency are inversely related. That is, doing something faster requires using more memory. That’s not always true, but it often is. If you’re concerned about efficiency you need to know not just how fast things are, but also how much memory they take and how that memory is allocated.
In addition to having different performance characteristics, jagged arrays and rectangular arrays are stored differently in memory. A rectangular array consists of a single allocation of size rows * cols * element-size, plus about 50 bytes of overhead for array metadata. A jagged array requires somewhat more memory. It consists of an allocation of size rows * sizeof(IntPtr) (plus metadata), and there is a separate allocation for each row, of size cols * element-size. So consider the array I showed above, which is 2,736 x 10,944. The difference in this large array is about 150 kilobytes, or less than one percent in this 30 megabyte array. Imagine, though, if the rows were only 50 bytes long. The jagged array would consume twice the memory of the rectangular array because the per-row overhead of 50 bytes still applies.
There are advantages to that distributed array representation, however. Prior to .NET 4.5, a rectangular array couldn’t occupy more than two gigabytes (slightly less, in reality). So an array of bytes with a million rows (actually, 2^20 rows) and 2,048 columns couldn’t be created. That is, byte[10^20, 2048] would produce OutOfMemoryException, regardless of how much memory you have on your machine.
But you could easily create a short[][] with 10^20 rows and 2,048 columns, because each row would be a paltry two kilobytes in size.
.NET 4.5 introduced the gcAllowVeryLargeObjects configuration file element that lets you allocate objects larger than 2 gigabytes, which alleviates some of the problems, but there are still restrictions as pointed out in the documentation. And it’s often impossible to get a huge chunk of contiguous memory, which the rectangular array needs. So there are still benefits to the jagged array.
So, which to use? Whatever best fits your model. I prefer rectangular arrays in most cases because I find them easier to work with (try initializing a jagged array) and easier to reason about. I also know that I can optimize my array handling code if I need to, using unsafe code and pointers. But every situation is different, and I’ll happily use a jagged array if the situation calls for it.
Only you can decide which array representation is appropriate for your application. You have to weigh the costs and benefits and pick the one that best meets your needs. Don’t immediately fall for the simplistic “jagged is faster” hype that’s all too often spouted as a matter of faith without understanding the wider implications.
By Jim, on May 7th, 2013 Last time, I showed how compiler optimization can cause problems with polling for cancellation, and discussed a few alternatives to solving the problem. At the end of the article I mentioned that my simple program is the uncommon case. Very often (perhaps most often?), your program’s background processing is done by a method in a completely different class, and it’s common for there to be multiple background tasks running. It’s likely that you want a single method call or variable setting to signal cancellation for all of the background tasks.
Before I get to that, I want to address a few comments I received on Facebook in response to the last post.
Ron asks why this isn’t considered a compiler bug. On the face of it, it does look like the compiler is being overly aggressive. However, the compiler doesn’t know that the loop is executing in a separate thread. The compiler doesn’t understand that ThreadPool.QueueUserWorkItem starts a new background thread and that it should therefore widen its scope when examining the cancelRequested variable. Expecting the compiler to understand that is unreasonable, and forcing the compiler to assume that a local variable is accessible by multiple threads would eliminate a large number of perfectly valid optimizations. The compiler is behaving correctly here.
Readers Eric and Pete advocate moving the variable to class scope and adding the volatile qualifier. I agree that that’s the easiest solution and that it would work. I’d probably even use it if I were writing a simple program like the one in the last post. But I consider it a hack. Whereas Pete is right that an anonymous function closing over local variables muddies the concept of local variables, promoting the variable to class scope completely destroys any concept of information hiding. More importantly, there are the problems with volatile that I pointed out in the article. Those problems are irrelevant in this simple case, but I wouldn’t want to deal with them in a production program.
More importantly, as you’ll see below, volatile isn’t a good general solution.
Let’s move on from that toy demonstration program to a more likely real world design.
In the program below, the code that’s executed in the background thread is in a separate class that has no knowledge of or access to the code that calls it. It can’t access a public cancelRequested flag, so the flag is passed by reference.
public class Program
{
private static void Main(string[] args)
{
var p = new Program();
p.DoStuff();
Console.WriteLine("Press Enter to end program.");
Console.ReadLine();
}
private void DoStuff()
{
bool cancelRequested = false;
// Start tasks
ThreadPool.QueueUserWorkItem((s) =>
{
var task1 = new Task1Class();
task1.DoTask1(ref cancelRequested);
});
Console.WriteLine("Press Enter to stop thread.");
Console.ReadLine();
// set cancel flag
cancelRequested = true;
Console.WriteLine("Cancel requested!");
}
}
public class Task1Class
{
public void DoTask1(ref bool cancelRequested)
{
Console.WriteLine("Task1 started.");
while (!cancelRequested)
{
// do stuff
}
Console.WriteLine("Task1 exit.");
}
}
If you run that in the debugger, it will work as expected. But run it without the debugger attached and the thread never exits! Once again, the compiler has optimized things.
Moving the cancelRequested flag to the outer scope doesn’t help, nor does moving it to the outer scope and marking it as volatile. In fact, adding volatile results in this compiler warning:
“a reference to a volatile field will not be treated as volatile.”
As with the example in my previous post, doing certain things in the loop will prevent the compiler from applying the optimization that reveals this programming bug. But do you really want to count on knowing what those rules are? For every platform your code will be running on, in all versions past, present, and future? I didn’t think so.
Passing the cancel flag by reference is a latent bug. Don’t do it.
I have to admit here that I gave some bad advice to my friend who first asked me about this. I was convinced that the compiler wouldn’t perform that optimization on a reference parameter. I was wrong, as this little example shows. I still don’t fully understand the optimization rules, but I’ve come to the conclusion that the compiler assumes single-threaded code in most situations. That makes sense, actually. The compiler would have to avoid a whole host of optimizations if it couldn’t assume single-threaded code.
If you’re writing multi-threaded code, you need to use synchronization primitives in order to control things. Prior to .NET 4.0, a common technique was to use a ManualResetEvent, like this:
private void DoStuff()
{
ManualResetEvent cancelRequested = new ManualResetEvent(false); // <----
// Start tasks
ThreadPool.QueueUserWorkItem((s) =>
{
var task1 = new Task1Class();
task1.DoTask1(cancelRequested); // <----
});
Console.WriteLine("Press Enter to stop thread.");
Console.ReadLine();
// set cancel flag
cancelRequested.Set();
Console.WriteLine("Cancel requested!");
}
public class Task1Class
{
public void DoTask1(ManualResetEvent cancelRequested) // <----
{
Console.WriteLine("Task1 started.");
while (!cancelRequested.WaitOne(0)) // <----
{
// do stuff
}
Console.WriteLine("Task1 exit.");
}
}
The lines with “// <----” comments are the only ones that change.
You can still use that technique in .NET 4.0 or later. You can get slightly better performance using ManualResetEventSlim. I see only two problems with using the events this way.
First, the expression !cancelRequested.WaitOne(0) isn’t exactly clear. It’s not as bad as the Interlocked.CompareExchange example from my last post, but it’s definitely not as clear as !cancelRequested.
More importantly, any thread can call cancelRequested.Set to cancel the entire operation. Granted, background tasks shouldn’t do that, but they could. It would be better if we could pass a read-only token of some sort.
.NET 4.0 formalized the concept of Cancellation, implemented with the CancellationTokenSource and CancellationToken classes. Using these classes, you can implement cancellation in a standard, safe, and reliable manner.
The basic idea is that the main program creates a CancellationTokenSource instance, and passes the Token member (a CancellationToken) to the methods that want to check for cancellation. I’ll let you read up on the details.
Here’s an example.
private void DoStuff()
{
CancellationTokenSource cancel = new CancellationTokenSource();
// Start tasks
ThreadPool.QueueUserWorkItem((s) =>
{
var task1 = new Task1Class();
task1.DoTask1(cancel.Token);
});
Console.WriteLine("Press Enter to stop thread.");
Console.ReadLine();
// set cancel flag
cancel.Cancel();
Console.WriteLine("Cancel requested!");
}
public class Task1Class
{
public void DoTask1(CancellationToken cancelToken)
{
Console.WriteLine("Task1 started.");
while (!cancelToken.IsCancellationRequested)
{
// do stuff
}
Console.WriteLine("Task1 exit.");
}
}
That looks a lot cleaner to me. I can tell just by looking at the code exactly what is going on. Furthermore, there is no way that one of the tasks can inadvertently cancel the entire process because there’s no way for them to set the cancel flag. They can only query it.
Another advantage of this method (and the one that uses events) over the reference flag (which doesn’t work, but even if it did) is that you can pass the CancellationToken to a class’s constructor so that the class can cache it. That way, it doesn’t have to pass the cancellation token to every internal method that might need it. For example:
private void DoStuff()
{
CancellationTokenSource cancel = new CancellationTokenSource();
// Start tasks
ThreadPool.QueueUserWorkItem((s) =>
{
var task1 = new Task1Class(cancel.Token);
task1.DoTask1();
});
Console.WriteLine("Press Enter to stop thread.");
Console.ReadLine();
// set cancel flag
cancel.Cancel();
Console.WriteLine("Cancel requested!");
}
public class Task1Class
{
private readonly CancellationToken _cancelToken;
public Task1Class(CancellationToken cancelToken)
{
_cancelToken = cancelToken;
}
public void DoTask1()
{
Console.WriteLine("Task1 started.");
DoSubtask1();
if (!_cancelToken.IsCancellationRequested)
{
DoSubtask2();
}
Console.WriteLine("Task1 exit.");
}
private void DoSubtask1()
{
while (!_cancelToken.IsCancellationRequested)
{
// do stuff
}
}
private void DoSubtask2()
{
while (!_cancelToken.IsCancellationRequested)
{
// do stuff
}
}
}
You can’t store a reference in C# (you can store an instance of a reference type, but that’s a different thing), so there wouldn’t be any way to do this using a simple Boolean flag. You can’t write, for example:
private ref bool cancelRequested; // this doesn't compile!
The most important thing I learned from this investigation is:
If you’re writing multi-threaded code, use synchronization primitives for all inter-thread communication.
Furthermore, I’ve shown that volatile is not a synchronization primitive.
If you’re writing for .NET 4.0 or later, then take advantage of Cancellation. Otherwise, use a ManualResetEvent or ManualResetEventSlim to approximate that functionality.
By Jim, on April 30th, 2013 Imagine you start a thread to perform some long-running task that you want the ability to cancel. For example, you might start that thread while your main program is waiting on user input, and then cancel that thread once the user has pressed Enter. It’s an easy thing to code up.
private static void Main(string[] args)
{
bool cancelRequested = false;
Console.WriteLine("Starting thread.");
ThreadPool.QueueUserWorkItem((s) =>
{
Console.WriteLine("Thread started.");
int counter = 0;
while (!cancelRequested)
{
// do some work here
++counter;
}
Console.WriteLine("Thread stopped: {0}", counter);
});
Console.WriteLine("Press Enter to stop thread.");
Console.ReadLine();
cancelRequested = true;
Console.WriteLine("Press Enter again to exit the program: ");
Console.ReadLine();
}
If you run that program under the debugger, you’ll find that it works just fine in Debug and in Release mode. When you press Enter, you’ll see the “Thread stopped” message as expected. However, if you run without the debugger attached, you’ll find that the thread never terminates! What’s going on?
Simply put, compiler optimization. The JIT compiler sees that nothing in the loop can modify the cancelRequested variable, so it generates code to load the value and keep it in a register. The thread never references that variable again.
Interestingly, placing a Thread.Sleep(0) in the loop disables that optimization, as does adding a lock statement. Other function calls inside the loop might also prevent that optimization, as will sufficiently complex calculations. If the exact rules for what prevents that optimization are written anywhere, I’ve been unable to find them. The C# Language Specification hints at some things, but I was unable to find any explicit language.
Absent explicit language in the specification, I’m hesitant to rely on the current behavior because it could change in the next version of the compiler or on a different platform.
There are various ways to force the compiler to examine memory at every iteration of the loop, but the options are limited when working with a Boolean. Interlocked.CompareExchange came immediately to mind, except that there isn’t an overload that takes a ref bool. I could make the variable an int, and write:
while (Interlocked.CompareExchange(ref cancelRequested, 1, 1) == 0)
Besides being ugly, it’s not at all clear what’s going on there. I’ve written stuff like that and six months later had to come back and figure out what’s happening. If you had to go look up Interlocked.CompareExchange and puzzle out what’s happening, I think you’ll agree that this is a less than ideal solution.
Those of you with some more experience might say, “Make it volatile!” There are some problems with that. First, you can’t mark a local variable with the volatile keyword. I’d have to promote the variable to class scope. That works, sure, but it’s dissatisfying having to move a variable to an outer scope just so I can make it work the way I want. It just feels wrong.
A major problem with volatile is that it applies different semantics to variable access, but does it at the declaration site. I don’t know, when I’m writing code to access a variable, that doing so will involve acquire semantics. It would be like stepping back to the bad old days of Pascal, where passing a reference (var) parameter looked the same as passing a parameter by value. That is, given this procedure:
procedure Frob(var fooby: Integer);
The code to call it doesn’t explicitly state that the parameter is passed by reference:
Frob(foo);
I rather like having to specify ref in C#, and I don’t like the implicit change in memory access semantics that comes along with volatile.
At least Interlocked.CompareExchange tells you that there’s something special about that cancelRequested variable, even if the logic behind what’s happening is confusing.
More importantly, people who know a lot more than I do about C#, .NET, and multithreading in general caution against the use of volatile for several reasons, one of which is that it probably doesn’t work the way you think it works or do what you think it does. See, for example, Eric Lippert’s Atomicity, volatility and immutability are different, part three and Joe Duffy’s Sayonara volatile.
Based on those comments, volatile is not the thing I want to use when trying to write portable and reliable code. I would suggest that you, too, avoid it.
All that said, there are other ways I could solve this problem. But the simple program I showed at the start of this entry is a pretty rare case. Real programs often have multiple background tasks running, all of which are subject to cancellation either individually or as a group, and a simple Boolean variable is not a very good way to control such things.
There is a much more flexible and reliable way, which I’ll talk about in my next post.
By Jim, on April 24th, 2013 Saturday evening, Debra and I were over at a friend’s place. There were eight or ten of us, total, sitting or standing around, just conversing. I was also whittling on a piece of Ash, making another little dog. I don’t know if my attention lapsed (possible) or if the knife just slipped (more likely), but suddenly I was cutting into my left hand.
Surprisingly enough, it didn’t really hurt. I stared at it for a second, surprised, then stood up and grabbed a paper towel to stop the bleeding. Then to the kitchen sink to rinse it, thinking I’d just clean it a bit, bandage it up, and go back to my carving. I’ve had plenty of minor cuts, so I wasn’t terribly worried.
I only needed a brief look at the laceration while I was running it under the faucet to know that this one would need stitches. My friend handed me a wash cloth, we secured it with duct tape, and Debra drove me to the emergency room.
The second surprise (the first was that I cut myself so badly) was that the emergency room was not full at 9:00 on a Saturday night. But then, Round Rock is kind of a sleepy place. I walked up to the check-in desk. The triage nurse was safely ensconced behind a window.
Nurse: “How can I help you?”
Me (holding up left hand): “I cut myself rather badly.”
Nurse: “The duct tape didn’t fix it?”
Me (laughing): “No, but it’s holding things together for now.”
Less than a minute later, I was sitting in an exam room with a RN removing an impromptu bandage while a nursing student looked on. When she got it uncovered she said, “That’s not so bad, but recovery is gonna suck.” I had what is (apparently commonly) called a suturable laceration. It was still bleeding a bit, so she had me apply a little pressure while we waited for the nurse practitioner to come along and stitch me up.
And waited. … And waited.
I’m not sure what he used to deaden the area, but it sure hurt like heck going in. I couldn’t feel the needle going in, but it sure hurt when he pushed the contents into my hand. But within a minute my thumb felt like it’d been shot up by the dentist, and I couldn’t feel a thing when he started stitching.
The cut, by the way, is about 7 centimeters (2-3/4 inches) long, running from the base of my thumb to the center of my wrist, and almost a centimeter deep. If you’re morbidly curious, you can see the result.
Then we got to wait a even longer. Debra and I got some amusement out of all the forms they brought in, and I laughed out loud when I had to sign a document saying that I received a copy of the hospital’s no smoking policy. Oh, and they gave me a 10 mg Norco tablet and told me not to drive, and a prescription for antibiotics and more Norco. I filled the antibiotic prescription but decided against the Norco. It didn’t hurt that bad.
All in all it took about 45 minutes from the time we walked in the door until my hand was stitched up. It took another 45 minutes or an hour before they finally brought the bill, took my payment, and then came and put on the bandage before I could walk out the door. Gotta love bureaucracy.
And all because I wasn’t wearing my carving glove. The glove is Kevlar, and probably wouldn’t have completely prevented the cut, but it most likely would have made the difference between a surface scratch and a trip to the emergency room.
I’d been lax, not wearing the glove because I thought I was good enough not to need it. Only takes one slip of the blade, though, to ruin your day. I’ll be wearing the glove from now on.
By Jim, on April 24th, 2013 Many Stackoverflow questions amount to nothing more than, “Which of these is faster?” or “Why is this so slow?” Very often, the person posting the question will include code that outputs some timing results. More often than not, the timing code is just wrong. Sometimes fatally so.
By the way, most “Which is faster” questions are non-starters. You shouldn’t ask before you’ve done the benchmark yourself. Read on.
If you want accurate timing information, you have to do it the right way. If you do it the wrong way, your results will be at best unreliable. Fortunately, doing things the right way is pretty easy.
But before I go any further, I feel obligated to ask the most important question: Why do you care? Before you start adding code to profile small parts of your application, you need to ask yourself if it’s really that important. Or are you just wasting your time worrying about the performance of something that is already “fast enough?” I strongly suggest that you read Eric Lippert’s, Which is faster?
Now, if you really need to time your code. . .
The information below is for .NET, and the code samples are in C#. You should follow the same techniques if you’re writing in Visual Basic .NET or Managed C++, and probably any other .NET language. The general techniques are true of pretty much any other platform as well, although the specifics will be much different.
When doing benchmarks, run them against a release build with no debugger attached. In Visual Studio, be sure to select the Release build, and either press Ctrl+F5 to run, or select Debug -> Start Without Debugging. Timings gathered from a debug build or from having the debugger attached (even in a Release build) will be inaccurate. You’ll often find that there is little difference between Debug and Release builds when the debugger is attached.
Do not use DateTime.Now, DateTime.UtcNow, or anything else based on DateTime as your time source. There are two reasons for this. First, the resolution of DateTime is somewhat questionable, but probably not much better than 15 milliseconds. Secondly, the clock can change on you. I once had a bit of code take -54 minutes (yes, it completed 54 minutes before it started) to execute because it started four minutes before 2:00 AM on “Fall back” day (Daylight Saving Time autumn adjustment date) and finished six minutes later. So start time was 01:56:00 and end time was 01:02:00. You don’t control the system clock, so don’t depend on it.
Use the System.Diagnostics.Stopwatch class for all elapsed time measurements. That’s based on the Windows high performance timer, which is the most accurate interval timer you’re likely to get on a Windows box unless you have special hardware. Accept no substitutes.
With Stopwatch, timings are easy. First, add this using statement to the top of your source file:
using System.Diagnostics;
Then, surround the code you want to time with a Stopwatch:
Stopwatch sw = Stopwatch.StartNew(); // creates and starts the stopwatch
//
// put the code you want to time here
//
sw.Stop(); // stop the watch.
Console.WriteLine("Elapsed time = {0} milliseconds", sw.ElapsedMilliseconds);
That’s all there is to it!
I usually like to report times in milliseconds (1/1000 second), simply because the usable range is reasonable. If I’m timing something that takes less than a millisecond to run, I need to do it many times to get a good average. A million iterations of any non-trivial loop will likely take longer than a few milliseconds. On the same note, a full day is 86,400 seconds or 86,400,000 milliseconds: a number that is still reasonably easy to work with. If I’m timing something that takes longer than a full day to run, I might decide to use seconds as my comparison.
One other nice thing about Stopwatch is that you can have many of them, each timing a discrete part of your program. For example:
Stopwatch totalTime = Stopwatch.StartNew();
Stopwatch pass1Time = Stopwatch.StartNew();
DoPass1();
pass1Time.Stop();
Stopwatch pass2Time = Stopwatch.StartNew();
DoPass2();
pass2Time.Stop();
totalTime.Stop();
You then have separate values for the total time and the time to execute each pass.
If you’ve done all of that and you still haven’t found an answer, then post a question. But include your timing code so that those of us who are willing to help you find an answer aren’t groping around in the dark. And be sure to mention that you ran in Release mode without the debugger attached.
By Jim, on April 17th, 2013 I’ve seen some pretty crazy code over the years, and this snippet that’s supposed to strip the extension from a Windows file name ranks right up there with the best of them.
string GetFileNameWithoutExtension(string filename)
{
string result = filename;
char[] charArray = filename.ToCharArray();
Array.Reverse(charArray);
string s1 = new string(charArray);
string[] ext = s1.Split(new char[] {'.'}, 2);
if (ext.Length > 1)
{
char[] charArray2 = ext[1].ToCharArray();
Array.Reverse(charArray2);
result = new string(charArray2);
}
return result;
}
Perhaps the most important thing to understand about this code is that it’s totally unnecessary. There is a method, Path.GetFileNameWithoutExtension, which has been in the .NET Framework since version 1.1, that does exactly what the name implies.
But assuming the author didn’t know about that, why would he reverse, split, fold, spindle, and mutilate the string just to get the part after the last period?
When I was a kid, I was fascinated by that expression, “fold, spindle, or mutilate.” I knew what fold and mutilate were, but I just couldn’t figure out spindle. One day I was at the dry cleaners with my mom and a new employee was learning how to work the cash register. The experienced employee said, “…and then you put the ticket on the spindle,” and pushed the ticket onto the sharp pointy thing there on the counter. I still remember the feeling of the light going on in my head.
Anyway, back to the topic at hand.
This is the wonkiest code I’ve ever seen to do such a simple job. In a nutshell, it turns a filename, say “c:\folder\subdir\filename.ext” into “txe.emanelif\ridbus\redlof\:c”. Then, the call to string.Split splits it on the dot, giving two strings:
"txe"
"emanelif\ridbus\redlof\:c"
The code then takes the second of the two strings, reverses it, and that’s the result.
Convoluted in the extreme. Idiotically clever.
I guess the author had never heard of String.LastIndexOf, either. With that method, you can duplicate the code above in two lines:
int dotPos = fileName.LastIndexOf('.');
result = (dotPos == -1) ? fileName : fileName.SubString(0, dotPos);
Yep. That’s all his code does.
By the way, there’s a bug in his code and in mine. Given a file name like “c:\folder\subdir.001\filename_without_extension”, both will give a result of “c:\folder\subdir”.
To do it right, you have to take the slashes into account, too:
string GetFileNameWithoutExtension(string filename)
{
int dotPos = filename.LastIndexOf('.');
// if there's no dot, there's no extension
if (dotPos == -1)
{
return filename;
}
int slashPos = filename.LastIndexOf('\\');
// if the last slash is after the last dot, then there's no extension
if (slashPos > dotPos)
{
return filename;
}
return filename.Substring(0, dotPos);
}
I sure hope the guy who wrote that original code isn’t being paid to write computer programs. I pity his poor clients, living with those ticking time bombs in their code. Frightening.
Or perhaps I should hope for more people writing such code. I’ve made a lot of money over the years fixing crazy stuff like that.
By Jim, on April 12th, 2013 I suffer from the old programmer’s “anti-bloat” mentality: get rid of things that are unnecessary. After all, the file that’s never used is just taking up precious hard drive space, and is potentially confusing. There’s no need to keep it, having it clutter up the project directories and revision control system, and sitting there unused in the distribution directory. It’s just good sense to delete it. Right?
I used to complain, when creating a new C# program, that Visual Studio would barf all over my hard drive, creating a bunch of files and directories that I just didn’t want. I wanted the tool to create projects my way, not the way some goon at Microsoft decided it should be done. It took me a long time to realize that I’d be a lot better off if I just accepted the defaults. Binaries go in bin\Debug and bin\Release. The obj directory has stuff that I don’t need to care about, etc. It took a while, but after a time I realized that the Visual Studio project structure works. It’s perhaps not what I would have designed, but it works just fine.
And then I created an MVC Web application. And I thought a new Windows Forms app barfed all over my hard drive? Ha! An empty MVC application has so much crap that I’d surprised if any one person could say with certainty what all of it is for.
I created an MVC Web API project a while back. Apparently, most Web API applications also have a browser-based interface. The MVC Wizard creates the site so that browser requests go to /app/controller/action/, and API requests go to /app/api/controller/action. That seems reasonable, but I was told that my application wouldn’t have a browser interface and that all my API requests should go to /app/controller/action. That was easy enough to change, and I finished the project. Then I made a big mistake.
As I said, the MVC Wizard barfs out all kinds of unnecessary stuff: help pages, JavaScript, references to Entity Framework, service providers, and I don’t know what all else. I figured I could save myself a lot of trouble with versioning and deployment if I removed all that stuff. And it worked. My project was lean and mean, easy to deploy and with no extraneous files.
But then I was told to add some browser-accessible pages.
I won’t regale you with all the things I attempted over the day I spent trying to add all that stuff back to my project. I went home very frustrated that evening, wondering how I was going to accomplish the task. In the shower the next morning, I decided that my best course of action was to start over with a new MVC application, let the Wizard barf all over my hard drive again, and then just make the modifications I need, leaving all the extraneous crap that I have no use for. It took me approximately an hour to duplicate my application that way.
When it came time to modify another application that I had “optimized,” I didn’t even try to re-add the stuff I’d deleted. I just created a new MVC application and made my changes. 30 minutes later, I was done.
The Wizard knows best. Let it do its thing. If it creates stuff you think you don’t need, just grumble if it makes you feel better, and then get over it. Do not try to “optimize” by removing from the project those things you think you won’t ever need. If you don’t want them cluttering up your distribution directory, then modify your deployment script so that they’re not copied. But don’t remove them from the project. You’ll be sorry if you do. Trust me.
By Jim, on April 11th, 2013 I encountered the participants of the Ride 2 Recovery Texas Challenge while I was out on a ride last year. I told myself that I’d join the ride this year.
The Challenge started in San Antonio on Monday, and yesterday was the leg from Austin to Ft. Hood. I signed up for the one-day ride, and took the day off from work so that I could participate.
The weather report for Wednesday was for rain, followed by a cold front. It was raining when I got up at 5:30 AM, but it was about 70 degrees outside. I don’t especially like riding in the rain, but I’ll do it if I have to.
When we started the ride at 9:00, it was about 55 degrees and still raining. The cold front had arrived. 55 degrees isn’t bad. 55 and rain is uncomfortable. 55, with rain and wind is miserable. And then the wind picked up and the temperature dropped. By the time we’d been on the road an hour, it was about 40 degrees and the wind was 15 to 20 MPH with higher gusts. I was so cold that I was shivering uncontrollably. I’ve been pretty uncomfortable on the bike before, but that was unbearable. We made a scheduled stop for a bathroom break, and the organizers gave us some extra time to warm up. We were inside where it really was warm, but I was unable to stop shivering. After about 30 minutes, they gave participants the option of continuing with the ride or taking a bus to the finish line. Considering that I couldn’t get warm and the weather hadn’t changed, I elected to cut the ride short. I’ll do a lot for the cause, and goodness knows I like a challenge, but I draw the line at hypothermia.
I can’t think of a time I’ve ever been colder. That was the most miserable bike ride I’ve ever been on, although it did have a few high points. The organizers arranged for us to ride by a number of elementary schools. The kids were all out waving American flags, cheering us on and holding out their hands. We rode by slowly and touched their hands. The kids got a real kick out of it, and many of the veterans I was riding with were touched by the gesture. Nice to see such support from the kids and also from adults who braved the cold and wet, not only near the schools but in many places along the route. We even had people get out of their cars at intersections and cheer us on as we passed.
The weather was beautiful for the Monday and Tuesday legs of the ride, and this morning was a little chilly but would not have been uncomfortable. And the forecast for the next two days is very nice. It just happened that Wednesday, the day I wanted to ride, was an unusual day. Disappointing, but that’s the way it goes. This whole month of April so far has felt more like a typical March. Dang weather, anyway.
Next year will be better, right?
By Jim, on April 7th, 2013 Yesterday was the 9th annual Spokes ‘n Spurs ride, benefiting Spirit Reins Ranch. I’ve participated in this ride for the last seven years as a volunteer, providing event communications with the Williamson County Amateur Radio Club. The ride organizers decided to use a different means of communication this year, so I was free to ride the course.
I chose to ride the 100 kilometer route, and had set an informal goal of finishing in under four hours. I didn’t want to push too hard because I’m a little out of shape. I didn’t ride over the winter, and just started getting serious about it again in March. I figured this would be a good introduction. The weather forecast called for highs around 80, and 10 to 15 MPH winds from the south, increasing to 20 or 25 in the afternoon. I didn’t figure the wind would be too much of a problem, since the ride is primarily east/west.
Whoever laid out the course must have been a fan of M.C. Escher. Or perhaps they had the spirit of Escher lay out the course. That’s the only way they could have created a circular course that has more uphill than down.
Except they did one better. They also somehow managed a constant headwind.
Joking aside, the wind was bad. It was 15 MPH to start, and the temperature was about 55 degrees. I wore my vest and arm warmers. The vest came off at 23 miles and the arm warmers at about 40. Good thing those cycling jerseys have very large pockets.
I did have a new experience on this ride: I was the leader at the very beginning. When we lined up, I was toward the front, and when we took off I somehow managed to be in the lead. For a bout the first mile. The fast people soon went by me, and I was smart enough not to try keeping up with them. That heart rate monitor is a valuable piece of equipment, provided I pay attention to what it’s telling me.
A rather large group came pedaling by at about six miles, lead by members of the Austin Flyers women’s cycling team. I let them pass, then latched on to the back, figuring I could let them pull me along for a while. That lasted until about mile 13, where we made a turn and started up a hill. Those ladies are strong. They left me and quite a few others gasping for breath as we struggled up the hill.
After that I was pretty much on my own. There were always riders nearby, and I’d chat briefly with those who passed me or whom I’d pass, but I didn’t form a group or get into a group with anybody else.
I stopped at mile 20 or so to refill my water bottles, get a few PB&J sandwiches (the benefits of an organized ride!) and peel off the vest. Then it was off on the worst stretch of the ride. From that rest stop, the course does a 17 mile loop, heading south and west to Burnet, then back to the same rest stop. There’s a lot of open space there with relatively few (compared to the rest of the course) hills, and nothing to block the wind. The wind had picked up and was about 20 MPH by then, with higher gusts. I had to work to keep from being blown over by the direct crosswind. Not fun. But I made it back to the rest stop, now at mile 40.
I’d never ridden the course on a bicycle before, but I was familiar with it from riding in the SAG wagons for several years as a volunteer. I knew that there was a series of hills from about mile 50 to about mile 55, with the first being a very short but steep (10 to 12 percent grade) little monster. I’d been dreading that stretch all day, and every time I saw my heart rate monitor reading above 80%, I’d back it off knowing that I’d need the energy later.
There was another rest stop at about mile 47. Its primary purpose, I think, is to be the midpoint rest for the 42 mile riders. I didn’t stop there, but I did notice that it was placed about 50 yards from the Oatmeal Cemetery. Coincidence?
I was right to be concerned. That first hill was tough. There were several people walking their bikes up it as I approached (riders from the 42 mile route, I think). I geared down and began ascending. When I stood up to put a little more power into the pedals, my left leg nearly locked up. I had to be very careful not to push too hard. Fortunately, the hill is short and I managed to ascend it without blowing a gasket. My heart rate did get to 86% of max, though. From there, there are several long climbs, but not nearly as steep. There’s another rest stop at mile 53 or so, but I had plenty of water and food, and I just wanted to finish the ride.
I finished the route, 60.28 miles, in 4 hours and 10 minutes. That’s an overall average of 14.5 MPH. My moving average was 15.1 MPH. I spent a total of 11 minutes at my two rest stops. It’s not my best time, but I’m okay with that. The wind was brutal, and the course did feature over 3,000 feet of climbing. Contrast that with the Waco Wild West Century, which had only 2,400 feet of climbing in 100 miles.
I’ve been impressed with the ride’s organization since the first year I volunteered for it, and I was still impressed as a participant. These people know how to put on an event. The start/finish area is well laid out, the course is well marked, the rest stops are fully stocked with food, water, and friendly people, and there is plenty of SAG support. It’s a small ride–maximum 1,000 people–but it’s really well done. And everybody I talked to on the road had nothing but good things to say about it.
If you’re an Austin area cyclist, you should seriously consider this ride next year. It’s a well organized ride on some of the most scenic roads in the area. Well worth getting up a little early and driving north to Liberty Hill.
By Jim, on March 27th, 2013 Date and time calculations done with local times can produce erroneous results. For example, consider this code, which adds one day to March 9, 2013 at 2:01 AM.
DateTime localStart = new DateTime(2013, 03, 09, 02, 01, 00); // 2:00 AM on 03/09/2013
DateTime localResult = localStart.AddDays(1.0);
Console.WriteLine("Local start date/time = {0}", localStart);
Console.WriteLine("Result of AddDays(1.0) = {0}", localResult);
The result will be March 10, 2013 2:01 AM. That seems right, but in my local time zone it’s incorrect. 2:01 AM never existed on that date. When 2:00 AM rolled around, the clock was zipped forward an hour for the annual Daylight Saving Time “spring ahead” adjustment.
The documentation for DateTime says:
Conversion operations between time zones (such as between UTC and local time, or between one time zone and another) take daylight saving time into account, but arithmetic and comparison operations do not.
The italics are mine.
The same thing happens with DateTimeOffset.
If you want to do date and time calculations, you should convert the values to UTC first. Perform your calculations and then convert the results back to local time. For example, this code produces the correct result:
DateTime utcStart = new DateTime(2013, 03, 09, 02, 01, 00).ToUniversalTime();
DateTime utcResult = utcStart.AddDays(1.0);
Console.WriteLine("UTC start date/time = {0}", utcStart);
Console.WriteLine("Result of AddDays(1.0) = {0}", utcResult);
Console.WriteLine();
Console.WriteLine("UTC result converted to local time = {0}", utcResult.ToLocalTime());
This is just one of the many problems that can occur when working with dates and times. I’ve found through hard experience that if you’re storing dates and times, you should store them as UTC. And when a user inputs a date and time, you should convert it to UTC as soon as possible. All datetime calculations and all datetime persistence should be UTC.
You could, if you wanted to keep track of where a particular data item was collected, persist times as DateTimeOffset, with the proper offset applied. Remember, though, to convert to UTC before doing any calculations.
Whichever you choose, be consistent. Otherwise you’re going to lose information and produce erroneous results.
By Jim, on March 26th, 2013 I’ve been known to say, “Static classes are evil.” That’s overstating my position somewhat, but a static class will definitely raise a red flag in a code review. The plain truth is that in a language like C# there’s seldom a need for a static class, and there are compelling reasons not to use them. They have their uses, but in my experience you should favor class instances over static classes whenever reasonable.
In C#, static class constructors run immediately before the first static method on its class is called, or (in the case of an instance class that has static members) immediately before the first instance of its class is created. Eric Lippert explains this in detail in his article series about static constructors.
It’s important to understand that, in general, you have no control over when the constructor runs. In addition, you can’t pass parameters to the static constructor. The constructor must have all the information it needs in order to fully initialize the class and all of its members. If the constructor can’t fully initialize things, then what you have is equivalent to a partially initialized object at global scope.
Let me illustrate why this is a bad thing.
First, suppose you have this (non-static) class:
public class Foo
{
public Foo() {} // parameterless constructor
public void Initialize(string bar); // sets up the object
// DoSomeWork does something that depends on Initialize being called.
public void DoSomeWork();
}
And you use it in code like this:
private void SomeMethod()
{
var f = new Foo();
Foo.Initialize("xyzzy");
Foo.DoSomeWork();
}
The key here is that if you don’t call Initialize first, then DoSomething is going to fail. This API is somewhat fragile because nothing stops the programmer from forgetting to initialize. In almost all cases, it would be better to have the constructor initialize the object:
public class Foo
{
public Foo(string bar) {} // create and initialize the object
public void DoSomeWork();
}
Now, it’s impossible to create an uninitialized object, meaning that it’s always safe to call DoSomeWork.
As bad as this API is, it’s not too bad because you control the scope of the uninitialized object. It’s not like code in some other method can reach in and access your Foo instance.
Now imagine you create a static class with a similar interface:
public static class Foo
{
static Foo() {} // static class constructor
public static void Initialize(string bar); // sets up the object
// DoSomeWork does something that depends on Initialize being called.
public static void DoSomeWork();
}
Remember, the static class is at global scope. Any part of your program can reference the namespace and call methods on this class. They don’t have to construct it. It’s all too easy to write somewhere in your program:
Foo.DoSomeWork();
If that’s the first reference to Foo in your program, the constructor will run, Initialize won’t be called, and your static class will either crash and burn or, probably worse, do something with the default values. You better hope that your defaults aren’t dangerous.
If your static class requires any explicit initialization on your part (i.e. you have to call an Initialize method), then you probably should revisit your design. It’s almost a certainty that you should re-code that to be an instance class with a constructor that accepts a parameter and creates a fully initialized object. I think you’ll discover that your code is cleaner without the static classes.
By Jim, on March 18th, 2013 In my post, Don’t depend on what you don’t control, I explained why you shouldn’t depend on the system clock for elapsed time measurements. That is, consider this code:
var startTime = DateTime.Now;
// perform some long-running operation
var endTime = DateTime.Now;
// compute elapsed time
var elapsed = endTime - startTime;
That code will produce erroneous values if the system clock changes between time checks. Imagine your surprise when you discover that it took -56 minutes to run a 5-minute job, because you just happened to start the job at one minute before 2:00 AM on the day we set our clocks back for Daylight Saving Time.
The system clock can change at any time. You don’t control it.
There’s another way this can trip you up. Say you want your program to do something at 4:00 AM. You start the program some hours before that and set a timer to wake up when the prescribed amount of time has elapsed:
TimeSpan delayTime = Settings.WakeupTime - DateTime.Now;
System.Threading.Timer t = new System.Threading.Timer(
TimerTickHandler, null, delayTime.TotalMilliseconds, -1);
That creates a one-shot timer that will execute TimerTickHandler when the specified amount of time has elapsed.
But it’s “spring ahead” weekend. You wanted the program to wake up at 4:00 AM, but the clock skipped ahead an hour so it’s 5:00 AM before the timer tick happens. Oops.
The cause is the same: assuming that the system clock increases one second for every second of time that elapses. It’s an incorrect assumption.
When the system clock changes due to an external event, any time interval based on the previous setting is invalid.
There are two ways around this problem. One is to create a timer that ticks every minute (or every second, depending on how accurate you want to be). The timer’s callback then checks the current system clock against the desired wakeup time and does whatever it’s supposed to do when they match. This works well, but I find it intellectually dissatisfying because it’s polling. Polling for time is such a waste of resources. (Never mind that a few instructions every second won’t be noticed.)
A better solution is to use a Waitable Timer. Unfortunately, the .NET Framework doesn’t have a Waitable Timer object. But one is available. See my article Waitable Timers in .NET with C#. Full source code is available at http://www.mischel.com/pubs/waitabletimer.zip.
By Jim, on March 14th, 2013 The C# compiler will convert expressions that use the “+” operator on strings into calls to the string.Concat function. So, for example, in this code:
string a = "hello, ";
string b = "world";
string s = a + b;
The compiler changes the last line to the equivalent of:
string s = string.Concat(a, b);
Seems reasonable, right?
Consider also Console.WriteLine, which treats its arguments as objects, and calls ToString on them. So this code:
int a = 10;
int b = 27;
Console.WriteLine(a + b);
Gets converted to the equivalent of:
int temp = a + b;
Console.WriteLine(temp.ToString());
So what happens when you write this?
Console.WriteLine(a + b + "foo");
You probably won’t be surprised to see the output “37foo”.
How about this one?
Console.WriteLine(a + (b + "foo"));
The result? “1027foo”!!
So how does that work? The compiler converted that code to:
string s = String.Concat(a, b, "foo");
Console.WriteLine(s);
You see, there are String.Concat overloads that treat the arguments as type object. Those methods just call ToString on each argument, and concatenate the results.
That was pretty surprising. Consider this:
int a = 10;
int b = 27;
string s1 = a + b; // error: can't convert int to string
string s2 = (a + b) + "foo"; // no problem, "37foo"
string s3 = a + (b + "foo"); // no problem, "1027foo"
I understand why this happens. I don’t understand why the compiler writers thought it was a good thing. It leads to some very confusing results:
int m = 10;
object o = new object();
string f = "foo";
var s1 = m + o + f; // error: cannot apply operator "+" to types int and object
var s2 = m + f + o; // hey, no problem.
That’s just wrong. It’s logically consistent, once you understand the logic. But it’s pretty darned confusing when you first look at it.
By Jim, on February 23rd, 2013 I mentioned yesterday that I’d run across a program that I’d written in 1982. I spent a little time looking it over and thought I’d post my impressions of the code. You can download the program itself and follow along if you like: RENUMBER.BAS
I’ll admit up front that being objective might prove difficult. After all, I wrote the program 30 years ago. I might be too easy on myself in some places and too hard on myself in others. I do have to keep in mind that some of what we would consider poor programming practice today was de rigueur at the time, and much was a result of the tool. BASIC wasn’t a particularly good language for writing structured code. Today’s BASIC (even Visual Basic 6) is a much different beast than the line-oriented BASIC of 1982.
At less than 400 lines, the program isn’t particularly large or, as it turns out, terribly complicated. My recollection, before I saw this code again the other day was of a much larger and more complex program. I do recall that it was something of a challenge at the time, although I never doubted that I could write it. In a modern block-structured language, the program would easily be 1,200 lines of code. That old BASIC could be terse, and there were definite benefits to placing multiple statements on a single line.
General operation
The program operates in two passes.
Pass 1
For each input line
- Strip high bit from each character // required because the editor used
// would sometimes set the high bit.
- Parse line number (if it exists) and save
- Write fixed output line to a temp file
- Compute new line numbers
Line numbers are stored in a two-dimensional array called LINE.NUMBERS. After the first pass is done, the first column contains the old line numbers and the second column contains the new line numbers. Given that structure, it’s easy to look up an old line number (using binary search) and get the new line number.
Pass 2
For each input line (from the temp file)
- Parse the old line number, look it up in the line numbers array, and
replace with the new line number.
- Parse the line looking for GOTO, GOSUB,
and other constructs that make line number references.
- As each such construct is found, get the old line number, and replace
it with the new line number.
- Write the modified line to the output file.
That’s the general operation. The option to move a block of code rather than renumber the entire program adds some wrinkles, but the basic idea remains the same.
First impressions
After I got past the visual assault of all-capital BASIC and such dense code, I tried to focus on the overall structure of the program. I thought I did a creditable job of modularizing the code at a high level, given the constraints of the tool. Each major subroutine is preceded by a short comment describing what it does, and those subroutines are reasonably short and focused.
Although I did place multiple statements per line in some places, most of that was due to the lack of a multi-line IF...ELSE construct. You'll often see
IF <condition> THEN <stmt1>:<stmt2>
ELSE <stmt3>:<stmt4>
I created that visual line break by inserting a LF character in the editor. That little bit of trickery made it look like a multi-line construct, although the entire thing was considered a single logical line by the interpreter. Doing that made the code more readable.
For reasons that I don’t quite recall, I didn’t like using the DEFINT A-Z statement to make all the variables implicitly integers. I think I got burned by relying on implicit definitions at some point and went on an explicit kick. The variable FOUND%, for example, is explicitly made an integer by the % suffix. Similarly, strings are identified by the $ character. Considering that the program doesn’t deal with floating point numbers at all, there’s no good reason not to have used DEFINT. It probably would have made the code more readable. I’ll dock myself on style for this one, especially because I didn’t apply it consistently.
As shown here, the program won’t handle more than 1,200 numbered lines. That’s not as bad as it sounds, though. The compiled BASIC didn’t require a number on every line–just those that were the targets of GOTO, GOSUB, etc. I don’t recall how large Dean’s source code (the program I wrote this to handle) was, but I do remember that 1,200 was far more line numbers than we actually needed.
An observation: it took me a few minutes to remember what PRINT #2 is supposed to do. It writes text to file number 2. The number is assigned in the OPEN statement. My first thought on seeing that was, “Some comments would have been nice.” But in truth, the problem was that I’m unfamiliar with the language. My 21-year-old self would rightfully object if I told him to add a comment like:
' WRITE LINES TO OUTPUT FILE
PRINT#2, A$
Doing so would be akin to commenting a MyStream.WriteLine(line) today.
I was too enamoured with short variable names. Whereas it’s true that code ran faster with shorter variable names, this wasn’t a particularly performance sensitive application. My usage here was inconsistent. There are plenty of variables like LINE.NUMBERS and LASTLINE, but also too many like MT%.
On a side note, I have to say that this is the only language I recall ever using that allowed the period to be part of a variable name. Other languages use it as a scope separator. LINE.NUMBERS, for example, would refer to the NUMBERS field in the LINE structure. But MBASIC didn’t support the underscore or any other such separator character.
I also made some poor choices in variable names. FIND for the line number to find is a particularly good example, as is FOUND for the result. Other examples are the variable MOVE, and using CHAR% and CHAR$ in the same context, and plenty more.
It’s impossible not to use global variables in a BASIC program, since all variables are global and there are no parameters to subroutines. Communication consists of setting global variables before calling, and retrieving the results from other global variables after the call returns. Still, it’s considered good practice to keep variables as local as possible. I don’t see any egregious misuses of the global data model, but I probably could have made things more explicit by marking data transfer variables somehow.
I’d give myself good marks for the overall structure of the program and trying to maintain locality of reference. About average for variable naming. A big black mark for failing to use DEFINT and for being inconsistent with the type suffixes.
A few nasties
Take a look at this construct:
610 IF MOVE%=0 THEN
WHILE NOT(EOF(1))
ELSE WHILE NOT(EOF(1) AND EOF(3))
When I saw that the other day, I remembered writing it. More to the point, I remember thinking that it was incredibly clever. And it gets worse. The loop is terminated starting at line 980:
980 IF MOVE%=1 THEN WEND:GOTO 1000
990 WEND
That BASIC allowed such a wonky construct is kind of surprising. Imagine being able to write this in a C-like language:
if (move == 0)
while (!eof(1)) {
else
while (!(eof(1) && eof(3)) {
// do stuff here
if (move == 1)
}; // ends the loop for one case
else
}; // ends the loop for the other case
I could have written that much more clearly and avoided the brain bending wonkiness. BASIC didn’t short circuit Boolean evaluation, but there were other alternatives.
Major demerits for that bit of cleverness, youngster. You knew better.
The first time I saw “REFRENCES”, I thought I’d made a typo. But there are four uses with that spelling, and no uses of the correct spelling, “REFERENCES.” Apparently, I didn’t know how to spell.
The verdict
Overall, I think I did a good job on that program. The most important part was that it worked. As I recall, there were a few minor bugs and one showstopper after the program was in production, but after that the program got regular use for several years. I don’t recall what any of the bugs were, and I don’t know if this version of the program includes the fixes.
It’s tempting to rewrite the program in C#, just to see how it would look. But not quite tempting enough to actually do it.
By Jim, on February 21st, 2013 Today a coworker and I were discussing the first programs we ever sold. He and I have been in the business about the same amount of time (right at 30 years), so it was a nostalgic although unfortunately brief discussion.
After dinner this evening, I got to wondering if I still had that program lying around somewhere. Thanks to the magic of multi-terabyte drives and Windows Explorer’s search facility, it was a matter of just a few minutes to locate a copy of that program. Man, I’m a pack rat.
10 'RENUMBER.BAS
PROGRAM RENUMBERS BASIC PROGRAMS.
BASIC PROGRAM MUST BE SAVED IN ASCII FORMAT OR EDITED WITH A
WORD PROCESSOR SUCH AS WORDSTAR.
This was the fall of 1982. I’d left college after two years and was doing odd jobs to get spending money, but mostly I was broke. My friend Dean had a fairly large compiler BASIC program that he had to renumber because … well, that’s what happened to BASIC programs. You thought you put enough space between line numbers and sections, but inevitably you’d get squeezed.
That wasn’t a problem with interpreted BASIC, because there was a RENUM command. No such thing in the compiler BASIC, though, and you couldn’t load the compiler BASIC program into an interpreter because the code didn’t have a line number on every line.
One feature of Microsoft’s compiler BASIC is that not every line needed to be numbered. Only lines that were the targets of GOTO, GOSUB, and other such statements required line numbers.
Dean mentioned this one night and, eager for a challenge (or not smart enough to keep my mouth shut), I said that I’d write him a program to renumber his BASIC program. After all, I’d been writing BASIC programs for three whole years. I’d even taken a few college courses on programming. Of course I could whip this thing out in no time.
I honestly don’t remember how long it took me to write the program. Probably on the order of a week of evenings spent at Dean’s shop standing at one of his computer terminals (a Televideo 950 hooked to a machine running MP/M, supporting four users with a 4 MHz Z-80 processor and a whopping 256 kilobytes of RAM).
The funny thing is that I took on the project as a challenge: something fun to do while spending time with my friend. Dean might have mentioned that he’d pay me for it, and he did when I finished it, but that wasn’t my primary motivation.
I always thought it was funny that I used an interpreted BASIC program to renumber his compiler BASIC source file. In the days before source level debuggers for compiled code, the BASIC interpreter was a much friendlier environment for prototyping things. I probably would have written the program in Pascal if I’d had a compiler for it at the time. The only other language I knew at the time was Z80 assembly, and I wasn’t looking for that much of a challenge.
The program is surprisingly small–less than 400 lines of code. Looking it over brought back some memories. Surprisingly, the code is pretty good. A little difficult to follow, as even small BASIC programs are. But that’s due more to the nature of the language than my meager programming skill at the time. There also are some reasonably mature insights in the code.
More than anything, looking at that code makes me appreciate languages that have real subroutines that require parameters and return values. Calling line numbers (GOSUB) and communicating with globals is terribly difficult to follow. And the language lets you do (sometimes forces you to do) wonky things.
Think I’ll study this code a little more and do something of a code review. Might even post the program here. That should be fun. More next time.
|
|
|