Cook Computing

« June 2009 »

Linq and Functional Programming

I'm writing some code to display some old blog posts and after retrieving each post as a single string containing CR-LF separated lines I needed to split the string into individual lines and wrap each line with a <p> tag. I could split the string using String.Split but I wondered if it was possible to use a StringReader to generate an enumeration of the lines in the string. It turns out that StringReader doesn't support this but I wrote an extension method to supply the required functionality:

public static partial class Extensions
{
  public static IEnumerable<string> Lines(this TextReader textReader)
  {
    string line;
    while ((line = textReader.ReadLine()) != null)
      yield return line;
  }
}

The extension method is defined for the TextReader parent class of StringReader so it will also work for StreamReader, which makes it possible, for example, to generate an enumeration of the lines in a file (you could use File.ReadAllLines but that generates the array of every line in the file when it is called, whereas using an iterator means that data is read from the file as required for each yield statement).

I also needed to concatenate the strings in the sequence of lines so I wrote another extension method:

public static partial class Extensions
{
  public static string Concatenate(this IEnumerable<string> strings,
    string separator)
  {
    StringBuilder strbldr = new StringBuilder();
    foreach (string str in strings)
    {
      if (strbldr.Length > 0)
        strbldr.Append(separator);
      strbldr.Append(str);
    }
    return strbldr.ToString();
  }
}

This then allows me to write code like this:

string txt = @"one
two
three";
StringReader rdr = new StringReader(txt);
string output = rdr.Lines()
  .Where(line => line != "")
  .Select(line => "<p>" + line + "</p>")
  .Concatenate(Environment.NewLine);

I am finding that with the influence of Linq I am using a more functional style of coding, not just for manipulating data in a database but also for in-memory objects such as arrays. Treating an array as a sequence to which you can apply functions to means you can write higher level code which is easier to understand, and which is less likely to have bugs, because the code is focused on the required functionality rather than how to implement it.

Bill Venners in How Scala Changed My Programming Style describes a similar experience. His example translates to C# as follows: the imperative version:

var nameHasUpperCase = false; 
for (int i = 0; i < name.Length; i++)
{
  if (char.IsUpper(name[i]))
  {
    nameHasUpperCase = true;
    break;
  }
}

And the functional version:

var nameHasUpperCase = name.Any(c => char.IsUpper(c));

Of course, as Raganwald says in Why Why Functional Programming Matters Matters, speaking of how functional code expresses a lot more what and a lot less how, this doesn't come for free:

In general, we think this is a good thing. But it isn't free: somewhere else there is a mass of code that supports your brevity. When that extra mass of code is built into the programming language, or is baked into the standard libraries, it is nearly free and obviously a Very Good Thing. A language that doesn't just separate the concern of how but does the work for you is very close to "something for nothing" in programming.

In the case of the code above I had to write the extension methods but in a more functionally oriented language it might not be necessary to write anything extra.

In general, I wonder if a language designed to be inherently more functional, such as F#, is worth learning; not necessarily as a language for day-to-day use — it may be sometime before it achieves widespread commercial usage, if ever — but because it might feed back into using C# more effectively, moving towards a more functional style of coding where possible.

Posted by Charles Cook at 01:33 PM. Permalink. View Comments.

Why No Top-Level Functions in C#

I've speculated for a long time about why C# doesn't have top-level functions, for example in this post, where the solution to the problem is rather ugly because the static functions have to be qualified by their class name, i.e. instead of this:

Rgx.Expr e = 
  Rgx.Seq(Rgx.Char('c'), 
    Rgx.Seq(Rgx.Plus(Rgx.Alt(Rgx.Char('a'), Rgx.Char('d'))), 
      Rgx.Char('r')));

It would have been nicer to write this:

Expr e = 
  Seq(Char('c'), 
    Seq(Plus(Alt(Char('a'), Char('d'))), 
      Char('r')));

So I was interested to read Eric Lippert's post Why Doesn't C# Implement "Top Level" Methods? Eric discusses the cost-benefit analysis of implementing this feature, in particular:

In this particular case, the clear user benefit was in the past not large enough to justify the complications to the language which would ensue. By restricting how different language entities nest inside each other we (1) restrict legal programs to be in a common, easily understood style, and (2) make it possible to define "identifier lookup" rules which are comprehensible, specifiable, implementable, testable and documentable.

By restricting method bodies to always be inside a struct or class, we make it easier to reason about the meaning of an unqualified identifier used in an invocation context; such a thing is always an invocable member of the current type (or a base type).

He describes how C# was originally intended to be a component-oriented language designed for large-scale application development, but that with the increasing popularity of REPL languages like F#, top-level functions are being considered for a future version of C# (with the emphasis on being considered).

Interestingly, a comment on the post notes that Java has the static import construct which allows unqualified access to static members. This allows you to import static class members either individually:

import static java.lang.Math.PI;

Or en masse:


<p>
import static java.lang.Math.*;
</p>

You can then use the imported members without qualification:

double r = cos(PI * theta);

The motivation for static import was to provide a way of avoiding the constant interface antipattern, described here by Joshua Bloch. This technique involves defining an interface which contains only static final fields. A class using these constants implements the interface and so code within the class doesn't need to qualify the constant names with a class name. Bloch provides this example:

// Constant interface antipattern - do not use!
public interface PhysicalConstants {
  // Avogadro's number (1/mol)
  static final double AVOGADROS_NUMBER   = 6.02214199e23;
  // Boltzmann constant (J/K)
  static final double BOLTZMANN_CONSTANT = 1.3806503e-23;
  // Mass of the electron (kg)
  static final double ELECTRON_MASS      = 9.10938188e-31;
}

Which is used like this:

class hello implements PhysicalConstants {
  public static void main(String[] args) {
    System.out.println("Avogadro's number is " + AVOGADROS_NUMBER);
  }
}

Fortunately or not, depending on your viewpoint, this antipattern cannot be used in C# because interfaces cannot contain fields.

Posted by Charles Cook at 09:17 AM. Permalink. View Comments.

Evening in Upper Dearne Woodlands

One of my favourite evening hikes after a day working at home is to park at Denby Dale, walk up to Upper Denby through Hagg Wood, then round to Square Wood Reservoir and up to the old Quaker settlement of High Flatts, then down to the Upper Dearne Woodlands via New House, and so back to Denby Dale. Beautiful views of the Dearne valley during the first half, then back through the woods which are particularly beautiful in the evening sunlight. About 4 miles.

Posted by Charles Cook at 08:47 AM. Permalink. View Comments.

Functional style regex engine in F#

Nick Palladinos has done a follow-up to my post Functional Style Regex Engine in C# Revisited. In his post Functional style regex engine in F# he presents just that, an F# version of my C# code:

let char c (s : string) = seq { if s.Length > 0 && s.[0] = c then yield s.Substring(1) }

let (=>) l r s = seq { for sl in l s do for sr in r sl -> sr }

let (<|>) l r s = seq { yield! l s; yield! r s }

let rec (<*>) e s = seq { yield s; yield! (e => (<*>) e) s }

let (<+>) e = e => (<*>) e

// example c(a|d)+r
let pattern = char 'c' => (<+>) (char 'a' <|> char 'd') => char 'r'

An interesting difference to my C# version is the use of custom operators, particularly the => and <|> infix operators, which make for a much nicer syntax. Compare the definition of the function pattern above to this:

Rgx.Expr e = Rgx.Seq(Rgx.Char('c'), 
               Rgx.Seq(Rgx.Plus(Rgx.Alt(Rgx.Char('a'), Rgx.Char('d'))), 
                 Rgx.Char('r')));

It's also nice to be able to define functions without having to put them in a class.

Posted by Charles Cook at 12:12 PM. Permalink. View Comments.

Running VMWare Fusion on iMac

I've been running Windows 7 RC as a VMWare Fusion guest machine for a week or so now. Windows 7 requires 1GB of memory and I was experiencing a lot of paging on my iMac, which only has 2GB of memory, when I was trying to run several other applications in Mac OS X at the same time. So I ordered a 2GB SO-DIMM and I just installed it. Things are running much more smoothly now.

Posted by Charles Cook at 07:07 PM. Permalink. View Comments.

NOptFunc

A few days ago Simon Willison posted about his optfunc command line parsing program written in Python:

Command line parsing libraries in Python such as optparse frustrate me because I can never remember how to use them without consulting the manual. optfunc is a new experimental interface to optparse which works by introspecting a function definition (including its arguments and their default values) and using that to construct a command line argument parser.

This is the example he provides:

import optfunc
    
def upper(filename, verbose = False):
    "Usage: %prog <file> [--verbose] - output file content in uppercase"
    s = open(filename).read()
    if verbose:
        print "Processing %s bytes..." % len(s)
    print s.upper()
 
if __name__ == '__main__':
    optfunc.run(upper)

And this is the resulting command-line interface:

$ ./demo.py --help
Usage: demo.py <file> [--verbose] - output file content in uppercase
    
Options:
  -h, --help show this help message and exit
  -v, --verbose 

I've recently been experimenting with C# 4.0 and I realized that the new optional parameter and default parameter value features make possible a similar style of command line parsing. So I wrote some code to do this and the result is the NOptFunc project. Using NOptFunc the code above can be written like this in C#:

using System;
using System.IO;
using CookComputing;

class Program
{
  static void Main(string[] args)
  {
    try
    {
      NOptFunc.Run(typeof(Program).GetMethod("Run"), args);
    }
    catch (Exception ex)
    {
      Console.Error.WriteLine(ex.Message);
    }
  }

  public static void Run(string filename, bool verbose = false)
  {
    string s = File.ReadAllText(filename);
    if (verbose)
      Console.WriteLine("Processing {0} bytes...", s.Length);
    Console.WriteLine(s.ToUpper());
  }
}

In comparison to the Python code the invocation of NOptFunc.Run() is quite ugly and also suffers from potentially failing at runtime if the wrong method name is supplied. It would be nice to be able to write something like this:

      NOptFunc.Run(methodinfo(Program.Run), args);

i.e. assuming that C# had a methodinfo operator along the lines of typeof, returning an instance of MethodInfo instead of Type (overloaded methods would complicate matters, requiring something like methodinfo(Program.Run(int, string)) ). Ian Griffiths discussed this in his post Getting a MethodInfo From a Method Token:

So I get a relatively warm fuzzy feeling about using typeof - I like code that will only be able to run if it can't fail. All other things being equal, I prefer this to code that has potential runtime failure modes.

I've always been mildly perplexed that there's no equivalent way of retrieving a MethodInfo object. E.g. a hypothetical methodinfo(SomeClass.SomeMethod) operator. It's not up the top of my list of language features I want added, it just seems mildly inconsistent to have the operator for getting Type objects but not the corresponding FieldInfo and MethodInfo objects. (Interestingly, there doesn't seem to be a direct way to retrieve an EventInfo in IL, so I can't really object to that one not being in the language.)

Until recently, I had never looked into the details of this. I wasn't previously sure if this missing feature was just something C# chooses not to do, or whether it's because, it can't be done. But I recently had reason to generate some IL that does exacly this, so I can now say with confidence that it's possible, and it's just that C# doesn't supply a corresponding operator.

I've still got a lot of tidying up to do with NOptFunc, for example throwing exceptions with more useful messages and supporting --help, but the basic functionality is working.

Posted by Charles Cook at 04:03 PM. Permalink. View Comments.

List<T> Enumerator Gotcha

I find I am using sequences and iterators much more these days because of the influence of Linq. Even when using manipulating arrays this often results in more robust code because you don't have to worry about boundary conditions with indices; and it is easier to pass around an iterator rather than an array and a reference to the current position within the array, or so I thought until I came across an issue with List<T>.GetEnumerator() which not is immediately obvious and which may apply to other collection classes.

I had refactored some code into a separate function, passing in the instance of List<string>.Enumerator I was using. This code illustrates the problem:

using System;
using System.Collections.Generic;

class Program
{
  static void Main(string[] args)
  {
    var list = new List<string>() { "text" };
    var iterator = list.GetEnumerator();
    iterator.MoveNext();
    Console.Write("{0} ", iterator.Current ?? "null");
    Foo(iterator);
    Console.Write("{0}", iterator.Current ?? "null");
  }

  private static void Foo(List<string>.Enumerator iterator)
  {
    iterator.MoveNext();
    Console.Write("{0} ", iterator.Current ?? "null");
  }
}

I expected that after the return from the function the state of the iterator would reflect the call to MoveNext() made in the function, i.e. in this example the output would be "text null null". But the output is actually "text null text". This seemed inexplicable until I discovered that the Enumerator<T> type returned by GetEnumerator() is a struct which means a copy of the struct is passed to the function. Presumably the position of the iterator is held in a value type, maybe an index into an array, which means that any changes to the position will not be reproduced in the original instance of the struct in the calling function.

The solution is to box the struct by casting it to an interface:

using System;
using System.Collections.Generic;

class Program
{
  static void Main(string[] args)
  {
    // ...
    var iterator = list.GetEnumerator() as IEnumerator<string>;
    // ...	
  }

  private static void Foo(IEnumerator<string> iterator)
  {
    // ...
  }
}

This ensures that the called function has access to the same instance of the struct as the calling function.

Posted by Charles Cook at 03:29 PM. Permalink. View Comments.