The evolution of an idea

What's this all about?

This post assumes some knowledge of Clojure, and is basically a first-person view of what I'm thinking about as I have an idea for an interesting function, write up a prototype, and gradually refine it. So it's probably most interesting if you want to see how Clojure code is often developed and how to take advantage of the language's features and idioms to produce quality functions. It's pretty in-depth, so feel free to skim if there are parts you think you already have well in hand - I won't mind, I promise!

Inception

Working my day job at HubPages, I find myself wanting to solve the following problem in PHP: I have a big ol' list of URLs, and I want to split it into two lists. The first list should have no more than two URLs from a given domain, and the second should have whatever's left over from the original list that didn't fit into the first.

The "obvious" solution in PHP is to iterate over the original list with a foreach loop, but there are several reasons not to like that:

  1. It interweaves application logic (what domain a URL belongs to) with bookkeeping code (maintaining two lists and iterating over a third).
  2. Mostly because of (1), it's error-prone and hard to debug. It includes a lot of boilerplate for walking over arrays, and if I mistype something or forget a step it will be hard to spot.
  3. It's not reusable at all: if tomorrow I want to search a list of users and find the first five whose name starts with each letter, I'll have to either start from scratch or copy-paste the code and (probably) make a mistake when editing it to fit into the new context.

Enter Clojure

PHP has support for functional programming that I would call "tolerable". It's possible to write functional code that is composable and reusable, but it's ugly and verbose, often so much so that the low-level imperative solution is the best compromise for the small job at hand.

Clojure makes it much easier to write concise, maintainable functional code, and having a REPL available dramatically speeds up development, so I often prototype solutions in Clojure. This lets me try things out and improve the basic algorithm from the instant feedback the REPL gives me, before doing it "for real" in PHP.

Preliminary groundwork

Writing code is fun, but before re-inventing the wheel just because coding is more fun than thinking (don't tell me you've never done this), a good engineer looks around for tools that already exist and can either be learned from or used directly for the problem at hand. In Clojure, the primary part of this work is breaking the task up into a set of smaller primitive tasks. So what is it I really want to do?

Well, first of all I notice that the task looks an awful lot like distinct, except that (a) I want to filter based not on the list elements themselves, but based on some property they have, and (b) I want to allow duplicates, but only up to N of them. So I can't use distinct directly, but I will keep it in mind, and keep its source code handy to see if there are cleverer ways of doing some of my subtasks.

My goal is to yield a sequence of items, and I'll be working my way through an input sequence. Map and reduce are the ideal candidates for this sort of job; in this case reduce will be my primary tool because I'll need to "remember" work from elements 0..N-1 in order to decide what element N is.

So what kind of data structure do I want to keep track of in the reduce phase's "accumulator"? I want to know how many of a given element I've already put into the output sequence, so that sounds like a hash-map of (element, count) pairs. But I also need to maintain the output sequence itself, since reduce returns a single object; so I'll keep a pair: [{elt1 count1, elt2 count2...} return-value].

A rough draft

(second (reduce (fn [[counts ret :as old] item]
                (if ((fnil > 0) (counts item) 1)
                  old
                  [(update-in counts [item] (fnil inc 0))
                   (conj ret item)]))
              [{} []]
              [1 2 2 2 2 3 0 1 1]))

;; results in [1 2 2 3 0 1]

In fact this problem has several parts, so I've broken it up a little more: for my first draft I'm not worrying about the type of data I'm working with, or about which "group" an element falls into: I've basically written distinct-2, for allowing at most two copies of each item in the list. I also haven't wrapped it in a function at all: I've hard-coded the maximum of 2, and the list to process; a benefit of the REPL is that I can build an expression for transforming data the way I want, and then later put it into a function with parameters for the parts I'll want to change.

I won't explain the whole function (though you should make sure to understand it before moving on), but I'd like to highlight a few points of interest:

  • Because reduce is accumulating a (working-data, result) pair, I've (a) wrapped it with (second) so that the caller only gets the end result, and (b) destructured reduce's first argument into [counts ret]. This is a very useful idiom for keeping track of internal data that won't be part of the end result when processing data with reduce (or iterate).
  • update-in. Whenever you're working with maps, ask yourself if you've remembered to consider update-in.
  • fnil. This little-used function is a fantastic partner to update-in. I want to get an element out of the counts map, or zero if no such item exists. I could have written (> (or (counts item) 0) 1), which is only a little less readable than the fnil version, but for "add one to the current entry" fnil is a huge win. The alternative is writing something horrible like (assoc counts item (inc (or (counts item) 0))) or (update-in counts [item] #(if % (inc %) 1))).

...in PHP?

This took all of three minutes to write in Clojure, but I'm already cringing at the idea of translating to PHP and I've only solved a simplified version of the problem. I give up and write the foreach loop - hopefully I won't have to rewrite this code for some other kind of data, like the user list mentioned earlier. But it was definitely an interesting experiment, so I shelve the Clojure for further work when I get home.

Solving the original problem

(defn keep-only [max classify seq]
  (second
   (reduce (fn [[counts ret :as old] item]
             (if ((fnil > 0) (counts (classify item)) (dec max))
               old
               [(update-in counts [(classify item)] (fnil inc 0))
                (conj ret item)]))
           [{} []]
           seq)))

(keep-only 2 identity [1 2 2 2 2 3 0 1 1])
[1 2 2 3 0 1] ; yup, haven't broken the first version

(keep-only 3 #(rem % 3) [1 1 2 4 5 92 4381 184 9123 658])
[1 1 2 4 5 92 9123] ; 9123 was the only one divisible by 3!

Cool, it works. Here I've mostly just wrapped it up in a function, and added the classify feature, so that the caller can say "I want at most N items in each bucket, and here's how you decide what bucket to put stuff in". Note that counts always deals with (classify item), and ret always gets given the original item: the caller doesn't care what bucket we put things into, they just want a trimmed version of the list.

Not much else of interest in this version, except to verify that when passed 2 and identity, keep-only behaves exactly like the original simplified version.

Paring it down

But there are several ideas that are repeated in this code, and I try to avoid repeating anything when I'm coding, especially in such a functional language as Clojure. There's also a little more intermixing of ucky math- and nil-handling code with the core "count and bucket" algorithm than I'd like, so I'll try separating it out as well. And maybe this isn't a good idea, but while I'm doing all this I'll do some premature optimization by factoring out repeated computations like (fnil inc 0).

(let [[< inc] (map #(fnil % 0) [< inc])]
  ;; TODO write a with-adjustments(?) macro for the above
  (defn keep-only [max classify seq]
    (second
     (reduce (fn [[counts ret :as old] [item class]]
               (if (< (counts class) max)
                 [(update-in counts [class] inc)
                  (conj ret item)]
                 old))
             [{} []]
             (map (juxt identity classify) seq)))))

Now I've made fnil only appear once in the code, and only be called twice no matter how many times the function is called, but more importantly the reduction code no longer deals explicitly with nil: it just calls < and inc, and zeros are magically substituted because I've stealthily replaced < and inc with specialized versions for this function. But now I've written < and inc a lot of times each, and I've left myself a note about this repetition: if I see this pattern coming up again, I'll write a macro that I can use in both contexts, something like (with-adjustments #(fnil % 0) [< inc] (...body...)).1

Before you look at changes to the reduction function (maybe it's called a reductor?), skip down to the bottom to see that (map ...) call. Because map is lazy, I can classify all the items once at the start, instead of making the reductor track the class of each item. I use (juxt identity classify) to get a sequence of [item, class] pairs, and then when I get to the reductor I can just destructure instead of calling classify.

I've also snuck in a small change in how I use max: instead of (> blah (dec max)), I've written (< blah max) and reversed the order of the if clauses. This has such a trivial impact on performance and readability that if one ordering of the if/else made more sense than the other I'd use that one, but since they seem about equal to me I might as well use the simpler and faster version.

This third revision has introduced some readability improvements, and some performance improvements, and because it introduced them both at once it's debatable whether this is actually a better version than the second draft - it may be sacrificing readability for frankly negligible performance gains. So if I were writing this for a public project or library, I would make another pass and decide which of the two is more important, figure out which "readability" changes are actually less readable, and add some more comments.

Or I could do something interesting

But this project started because it was fun, and it won't be fun to apply that professional level of polish. Instead, I'll try something more challenging yet still useful: can I make this lazy? A problem with reduce is that it consumes the entire input sequence to produce one output value; this means I can't use my keep-only with an infinite sequence like (range), for example to find the first ten numbers that are divisible by four and the first ten that aren't with (take 20 (keep-only 10 #(zero? (rem % 4)) (range))).

But making this lazy is hard, because I can't use reduce anymore. I might be able to use reductions, but it's not clear to me how because it will produce one output value for each input value, when I want to completely discard some of the inputs. Will I have to use lazy-seq directly? That seems really unpleasant, and there must be a better way. What about iterate? I could iterate over a data structure like [next-output counts remaining-input], with a function that consumes remaining-input until it finds something suitable and then returns a new tuple. It will be a pain to have my iterator function ignore its first argument, though, so maybe I'll pass it last and use a destructuring form that doesn't even bind it.

Laziness in all its "glory"

;; Now it's lazy, but getting disgusting. Find a
;; reasonable way to factor out next-elt, and maybe
;; tidy up the horrible destructuring used to iterate
;; over two things (counts and class/item pairs) while
;; actually returning a third (items accepted).
;; Wonder how this would look using (unfold)?
(let [[< inc] (map #(fnil % 0) [< inc])]
  (defn keep-only [max classify seq]
    (let [seq (map (juxt identity classify) seq)
          next-elt (fn [[[counts items]]]
                     (when-let [[item class] (first items)]
                       (let [count (counts class)]
                         (if (< count max)
                           [[(assoc counts class (inc count))
                             (rest items)]
                            item]
                           (recur [[counts (rest items)]])))))]
      (->> [[{} seq]]
           (iterate next-elt)
           (rest)
           (take-while identity)
           (map second)))))

This code is not easy to read. Don't worry too much if you have trouble with it - I'm programming at the limit of my competence2 and have a hard time following some of it myself. And that leads to quite a few valuable lessons:

  1. Challenge yourself to write code that makes your brain hurt. It's good for you, and with practice you'll be able to use these confusing, complex tools more and more comfortably. Usually they're confusing because they're more powerful than what you're used to, so this will be worth it.
  2. Code quality is one of the many factors you need to consider when making trade-offs while developing. Sometimes having code that (probably) works an hour from now is better than having super-elegant code in a week; sometimes you need code that will still be legible and maintainable five years from now.
  3. But know you are making that choice. Don't just write crap code because it's easy: leave yourself a note saying how dreadful it is and how you can improve it if that becomes necessary. Often just writing that note will shame you into doing it right.

A review: other possibilities

Notice I left myself a note about using unfold instead of writing iterate/take-while and such myself. It seems like this is the sort of problem unfold is perfect for, the opposite of a reduction: turn one value (the input list) into a lazy sequence of output values. Well, I don't have a lot of experience with unfold (I've written my own version, but it's not a Clojure built-in), so it didn't occur to me when I was considering existing tools I could use.

I did try rewriting it with unfold, and concluded that it's definitely possible, but there are some really awkward things I would have to do to make it work, so I stopped. Maybe someone with more experience than I have would have found a way, but I'm satisfied with my iterative solution.

And while I was writing this article, I realized there was yet another tool that might have been appropriate for the job, which would have made the whole task trivial if I were willing to accept some limitations: group-by. I could just write:

(defn keep-only [max classify seq]
  (mapcat (comp (partial take max) val)
          (group-by classify seq)))

This can't be lazy, and it loses ordering of the original sequence, but it works perfectly if you're willing to accept that trade-off.

Source

Hammock time

This review really emphasized, for me, the value of what Rich Hickey calls "hammock time": the time you spend away from the computer thinking, instead of pounding out code as soon as it comes into your head. It results in much better code eventually, often at a net time savings: spend ten minutes thinking and five coding, instead of thirty coding. Sometimes this is tough to do because unenlightened managers equate typing with working, but I'm fortunate not to work in that kind of environment.

So I hope you've enjoyed your foray into my own hammock; if you have a better understanding of how to go about writing code, or you've learned some things not to do, then I'm delighted. Stick around, and leave comments if there are any specific topics you'd like me to cover, or I'll just keep writing about whatever strikes my fancy.

Footnotes and updates

1I wrote this macro after all, and (pardon me while I proselytize a bit) molding the language to suit my problem was as easy as could be, because of the power of Lisp. About thirty seconds of typing, and I have a new language just like Clojure, but better at seamlessly using specialized versions of the built-in functions. Here's the whole thing, and a sample usage:

(defmacro with-adjustments [adjuster funcs & body]
  `(let [~funcs (map ~adjuster ~funcs)]
     ~@body))

(inc nil)
java.lang.NullPointerException

(with-adjustments #(fnil % 0) [inc <]
  (< nil (inc nil)))
true ; yep, 0 is less than 1

2Ray Miller has contributed a simpler and much more readable lazy version in the comments below. You can see his original version at https://gist.github.com/848069. and a version I improved slightly at https://gist.github.com/848084. Ray has kindly helped me prove that no matter how much thinking I do about an idea, there's always a better way to do it!

More by this Author


Comments 4 comments

Ray Miller 5 years ago

I enjoyed this article - thanks - but I couldn't help thinking that such a simple problem should have a simple solution. I know the usual advice is to use higher-order functions in preference to lazy-seq, but sometimes it does make life easier. Here's my attempt at implementing keep-only using lazy-seq:

(defn keep-only

[n bucket-fn s]

(letfn [(my-filter

[s seen]

(when (seq s)

(let [e (first s)

b (bucket-fn e)

c (inc (get seen b 0))]

(if (> c n)

(recur (rest s) seen)

(cons e (lazy-seq (my-filter (rest s) (assoc seen b c))))))))]

(my-filter s {})))


Ray Miller 5 years ago

Apologies for the poor formatting of the code in my last comment. Please see this gist instead: https://gist.github.com/848069


amalloy profile image

amalloy 5 years ago from Los Angeles Author

Thanks, Ray. This is a much more readable lazy version than mine, and I'll add a footnote pointing to it. You do have a small error, though: the sequence isn't consumed as lazily as it should be. Eg, even if no element is ever requested from (keep-only), the first element of the underlying seq is consumed. I've forked your gist to https://gist.github.com/848084 with the adjustment made.


Simone Smith profile image

Simone Smith 5 years ago from San Francisco

Hoooooly crap. No wonder you take time to write Hubs... you could practically print and bind this thing! What a beaut of a Hub! Gosh, I wish I could understand more of it. Perhaps someday I will, hahaa!

Wonderfully written, wonderfully illustrated, wonderfully formatted. My hat is off to you, sir.

    Sign in or sign up and post using a HubPages Network account.

    0 of 8192 characters used
    Post Comment

    No HTML is allowed in comments, but URLs will be hyperlinked. Comments are not for promoting your articles or other sites.


    Click to Rate This Article
    working