This page may be out of date. Submit any pending changes before refreshing this page.
Hide this message.
Quora uses cookies to improve your experience. Read more

How are Haskell data structures represented so that they don't need to be copied every time I make a small change to them?

4 Answers
Jim Pivarski
I want to answer this because I think the technique is just too clever.  :)

The details differ for every data structure, but let's start by considering a tree.  A singly linked list is a "unary" tree: each node has zero or one children, zero if it is the end of the list.  Records of records (without cross-references) are trees with a number of children that depends on the type of record.

If you want to replace a big tree T with a bigger tree T' that consists of a new root node whose only child is T, then you can just create the new node and set its "child" pointer to T.  You don't have to copy T, you just reference it.  If T is actually a linked list, you have just prepended an element to head of the list.  The old list works just like it did before (as long as it is singly linked).

If you want to replace a big tree T with a same-size tree T'' that has one leaf replaced, then you have to do some surgery.  You start by replacing the leaf and its immediate parent, but when constructing the parent, you can just reference unchanged structures in the old T.  Continuing this process recursively up the tree, you find that you have to create new nodes all along the path from the replaced leaf to the root, but none of the others.  For a well-balanced tree with N elements, you have to replace about logN nodes.  If T is a linked list of length N, you have to replace all N of the nodes.

For linked lists, this is a huge asymmetry: changing the head node costs almost nothing, but changing the last node costs as much as creating a new list.  For balanced trees, it is a good trade-off: logN is small compared to N, and you get the safety and portability of (effectively) making defensive copies everywhere, as well as retaining a complete undo history, if you want it.  When working with immutable data, you really need to think about data structures, so that you only use the right end of a list, for example.

There are at least two things that could break this.  (1) If your data are not *all* strictly immutable, this structural sharing would create a confusing mess of cross-references--- mutations would cause hard-to-predict action-at-a-distance.  (2) The data structures must not contain any loops (they must be Directed Acyclic Graphs, or DAGs).  Structural sharing would break a doubly linked list, because while the "next" pointer would be correct in the new list, the "prev" pointer would be wrong.

Note that Haskell is not the only language that does this.  I'm mostly familiar with Scala's immutable collections, and I know that the same thing has been implemented in Clojure (with a nice "mutable until frozen" feature), and I'm pretty sure that all of these ideas were developed in the ML family of languages in the 70's and 80's.
Andrea Ferro
Andrea Ferro, Knows a few programming languages. Learning a few more.
When you think Haskell you should not think in terms of mutable data structures. You should think in terms of data values. These effectively are immutable and persistent data structures. Only when you dig deep into a specific Haskell implementation you will eventually find that some of those values are represented internally in terms of mutable data structures.

You may want to google "persistent data structures" for a general understanding of the concept of how this works. I'll answer your question more directly in terms of how they are represented in a Haskell implementation.

The internal representation of data is, for the most part, as boxed values. Each boxed value is in fact a small block heap allocated memory and therefore it has a unique address which is what can be used to refer to that value. This memory contains a tag and zero or more references (addresses) to other boxed values. (note: actually not all data is boxed, so a boxed value in fact also contains zero or more unboxed data words).

Let's look at your examples.

When you mention records it is unclear if you mean algebraic data types in general or if you refer to Haskell record syntax. In case you refer to the latter, you should understand that it's just mere syntactic sugar. There actually is no such a thing as a record in Haskell: once you desugar your syntax it becomes just code that manipulates ordinary data types.
You therefore cannot really do something like "update a single field". What you do is really creating a new value. The fields are just components of the data type, which are themselves values, and therefore are stored as references. Since these are all boxed you are both "copying" the "structure" and, at the same time, "not copying" the actual values. You are copying the references.

You mention lists. Lists in Haskell enjoy special status and there are several syntactic elements that make working with lists more syntactically pleasant, but under the hood lists are just an ordinary polymorphic type with two constructors. One constructor has no arguments and represents the empty list (it's known as Nil in functional jargon), the other has two arguments, an element and a list, and represents non empty lists (it's known as Cons in functional jargon).

Given a list and a new element you can trivially create a list with the element prepended to the given list: your new list representation is the boxed representation of a list using the Cons constructor. The memory representing the box has three words: a tag that witnesses the fact that you are using the Cons constructor, a reference (pointer) to the element and a reference (pointer) to the original list.

One important thing you should remember, however, is that the above is not what happens when you think it does. Haskell types are in fact lifted types and evaluation is lazy. Therefore the actual representation of the new list that, probably, is generated is just an unevaluated closure, known as a thunk. This is itself represented as a box (internally it's just like any other value) which has an appropriate tag and references to a function and to its arguments.

Only when you actually use that list, the think is forced (aka evaluated) and it's representation is overwritten with the actual representation of the list that I described above.

Furthermore Haskell employs global cross module inlining and optimizations. A lot of data is in fact never built and never allocated at all. Many lists values (emphasis on value) actually represent loops, not data structures (emphasis on structure), and are therefore never represented as data: they become the corresponding loop code. A lot of data is only used once and is never allocated on the heap, it just stays in registers. Things are therefore not at all what you think they are, once you dig down to the gory low level details.

There are several references you can read that explains things with much more detail than I did here. For the GHC haskell compiler you can read https://ghc.haskell.org/trac/ghc... but I warn you: it has so many details that you may get lost.

It's probably much more valuable for you to study the graph reduction evaluation model as implemented by the spineless tagless g-machine, which is (in a more modern form) what GHC does. The old paper is Implementing Lazy Functional Languages on Stock Hardware: The Spineless Tagless G-machine (the link to the actual paper as a zipped postscript is on the right). It describes not only the evaluation engine but also what is needed in the data representation.

For better understanding on how persistent data can have very good performance you may want to read Chris Okasaki's "Purely Functional Data Structures".
According to An intro to lists - Learn You a Haskell:

When you put together two lists (even if you append a singleton list to a list, for instance: [1,2,3] ++ [4]), internally, Haskell has to walk through the whole list on the left side of ++.  That's not a problem when dealing with lists that aren't too big. But  putting something at the end of a list that's fifty million entries long  is going to take a while. However, putting something at the beginning  of a list using the : operator (also called the cons operator) is instantaneous.
Tae Lim Kook
If you have a small data structure, you can just copy it.

If you have a larger structure, then split it into smaller parts so changes are localized.

For your problem of appending an element to the list, you can try using doubly-linked lists.