Deep copying data structures is surprisingly undocumented.

Definition

First, let us define a deep copy cpy of an object orig: Any actions on cpy must have the same results as on orig and actions on cpy must not change the state of orig.

This means, deep copy must consider the graph-shape of references. If there are multiple references to an identical object within orig, the references in cpy must point to an identical object as well. This requirement includes the problem of cyclic data structures.

A deep copy can simply return its argument if the object is immutable, since the state of immutable data never changes.

Requirements

There are usual more practical requirements. For example, it should work for everything by default, since it becomes tedious to implement a copy method for each and every class. Nevertheless, the copy operation should fail for specific classes, which cannot be copied, like Lock, Thread, or singletons.

Some people think additional modifications should be possible, for example converting a LinkedList into an ArrayList. While this fulfills the definition above, a linked list behaves differently than an array. Random access becomes more expensive, while adding elements becomes cheaper.

Algorithm

The basic algorithm of deep copy must remember already copied objects for the multiple-references-to-identical-object requirement. Let a map be an additional argument to deep copy and initialize it as an empty map.

deepCopy(Object orig, Map<int,Object> m)
  id := orig.hashCode()
  if id in m, return m[id]
  cpy := alloc(sizeof(orig))
  m[id] := cpy
  for each (non static) field f of orig:
    cpy.f := deepCopy(orig.f, m)
  return cpy

For native data types the field is copied by value. References to other objects are handled recursively.

Approaches

To implement a deep copy operation, various approaches are possible.

In dynamically typed languages like Python a reflection-based method is a good solution. Using runtime type information you can easily get a list of all fields from an object. It is possible in a language like Java as well. However, Java prominently provides a shallow copy through Object.clone, which can be implemented as a memcpy.

A second variant, which is often recommended in Java due to its (relative) simplicity, is to rely on the serialization mechanisms. Serializing and deserializing provides you with a legitimate deep copy. The algorithm of serialization is mostly the same, as it needs a map as well. Performance suffers of course.

As a third option, a compiler could generate this loop (in an unrolled form) and provide an internal fieldCopy method like this:

class MyObj
  int foo
  static MyObj bar
  MyObj baz

MyObj::fieldCopy(MyObj cpy, Map<int,Object> m)
  cpy.foo = this.foo                // native type
                                    // 'bar' is static
  cpy.baz = deepCopy(this.baz, m)   // reference

The literature for this approach is not found under deep copying, but marshalling or serialization in high-performance computing.

Constructor Behavior

Obviously, creating a copy constructs a new object. Therefore, the operation should take place in a constructor. Just think about final fields, which can only be modified in constructors. So, whatever implementation approach we take it should be implemented in the copy constructor. It is unspecified whether the copy constructor should create a shallow or a deep copy. Hence, the copy constructor should actually take an additional argument to distinguish between shallow and deep copy. For a different behavior (neither shallow nor deep), the programmer is free to implement a copy constructor without an additional argument.

© 2012-06-21