Saturday, April 12, 2014

Climbing down the tree: framework and performance

Continuing with our study on optimizing lexicographical comparison as used within binary search algorithms, we will provide now a fuller description of the supporting C++ framework and measure the actual performance with several test sequences on different environments.
Let Compare be a binary relation inducing a strict weak ordering on values of a type T, cmp a value of Compare and x a value of T. A binary search binder for cmp and x is any object cbind for which the expressions cbind.less(y) and cbind.greater(y) are valid whenever cmp(y,x) and cmp(x,y) are valid, respectively, and return the same boolean value as these. Moreover, when the range being searched is sorted (i.e. not merely partitioned with respect to x) the implementation of a binary search binder can assume the following usage restrictions:
  • If either the invocation cbind.less(y) returns true or the invocation cbind.greater(y) returns false, subsequent invocations on cbind will be done against values not less than y under the ordering relation induced by Compare.
  • If either the invocation cbind.less(y) returns false or the invocation cbind.greater(y) returns true, subsequent invocations on cbind will be done against values not greater than y.
Standard C++ binary search algorithms such as lower_bound or binary_search can be implemented by replacing each usage of the given Compare object with the equivalent invocation of a single associated binder obtained through comp_bind, which is defined as:
template<typename Comp,typename T>
struct comp_binder
{
  comp_binder(Comp cmp,const T& x):cmp(cmp),x(x){}

  template<typename Q> bool less(const Q& y){return cmp(y,x);}
  template<typename Q> bool greater(const Q& y){return cmp(x,y);}

  Comp     cmp;
  const T& x;
};

template<typename Comp,typename T>
comp_binder<Comp,T> comp_bind(Comp cmp,const T& x)
{
  return comp_binder<Comp,T>(cmp,x);
}
(Note that comp_binder is compatible even with polymorphic comparison types such as C++14 transparent operator functors.) While providing backwards compatibility, this framework introduces new customization opportunities via specialization of comp_binder; this is precisely what prefix_less does:
template<typename T>
struct prefix_less:std::less<T>
{
};

template<typename T>
struct comp_binder<prefix_less<T>,T>
{
  bool less(const T& y);
  bool greater(const T& y);
};
T must be either an instantiation of std::basic_string or a container with lexicographical less-than semantics. prefix_less<T> is functionally equivalent to std::less<T>, but its associated binder takes advantage of the special binary search context where it is being used to provide better algorithmic complexity. The actual implementation can be consulted in the test program provided below. From the point of view of the end user, prefix_less works exactly as std::less:
std::vector<std::string> v;
...
auto it=std::lower_bound(
  v.begin(),v.end(),x,prefix_less<std::string>());
except that, if the algorithm internally uses comp_bind, the resulting performance might improve —or degrade, we will see now how the real thing behaves.
Measurements
I have written a program that compares the execution speeds of std::less and prefix_less with comp_bind-compliant versions of lower_bound and binary_search under the following testing scenarios:
template<typename Sequence,typename Compare>
void run_lower_bound(const Sequence& s,Compare cmp)
{
  for(const auto& x:s){
    lower_bound(s.begin(),s.end(),x,cmp);
  }
}

template<typename Sequence,typename Compare>
void run_binary_search(const Sequence& s,Compare cmp)
{
  for(const auto& x:s){
    binary_search(s.begin(),s.end(),x,cmp);
  }
}
An assortment of sequences is tested (in what follows SN,L denotes the sorted range of strings of length L composed with the N characters '0', '1', ...; for instance, S2,3 is the range ["000", "001", "010", "011", "100", "101", "110", "111"]):
  • S2,4 (size: 16), S2,10 (size: 1,024), S2,20 (size: ~1,000,000), S10,4 (size: 10,000), S10,6 (size: 10,000,000), S20,4 (size: 160,000),
  • The sorted set of ~23,000 unique words within Cervantes' Don Quijote de la Mancha
Several representation types are also used to encode the elements of these sequences:
  • std::string,
  • std::wstring,
  • std::vector<udchar>, where udchar is a simple wrapper around char,
  • std::vector<std::string>, i.e. each character is codified as a separate string of length 1.
MSVC
Tests were built with Microsoft Visual Studio 2012 using default release mode settings and run on a Windows box with an Intel Core i5-2520M CPU @2.50GHz. Depicted values are microseconds/n, where n is the number of elements in the sequence tested.
Execution times, std::string.
Execution times, std::wstring.
Execution times, std::vector<udchar>.
Execution times, std::vector<std::string>.
It is hard to beat the default implementation of lexicographical comparison for such simple types as std::string, which usually reduces to a call to the extremely efficient std::memcmp, and additionally the theoretical performance gain of prefix_less is dwarfed by the constant bookkeeping cost associated to tracking the length of the discarded common prefix. With these considerations in mind, prefix_less performs only marginally better than std::less when average string length is small and the representation type is simple, and excels in the opposite situations. std::vector<udchar> is an anomaly for which I do not have a clear explanation: udchar not being a memcmp-compatible type, one would expect prefix_less to behave unconditionally better than std::less just as with std::vector<std::string>, but in fact this is not the case; maybe there is some inlining threshold met by prefix_less that the simpler code of std::less manages to avoid.
GCC
Thomas Goyne has built the tests with GCC 4.9 20140330, release settings -O3 -std=c++11, and run them on an OS X 10.9.1 box with an Intel Core i7-3720QM @2.6GHz.
Execution times, std::string.
Execution times, std::wstring.
Execution times, std::vector<udchar>.
Execution times, std::vector<std::string>.
Unlike with MSVC, here prefix_less performs consistently better than std::less except for S2,4, where string length is too small to compensate for the extra constant cost incurred by prefix_less. Reductions in execution times range from 10% to 50%.
Clang
Results also provided by Thomas Goyne: Clang 503.0.38 (3.4svn) with libc++, release settings -O3 -std=c++11, same OS X machine as with GCC.
Execution times, std::string.
Execution times, std::wstring.
Execution times, std::vector<udchar>.
Execution times, std::vector<std::string>.
Performance gains are similar to those obtained in the case of GCC except when the representation type is std::vector<udchar>, in which case std::less is generally faster, again for reasons for which I do not have a satisfactory explanation.
Conclusions
C++ binary search algorithms can be extended in a backwards compatible way to support mechanisms for lexicographical comparison with lower complexity and, generally speaking, better performance than provided by regular std::less instantiations. Although we have focused on comparison of containers and strings, there is also an additional category of entities, namely tuples and similar types, that could also benefit from this framework: this we might explore in future entries.

Saturday, April 5, 2014

Climbing down the tree

Within the execution of binary search through a range of strings, the algorithmic complexity of lexicographical comparison, which is almost independent of the length of the strings in the general case, turns out here to degrade to O(length of strings). We want to devise a mechanism to restore the efficiency of lexicographical comparison by introducing some contextual information on the binary search execution back into the comparison routine.
Binary search can be regarded as the process of climbing down a binary tree whose nodes are the elements of the sorted range being searched through:
Say we are looking for a given value x and denote by x1, x2, ... the sequence of elements being compared against x during the process. The following is a fundamental property of partition-based algorithms such as binary search:
Property. For each xi visited during the process
  • if x < xi then xj < xi for all j > i,
  • if x > xi then xj > xi for all j > i.
So, searching zeroes in on x by fencing it into a narrowing interval [li, ui] ("l" and "u" stand for lower and upper bound, respectively) that is implicitly calculated as
  • [l0, u0] = [smallest string, largest string],
  • if x < xi then [li, ui] = [li-1, si],
  • if x > xi then [li, ui] = [si, ui-1].
The key observation here in connection with lexicographical comparison is: if, at stage i, li and ui are strings sharing a common prefix pi, then all subsequent comparisons will be done against strings beginning with this same prefix. An intelligent exploitation of this fact spares us then the need to check the prefix so that we can start the comparison at mid-string position.
Let us put the idea in C++. The canonical implementation of std::lower_bound looks like this:
template<typename ForwardIterator,typename T,typename Compare>
ForwardIterator lower_bound(
  ForwardIterator first,ForwardIterator last,
  const T& x,Compare comp)
{
  ForwardIterator it;
  typename std::iterator_traits<
    ForwardIterator>::difference_type count,step;
  count=std::distance(first,last);
  while(count>0){
    it=first;
    step=count/2;
    std::advance(it,step);
    if(comp(*it,x)){
      first=++it;
      count-=step+1;
    }
    else count=step;
  }
  return first;
}
where comp is a functor comparing T values for inequality (typically std::less<T>). This interface does not serve our purposes because, when invoking comp(x,y), we are not indicating which argument is the element being searched and which the range element compared against. So let us tweak the code a bit to make this information explicit:
{
  ...
  auto cbind=comp_bind(comp,x);
  while(count>0){
    it=first;
    step=count/2;
    std::advance(it,step);
    if(cbind.less(*it)){
      first=++it;
      count-=step+1;
    }
    else count=step;
  }
  return first;
}
We will not delve now into the implementation of the framework comp_bind relies on: suffice it to say that it passes x to a comparison object with the following interface: 
template<typename Comp,typename Value>
struct comp_binder
{
  comp_binder(const Comp& cmp,const Value& x);

  bool less(const Value& y);    // ~ cmp(y,x)
  bool greater(const Value& y); // ~ cmp(x,y)
};
This is then a mere change of interface with respect to that of std::less, but one that allows us to keep relevant contextual information on the execution of lower_bound within the comparison object itself, information that an optimized lexicographical comparison object can take advantage of:
template<typename T>
struct comp_binder<prefix_less<T>,T>
{
  comp_binder(const prefix_less<T>&,const T& x):
    x(x),pref_left(0),pref_right(0)
  {}

  bool less(const T& y)
  {
    auto n=std::min(pref_left,pref_right);
    auto m=std::min(x.size(),y.size());
    auto it1=x.begin()+n,it2=y.begin()+n;
    for(;n!=m&&*it1==*it2;++n,++it1,++it2);
    return (n!=m?*it2<*it1:n<x.size())?
      (pref_left=n,true):
      (pref_right=n,false);
  }

  bool greater(const T& y); // similar to less

  const T&              x;
  typename T::size_type pref_left,pref_right;
};
The object keeps in pref_left the length of the common prefix between the latest lower bounding element and x, and similarly with pref_right for upper bounding elements. By the fundamental properties of partition based algorithms, it is guaranteed that further comparisons will be done only against strings sharing with x a common prefix of length ≥ min{pref_left, pref_right}, a fact we can use to shorten the checking loop. 
I have put together an instrumented version of this scheme into a test program that calculates the resulting complexity of optimized lexicographical comparison for the sorted range of the NL strings of length L constructible with an alphabet of N symbols, denoted in what follows by E[CoptN,L]:
E[CoptN,L] as a function of L.
Remember that E[CN,L] is the average number of symbols checked when strings are randomly extracted from the range and E[C'N,L] is the corresponding value in the case that comparison happens within binary search. We have then
E[CN,L] ≤ E[CoptN,L] ≤ E[C'N,L],
E[CoptN,L] is bounded as N, L → ∞
(we won't prove this); so, we have restored O(1) complexity for lexicographical comparison, though the figures are still around two times higher than in the "free-floating" scenario.
In a later entry we will elaborate on the definition of comp_bind to support a backwards-compatible framework for less-based standard binary search algorithms and will also measure actual execution times with and without this optimization.

Wednesday, April 2, 2014

Complexity of lexicographical comparison: part II

Let SN,L the equiprobable sample space of strings of length L 1 over an alphabet A of N ≥ 2 symbols. For instance, S2,3 with A = {a, b} (which particular symbols A consists of is immaterial to our discussion) runs over the strings
aa, ab, ba, bb,
each with probability 1/4. According to the representation model introduced in an earlier entry, SN,L is characterized by
L(n) = 0, n < L,
L(n) = 1, n L,
T(p,q) = (1 - T(p,0))/N, pA*,
and the average number of steps it takes to lexicographically compare two independent outcomes of S, which we proved for the general case to be
E[C] =  Σn ≥ 0 (1/N)n(1- L(n - 1))2,
reduces here to
E[CN,L] = Σ0 ≤ nL (1/N)n = (NL+1 - 1)/(NL+1 - NL),
tending to N/(N - 1) as L → ∞. The figure shows E[CN,L] as a function of L for various values of N.
E[CN,L] as a function of L.
But lexicographical comparison does not perform so well in other, very common scenarios. Suppose we form a sequence s =(s1, ..., sNL) with the sorted values of SN,L and perform a binary search on it of some si extracted at random:
bs(si, s).
This operation does approximately log2 N comparisons between strings in s. We want now to calculate the average number of steps (i.e. symbols checked) these comparisons take, which we denote by E[C'N,L]. A simple C++ program exercising std::lower_bound helps us obtain the figures:
E[C'N,L] as a function of L.
E[C'N,L] is indeed different to E[CN,L] and in fact grows linearly with the length of the strings in s. The reason is that the algorithm for searching si iteratively touches on strings more and more similar to si, that is, sharing an increasingly longer common prefix with si, which imposes a penalty on lexicographical comparison. We can make a crude estimation analysis of E[C'N,L]: as each step of binary search gains one extra bit of information, the common prefix grows by 1/(log2 N) symbols per step, yielding an average common prefix length per comparison of
1 ≤ n ≤ (log2N ) - 1 n/(log2 N))/(log2 N) =  (1/2)(L - 1/(log2 N))
to be added an additional term 1 ≤ c < N/(N - 1) accounting for the comparison of the remaining string suffixes.
Lexicographical comparison as used in binary searching is then a rather inefficient O(length of strings). We will see in a later entry how to improve complexity by incorporating contextual information to the execution of the search algorithm.