2013-05-10

Simple memory allocations

The great convenience of standard collections for a developer is that they free him/her from extra efforts on managing underlying memory allocations.
It’s so much convenient to just write
{
    std::vector v(n);
    for (int i = 0; i < n; ++i)
        v[i] = i;
}

or likewise for OCC fixed size array:
{
    TColStd_Array1OfReal a (0, n-1);
    for (int i = 0; i < n; ++i)
        v(i) = sqrt (i);
}

and be done. Who cares where the memory gets allocated for underlying elements ? Not a big deal, right ?

Well, it can become a big deal if you have to work with multiple allocations/deallocations which become the hotspots.

I was recently improving thread-safety of a few core Open CASCADE algorithms: approximation, intersection, etc (see #23952) and observed an interesting pattern that motivated posting this blog.

Some OCC developer(s) realized in the past that memory allocations can be expensive in hot cycles and worked around this by using allocation in static memory. Excerpt from ApproxInt_ImpPrmSvSurfaces::Compute() (ApproxInt_ImpPrmSvSurfaces.gxx):

static math_Vector BornInf(1,2),BornSup(1,2),F(1,1),
  X(1,2), Tolerance(1,2);
static math_Matrix D(1, 1, 1, 2);


This would result in a single allocation (upon first entry into this function) which would stay until the program termination.

Obviously that does not work in multi-threaded environment. This would create data races – as two or more threads concurrently calling the same method – and results would be unpredictable (wrong and/or unstable results, sporadic crashes, etc).

Substituting with local variables
math_Vector BornInf(1,2),BornSup(1,2),F(1,1),
  X(1,2),Tolerance(1,2);
math_Matrix D(1, 1, 1, 2);


would resolve the reentrancy problem but would create another one – continuous memory allocation/deallocation would undermine performance as each constructor and destructor would call new[] and delete[] operators.

The good news is that Open CASCADE provides constructors for math_* and TCollection_Array* classes that accept an extra parameter – an address of allocated memory. In this case they simply reuse that memory and only provide access to it. Neither do they attempt to free it upon own destruction. This provides a possibility to allocate data on stack (which has essentially zero performance cost) and make respective objects use it, preserving source compatibility with the rest of the code.

Here is a fix that uses that:
  Standard_Real aBornInf[2],aBornSup[2],aF[1],aX[2],
    aTolerance[2];
  math_Vector BornInf(aBornInf,1,2),BornSup(aBornSup,1,2),
    F(aF,1,1), X(aX,1,2),Tolerance(aTolerance,1,2);
  Standard_Real aD[1][2];
  math_Matrix D(aD,1, 1, 1, 2);  


You might want to take advantage of this simple technique whenever dealing with small objects and knowing the required sizes in advance.

2012-11-22

Standard allocator. Part 2

(continued)

Having reviewed available NCollection allocators, we can create a C++ standard compliant allocator that would wrap them.

Please have a look at the NCollection_StdAllocator.hxx which is available in the git repository, branch CR23569. Here is a corresponding tracker report.

You can feed it with any instance of NCollection_BaseAllocator and then provide it to any STL container, for instance:

    Handle(NCollection_IncAllocator) anIncAlloc = new NCollection_IncAllocator;
    NCollection_StdAllocator aSAlloc (anIncAlloc);
    std::list > aL (aSAlloc);
    TopoDS_Solid aSolid = BRepPrimAPI_MakeBox (10., 20., 30.);
    aL.push_back (aSolid);

Default constructor uses NCollection_BaseAllocator::CommonBaseAllocator() and thus redirects all allocation calls to Open CASCADE central allocator interface. The following will do exactly this:
    std::list > aL2;

This makes NCollection_StdAllocator a superset of Standard_StdAllocator suggested in Part1 and thus self-sufficient.

Moreover, as allocators for one object type can be casted to another type (part of the C++ standard requirement), you might end up using a single type:

typedef NCollection_StdAllocator AllocatorType;

AllocatorType anAlloc (new NCollection_IncAllocator);
std::list aL3 (anAlloc);
std::vector aV (anAlloc);

and also:
typedef std::basic_string, AllocatorType> StringType;
StringType aS1; //defaults to central allocator interface
StringType aS2 (anAlloc); //uses pool allocator
StringType aS3 (AllocatorType (NCollection_HeapAllocator::GlobalHeapAllocator())); //directly uses OS allocator and is effectively equivalent to std::string


Now, with NCollection_StdAllocator you can take advantage of available Open CASCADE allocators and use STL, Boost or any other containers or template classes that accept standard allocator. By the way, this flexibility enabled me to start gradually moving from NCollection to Boost, but that's a subject for a separate post ;-)...

NCollection allocators

The previous post has introduced a C++ standard compliant allocator that wraps Open CASCADE allocator interface (Standard::Allocate(), ::Free(), ::Reallocate()) and thus can be used in any containers or other template classes that accept such allocator.

Before we move forward to consider another use case of standard allocator, let's make a step aside and review another allocator flavors offered by Open CASCADE.

Those of you who happened to use templates from NCollection or just attentively read the source code employing it might notice NCollection_BaseAllocator base class. NCollection containers sort of tried to mimic interfaces of STL containers and as such introduced allocator in its inception in early 2000es. However, unlike STL where allocator is part of the type (e.g. std::list is a different type than std::list) NCollection allows to dynamically supply an allocator, for instance:

NCollection_List aVec1, aVec2 (new NCollection_IncAllocator);

There are three types of allocator in NCollection:
-    NCollection_BaseAllocator, which is a base class that also implements a wrapper over Standard::Allocate() and ::Free(). There is a singletone returned by NCollection_BaseAllocator::CommonBaseAllocator().
-    NCollection_HeapAllocator, which wraps standard OS allocator (malloc/free). Similar to a base class there is a singletone returned by NCollection_HeapAllocator::GlobalHeapAllocator().
-    NCollection_IncAllocator, which implements a pool allocator.

The former two are straightforward and probably only NCollection_IncAllocator deserves some details.

Pool allocators (see for instance Wikipedia) are most helpful to manage temporary objects, especially when their destruction can be deferred to some common point in time, e.g. completion of an algorithm that created them. Implementation of a pool allocator is very straightforward – it preallocates a large chunk of memory (e.g. from a system heap or another allocator), and whenever there is a new temporary object to be allocated, just takes a required size from that large chunk's head pointer and shifts that pointer. Deallocation is void, i.e. no memory gets really freed after the object destructor has been called. Real deallocation of large chunk(s) happens upon own pool allocator destruction.

This simple implementation allows to greatly improve performance when dealing with numerous temporary objects due to avoiding fragmentation and/or complicated techniques to manage individual smaller memory chunks.

Here is a usage example which creates a temporary hash table:
class PartnerShapeHasher
{
public:
    static Standard_Integer HashCode (const TopoDS_Shape& theKey, const Standard_Integer theUpper)
    {
        return ::HashCode (theKey.TShape(), theUpper);
    }
    static Standard_Boolean IsEqual(const TopoDS_Shape& theKey1, const TopoDS_Shape& theKey2)
    {
        return theKey1.IsPartner (theKey2);
    }
};

//! Returns a number of partner subshapes of a given type
/*! A partner shape is a shared instance regardless of a location (transformation matrix).*/
static size_t NumberOfPartnerSubshapes (const TopoDS_Shape& theShape, const TopAbs_ShapeEnum theType)
{
    NCollection_Map aMap (1 /*default number of buckets*/, new NCollection_IncAllocator);
    for (TopExp_Explorer anExp (theShape, theType); anExp.More(); anExp.Next()) {
        const TopoDS_Shape& aSubshape = anExp.Current();
        aMap.Add (aSubshape);
    }
    return aMap.Size();
}

If your temporary objects' life span is not necessarily aligned with the end of the algorithm scope (e.g. some objects are destroyed earlier) then you could either have a separate pool allocator for such objects or just accept a trade-off of temporary larger peak memory footprint.

NCollection_IncAllocator's constructor accepts a parameter theBlockSize which specifies a size of large memory chunk(s) preallocated from the heap. I ended up with having a few predefined sizes via the following enumeration:

enum NCollection_IncAllocator_Size {
    NCollection_IncAllocator_Tiny = 1000,
    NCollection_IncAllocator_Small = 4000,
    NCollection_IncAllocator_Medium = 8000,
    NCollection_IncAllocator_Large = 16000,
    NCollection_IncAllocator_Huge = 32000
};

and use respective value depending on the most probable case, e.g.:
Handle(NCollection_BaseAllocator) anAlloc = new NCollection_IncAllocator (NCollection_IncAllocator_Tiny);
NCollection_List aList (anAlloc);

This should help avoid extra fragmentatoin in the OS allocator due to reuse of the same size blocks over and over again, yet providing a hint instead of default constructor parameter (around 26K).

If the allocator is requested to allocate an object of a larger size than the constructor's theBlockSize parameter then a new larger chunk is allocated. Whenever a new large chunk is allocated then the remaining unused free bytes in a previous chunk are wasted. So some attention should be made to minimize unnecessary waste, if it can be substantial.

Another point to remember is that NCollection_IncAllocator is not thread-safe (though reentrant), so access to the same instance should be synchronized outside, if it is used from multiple threads. For thread-safe scalable implementations you might want to check Intel Threading Building Blocks memory pools.

Now with the review of NCollection allocator completed, let's get back to the standard interface.

2011-11-24

Standard allocator interface.

As you know, Open CASCADE has its own memory allocation mechanism, which entry points are Standard::Allocate() and Standard::Free(). They forward (de-)allocation requests to a current memory manager, which can either be a). default system allocator (if environment variable MMGT_OPT=0), b). own Open CASCADE's (MMGT_OPT=1), or c). Intel TBB (MMGT_OPT=2).

Most Open CASCADE classes (e.g. all which are defined in .cdl files) redefine operators new and delete to use Standard::Allocate() and Standard::Free() respectively – look at any .hxx file.

Using common memory allocation mechanism allows to decrease memory footprint and/or increase performance, especially when using TBB allocator (in multi-threaded apps). However, you may easily miss this advantage in your application in the following typical cases:
1. You dynamically allocate objects of your classes and their new/delete operators do not use Standard::Allocate()/::Free().
2. You use standard containers (std::vector, std::map, std::list,...).
#1 is easily addressed by redefining new/delete operators in your base class(es) in a way similar to OCC.
#2 is more tough, as standard collections by default use standard allocator (std::allocator<T>). So all auxiliary elements (e.g. list nodes) are allocated outside of OCC-governed memory allocator.

This was until now. Hereby I would like to share a code that implements the standard allocator interface as defined by the ISO C++ Standard 2003 (and also conforming to a recently published C++11). Download Standard_StdAllocator.hxx.

This is a pure header implementation, so no .cxx files and linking with libraries is needed; just copy to your project and start using. Implementation is derived from Intel tbb::tbb_allocator, which itself just follows the C++ standard requirements.

The file is intentionally made copyright-free and put into a Public domain, the most permissive way. I also hope that the OCC team will be willing to pick up this file and integrate into OCC.

The simplest example can be as follows:

typedef Standard_StdAllocator<void> allocator_type;

std::vector<TopoDS_Shape, allocator_type> aSVec;
TopoDS_Solid aSolid = BRepPrimAPI_MakeBox (10., 20., 30.);
aSVec.push_back (aSolid);

std::list<int_Shape, allocator_type> aList;
aList.push_back (1);


There is also a unit test (download Standard_StdAllocatorTest.cxx). The code is based on Qt test framework and should be self-explaining to port to any other test framework.

Those who are curious enough may offer similar standard allocator implementation for NCollection_BaseAllocator which is used in NCollection containers. It is as straightforward as this one.