jemalloc.git - jemalloc is a general purpose malloc(3) implementation that emphasizes fragmentation avoidance and scalable concurrency support.

	Commit message (Collapse)	Author	Age	Files	Lines
*	Lookup extent once per time during tcache_flush_small / _large.	Qi Wang	2017-03-28	1	-14/+28
\| \| \| \|	Caching the extents on stack to avoid redundant looking up overhead.
*	Move arena_slab_data_t's nfree into extent_t's e_bits.	Jason Evans	2017-03-28	2	-20/+20
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	Compact extent_t to 128 bytes on 64-bit systems by moving arena_slab_data_t's nfree into extent_t's e_bits. Cacheline-align extent_t structures so that they always cross the minimum number of cacheline boundaries. Re-order extent_t fields such that all fields except the slab bitmap (and overlaid heap profiling context pointer) are in the first cacheline. This resolves #461.
*	Remove BITMAP_USE_TREE.	Jason Evans	2017-03-27	1	-78/+0
\| \| \| \| \| \| \| \| \| \|	Remove tree-structured bitmap support, in order to reduce complexity and ease maintenance. No bitmaps larger than 512 bits have been necessary since before 4.0.0, and there is no current plan that would increase maximum bitmap size. Although tree-structured bitmaps were used on 32-bit platforms prior to this change, the overall benefits were questionable (higher metadata overhead, higher bitmap modification cost, marginally lower search cost).
*	Force inline ifree to avoid function call costs on fast path.	Qi Wang	2017-03-25	1	-2/+2
\| \| \| \| \|	Without ALWAYS_INLINE, sometimes ifree() gets compiled into its own function, which adds overhead on the fast path.
*	Use a bitmap in extents_t to speed up search.	Jason Evans	2017-03-25	1	-11/+30
\| \| \| \| \|	Rather than iteratively checking all sufficiently large heaps during search, maintain and use a bitmap in order to skip empty heaps.
*	Implement bitmap_ffu(), which finds the first unset bit.	Jason Evans	2017-03-25	2	-7/+22
\|
*	Use first fit layout policy instead of best fit.	Jason Evans	2017-03-25	1	-12/+42
\| \| \| \| \| \| \| \| \|	For extents which do not delay coalescing, use first fit layout policy rather than first-best fit layout policy. This packs extents toward older virtual memory mappings, but at the cost of higher search overhead in the common case. This resolves #711.
*	Profile per arena base mutex, instead of just a0.	Qi Wang	2017-03-23	2	-5/+6
\|
*	Refactor mutex profiling code with x-macros.	Qi Wang	2017-03-23	3	-210/+180
\|
*	Switch to nstime_t for the time related fields in mutex profiling.	Qi Wang	2017-03-23	2	-14/+16
\|
*	Added custom mutex spin.	Qi Wang	2017-03-23	1	-2/+14
\| \| \| \| \| \| \|	A fixed max spin count is used -- with benchmark results showing it solves almost all problems. As the benchmark used was rather intense, the upper bound could be a little bit high. However it should offer a good tradeoff between spinning and blocking.
*	Added extents_dirty / _muzzy mutexes, as well as decay_dirty / _muzzy.	Qi Wang	2017-03-23	2	-37/+55
\|
*	Added "stats.mutexes.reset" mallctl to reset all mutex stats.	Qi Wang	2017-03-23	4	-156/+215
\| \| \| \|	Also switched from the term "lock" to "mutex".
*	Added JSON output for lock stats.	Qi Wang	2017-03-23	2	-42/+116
\| \| \| \|	Also added option 'x' to malloc_stats() to bypass lock section.
*	Added lock profiling and output for global locks (ctl, prof and base).	Qi Wang	2017-03-23	4	-74/+155
\|
*	Add arena lock stats output.	Qi Wang	2017-03-23	4	-42/+246
\|
*	Output bin lock profiling results to malloc_stats.	Qi Wang	2017-03-23	4	-34/+80
\| \| \| \| \|	Two counters are included for the small bins: lock contention rate, and max lock waiting time.
*	First stage of mutex profiling.	Qi Wang	2017-03-23	1	-0/+43
\| \| \| \|	Switched to trylock and update counters based on state.
*	Push down iealloc() calls.	Jason Evans	2017-03-23	3	-120/+91
\| \| \| \| \|	Call iealloc() as deep into call chains as possible without causing redundant calls.
*	Remove extent dereferences from the deallocation fast paths.	Jason Evans	2017-03-23	6	-54/+34
\|
*	Remove extent arg from isalloc() and arena_salloc().	Jason Evans	2017-03-23	3	-21/+18
\|
*	Embed root node into rtree_t.	Jason Evans	2017-03-23	2	-71/+22
\| \| \| \|	This avoids one atomic operation per tree access.
*	Incorporate szind/slab into rtree leaves.	Jason Evans	2017-03-23	3	-55/+100
\| \| \| \| \| \|	Expand and restructure the rtree API such that all common operations can be achieved with minimal work, regardless of whether the rtree leaf fields are independent versus packed into a single atomic pointer.
*	Split rtree_elm_t into rtree_{node,leaf}_elm_t.	Jason Evans	2017-03-23	2	-140/+287
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	This allows leaf elements to differ in size from internal node elements. In principle it would be more correct to use a different type for each level of the tree, but due to implementation details related to atomic operations, we use casts anyway, thus counteracting the value of additional type correctness. Furthermore, such a scheme would require function code generation (via cpp macros), as well as either unwieldy type names for leaves or type aliases, e.g. typedef struct rtree_elm_d2_s rtree_leaf_elm_t; This alternate strategy would be more correct, and with less code duplication, but probably not worth the complexity.
*	Remove binind field from arena_slab_data_t.	Jason Evans	2017-03-23	1	-5/+5
\| \| \| \| \|	binind is now redundant; the containing extent_t's szind field always provides the same value.
*	Convert extent_t's usize to szind.	Jason Evans	2017-03-23	5	-148/+139
\| \| \| \| \| \| \| \|	Rather than storing usize only for large (and prof-promoted) allocations, store the size class index for allocations that reside within the extent, such that the size class index is valid for all extents that contain extant allocations, and invalid otherwise (mainly to make debugging simpler).
*	Not re-binding iarena when migrate between arenas.	Qi Wang	2017-03-21	1	-1/+0
\|
*	Refactor tcaches flush/destroy to reduce lock duration.	Jason Evans	2017-03-16	1	-6/+13
\| \| \| \|	Drop tcaches_mtx before calling tcache_destroy().
*	Propagate madvise() success/failure from pages_purge_lazy().	Jason Evans	2017-03-16	1	-3/+3
\|
*	Implement two-phase decay-based purging.	Jason Evans	2017-03-15	6	-244/+523
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Split decay-based purging into two phases, the first of which uses lazy purging to convert dirty pages to "muzzy", and the second of which uses forced purging, decommit, or unmapping to convert pages to clean or destroy them altogether. Not all operating systems support lazy purging, yet the application may provide extent hooks that implement lazy purging, so care must be taken to dynamically omit the first phase when necessary. The mallctl interfaces change as follows: - opt.decay_time --> opt.{dirty,muzzy}_decay_time - arena.<i>.decay_time --> arena.<i>.{dirty,muzzy}_decay_time - arenas.decay_time --> arenas.{dirty,muzzy}_decay_time - stats.arenas.<i>.pdirty --> stats.arenas.<i>.p{dirty,muzzy} - stats.arenas.<i>.{npurge,nmadvise,purged} --> stats.arenas.<i>.{dirty,muzzy}_{npurge,nmadvise,purged} This resolves #521.
*	Move arena_t's purging field into arena_decay_t.	Jason Evans	2017-03-15	1	-5/+4
\|
*	Refactor decay-related function parametrization.	Jason Evans	2017-03-15	1	-86/+96
\| \| \| \| \| \| \|	Refactor most of the decay-related functions to take as parameters the decay_t and associated extents_t structures to operate on. This prepares for supporting both lazy and forced purging on different decay schedules.
*	Convert remaining arena_stats_t fields to atomics	David Goldblatt	2017-03-14	2	-47/+83
\| \| \| \| \| \| \|	These were all size_ts, so we have atomics support for them on all platforms, so the conversion is straightforward. Left non-atomic is curlextents, which AFAICT is not used atomically anywhere.
*	Switch atomic uint64_ts in arena_stats_t to C11 atomics	David Goldblatt	2017-03-14	2	-41/+99
\| \| \| \| \| \|	I expect this to be the trickiest conversion we will see, since we want atomics on 64-bit platforms, but are also always able to piggyback on some sort of external synchronization on non-64 bit platforms.
*	Prefer pages_purge_forced() over memset().	Jason Evans	2017-03-14	2	-16/+30
\| \| \| \| \| \|	This has the dual advantages of allowing for sparsely used large allocations, and relying on the kernel to supply zeroed pages, which tends to be very fast on modern systems.
*	Add alignment/size assertions to pages_*().	Jason Evans	2017-03-14	1	-0/+15
\| \| \| \| \|	These sanity checks prevent what otherwise might result in failed system calls and unintended fallback execution paths.
*	Fix pages_purge_forced() to discard pages on non-Linux systems.	Jason Evans	2017-03-14	1	-1/+8
\| \| \| \| \|	madvise(..., MADV_DONTNEED) only causes demand-zeroing on Linux, so fall back to overlaying a new mapping.
*	Convert rtree code to use C11 atomics	David Goldblatt	2017-03-13	1	-16/+34
\| \| \| \| \| \| \|	In the process, I changed the implementation of rtree_elm_acquire so that it won't even try to CAS if its initial read (getting the extent + lock bit) indicates that the CAS is doomed to fail. This can significantly improve performance under contention.
*	Convert arena_t's purging field to non-atomic bool.	Jason Evans	2017-03-10	1	-4/+5
\| \| \| \|	The decay mutex already protects all accesses.
*	Implement per-CPU arena.	Qi Wang	2017-03-09	5	-29/+150
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	The new feature, opt.percpu_arena, determines thread-arena association dynamically based CPU id. Three modes are supported: "percpu", "phycpu" and disabled. "percpu" uses the current core id (with help from sched_getcpu()) directly as the arena index, while "phycpu" will assign threads on the same physical CPU to the same arena. In other words, "percpu" means # of arenas == # of CPUs, while "phycpu" has # of arenas == 1/2 * (# of CPUs). Note that no runtime check on whether hyper threading is enabled is added yet. When enabled, threads will be migrated between arenas when a CPU change is detected. In the current design, to reduce overhead from reading CPU id, each arena tracks the thread accessed most recently. When a new thread comes in, we will read CPU id and update arena if necessary.
*	Fix arena_prefork lock rank order for witness.	Qi Wang	2017-03-09	2	-12/+30
\| \| \| \| \| \| \| \|	When witness is enabled, lock rank order needs to be preserved during prefork, not only for each arena, but also across arenas. This change breaks arena_prefork into further stages to ensure valid rank order across arenas. Also changed test/unit/fork to use a manual arena to catch this case.
*	Convert extents_t's npages field to use C11-style atomics	David Goldblatt	2017-03-09	1	-6/+23
\| \| \| \| \| \|	In the process, we can do some strength reduction, changing the fetch-adds and fetch-subs to be simple loads followed by stores, since the modifications all occur while holding the mutex.
*	Store associated arena in tcache.	Qi Wang	2017-03-07	3	-9/+11
\| \| \| \| \| \|	This fixes tcache_flush for manual tcaches, which wasn't able to find the correct arena it associated with. Also changed the decay test to cover this case (by using manually created arenas).
*	Use any-best-fit for cached extent allocation.	Jason Evans	2017-03-07	1	-5/+8
\| \| \| \| \| \|	This simplifies what would be pairing heap operations to the equivalent of LIFO queue operations. This is a complementary optimization in the context of delayed coalescing for cached extents.
*	Perform delayed coalescing prior to purging.	Jason Evans	2017-03-07	2	-46/+133
\| \| \| \| \| \| \| \| \|	Rather than purging uncoalesced extents, perform just enough incremental coalescing to purge only fully coalesced extents. In the absence of cached extent reuse, the immediate versus delayed incremental purging algorithms result in the same purge order. This resolves #655.
*	Change arena to use the atomic functions for ssize_t instead of the union ↵	David Goldblatt	2017-03-07	1	-6/+2
\| \| \| \|	strategy
*	Disentangle assert and util	David Goldblatt	2017-03-06	1	-11/+31
\| \| \| \| \| \| \| \| \|	This is the first header refactoring diff, #533. It splits the assert and util components into separate, hermetic, header files. In the process, it splits out two of the large sub-components of util (the stdio.h replacement, and bit manipulation routines) into their own components (malloc_io.h and bit_util.h). This is mostly to break up cyclic dependencies, but it also breaks off a good chunk of the catch-all-ness of util, which is nice.
*	Optimize malloc_large_stats_t maintenance.	Jason Evans	2017-03-04	1	-29/+6
\| \| \| \| \| \| \| \| \| \|	Convert the nrequests field to be partially derived, and the curlextents to be fully derived, in order to reduce the number of stats updates needed during common operations. This change affects ndalloc stats during arena reset, because it is no longer possible to cancel out ndalloc effects (curlextents would become negative).
*	Introduce a backport of C11 atomics	David Goldblatt	2017-03-03	1	-2/+0
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	This introduces a backport of C11 atomics. It has four implementations; ranked in order of preference, they are: - GCC/Clang __atomic builtins - GCC/Clang __sync builtins - MSVC _Interlocked builtins - C11 atomics, from <stdatomic.h> The primary advantages are: - Close adherence to the standard API gives us a defined memory model. - Type safety: atomic objects are now separate types from non-atomic ones, so that it's impossible to mix up atomic and non-atomic updates (which is undefined behavior that compilers are starting to take advantage of). - Efficiency: we can specify ordering for operations, avoiding fences and atomic operations on strongly ordered architectures (example: `atomic_write_u32(ptr, val);` involves a CAS loop, whereas `atomic_store(ptr, val, ATOMIC_RELEASE);` is a plain store. This diff leaves in the current atomics API (implementing them in terms of the backport). This lets us transition uses over piecemeal. Testing: This is by nature hard to test. I've manually tested the first three options on Linux on gcc by futzing with the #defines manually, on freebsd with gcc and clang, on MSVC, and on OS X with clang. All of these were x86 machines though, and we don't have any test infrastructure set up for non-x86 platforms.
*	Immediately purge cached extents if decay_time is 0.	Jason Evans	2017-03-03	2	-38/+34
\| \| \| \| \| \| \| \|	This fixes a regression caused by 54269dc0ed3e4d04b2539016431de3cfe8330719 (Remove obsolete arena_maybe_purge() call.), as well as providing a general fix. This resolves #665.