summaryrefslogtreecommitdiffstats
Commit message (Collapse)AuthorAgeFilesLines
* rm unused arena wrangling from xallocxDaniel Micay2014-10-311-16/+8
| | | | | It has no use for the arena_t since unlike rallocx it never makes a new memory allocation. It's just an unused parameter in ixalloc_helper.
* Miscellaneous cleanups.Jason Evans2014-10-313-10/+10
|
* avoid redundant chunk header readsDaniel Micay2014-10-312-45/+42
| | | | | | * use sized deallocation in iralloct_realign * iralloc and ixalloc always need the old size, so pass it in from the caller where it's often already calculated
* mark huge allocations as unlikelyDaniel Micay2014-10-314-16/+16
| | | | This cleans up the fast path a bit more by moving away more code.
* Fix prof_{enter,leave}() calls to pass tdata_self.Jason Evans2014-10-301-19/+24
|
* Use JEMALLOC_INLINE_C everywhere it's appropriate.Jason Evans2014-10-304-15/+15
|
* Merge pull request #154 from guilherme-pg/implicit-intJason Evans2014-10-201-1/+1
|\ | | | | Fix variable declaration with no type in the configure script.
| * Fix variable declaration with no type in the configure script.Guilherme Goncalves2014-10-201-1/+1
|/
* Merge pull request #151 from thestinger/rallocJason Evans2014-10-162-2/+2
|\ | | | | use sized deallocation internally for ralloc
| * use sized deallocation internally for rallocDaniel Micay2014-10-162-2/+2
| | | | | | | | | | | | | | The size of the source allocation is known at this point, so reading the chunk header can be avoided for the small size class fast path. This is not very useful right now, but it provides a significant performance boost with an alternate ralloc entry point taking the old size.
* | Initialize chunks_mtx for all configurations.Jason Evans2014-10-161-4/+3
|/ | | | This resolves #150.
* Purge/zero sub-chunk huge allocations as necessary.Jason Evans2014-10-161-24/+51
| | | | | | | Purge trailing pages during shrinking huge reallocation when resulting size is not a multiple of the chunk size. Similarly, zero pages if necessary during growing huge reallocation when the resulting size is not a multiple of the chunk size.
* Add small run utilization to stats output.Jason Evans2014-10-151-16/+34
| | | | | | | | | | | Add the 'util' column, which reports the proportion of available regions that are currently in use for each small size class. Small run utilization is the complement of external fragmentation. For example, utilization of 0.75 indicates that 25% of small run memory is consumed by external fragmentation, in other (more obtuse) words, 33% external fragmentation overhead. This resolves #27.
* Thwart compiler optimizations.Jason Evans2014-10-151-0/+12
|
* Fix line wrapping.Jason Evans2014-10-151-10/+10
|
* Fix huge allocation statistics.Jason Evans2014-10-155-160/+252
|
* Update size class documentation.Jason Evans2014-10-151-26/+84
|
* Add per size class huge allocation statistics.Jason Evans2014-10-1310-338/+724
| | | | | | | | | | | | | Add per size class huge allocation statistics, and normalize various stats: - Change the arenas.nlruns type from size_t to unsigned. - Add the arenas.nhchunks and arenas.hchunks.<i>.size mallctl's. - Replace the stats.arenas.<i>.bins.<j>.allocated mallctl with stats.arenas.<i>.bins.<j>.curregs . - Add the stats.arenas.<i>.hchunks.<j>.nmalloc, stats.arenas.<i>.hchunks.<j>.ndalloc, stats.arenas.<i>.hchunks.<j>.nrequests, and stats.arenas.<i>.hchunks.<j>.curhchunks mallctl's.
* Fix a prof_tctx_t/prof_tdata_t cleanup race.Jason Evans2014-10-122-5/+11
| | | | | | Fix a prof_tctx_t/prof_tdata_t cleanup race by storing a copy of thr_uid in prof_tctx_t, so that the associated tdata need not be present during tctx teardown.
* Remove arena_dalloc_bin_run() clean page preservation.Jason Evans2014-10-112-74/+13
| | | | | | | | | | | | | Remove code in arena_dalloc_bin_run() that preserved the "clean" state of trailing clean pages by splitting them into a separate run during deallocation. This was a useful mechanism for reducing dirty page churn when bin runs comprised many pages, but bin runs are now quite small. Remove the nextind field from arena_run_t now that it is no longer needed, and change arena_run_t's bin field (arena_bin_t *) to binind (index_t). These two changes remove 8 bytes of chunk header overhead per page, which saves 1/512 of all arena chunk memory.
* Add --with-lg-tiny-min, generalize --with-lg-quantum.Jason Evans2014-10-116-16/+105
|
* Add AC_CACHE_CHECK() for pause instruction.Jason Evans2014-10-111-3/+4
| | | | This supports cross compilation.
* Don't fetch tsd in a0{d,}alloc().Jason Evans2014-10-112-11/+8
| | | | | Don't fetch tsd in a0{d,}alloc(), because doing so can cause infinite recursion on systems that require an allocated tsd wrapper.
* Add configure options.Jason Evans2014-10-1016-136/+277
| | | | | | | | | | | | Add: --with-lg-page --with-lg-page-sizes --with-lg-size-class-group --with-lg-quantum Get rid of STATIC_PAGE_SHIFT, in favor of directly setting LG_PAGE. Fix various edge conditions exposed by the configure options.
* Don't configure HAVE_SSE2.Jason Evans2014-10-092-11/+4
| | | | | | | Don't configure HAVE_SSE2 (on behalf of SFMT), because its dependencies are notoriously unportable in practice. This resolves #119.
* Avoid atexit(3) when possible, disable prof_final by default.Jason Evans2014-10-094-14/+26
| | | | | | | | | | | | atexit(3) can deadlock internally during its own initialization if jemalloc calls atexit() during jemalloc initialization. Mitigate the impact by restructuring prof initialization to avoid calling atexit() unless the registered function will actually dump a final heap profile. Additionally, disable prof_final by default so that this land mine is opt-in rather than opt-out. This resolves #144.
* Fix a recursive lock acquisition regression.Jason Evans2014-10-081-11/+16
| | | | | | Fix a recursive lock acquisition regression, which was introduced by 8bb3198f72fc7587dc93527f9f19fb5be52fa553 (Refactor/fix arenas manipulation.).
* Use regular arena allocation for huge tree nodes.Daniel Micay2014-10-085-15/+29
| | | | | | This avoids grabbing the base mutex, as a step towards fine-grained locking for huge allocations. The thread cache also provides a tiny (~3%) improvement for serial huge allocations.
* Refactor/fix arenas manipulation.Jason Evans2014-10-0813-347/+740
| | | | | | | | | | | | | | | | | | | | | | | | | | | | Abstract arenas access to use arena_get() (or a0get() where appropriate) rather than directly reading e.g. arenas[ind]. Prior to the addition of the arenas.extend mallctl, the worst possible outcome of directly accessing arenas was a stale read, but arenas.extend may allocate and assign a new array to arenas. Add a tsd-based arenas_cache, which amortizes arenas reads. This introduces some subtle bootstrapping issues, with tsd_boot() now being split into tsd_boot[01]() to support tsd wrapper allocation bootstrapping, as well as an arenas_cache_bypass tsd variable which dynamically terminates allocation of arenas_cache itself. Promote a0malloc(), a0calloc(), and a0free() to be generally useful for internal allocation, and use them in several places (more may be appropriate). Abstract arena->nthreads management and fix a missing decrement during thread destruction (recent tsd refactoring left arenas_cleanup() unused). Change arena_choose() to propagate OOM, and handle OOM in all callers. This is important for providing consistent allocation behavior when the MALLOCX_ARENA() flag is being used. Prior to this fix, it was possible for an OOM to result in allocation silently allocating from a different arena than the one specified.
* Fix a prof_tctx_t destruction race.Jason Evans2014-10-061-18/+32
|
* Normalize size classes.Jason Evans2014-10-0616-474/+557
| | | | | | | | | | Normalize size classes to use the same number of size classes per size doubling (currently hard coded to 4), across the intire range of size classes. Small size classes already used this spacing, but in order to support this change, additional small size classes now fill [4 KiB .. 16 KiB). Large size classes range from [16 KiB .. 4 MiB). Huge size classes now support non-multiples of the chunk size in order to fill (4 MiB .. 16 MiB).
* Fix a docbook element nesting nit.Jason Evans2014-10-051-4/+4
| | | | | | | According to the docbook documentation for <funcprototype>, its parent must be <funcsynopsis>; fix accordingly. Nonetheless, the man page processor fails badly when this construct is embedded in a <para> (which is documented to be legal), although the html processor does fine.
* Attempt to expand huge allocations in-place.Daniel Micay2014-10-0510-41/+118
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This adds support for expanding huge allocations in-place by requesting memory at a specific address from the chunk allocator. It's currently only implemented for the chunk recycling path, although in theory it could also be done by optimistically allocating new chunks. On Linux, it could attempt an in-place mremap. However, that won't work in practice since the heap is grown downwards and memory is not unmapped (in a normal build, at least). Repeated vector reallocation micro-benchmark: #include <string.h> #include <stdlib.h> int main(void) { for (size_t i = 0; i < 100; i++) { void *ptr = NULL; size_t old_size = 0; for (size_t size = 4; size < (1 << 30); size *= 2) { ptr = realloc(ptr, size); if (!ptr) return 1; memset(ptr + old_size, 0xff, size - old_size); old_size = size; } free(ptr); } } The glibc allocator fails to do any in-place reallocations on this benchmark once it passes the M_MMAP_THRESHOLD (default 128k) but it elides the cost of copies via mremap, which is currently not something that jemalloc can use. With this improvement, jemalloc still fails to do any in-place huge reallocations for the first outer loop, but then succeeds 100% of the time for the remaining 99 iterations. The time spent doing allocations and copies drops down to under 5%, with nearly all of it spent doing purging + faulting (when huge pages are disabled) and the array memset. An improved mremap API (MREMAP_RETAIN - #138) would be far more general but this is a portable optimization and would still be useful on Linux for xallocx. Numbers with transparent huge pages enabled: glibc (copies elided via MREMAP_MAYMOVE): 8.471s jemalloc: 17.816s jemalloc + no-op madvise: 13.236s jemalloc + this commit: 6.787s jemalloc + this commit + no-op madvise: 6.144s Numbers with transparent huge pages disabled: glibc (copies elided via MREMAP_MAYMOVE): 15.403s jemalloc: 39.456s jemalloc + no-op madvise: 12.768s jemalloc + this commit: 15.534s jemalloc + this commit + no-op madvise: 6.354s Closes #137
* Fix OOM-related regression in arena_tcache_fill_small().Jason Evans2014-10-051-1/+12
| | | | | | | | | | | | | | | Fix an OOM-related regression in arena_tcache_fill_small() that caused cache corruption that would almost certainly expose the application to undefined behavior, usually in the form of an allocation request returning an already-allocated region, or somewhat less likely, a freed region that had already been returned to the arena, thus making it available to the arena for any purpose. This regression was introduced by 9c43c13a35220c10d97a886616899189daceb359 (Reverse tcache fill order.), and was present in all releases from 2.2.0 through 3.6.0. This resolves #98.
* Add missing header includes in jemalloc/jemalloc.h .Jason Evans2014-10-052-2/+4
| | | | | | | Add stdlib.h, stdbool.h, and stdint.h to jemalloc/jemalloc.h so that applications only have to #include <jemalloc/jemalloc.h>. This resolves #132.
* Fix prof regressions.Jason Evans2014-10-041-16/+23
| | | | | | | | | | | Fix prof regressions related to tdata (main per thread profiling data structure) destruction: - Deadlock. The fix for this was intended to be part of 20c31deaae38ed9aa4fe169ed65e0c45cd542955 (Test prof.reset mallctl and fix numerous discovered bugs.) but the fix was left incomplete. - Destruction race. Detaching tdata just prior to destruction without holding the tdatas lock made it possible for another thread to destroy the tdata out from under the thread that was on its way to doing so.
* Don't disable tcache for lazy-lock.Jason Evans2014-10-041-2/+0
| | | | | | Don't disable tcache when lazy-lock is configured. There already exists a mechanism to disable tcache, but doing so automatically due to lazy-lock causes surprising performance behavior.
* Avoid purging in microbench when lazy-lock is enabled.Jason Evans2014-10-041-0/+9
|
* Silence a compiler warning.Jason Evans2014-10-041-1/+1
|
* Make prof-related inline functions always-inline.Jason Evans2014-10-041-9/+9
|
* Don't force TLS on behalf of heap profiling.Jason Evans2014-10-041-5/+0
| | | | | | | | Revert 6716aa83526b3f866d73a033970cc920bc61c13f (Force use of TLS if heap profiling is enabled.). No existing tests indicate that this is necessary, nor does code inspection uncover any potential issues. Most likely the original commit covered up a bug related to tsd-internal allocation that has since been fixed.
* Fix tsd cleanup regressions.Jason Evans2014-10-0412-147/+137
| | | | | | | | | | | | | | | | Fix tsd cleanup regressions that were introduced in 5460aa6f6676c7f253bfcb75c028dfd38cae8aaf (Convert all tsd variables to reside in a single tsd structure.). These regressions were twofold: 1) tsd_tryget() should never (and need never) return NULL. Rename it to tsd_fetch() and simplify all callers. 2) tsd_*_set() must only be called when tsd is in the nominal state, because cleanup happens during the nominal-->purgatory transition, and re-initialization must not happen while in the purgatory state. Add tsd_nominal() and use it as needed. Note that tsd_*{p,}_get() can still be used as long as no re-initialization that would require cleanup occurs. This means that e.g. the thread_allocated counter can be updated unconditionally.
* Fix install_lib target (incorrect jemalloc.pc path).Jason Evans2014-10-041-1/+1
|
* Skip test_prof_thread_name_validation if !config_prof.Jason Evans2014-10-041-0/+2
|
* Implement/test/fix prof-related mallctl's.Jason Evans2014-10-0411-65/+544
| | | | | | | | | | | Implement/test/fix the opt.prof_thread_active_init, prof.thread_active_init, and thread.prof.active mallctl's. Test/fix the thread.prof.name mallctl. Refactor opt_prof_active to be read-only and move mutable state into the prof_active variable. Stop leaning on ctl-related locking for protection.
* Convert to uniform style: cond == false --> !condJason Evans2014-10-0320-115/+111
|
* Remove obsolete comment.Jason Evans2014-10-031-6/+0
|
* Test prof.reset mallctl and fix numerous discovered bugs.Jason Evans2014-10-035-76/+405
|
* Refactor permuted backtrace test allocation.Jason Evans2014-10-0210-56/+60
| | | | | | Refactor permuted backtrace test allocation that was originally used only by the prof_accum test, so that it can be used by other heap profiling test binaries.
* Implement in-place huge allocation shrinking.Daniel Micay2014-10-011-27/+62
| | | | | | | | | | | | | | | | | | | | | | | | | | Trivial example: #include <stdlib.h> int main(void) { void *ptr = malloc(1024 * 1024 * 8); if (!ptr) return 1; ptr = realloc(ptr, 1024 * 1024 * 4); if (!ptr) return 1; } Before: mmap(NULL, 8388608, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fcfff000000 mmap(NULL, 4194304, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fcffec00000 madvise(0x7fcfff000000, 8388608, MADV_DONTNEED) = 0 After: mmap(NULL, 8388608, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f1934800000 madvise(0x7f1934c00000, 4194304, MADV_DONTNEED) = 0 Closes #134