summaryrefslogtreecommitdiffstats
path: root/src
Commit message (Collapse)AuthorAgeFilesLines
...
* Attempt to expand huge allocations in-place.Daniel Micay2014-10-053-28/+101
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This adds support for expanding huge allocations in-place by requesting memory at a specific address from the chunk allocator. It's currently only implemented for the chunk recycling path, although in theory it could also be done by optimistically allocating new chunks. On Linux, it could attempt an in-place mremap. However, that won't work in practice since the heap is grown downwards and memory is not unmapped (in a normal build, at least). Repeated vector reallocation micro-benchmark: #include <string.h> #include <stdlib.h> int main(void) { for (size_t i = 0; i < 100; i++) { void *ptr = NULL; size_t old_size = 0; for (size_t size = 4; size < (1 << 30); size *= 2) { ptr = realloc(ptr, size); if (!ptr) return 1; memset(ptr + old_size, 0xff, size - old_size); old_size = size; } free(ptr); } } The glibc allocator fails to do any in-place reallocations on this benchmark once it passes the M_MMAP_THRESHOLD (default 128k) but it elides the cost of copies via mremap, which is currently not something that jemalloc can use. With this improvement, jemalloc still fails to do any in-place huge reallocations for the first outer loop, but then succeeds 100% of the time for the remaining 99 iterations. The time spent doing allocations and copies drops down to under 5%, with nearly all of it spent doing purging + faulting (when huge pages are disabled) and the array memset. An improved mremap API (MREMAP_RETAIN - #138) would be far more general but this is a portable optimization and would still be useful on Linux for xallocx. Numbers with transparent huge pages enabled: glibc (copies elided via MREMAP_MAYMOVE): 8.471s jemalloc: 17.816s jemalloc + no-op madvise: 13.236s jemalloc + this commit: 6.787s jemalloc + this commit + no-op madvise: 6.144s Numbers with transparent huge pages disabled: glibc (copies elided via MREMAP_MAYMOVE): 15.403s jemalloc: 39.456s jemalloc + no-op madvise: 12.768s jemalloc + this commit: 15.534s jemalloc + this commit + no-op madvise: 6.354s Closes #137
* Fix OOM-related regression in arena_tcache_fill_small().Jason Evans2014-10-051-1/+12
| | | | | | | | | | | | | | | Fix an OOM-related regression in arena_tcache_fill_small() that caused cache corruption that would almost certainly expose the application to undefined behavior, usually in the form of an allocation request returning an already-allocated region, or somewhat less likely, a freed region that had already been returned to the arena, thus making it available to the arena for any purpose. This regression was introduced by 9c43c13a35220c10d97a886616899189daceb359 (Reverse tcache fill order.), and was present in all releases from 2.2.0 through 3.6.0. This resolves #98.
* Fix prof regressions.Jason Evans2014-10-041-16/+23
| | | | | | | | | | | Fix prof regressions related to tdata (main per thread profiling data structure) destruction: - Deadlock. The fix for this was intended to be part of 20c31deaae38ed9aa4fe169ed65e0c45cd542955 (Test prof.reset mallctl and fix numerous discovered bugs.) but the fix was left incomplete. - Destruction race. Detaching tdata just prior to destruction without holding the tdatas lock made it possible for another thread to destroy the tdata out from under the thread that was on its way to doing so.
* Silence a compiler warning.Jason Evans2014-10-041-1/+1
|
* Fix tsd cleanup regressions.Jason Evans2014-10-045-88/+53
| | | | | | | | | | | | | | | | Fix tsd cleanup regressions that were introduced in 5460aa6f6676c7f253bfcb75c028dfd38cae8aaf (Convert all tsd variables to reside in a single tsd structure.). These regressions were twofold: 1) tsd_tryget() should never (and need never) return NULL. Rename it to tsd_fetch() and simplify all callers. 2) tsd_*_set() must only be called when tsd is in the nominal state, because cleanup happens during the nominal-->purgatory transition, and re-initialization must not happen while in the purgatory state. Add tsd_nominal() and use it as needed. Note that tsd_*{p,}_get() can still be used as long as no re-initialization that would require cleanup occurs. This means that e.g. the thread_allocated counter can be updated unconditionally.
* Implement/test/fix prof-related mallctl's.Jason Evans2014-10-044-50/+198
| | | | | | | | | | | Implement/test/fix the opt.prof_thread_active_init, prof.thread_active_init, and thread.prof.active mallctl's. Test/fix the thread.prof.name mallctl. Refactor opt_prof_active to be read-only and move mutable state into the prof_active variable. Stop leaning on ctl-related locking for protection.
* Convert to uniform style: cond == false --> !condJason Evans2014-10-0312-91/+90
|
* Test prof.reset mallctl and fix numerous discovered bugs.Jason Evans2014-10-031-64/+149
|
* Implement in-place huge allocation shrinking.Daniel Micay2014-10-011-27/+62
| | | | | | | | | | | | | | | | | | | | | | | | | | Trivial example: #include <stdlib.h> int main(void) { void *ptr = malloc(1024 * 1024 * 8); if (!ptr) return 1; ptr = realloc(ptr, 1024 * 1024 * 4); if (!ptr) return 1; } Before: mmap(NULL, 8388608, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fcfff000000 mmap(NULL, 4194304, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fcffec00000 madvise(0x7fcfff000000, 8388608, MADV_DONTNEED) = 0 After: mmap(NULL, 8388608, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f1934800000 madvise(0x7f1934c00000, 4194304, MADV_DONTNEED) = 0 Closes #134
* Mark malloc_conf as a weak symbolDave Rigby2014-09-291-1/+1
| | | | This fixes issue #113 - je_malloc_conf is not respected on OS X
* Move small run metadata into the arena chunk header.Jason Evans2014-09-291-194/+153
| | | | | | | | | | | | | | | | Move small run metadata into the arena chunk header, with multiple expected benefits: - Lower run fragmentation due to reduced run sizes; runs are more likely to completely drain when there are fewer total regions. - Improved cache behavior. Prior to this change, run headers were always page-aligned, which put extra pressure on some CPU cache sets. The degree to which this was a problem was hardware dependent, but it likely hurt some even for the most advanced modern hardware. - Buffer overruns/underruns are less likely to corrupt allocator metadata. - Size classes between 4 KiB and 16 KiB become reasonable to support without any special handling, and the runs are small enough that dirty unused pages aren't a significant concern.
* Implement compile-time bitmap size computation.Jason Evans2014-09-281-15/+3
|
* Fix profile dumping race.Jason Evans2014-09-251-1/+9
| | | | | | | | | | | | | | Fix a race that caused a non-critical assertion failure. To trigger the race, a thread had to be part way through initializing a new sample, such that it was discoverable by the dumping thread, but not yet linked into its gctx by the time a later dump phase would normally have reset its state to 'nominal'. Additionally, lock access to the state field during modification to transition to the dumping state. It's not apparent that this oversight could have caused an actual problem due to outer locking that protects the dumping machinery, but the added locking pedantically follows the stated locking protocol for the state field.
* Convert all tsd variables to reside in a single tsd structure.Jason Evans2014-09-2310-496/+546
|
* Fix prof regressions.Jason Evans2014-09-121-1/+22
| | | | | | | | | | | | Don't use atomic_add_uint64(), because it isn't available on 32-bit platforms. Fix forking support functions to manage all prof-related mutexes. These regressions were introduced by 602c8e0971160e4b85b08b16cf8a2375aa24bc04 (Implement per thread heap profiling.), which did not make it into any releases prior to these fixes.
* Fix irallocx_prof() sample logic.Jason Evans2014-09-121-3/+3
| | | | | | | | Fix irallocx_prof() sample logic to only update the threshold counter after it knows what size the allocation ended up being. This regression was caused by 6e73dc194ee9682d3eacaf725a989f04629718f7 (Fix a profile sampling race.), which did not make it into any releases prior to this fix.
* Apply likely()/unlikely() to allocation/deallocation fast paths.Jason Evans2014-09-124-83/+85
|
* Fix mallocx() to always honor MALLOCX_ARENA() when profiling.Jason Evans2014-09-111-2/+1
|
* mark some conditions as unlikelyDaniel Micay2014-09-111-21/+21
| | | | | | | | | | | | * assertion failure * malloc_init failure * malloc not already initialized (in malloc_init) * running in valgrind * thread cache disabled at runtime Clang and GCC already consider a comparison with NULL or -1 to be cold, so many branches (out-of-memory) are already correctly considered as cold and marking them is not important.
* Fix a profile sampling race.Jason Evans2014-09-102-53/+91
| | | | | | | | | | | Fix a profile sampling race that was due to preparing to sample, yet doing nothing to assure that the context remains valid until the stats are updated. These regressions were caused by 602c8e0971160e4b85b08b16cf8a2375aa24bc04 (Implement per thread heap profiling.), which did not make it into any releases prior to these fixes.
* Fix prof_tdata_get()-related regressions.Jason Evans2014-09-091-25/+20
| | | | | | | | | | | | | Fix prof_tdata_get() to avoid dereferencing an invalid tdata pointer (when it's PROF_TDATA_STATE_{REINCARNATED,PURGATORY}). Fix prof_tdata_get() callers to check for invalid results besides NULL (PROF_TDATA_STATE_{REINCARNATED,PURGATORY}). These regressions were caused by 602c8e0971160e4b85b08b16cf8a2375aa24bc04 (Implement per thread heap profiling.), which did not make it into any releases prior to these fixes.
* Fix sdallocx() assertion.Jason Evans2014-09-091-16/+18
| | | | | Refactor sdallocx() and nallocx() to share inallocx(), and fix an sdallocx() assertion to check usize rather than size.
* Add support for sized deallocation.Daniel Micay2014-09-091-0/+44
| | | | | | | | | | | | | | | | | This adds a new `sdallocx` function to the external API, allowing the size to be passed by the caller. It avoids some extra reads in the thread cache fast path. In the case where stats are enabled, this avoids the work of calculating the size from the pointer. An assertion validates the size that's passed in, so enabling debugging will allow users of the API to debug cases where an incorrect size is passed in. The performance win for a contrived microbenchmark doing an allocation and immediately freeing it is ~10%. It may have a different impact on a real workload. Closes #28
* Optimize [nmd]alloc() fast paths.Jason Evans2014-09-073-102/+141
| | | | | | Optimize [nmd]alloc() fast paths such that the (flags == 0) case is streamlined, flags decoding only happens to the minimum degree necessary, and no conditionals are repeated.
* Whitespace cleanups.Jason Evans2014-09-051-7/+7
|
* Refactor chunk map.Qinfan Wu2014-09-053-104/+116
| | | | | Break the chunk map into two separate arrays, in order to improve cache locality. This is related to issue #23.
* Remove junk filling in tcache_bin_flush_small().Qinfan Wu2014-08-271-4/+0
| | | | | | Junk filling is done in arena_dalloc_bin_locked(), so arena_alloc_junk_small() is redundant. Also, we should use arena_dalloc_junk_small() instead of arena_alloc_junk_small().
* Test for availability of malloc hooks via autoconfSara Golemon2014-08-221-1/+3
| | | | | | | | | __*_hook() is glibc, but on at least one glibc platform (homebrew), the __GLIBC__ define isn't set correctly and we miss being able to use these hooks. Do a feature test for it during configuration so that we enable it anywhere the hooks are actually available.
* Implement per thread heap profiling.Jason Evans2014-08-205-439/+939
| | | | | | | | | | | | | | | | | | | | | | | | Rename data structures (prof_thr_cnt_t-->prof_tctx_t, prof_ctx_t-->prof_gctx_t), and convert to storing a prof_tctx_t for sampled objects. Convert PROF_ALLOC_PREP() to prof_alloc_prep(), since precise backtrace depth within jemalloc functions is no longer an issue (pprof prunes irrelevant frames). Implement mallctl's: - prof.reset implements full sample data reset, and optional change of sample interval. - prof.lg_sample reads the current sample interval (opt.lg_prof_sample was the permanent source of truth prior to prof.reset). - thread.prof.name provides naming capability for threads within heap profile dumps. - thread.prof.active makes it possible to activate/deactivate heap profiling for individual threads. Modify the heap dump files to contain per thread heap profile data. This change is incompatible with the existing pprof, which will require enhancements to read and process the enriched data.
* Dump heap profile backtraces in a stable order.Jason Evans2014-08-201-52/+105
| | | | | Also iterate over per thread stats in a stable order, which prepares the way for stable ordering of per thread heap profile dumps.
* Directly embed prof_ctx_t's bt.Jason Evans2014-08-201-51/+18
|
* Convert prof_tdata_t's bt2cnt to a comprehensive map.Jason Evans2014-08-201-50/+17
| | | | | | Treat prof_tdata_t's bt2cnt as a comprehensive map of the thread's extant allocation samples (do not limit the total number of entries). This helps prepare the way for per thread heap profiling.
* Fix arena.<i>.dss mallctl to handle read-only calls.Jason Evans2014-08-151-23/+29
|
* Fix and refactor runs_dirty-based purging.Jason Evans2014-08-141-104/+80
| | | | | | | | | | | | | Fix runs_dirty-based purging to also purge dirty pages in the spare chunk. Refactor runs_dirty manipulation into arena_dirty_{insert,remove}(), and move the arena->ndirty accounting into those functions. Remove the u.ql_link field from arena_chunk_map_t, and get rid of the enclosing union for u.rb_link, since only rb_link remains. Remove the ndirty field from arena_chunk_t.
* arena->npurgatory is no longer needed since we drop arena's lockQinfan Wu2014-08-121-12/+3
| | | | after stashing all the purgeable runs.
* Remove chunks_dirty tree, nruns_avail and nruns_adjac since we noQinfan Wu2014-08-121-177/+10
| | | | longer need to maintain the tree for dirty page purging.
* Purge dirty pages from the beginning of the dirty list.Qinfan Wu2014-08-121-165/+70
|
* Add dirty page counting for debugQinfan Wu2014-08-121-4/+29
|
* Maintain all the dirty runs in a linked list for each arenaQinfan Wu2014-08-121-0/+47
|
* Fix the cactive statistic.Jason Evans2014-08-071-3/+3
| | | | | | | Fix the cactive statistic to decrease (rather than increase) when active memory decreases. This regression was introduced by aa5113b1fdafd1129c22512837c6c3d66c295fc8 (Refactor overly large/complex functions) and first released in 3.5.0.
* Reintroduce the comment that was removed in f9ff603.Qinfan Wu2014-08-061-1/+5
|
* Fix the bug that causes not allocating free run with lowest address.Qinfan Wu2014-08-061-3/+7
|
* Ensure the default purgeable zone is after the default zone on OS XMike Hommey2014-06-101-9/+25
|
* Add check for madvise(2) to configure.ac.Richard Diamond2014-06-031-2/+5
| | | | | | Some platforms, such as Google's Portable Native Client, use Newlib and thus lack access to madvise(2). In those instances, pages_purge() is transformed into a no-op.
* Fix -Wsometimes-uninitialized warningsChris Peterson2014-06-021-1/+3
|
* Fix -Wsign-compare warningsChris Peterson2014-06-022-4/+4
|
* Don't catch fork()ing events for Native Client.Richard Diamond2014-06-021-1/+1
| | | | | | | | Native Client doesn't allow forking, thus there is no need to catch fork()ing events for Native Client. Additionally, without this commit, jemalloc will introduce an unresolved pthread_atfork() in PNaCl Rust bins.
* Try to use __builtin_ffsl if ffsl is unavailable.Richard Diamond2014-06-022-3/+3
| | | | | | | | | | | Some platforms (like those using Newlib) don't have ffs/ffsl. This commit adds a check to configure.ac for __builtin_ffsl if ffsl isn't found. __builtin_ffsl performs the same function as ffsl, and has the added benefit of being available on any platform utilizing Gcc-compatible compiler. This change does not address the used of ffs in the MALLOCX_ARENA() macro.
* Add size class computation capability.Jason Evans2014-05-291-29/+33
| | | | | | | Add size class computation capability, currently used only as validation of the size class lookup tables. Generalize the size class spacing used for bins, for eventual use throughout the full range of allocation sizes.
* Refactor huge allocation to be managed by arenas.Jason Evans2014-05-168-237/+256
| | | | | | | | | | | | | | | | | | | | Refactor huge allocation to be managed by arenas (though the global red-black tree of huge allocations remains for lookup during deallocation). This is the logical conclusion of recent changes that 1) made per arena dss precedence apply to huge allocation, and 2) made it possible to replace the per arena chunk allocation/deallocation functions. Remove the top level huge stats, and replace them with per arena huge stats. Normalize function names and types to *dalloc* (some were *dealloc*). Remove the --enable-mremap option. As jemalloc currently operates, this is a performace regression for some applications, but planned work to logarithmically space huge size classes should provide similar amortized performance. The motivation for this change was that mremap-based huge reallocation forced leaky abstractions that prevented refactoring.