Skip to content

Latest commit

 

History

History
55 lines (43 loc) · 5.05 KB

general-tasks.md

File metadata and controls

55 lines (43 loc) · 5.05 KB

General

Note: Some of these items may be out of date.

Get rid of unsafe pointer magic WithSidebands (cost: high)

What needs to be done is to transform the &mut StreamingPeekableIter into a child future, and when exhausted, it must be transformed back into the &mut _ that created it. That way, only a single mutable reference to said Iter is present at any time. Unfortunately the generated futures (using async) don't support that as we would have to keep both the future and the parent that created it inside of our own struct. Instead of hiding this using pointers, one could implement the magical part by hand, a custom future, which happily dissolves into its mutable parent iter ref. That would be quite some work though.

Potential for improving performance

gix-object

NLL/Borrowcheck limitation gix-odb::(compound|linked)::Db cause additional code complexity

  • Once polonius is available with production-ready performance, we should simplify the locate(…) code in (compound|linked)::Db() respectively. Currently these first have to obtain an index, and when found, access the data to avoid having the borrowchecker fail to understand our buffer usage within a loop correctly. Performance itself it probably not reasonably affected.

Pack Decoding

  • Pack decoding takes 5x more memory than git on the android-base repository.
  • On ARM64 on MacOS the SHA1 implementation of the sha-1 crate is capped at about 550MB/s, half the speed of what I saw on Intel and about 50% slower than what's implemented in libcorecrypto.dylib. Get that fast and the decoding stage will be able to beat git on fewer cores. See this comment for more. Right now we only do when scaling beyond what git can do due to lock contention.
    • This should work once the asm feature can be enabled in the sha-1 crate, which currently fails but is tracked in this issue.
  • pack::cache::lru::Memory all copy input data in individual allocations. Could a pre-allocated arena or slab be faster?
    • Probably not, as allocation performance might not be the issue here. Even though there definitely is a lot of effectively useless copying of data and deallocation happening if caches are not used after all.
  • Add more control over the amount of memory used for the less-memory algorithm of pack-verify to increase cache hit rate at the cost of memory. Note that depending on this setting, it might not be needed anymore to iterated over sorted offsets, freeing 150MB of memory in the process that could be used for the improved cache. With the current cache and no sorted offsets, the time nearly triples.
  • progress measuring costs when using 96 cores (see this comment)
    • potential savings: low
  • Add '--chunk|batch-size' flag to pack-verify and pack-index-from-data to allow tuning sizes for large amounts of cores
    • @joshtriplett write: "I did find that algorithm when I was looking for the chunk size, though I didn't dig into the details. As a quick hack, I tried dropping the upper number from 1000 to 250, which made no apparent difference in performance."
    • potential savings: medium unclear
  • On 96 core machines, it takes visible time until all threads are started and have work. Is it because starting 100 threads takes so long? Or is it contention to get work?
  • Improve cache hit rate of lookup pack traversal by using partial DAGs build with help of the index
    • @joshtriplett writes: "Would it be possible, with some care, to use the index to figure out in advance which objects will be needed again and which ones won't? Could you compute a small DAG of objects you need for deltas (without storing the objects themselves), and use that to decide the order you process objects in?"
    • Note that there is tension between adding more latency to build such tree and the algorithms ability to (otherwise) start instantly.
    • potential savings: unknown