Jump to: navigation, search

Swift/ec approval pointer idea

EC "approval pointer" Idea

History: We started calling this a manifest, but decided that was a bad name after we fleshed out the idea. Now it's an "approval pointer". Have fun with that.


Paul: This is an alternative to the ec_hashes pickle

  • Which set of EC fragments should be considered the durable version to be maintained?
  • How do we know which fragments can be cleaned up
  • Can we optimize for small object EC?

When the proxy receives sufficient success from writing the fragment archives, then it knows the set of fragments to maintain in the system and writes the "approval pointer" data (ie tells a set of object servers to write that data).

What is the "approval pointer" object?

  • It is an object (zero-sized)
  • It is replicated in the system.
  • It contains a timestamp and a UUID that uniquely identifies the set of fragment archives to maintain.
  • A timestamp marker that "at this time, there was a durable PUT"

Why do we store a UUID to refer to a set of fragment archive?

  • It removes (on-disk) name conflicts in the case of [partial] overwrites
  • The "approval pointer" update is atomic and only done after the EC write is successful.
  • The timstamps are different anyway. The UUID can allow for FA to be stored on different nodes than the "approval pointer"

Where is the "approval pointer" stored?

  • Option 1: stored (as a separate on-disk file) on the same nodes as the fragment archives
    • stored in the on-disk dir tree
      • /mnt/part/suff/hash/ts.data
      • /mnt/part/suff/hash/uuid/ts.data
  • Option 2: stored 3x in the system (whatever the replica count is)
    • stored as objects like normal

What about the reconstructor?

Option 1 (next to FA): <-- doesn't have as much reconstructor network usage

The reconstructor finds a FA and then checks the local ec "approval pointer". If the "approval pointer" refers to the same UUID as the FA, then we're good locally and can proceed with checking that the entire EC object is correct...

If the "approval pointer" refers to a different UUID, then the reconstructor needs to compare the timestamps of the "approval pointer" and the local FA. If the local FA has an older timestamp, then the local FA can be deleted. If the local FA has a newer timestamp than the "approval pointer", then it can only be deleted after a time has passed (one week?).

Option 2 (3x replica of "approval pointer"): <-- might be used for small-object optimization

This is the same as Option 1 except there _IS_ network operation for finding the "approval pointer" data. The local copy for the "approval pointer" can be used, if there is a local replica.

Hard problem: <-- this is only necessary for Option 2 and you used that to solve small objects How do we store 3x replicas and 14-way EC in the same policy? Two rings? Some subset of the 14-way ring?

Strong recommendation:

Use Option 1 (store the "aproval pointer" with every fragment archive) and solve small-object EC optimization later.


  • If we figure out Option 2 under where the "approval pointer" stored, then we have a very good place to add small-file optimization.
  • Store "approval pointer" data as metadata?
  • Store the "approval pointer" as a hardlink to the uuid directory?
  • On GET path, object server doesn't return 0-byte "approval pointer", instead return the referred-to FA. If there is no local "approval pointer", go ahead and return the latest FA. If the correct (referred-to) FA doesn't exist, the object server can return 404. The object server still needs to return the other available timestamps.