|  | Large Object Promisors | 
|  | ====================== | 
|  |  | 
|  | Since Git has been created, users have been complaining about issues | 
|  | with storing large files in Git. Some solutions have been created to | 
|  | help, but they haven't helped much with some issues. | 
|  |  | 
|  | Git currently supports multiple promisor remotes, which could help | 
|  | with some of these remaining issues, but it's very hard to use them to | 
|  | help, because a number of important features are missing. | 
|  |  | 
|  | The goal of the effort described in this document is to add these | 
|  | important features. | 
|  |  | 
|  | We will call a "Large Object Promisor", or "LOP" in short, a promisor | 
|  | remote which is used to store only large blobs and which is separate | 
|  | from the main remote that should store the other Git objects and the | 
|  | rest of the repos. | 
|  |  | 
|  | By extension, we will also call "Large Object Promisor", or LOP, the | 
|  | effort described in this document to add a set of features to make it | 
|  | easier to handle large blobs/files in Git by using LOPs. | 
|  |  | 
|  | This effort aims to especially improve things on the server side, and | 
|  | especially for large blobs that are already compressed in a binary | 
|  | format. | 
|  |  | 
|  | This effort aims to provide an alternative to Git LFS | 
|  | (https://git-lfs.com/) and similar tools like git-annex | 
|  | (https://git-annex.branchable.com/) for handling large files, even | 
|  | though a complete alternative would very likely require other efforts | 
|  | especially on the client side, where it would likely help to implement | 
|  | a new object representation for large blobs as discussed in: | 
|  |  | 
|  | https://lore.kernel.org/git/xmqqbkdometi.fsf@gitster.g/ | 
|  |  | 
|  | 0) Non goals | 
|  | ------------ | 
|  |  | 
|  | - We will not discuss those client side improvements here, as they | 
|  | would require changes in different parts of Git than this effort. | 
|  | + | 
|  | So we don't pretend to fully replace Git LFS with only this effort, | 
|  | but we nevertheless believe that it can significantly improve the | 
|  | current situation on the server side, and that other separate | 
|  | efforts could also improve the situation on the client side. | 
|  |  | 
|  | - In the same way, we are not going to discuss all the possible ways | 
|  | to implement a LOP or their underlying object storage, or to | 
|  | optimize how LOP works. | 
|  | + | 
|  | Our opinion is that the simplest solution for now is for LOPs to use | 
|  | object storage through a remote helper (see section II.2 below for | 
|  | more details) to store their objects. So we consider that this is the | 
|  | default implementation. If there are improvements on top of this, | 
|  | that's great, but our opinion is that such improvements are not | 
|  | necessary for LOPs to already be useful. Such improvements are likely | 
|  | a different technical topic, and can be taken care of separately | 
|  | anyway. | 
|  | + | 
|  | So in particular we are not going to discuss pluggable ODBs or other | 
|  | object database backends that could chunk large blobs, dedup the | 
|  | chunks and store them efficiently. Sure, that would be a nice | 
|  | improvement to store large blobs on the server side, but we believe | 
|  | it can just be a separate effort as it's also not technically very | 
|  | related to this effort. | 
|  | + | 
|  | We are also not going to discuss data transfer improvements between | 
|  | LOPs and clients or servers. Sure, there might be some easy and very | 
|  | effective optimizations there (as we know that objects on LOPs are | 
|  | very likely incompressible and not deltifying well), but this can be | 
|  | dealt with separately in a separate effort. | 
|  |  | 
|  | In other words, the goal of this document is not to talk about all the | 
|  | possible ways to optimize how Git could handle large blobs, but to | 
|  | describe how a LOP based solution can already work well and alleviate | 
|  | a number of current issues in the context of Git clients and servers | 
|  | sharing Git objects. | 
|  |  | 
|  | Even if LOPs are used not very efficiently, they can still be useful | 
|  | and worth using in some cases, as we will see in more details | 
|  | later in this document: | 
|  |  | 
|  | - they can make it simpler for clients to use promisor remotes and | 
|  | therefore avoid fetching a lot of large blobs they might not need | 
|  | locally, | 
|  |  | 
|  | - they can make it significantly cheaper or easier for servers to | 
|  | host a significant part of the current repository content, and | 
|  | even more to host content with larger blobs or more large blobs | 
|  | than currently. | 
|  |  | 
|  | I) Issues with the current situation | 
|  | ------------------------------------ | 
|  |  | 
|  | - Some statistics made on GitLab repos have shown that more than 75% | 
|  | of the disk space is used by blobs that are larger than 1MB and | 
|  | often in a binary format. | 
|  |  | 
|  | - So even if users could use Git LFS or similar tools to store a lot | 
|  | of large blobs out of their repos, it's a fact that in practice they | 
|  | don't do it as much as they probably should. | 
|  |  | 
|  | - On the server side ideally, the server should be able to decide for | 
|  | itself how it stores things. It should not depend on users deciding | 
|  | to use tools like Git LFS on some blobs or not. | 
|  |  | 
|  | - It's much more expensive to store large blobs that don't delta | 
|  | compress well on regular fast seeking drives (like SSDs) than on | 
|  | object storage (like Amazon S3 or GCP Buckets). Using fast drives | 
|  | for regular Git repos makes sense though, as serving regular Git | 
|  | content (blobs containing text or code) needs drives where seeking | 
|  | is fast, but the content is relatively small. On the other hand, | 
|  | object storage for Git LFS blobs makes sense as seeking speed is not | 
|  | as important when dealing with large files, while costs are more | 
|  | important. So the fact that users don't use Git LFS or similar tools | 
|  | for a significant number of large blobs has likely some bad | 
|  | consequences on the cost of repo storage for most Git hosting | 
|  | platforms. | 
|  |  | 
|  | - Having large blobs handled in the same way as other blobs and Git | 
|  | objects in Git repos instead of on object storage also has a cost in | 
|  | increased memory and CPU usage, and therefore decreased performance, | 
|  | when creating packfiles. (This is because Git tries to use delta | 
|  | compression or zlib compression which is unlikely to work well on | 
|  | already compressed binary content.) So it's not just a storage cost | 
|  | increase. | 
|  |  | 
|  | - When a large blob has been committed into a repo, it might not be | 
|  | possible to remove this blob from the repo without rewriting | 
|  | history, even if the user then decides to use Git LFS or a similar | 
|  | tool to handle it. | 
|  |  | 
|  | - In fact Git LFS and similar tools are not very flexible in letting | 
|  | users change their minds about the blobs they should handle or not. | 
|  |  | 
|  | - Even when users are using Git LFS or similar tools, they are often | 
|  | complaining that these tools require significant effort to set up, | 
|  | learn and use correctly. | 
|  |  | 
|  | II) Main features of the "Large Object Promisors" solution | 
|  | ---------------------------------------------------------- | 
|  |  | 
|  | The main features below should give a rough overview of how the | 
|  | solution may work. Details about needed elements can be found in | 
|  | following sections. | 
|  |  | 
|  | Even if each feature below is very useful for the full solution, it is | 
|  | very likely to be also useful on its own in some cases where the full | 
|  | solution is not required. However, we'll focus primarily on the big | 
|  | picture here. | 
|  |  | 
|  | Also each feature doesn't need to be implemented entirely in Git | 
|  | itself. Some could be scripts, hooks or helpers that are not part of | 
|  | the Git repo. It would be helpful if those could be shared and | 
|  | improved on collaboratively though. So we want to encourage sharing | 
|  | them. | 
|  |  | 
|  | 1) Large blobs are stored on LOPs | 
|  | ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ | 
|  |  | 
|  | Large blobs should be stored on special promisor remotes that we will | 
|  | call "Large Object Promisors" or LOPs. These LOPs should be additional | 
|  | remotes dedicated to contain large blobs especially those in binary | 
|  | format. They should be used along with main remotes that contain the | 
|  | other objects. | 
|  |  | 
|  | Note 1 | 
|  | ++++++ | 
|  |  | 
|  | To clarify, a LOP is a normal promisor remote, except that: | 
|  |  | 
|  | - it should store only large blobs, | 
|  |  | 
|  | - it should be separate from the main remote, so that the main remote | 
|  | can focus on serving other objects and the rest of the repos (see | 
|  | feature 4) below) and can use the LOP as a promisor remote for | 
|  | itself. | 
|  |  | 
|  | Note 2 | 
|  | ++++++ | 
|  |  | 
|  | Git already makes it possible for a main remote to also be a promisor | 
|  | remote storing both regular objects and large blobs for a client that | 
|  | clones from it with a filter on blob size. But here we explicitly want | 
|  | to avoid that. | 
|  |  | 
|  | Rationale | 
|  | +++++++++ | 
|  |  | 
|  | LOPs aim to be good at handling large blobs while main remotes are | 
|  | already good at handling other objects. | 
|  |  | 
|  | Implementation | 
|  | ++++++++++++++ | 
|  |  | 
|  | Git already has support for multiple promisor remotes, see | 
|  | link:partial-clone.html#using-many-promisor-remotes[the partial clone documentation]. | 
|  |  | 
|  | Also, Git already has support for partial clone using a filter on the | 
|  | size of the blobs (with `git clone --filter=blob:limit=<size>`).  Most | 
|  | of the other main features below are based on these existing features | 
|  | and are about making them easy and efficient to use for the purpose of | 
|  | better handling large blobs. | 
|  |  | 
|  | 2) LOPs can use object storage | 
|  | ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ | 
|  |  | 
|  | LOPs can be implemented using object storage, like an Amazon S3 or GCP | 
|  | Bucket or MinIO (which is open source under the GNU AGPLv3 license) to | 
|  | actually store the large blobs, and can be accessed through a Git | 
|  | remote helper (see linkgit:gitremote-helpers[7]) which makes the | 
|  | underlying object storage appear like a remote to Git. | 
|  |  | 
|  | Note | 
|  | ++++ | 
|  |  | 
|  | A LOP can be a promisor remote accessed using a remote helper by | 
|  | both some clients and the main remote. | 
|  |  | 
|  | Rationale | 
|  | +++++++++ | 
|  |  | 
|  | This looks like the simplest way to create LOPs that can cheaply | 
|  | handle many large blobs. | 
|  |  | 
|  | Implementation | 
|  | ++++++++++++++ | 
|  |  | 
|  | Remote helpers are quite easy to write as shell scripts, but it might | 
|  | be more efficient and maintainable to write them using other languages | 
|  | like Go. | 
|  |  | 
|  | Some already exist under open source licenses, for example: | 
|  |  | 
|  | - https://github.com/awslabs/git-remote-s3 | 
|  | - https://gitlab.com/eric.p.ju/git-remote-gs | 
|  |  | 
|  | Other ways to implement LOPs are certainly possible, but the goal of | 
|  | this document is not to discuss how to best implement a LOP or its | 
|  | underlying object storage (see the "0) Non goals" section above). | 
|  |  | 
|  | 3) LOP object storage can be Git LFS storage | 
|  | ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ | 
|  |  | 
|  | The underlying object storage that a LOP uses could also serve as | 
|  | storage for large files handled by Git LFS. | 
|  |  | 
|  | Rationale | 
|  | +++++++++ | 
|  |  | 
|  | This would simplify the server side if it wants to both use a LOP and | 
|  | act as a Git LFS server. | 
|  |  | 
|  | 4) A main remote can offload to a LOP with a configurable threshold | 
|  | ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ | 
|  |  | 
|  | On the server side, a main remote should have a way to offload to a | 
|  | LOP all its blobs with a size over a configurable threshold. | 
|  |  | 
|  | Rationale | 
|  | +++++++++ | 
|  |  | 
|  | This makes it easy to set things up and to clean things up. For | 
|  | example, an admin could use this to manually convert a repo not using | 
|  | LOPs to a repo using a LOP. On a repo already using a LOP but where | 
|  | some users would sometimes push large blobs, a cron job could use this | 
|  | to regularly make sure the large blobs are moved to the LOP. | 
|  |  | 
|  | Implementation | 
|  | ++++++++++++++ | 
|  |  | 
|  | Using something based on `git repack --filter=...` to separate the | 
|  | blobs we want to offload from the other Git objects could be a good | 
|  | idea. The missing part is to connect to the LOP, check if the blobs we | 
|  | want to offload are already there and if not send them. | 
|  |  | 
|  | 5) A main remote should try to remain clean from large blobs | 
|  | ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ | 
|  |  | 
|  | A main remote should try to avoid containing a lot of oversize | 
|  | blobs. For that purpose, it should offload as needed to a LOP and it | 
|  | should have ways to prevent oversize blobs to be fetched, and also | 
|  | perhaps pushed, into it. | 
|  |  | 
|  | Rationale | 
|  | +++++++++ | 
|  |  | 
|  | A main remote containing many oversize blobs would defeat the purpose | 
|  | of LOPs. | 
|  |  | 
|  | Implementation | 
|  | ++++++++++++++ | 
|  |  | 
|  | The way to offload to a LOP discussed in 4) above can be used to | 
|  | regularly offload oversize blobs. About preventing oversize blobs from | 
|  | being fetched into the repo see 6) below. About preventing oversize | 
|  | blob pushes, a pre-receive hook could be used. | 
|  |  | 
|  | Also there are different scenarios in which large blobs could get | 
|  | fetched into the main remote, for example: | 
|  |  | 
|  | - A client that doesn't implement the "promisor-remote" protocol | 
|  | (described in 6) below) clones from the main remote. | 
|  |  | 
|  | - The main remote gets a request for information about a large blob | 
|  | and is not able to get that information without fetching the blob | 
|  | from the LOP. | 
|  |  | 
|  | It might not be possible to completely prevent all these scenarios | 
|  | from happening. So the goal here should be to implement features that | 
|  | make the fetching of large blobs less likely. For example adding a | 
|  | `remote-object-info` command in the `git cat-file --batch` protocol | 
|  | and its variants might make it possible for a main repo to respond to | 
|  | some requests about large blobs without fetching them. | 
|  |  | 
|  | 6) A protocol negotiation should happen when a client clones | 
|  | ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ | 
|  |  | 
|  | When a client clones from a main repo, there should be a protocol | 
|  | negotiation so that the server can advertise one or more LOPs and so | 
|  | that the client and the server can discuss if the client could | 
|  | directly use a LOP the server is advertising. If the client and the | 
|  | server can agree on that, then the client would be able to get the | 
|  | large blobs directly from the LOP and the server would not need to | 
|  | fetch those blobs from the LOP to be able to serve the client. | 
|  |  | 
|  | Note | 
|  | ++++ | 
|  |  | 
|  | For fetches instead of clones, a protocol negotiation might not always | 
|  | happen, see the "What about fetches?" FAQ entry below for details. | 
|  |  | 
|  | Rationale | 
|  | +++++++++ | 
|  |  | 
|  | Security, configurability and efficiency of setting things up. | 
|  |  | 
|  | Implementation | 
|  | ++++++++++++++ | 
|  |  | 
|  | A "promisor-remote" protocol v2 capability looks like a good way to | 
|  | implement this. The way the client and server use this capability | 
|  | could be controlled by configuration variables. | 
|  |  | 
|  | Information that the server could send to the client through that | 
|  | protocol could be things like: LOP name, LOP URL, filter-spec (for | 
|  | example `blob:limit=<size>`) or just size limit that should be used as | 
|  | a filter when cloning, token to be used with the LOP, etc. | 
|  |  | 
|  | 7) A client can offload to a LOP | 
|  | ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ | 
|  |  | 
|  | When a client is using a LOP that is also a LOP of its main remote, | 
|  | the client should be able to offload some large blobs it has fetched, | 
|  | but might not need anymore, to the LOP. | 
|  |  | 
|  | Note | 
|  | ++++ | 
|  |  | 
|  | It might depend on the context if it should be OK or not for clients | 
|  | to offload large blobs they have created, instead of fetched, directly | 
|  | to the LOP without the main remote checking them in some ways | 
|  | (possibly using hooks or other tools). | 
|  |  | 
|  | This should be discussed and refined when we get closer to | 
|  | implementing this feature. | 
|  |  | 
|  | Rationale | 
|  | +++++++++ | 
|  |  | 
|  | On the client, the easiest way to deal with unneeded large blobs is to | 
|  | offload them. | 
|  |  | 
|  | Implementation | 
|  | ++++++++++++++ | 
|  |  | 
|  | This is very similar to what 4) above is about, except on the client | 
|  | side instead of the server side. So a good solution to 4) could likely | 
|  | be adapted to work on the client side too. | 
|  |  | 
|  | There might be some security issues here, as there is no negotiation, | 
|  | but they might be mitigated if the client can reuse a token it got | 
|  | when cloning (see 6) above). Also if the large blobs were fetched from | 
|  | a LOP, it is likely, and can easily be confirmed, that the LOP still | 
|  | has them, so that they can just be removed from the client. | 
|  |  | 
|  | III) Benefits of using LOPs | 
|  | --------------------------- | 
|  |  | 
|  | Many benefits are related to the issues discussed in "I) Issues with | 
|  | the current situation" above: | 
|  |  | 
|  | - No need to rewrite history when deciding which blobs are worth | 
|  | handling separately than other objects, or when moving or removing | 
|  | the threshold. | 
|  |  | 
|  | - If the protocol between client and server is developed and secured | 
|  | enough, then many details might be setup on the server side only and | 
|  | all the clients could then easily get all the configuration | 
|  | information and use it to set themselves up mostly automatically. | 
|  |  | 
|  | - Storage costs benefits on the server side. | 
|  |  | 
|  | - Reduced memory and CPU needs on main remotes on the server side. | 
|  |  | 
|  | - Reduced storage needs on the client side. | 
|  |  | 
|  | IV) FAQ | 
|  | ------- | 
|  |  | 
|  | What about using multiple LOPs on the server and client side? | 
|  | ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ | 
|  |  | 
|  | That could perhaps be useful in some cases, but for now it's more | 
|  | likely that in most cases a single LOP will be advertised by the | 
|  | server and should be used by the client. | 
|  |  | 
|  | A case where it could be useful for a server to advertise multiple | 
|  | LOPs is if a LOP is better for some users while a different LOP is | 
|  | better for other users. For example some clients might have a better | 
|  | connection to a LOP than others. | 
|  |  | 
|  | In those cases it's the responsibility of the server to have some | 
|  | documentation to help clients. It could say for example something like | 
|  | "Users in this part of the world might want to pick only LOP A as it | 
|  | is likely to be better connected to them, while users in other parts | 
|  | of the world should pick only LOP B for the same reason." | 
|  |  | 
|  | When should we trust or not trust the LOPs advertised by the server? | 
|  | ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ | 
|  |  | 
|  | In some contexts, like in corporate setup where the server and all the | 
|  | clients are parts of an internal network in a company where admins | 
|  | have all the rights on every system, it's OK, and perhaps even a good | 
|  | thing, if the clients fully trust the server, as it can help ensure | 
|  | that all the clients are on the same page. | 
|  |  | 
|  | There are also contexts in which clients trust a code hosting platform | 
|  | serving them some repos, but might not fully trust other users | 
|  | managing or contributing to some of these repos. For example, the code | 
|  | hosting platform could have hooks in place to check that any object it | 
|  | receives doesn't contain malware or otherwise bad content. In this | 
|  | case it might be OK for the client to use a main remote and its LOP if | 
|  | they are both hosted by the code hosting platform, but not if the LOP | 
|  | is hosted elsewhere (where the content is not checked). | 
|  |  | 
|  | In other contexts, a client should just not trust a server. | 
|  |  | 
|  | So there should be different ways to configure how the client should | 
|  | behave when a server advertises a LOP to it at clone time. | 
|  |  | 
|  | As the basic elements that a server can advertise about a LOP are a | 
|  | LOP name and a LOP URL, the client should base its decision about | 
|  | accepting a LOP on these elements. | 
|  |  | 
|  | One simple way to be very strict in the LOP it accepts is for example | 
|  | for the client to check that the LOP is already configured on the | 
|  | client with the same name and URL as what the server advertises. | 
|  |  | 
|  | In general default and "safe" settings should require that the LOP are | 
|  | configured on the client separately from the "promisor-remote" | 
|  | protocol and that the client accepts a LOP only when information about | 
|  | it from the protocol matches what has been already configured | 
|  | separately. | 
|  |  | 
|  | What about LOP names? | 
|  | ~~~~~~~~~~~~~~~~~~~~~ | 
|  |  | 
|  | In some contexts, for example if the clients sometimes fetch from each | 
|  | other, it can be a good idea for all the clients to use the same names | 
|  | for all the remotes they use, including LOPs. | 
|  |  | 
|  | In other contexts, each client might want to be able to give the name | 
|  | it wants to each remote, including each LOP, it interacts with. | 
|  |  | 
|  | So there should be different ways to configure how the client accepts | 
|  | or not the LOP name the server advertises. | 
|  |  | 
|  | If a default or "safe" setting is used, then as such a setting should | 
|  | require that the LOP be configured separately, then the name would be | 
|  | configured separately and there is no risk that the server could | 
|  | dictate a name to a client. | 
|  |  | 
|  | Could the main remote be bogged down by old or paranoid clients? | 
|  | ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ | 
|  |  | 
|  | Yes, it could happen if there are too many clients that are either | 
|  | unwilling to trust the main remote or that just don't implement the | 
|  | "promisor-remote" protocol because they are too old or not fully | 
|  | compatible with the 'git' client. | 
|  |  | 
|  | When serving such a client, the main remote has no other choice than | 
|  | to first fetch from its LOP, to then be able to provide to the client | 
|  | everything it requested. So the main remote, even if it has cleanup | 
|  | mechanisms (see section II.4 above), would be burdened at least | 
|  | temporarily with the large blobs it had to fetch from its LOP. | 
|  |  | 
|  | Not behaving like this would be breaking backward compatibility, and | 
|  | could be seen as segregating clients. For example, it might be | 
|  | possible to implement a special mode that allows the server to just | 
|  | reject clients that don't implement the "promisor-remote" protocol or | 
|  | aren't willing to trust the main remote. This mode might be useful in | 
|  | a special context like a corporate environment. There is no plan to | 
|  | implement such a mode though, and this should be discussed separately | 
|  | later anyway. | 
|  |  | 
|  | A better way to proceed is probably for the main remote to show a | 
|  | message telling clients that don't implement the protocol or are | 
|  | unwilling to accept the advertised LOP(s) that they would get faster | 
|  | clone and fetches by upgrading client software or properly setting | 
|  | them up to accept LOP(s). | 
|  |  | 
|  | Waiting for clients to upgrade, monitoring these upgrades and limiting | 
|  | the use of LOPs to repos that are not very frequently accessed might | 
|  | be other good ways to make sure that some benefits are still reaped | 
|  | from LOPs. Over time, as more and more clients upgrade and benefit | 
|  | from LOPs, using them in more and more frequently accessed repos will | 
|  | become worth it. | 
|  |  | 
|  | Corporate environments, where it might be easier to make sure that all | 
|  | the clients are up-to-date and properly configured, could hopefully | 
|  | benefit more and earlier from using LOPs. | 
|  |  | 
|  | What about fetches? | 
|  | ~~~~~~~~~~~~~~~~~~~ | 
|  |  | 
|  | There are different kinds of fetches. A regular fetch happens when | 
|  | some refs have been updated on the server and the client wants the ref | 
|  | updates and possibly the new objects added with them. A "backfill" or | 
|  | "lazy" fetch, on the contrary, happens when the client needs to use | 
|  | some objects it already knows about but doesn't have because they are | 
|  | on a promisor remote. | 
|  |  | 
|  | Regular fetch | 
|  | +++++++++++++ | 
|  |  | 
|  | In a regular fetch, the client will contact the main remote and a | 
|  | protocol negotiation will happen between them. It's a good thing that | 
|  | a protocol negotiation happens every time, as the configuration on the | 
|  | client or the main remote could have changed since the previous | 
|  | protocol negotiation. In this case, the new protocol negotiation | 
|  | should ensure that the new fetch will happen in a way that satisfies | 
|  | the new configuration of both the client and the server. | 
|  |  | 
|  | In most cases though, the configurations on the client and the main | 
|  | remote will not have changed between 2 fetches or between the initial | 
|  | clone and a subsequent fetch. This means that the result of a new | 
|  | protocol negotiation will be the same as the previous result, so the | 
|  | new fetch will happen in the same way as the previous clone or fetch, | 
|  | using, or not using, the same LOP(s) as last time. | 
|  |  | 
|  | "Backfill" or "lazy" fetch | 
|  | ++++++++++++++++++++++++++ | 
|  |  | 
|  | When there is a backfill fetch, the client doesn't necessarily contact | 
|  | the main remote first. It will try to fetch from its promisor remotes | 
|  | in the order they appear in the config file, except that a remote | 
|  | configured using the `extensions.partialClone` config variable will be | 
|  | tried last. See | 
|  | link:partial-clone.html#using-many-promisor-remotes[the partial clone documentation]. | 
|  |  | 
|  | This is not new with this effort. In fact this is how multiple remotes | 
|  | have already been working for around 5 years. | 
|  |  | 
|  | When using LOPs, having the main remote configured using | 
|  | `extensions.partialClone`, so it's tried last, makes sense, as missing | 
|  | objects should only be large blobs that are on LOPs. | 
|  |  | 
|  | This means that a protocol negotiation will likely not happen as the | 
|  | missing objects will be fetched from the LOPs, and then there will be | 
|  | nothing left to fetch from the main remote. | 
|  |  | 
|  | To secure that, it could be a good idea for LOPs to require a token | 
|  | from the client when it fetches from them. The client could get the | 
|  | token when performing a protocol negotiation with the main remote (see | 
|  | section II.6 above). | 
|  |  | 
|  | V) Future improvements | 
|  | ---------------------- | 
|  |  | 
|  | It is expected that at the beginning using LOPs will be mostly worth | 
|  | it either in a corporate context where the Git version that clients | 
|  | use can easily be controlled, or on repos that are infrequently | 
|  | accessed. (See the "Could the main remote be bogged down by old or | 
|  | paranoid clients?" section in the FAQ above.) | 
|  |  | 
|  | Over time, as more and more clients upgrade to a version that | 
|  | implements the "promisor-remote" protocol v2 capability described | 
|  | above in section II.6), it will be worth it to use LOPs more widely. | 
|  |  | 
|  | A lot of improvements may also help using LOPs more widely. Some of | 
|  | these improvements are part of the scope of this document like the | 
|  | following: | 
|  |  | 
|  | - Implementing a "remote-object-info" command in the | 
|  | `git cat-file --batch` protocol and its variants to allow main | 
|  | remotes to respond to requests about large blobs without fetching | 
|  | them. (Eric Ju has started working on this based on previous work | 
|  | by Calvin Wan.) | 
|  |  | 
|  | - Creating better cleanup and offload mechanisms for main remotes | 
|  | and clients to prevent accumulation of large blobs. | 
|  |  | 
|  | - Developing more sophisticated protocol negotiation capabilities | 
|  | between clients and servers for handling LOPs, for example adding | 
|  | a filter-spec (e.g., blob:limit=<size>) or size limit for | 
|  | filtering when cloning, or adding a token for LOP authentication. | 
|  |  | 
|  | - Improving security measures for LOP access, particularly around | 
|  | token handling and authentication. | 
|  |  | 
|  | - Developing standardized ways to configure and manage multiple LOPs | 
|  | across different environments. Especially in the case where | 
|  | different LOPs serve the same content to clients in different | 
|  | geographical locations, there is a need for replication or | 
|  | synchronization between LOPs. | 
|  |  | 
|  | Some improvements, including some that have been mentioned in the "0) | 
|  | Non Goals" section of this document, are out of the scope of this | 
|  | document: | 
|  |  | 
|  | - Implementing a new object representation for large blobs on the | 
|  | client side. | 
|  |  | 
|  | - Developing pluggable ODBs or other object database backends that | 
|  | could chunk large blobs, dedup the chunks and store them | 
|  | efficiently. | 
|  |  | 
|  | - Optimizing data transfer between LOPs and clients/servers, | 
|  | particularly for incompressible and non-deltifying content. | 
|  |  | 
|  | - Creating improved client side tools for managing large objects | 
|  | more effectively, for example tools for migrating from Git LFS or | 
|  | git-annex, or tools to find which objects could be offloaded and | 
|  | how much disk space could be reclaimed by offloading them. | 
|  |  | 
|  | Some improvements could be seen as part of the scope of this document, | 
|  | but might already have their own separate projects from the Git | 
|  | project, like: | 
|  |  | 
|  | - Improving existing remote helpers to access object storage or | 
|  | developing new ones. | 
|  |  | 
|  | - Improving existing object storage solutions or developing new | 
|  | ones. | 
|  |  | 
|  | Even though all the above improvements may help, this document and the | 
|  | LOP effort should try to focus, at least first, on a relatively small | 
|  | number of improvements mostly those that are in its current scope. | 
|  |  | 
|  | For example introducing pluggable ODBs and a new object database | 
|  | backend is likely a multi-year effort on its own that can happen | 
|  | separately in parallel. It has different technical requirements, | 
|  | touches other part of the Git code base and should have its own design | 
|  | document(s). |