Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Hash common fields for message propagation #792

Open
3 of 6 tasks
Kubuxu opened this issue Dec 12, 2024 · 30 comments
Open
3 of 6 tasks

Hash common fields for message propagation #792

Kubuxu opened this issue Dec 12, 2024 · 30 comments
Assignees

Comments

@Kubuxu
Copy link
Contributor

Kubuxu commented Dec 12, 2024

The most significant offender here is ECChain, which is the same (or similar) for many messages and can end up large.

The idea is to hash the ECChain before sending the message and let the receiver resolve it from the hash.
In many scenarios, we will know which ECChain corresponds to the hash because we will have the same one, but for the less common case, we need to propagate these ECChains through pubsub. This can be done on a separate topic (or maybe even on the same topic).

That propagation channel should limit the propagations to only the current instance, so the ECChain message should carry the instance number.

There are a couple of architectural questions:

  1. Should GPBFT be aware of it, or should it continue to operate on GMessage as it is today?
    Decision: GPBGT should be unaware
  2. What should we do with messages that ECChain hashes we cannot resolve?
    Decision: Buffer them, and the queue should reconstitute them when ECChain arrives. Have a metric about it.
  3. How do we limit the ECChain lookup buffer?
    Partial decision: Hard limit, but we need strategies to keep important ones if we get to the limit.
  4. Do we need a separate topic, or is it better to use the same one?

Notes:
After receiving an ECChain, we should add all prefix chains to the lookup buffer. This provides more information and improves efficiency. We should also preload the buffer with our own ECChain.

Tasks

Preview Give feedback
  1. Kubuxu
@github-project-automation github-project-automation bot moved this to Todo in F3 Dec 12, 2024
@Kubuxu Kubuxu changed the title Hash common fields for propagation Hash common fields for message propagation Dec 12, 2024
@masih
Copy link
Member

masih commented Dec 12, 2024

@Stebalien what is your take on question No. 4? Do you think it's better to separate the topic or use the same one with a message ID function that's smart enough?

@masih
Copy link
Member

masih commented Dec 12, 2024

Summary of discussions in standup:

  • Because the chains advertised in QUALITY phase are the superset of values that could ever be decided in an instance, we could limit publication of ec chains to QUALITY proposals only.
  • Each node can then calculate the hash of every prefix of those proposals to then infer other hashes that the instance may be swayed towards.
  • This then means we do not need to change the chain that is being broadcased on the ec chain topic relative to the gpbft progress.

@Stebalien
Copy link
Member

what is your take on question No. 4? Do you think it's better to separate the topic or use the same one with a message ID function that's smart enough?

IMO, use a separate topic. Having to decode messages in the message ID function is, IMO, a bad idea in general. But I don't feel strongly about this.

Should GPBFT be aware of it, or should it continue to operate on GMessage as it is today?

Don't we need quality information to support the ECChains? Otherwise, an attacker can just spam EC Chains and force everyone to store every possible EC chain because they'll have no idea which ones are actually useful.

That is, I think what we need is an extra phase after quality.

@Kubuxu
Copy link
Contributor Author

Kubuxu commented Dec 12, 2024

Don't we need quality information to support the ECChains? Otherwise, an attacker can just spam EC Chains and force everyone to store every possible EC chain because they'll have no idea which ones are actually useful.

We can limit it and discard chains aggressively after quality.

@Stebalien
Copy link
Member

I don't really see how to do this securely (e.g., a flood can easily mask all real values). Especially before/during quality.

Are we relying on some form of continuous rebroadcast?

@Kubuxu
Copy link
Contributor Author

Kubuxu commented Dec 13, 2024

You are correct that it is tricky, but I think it is possible with code that is defensive enough.
Some raw thoughts:

  • wait to expand the ECChain into sub-hashes until the ECChain reference appears in a GMessage. This significantly reduces the memory amplification the spam adversary can get.
  • proactively notify ECChain receiver that we are looking for some ECChain hash based on a mostly valid GMessage, so even with a significant spam, we can pick out good values if they get re-broadcasted
  • We can definitely re-broadcast these with rebroadcasts. Another option is to broadcast the quality first and the ECChain a bit later

@masih
Copy link
Member

masih commented Dec 13, 2024

This significantly reduces the memory amplification the spam adversary can get.
proactively notify ECChain receiver that we are looking for some ECChain hash based on a mostly valid GMessage, so even with a significant spam, we can pick out good values if they get re-broadcasted

How would we know if a message is valid at all. Not knowing the chain when GMessage arrives opens up another attack vector about invalid messages that cannot be validated fully until hashes arrive. The adversary can also immediately publish both gmessage with the hashes and the value associated to it across both topics.

@masih
Copy link
Member

masih commented Dec 13, 2024

An alternative design:

  • Keep the QUALITY messages as they are: include full chain in the QUALITY message. That phase has a timeout + early completion on strong quorum of proposal.
    • On receipt of a QUALITY message, each node would calculate sub-hashes of the chain it has received.
  • From then on, consecutive messages as successive phases would use hash of the chain only.
  • Nodes that have received the QUALITY messages can infer every hash they see from then on.
  • Nodes that have not would rely on QUALITY re-broadcast to receive them and can only progress wit base decision until they do so.
    • If they never receive the QUALITY message then base decision is most likely the right decision anyway? considering the prior assumption is that no quorum of proposal for QUALITY phase was possible from the point of view of those nodes.
    • If they do then they will proceed with the regular swaying etc. as implemented today.
  • No secondary topic required.

This way the spam attack verctor would disappear (i.e. won't be any worse than it is today). The catch, of course is: larger messages in the QUALITY phase. This approach in terms of bandwidth consumption would certainly be far better than what we have today.

Now, the question would become: would the QUALITY phase in this scenario be efficient enough to not put us back at square one considering the sensitivity of gpbft progress to it? I think yes (but we can test this). Because:

  • We now have compression which should reduce the size of QUALITY messages significantly
  • Because successive phases use far less bandwidth the chances are QUALITY rebroadcasts will get delivered more effectively.
  • If still not sufficiently omptimised, we have the option to introduce a dedicated timeout for QUALITY phase that is independent of Delta.
    • In passive testing I did observe that with high delta quality phase did not wait for full delta timeout and indeed progressed once finding strong quorum of proposal. Which leaves me to believe that congestion as a whole played a bigger role in the lack of overall instance progress. Data is there we can analyse it more closely to say for sure this was the case.

@Kubuxu
Copy link
Contributor Author

Kubuxu commented Dec 13, 2024

Addressing just this:

How would we know if a message is valid at all. Not knowing the chain when GMessage arrives opens up another attack vector about invalid messages that cannot be validated fully until hashes arrive.

If the ECChain hash is the same as used in the signing process, we can check the signature.

@BigLep BigLep moved this from Todo to In progress in F3 Dec 13, 2024
@Kubuxu
Copy link
Contributor Author

Kubuxu commented Dec 13, 2024

From @masih: We have data on how quality has progressed. Was it sooner than delta, then the quorum was reached.
Or also differently, round 0 prepares with the non-base proposal.

@Kubuxu
Copy link
Contributor Author

Kubuxu commented Dec 13, 2024

The investigation into Quality and Prepare messages revealed that we could not reliably propagate quorum of large Quality messages within 3min.

@masih
Copy link
Member

masih commented Dec 13, 2024

Here is an example query to use in terms of checking to what extent propagation of QUALITY messages was an issue:

SELECT DISTINCT (Vote).Instance
FROM messages
WHERE (Vote).Phase = 'PREPARE'
  AND (Vote).Round = 0
  AND array_length((Vote).Value) > 1
  AND networkname = 'filecoin/45'

The issue we have with the data is that only network 44 progressed at a reasonable velocity past bootstrap and left to run at scale for a day. In that network the delta was sufficiently large to dilute the answer to the question being asked.

So we are back at investigating how to combat the spam attack vector in the 2-topic design.

@masih
Copy link
Member

masih commented Dec 13, 2024

Summary of discussions from standup:

  • We have two topics: one as is for gmessage, and one that is dedicated to propagation of ECChain values.
  • The gmessages contain the hashes of the values, and need to discover what ECChain they correspond to in order to fully validate a messge; but messages can partially be validaed without the mapping.
  • The participant maintains a map of hash -> ECChain with a capped size (actual size TBD but small enough to not risk memory exhaustion)
  • The map gets populated as ECChain mapping messages on the secondary topic arrive... but once a gmessage arrives for a known hash, the priority of hash -> ECChain storage for that hash increases:
    • Effectively, the participant prioritises on storing the mappings that it knows it needs to know because of the presence of a gmessage.
    • In the absense of gmessages mapping storage of all other hashes is opportunistic.
  • If there was a hash that was indeed useful but got evicted before gmessage arrived, then we will primarily rely on re-broadcast of messages on the ECChain topic to find out about it.
    • Now, because in normal operation the variety across ECChains is limited we think that relying on re-broadcast is sufficient because, the chances are there are a group of nodes that are publishing the same ECChain. Therefore, it does not matter who we hear the hash mapping from as long as there is someone who publishes it and it eventually gets to us.
    • Re-broadcast interval/intensity in the ECChain topic must be determined relative to delta, such that ideally ECChain messages arrive before the phase end within an instance.
    • To work around pubusb deduplication the broadcasted messages contain a rounded timestamp in addition to instance + ECChain. No messages from a future timestamp are allowed and will be rejected.
      • This is to protect against an advanced adversary trying to exploit the order of ECChain message arrival such that a participant can never learn the true chain by dropping good ECChain messages too quickly once the buffer is full.
  • The priority of mappings stored in the capped buffer changes depending on the arrival of gmessages: a seen hash has higher priority in storage than an unseen one. This can be further optimised by taking into account the current phase, and its proposal.
  • As redundancy measure, when gpbft rebroadcasts a gmessage, its corresponding ecchain will also be rebroadcasted.
    • note that upon arrival of quality ecchain sub chain hashes can be calculated locally. but we do re-broadcast chains just for resilience.

@masih
Copy link
Member

masih commented Dec 13, 2024

@Stebalien could you read though the comment above and let us know what you think please ?

@Stebalien
Copy link
Member

I'd like to tackle all the spam vectors we currently have, not add more. But as long as we have rebroadcast, this should work.

Re-broadcast interval/intensity in the ECChain topic must be determined relative to delta, such that ideally ECChain messages arrive before the phase end within an instance.

Why?

The priority of mappings stored in the capped buffer changes depending on the arrival of gmessages: a seen hash has higher priority in storage than an unseen one. This can be further optimised by taking into account the current phase, and its proposal.

I'm confused. Either we know we need it or we don't know we need it. If we need it, we shouldn't put it into some space limited cache, we should save it permanently. This permanent storage will be limited to 100 tipsets per valid participant.


Honestly, I'm convinced we should do one of:

  1. Masih's proposal to send tipsets only in quality but optimized further to only send the hash of each tipset (max 3200 bytes).
  2. Do quality first, then broadcast ECChains. This will cost an extra round but will solve the spam issue entirely. I assume we'll also be able to reduce the delta if we do this, so it may not actually cost us any extra time.

@Kubuxu
Copy link
Contributor Author

Kubuxu commented Dec 13, 2024

Direct responses, I will think about your proposed options.

Re-broadcast interval/intensity in the ECChain topic must be determined relative to delta, such that ideally ECChain messages arrive before the phase end within an instance.

We need to resolve ECChains within the quality round otherwise everyone could PREPARE on base.

I'm confused. Either we know we need it or we don't know we need it. If we need it, we shouldn't put it into some space limited cache, we should save it permanently. This permanent storage will be limited to 100 tipsets per valid participant.

We expect very few ECChains per instance, which protects against an adversary who is a sybil-ed participant. The upper size of the cache could be smaller than 100N. However, participants can also get swayed, so it would probably have to be 100 N* Rounds* Phases if we had it on a per participant basis.

@Stebalien
Copy link
Member

We need to resolve ECChains within the quality round otherwise everyone could PREPARE on base.

After quality, before prepare.

We expect very few ECChains per instance, which protects against an adversary who is a sybil-ed participant. The upper size of the cache could be smaller than 100N. However, participants can also get swayed, so it would probably have to be 100 N* Rounds* Phases if we had it on a per participant basis.

It protects against nothing because we have ho idea which ECChains are valid and which are invalid. What am I missing here?

@Stebalien
Copy link
Member

What if we took a hybrid approach and only split out the chains during catch-up, including them in QUALTIY otherwise? Basically:

  • If we have more than N tipsets in quality, perform quality on a hash.
  • If at any point we see 2/3rds of the power vote for some hash, we start an ECChain round.

I haven't thought this through completely so I know there are likely some hidden complexities, but I don't think it's any more complex than the current proposal.

@Kubuxu
Copy link
Contributor Author

Kubuxu commented Dec 13, 2024

It protects against nothing because we have ho idea which ECChains are valid and which are invalid. What am I missing here?

During quality, we learn which chains are valuable, so we proposed this "within-phase rebroadcast". During normal operations, ECChains land in the buffer, and are utilised by quality messages. If an adversary decided to spam, we get GMessage, we don't know it's ECChain, we note the hash of it until first round of ECChain rebroadcasts hits then we are able to resolve the ECChain hash and proceed.

Each time an ECChain proves useful in a valid message, its priority would get bumped up, so it won't get discarded. That is to protect against 33% adversary that decided to start spamming valid GMessages, each with unique ECChain.

@Kubuxu
Copy link
Contributor Author

Kubuxu commented Dec 13, 2024

If at any point we see 2/3rds of the power vote for some hash, we start an ECChain round.

I don't see what exactly that buys us, it is all the complexity, plus more with having to support both ways of doing things.


Honestly, I'm convinced we should do one of:

  1. Masih's proposal to send tipsets only in quality but optimized further to only send the hash of each tipset (max 3200 bytes).
  2. Do quality first, then broadcast ECChains. This will cost an extra round but will solve the spam issue entirely. I assume we'll also be able to reduce the delta if we do this, so it may not actually cost us any extra time.

Re 1. I will look if we can convince ourselves that the gain there will be significant enough to solve the bandwidth, it could be an option, although it brings in a lot of the complexity of having to communicate hashes for smaller gain.
Re 2: That would remove the fast-forward in quality, but it should be okay if it allows us to bring the delta down. It also reduces complexity significantly.

@Stebalien
Copy link
Member

Each time an ECChain proves useful in a valid message, its priority would get bumped up, so it won't get discarded. That is to protect against 33% adversary that decided to start spamming valid GMessages, each with unique ECChain.

But it's not possible to spam quality messages.

Each time an ECChain proves useful in a valid message, its priority would get bumped up, so it won't get discarded. That is to protect against 33% adversary that decided to start spamming valid GMessages, each with unique ECChain.

I'm still not understanding the probability part of this. Either it's included in a valid quality message and is therefore required or it's not and is therefore spam.

@Stebalien
Copy link
Member

If at any point we see 2/3rds of the power vote for some hash, we start an ECChain round.

I don't see what exactly that buys us, it is all the complexity, plus more with having to support both ways of doing things.

My main proposal is to run a quality round first, collecting at most one quality message from all participants. Then run an EC chain round, accepting all EC chains that match hashes we expect from the quality round. This avoids having to try to "guess" which EC chains are valid and completely solves all the spam issues.

Unfortunately, this means adding an additional round. We can avoid that additional round by including short ECChains inside quality messages, skipping the ECChain round if it looks like it's unnecessary.

@masih
Copy link
Member

masih commented Dec 16, 2024

While we are finalising the design, I am going to start work on this by implementing hash based gpbft messaging and wiring in the on-the-side lookup from hash to chain.

@masih
Copy link
Member

masih commented Dec 16, 2024

My main proposal is to run a quality round first, collecting at most one quality message from all participants. Then run an EC chain round, accepting all EC chains that match hashes we expect from the quality round.

This makes sense to me. My thoughts:

  • Most likely we want some chain-rebroadcast still in case the chain does not get through the network in time.

  • Shortcut: Instances can progress without any propagation of chain if:

    1. The hash is known to the current node, i.e. is the hash it has proposed to the network.
    2. There is a strong quorum for that hash.

    For nodes that do not know the hash, the can blindly proceed with the instance as long as they observe strong quorum for the hash as required by GPBFT, and there is no swaying mid instance. They can then learn about the mapping of hash -> chain via cert exchange.

@masih
Copy link
Member

masih commented Dec 16, 2024

Another idea: Adaptive Poll-based Chain exchange protocol: chainex

Utilise a polling mechanism similar to cert exchange but for chain, that populates a look-aside lookup table to map a hash to a chain proposal for hashes discovered from the received GPBFT messages.

The Gist:

  • GPBFT message progress/propagate as they are, with one exception: ec chain proposal is replaced by its hash.
  • In parallel, nodes poll chains from one-another to fill the gaps in the lookup table of hash -> chain.
  • Because it is the nodes that polls we have control over what to ask, when to ask it etc. So nodes can be highly optimal on what they ask based on the progress of GPBFT/what they observe of the network.
    • The polling candidate selection can be further optimised by asking nodes that sent GPBFT messages containing the hash, i.e. an implicit assumption that if a node sends a GPBFT message containing a hash it most likely knows the chain that corresponds to it.
  • This way:
    • no modification to core protocol is necessary: we treat an "unknown mapping of a hash -> chain" simply as "message has not arrived yet".
      • ... with the caveat that we can potentially shortcut things similarly to what I proposed in my previous comment above (but this could totally be a YAGNI; keep reading).
    • no spamming, because we poll.
    • It can potentially be more efficient than any of the designs suggested so far. Because of the inversion of control in discovering a chain associated to a hash.
    • This could work with the existing delta values (order of seconds) because of parallel polling of chain. And if it does not, the tuning knob to adjust it is simple: increase delta to give time for parallel polling to do its job.
      • If still not enough, we can increase the degree of parallel polling based on some node selection heuristic (e.g. historical latency, participation in signed certs and so on).
      • Key take away here is that: tuning knobs are less/simpler and therefore easier to find an optimal one using passive testing.

@masih masih self-assigned this Dec 16, 2024
@Kubuxu
Copy link
Contributor Author

Kubuxu commented Dec 16, 2024

In parallel, nodes poll chains from one-another to fill the gaps in the lookup table of hash -> chain.

This one small point encapsulates a lot of complexity and possible latency.

@masih
Copy link
Member

masih commented Dec 16, 2024

encapsulates a lot of complexity and possible latency

I'd like to flesh those out in the standup and dig deeper into the comparison between approaches.

@Kubuxu
Copy link
Contributor Author

Kubuxu commented Dec 16, 2024

My main proposal is to run a quality round first, collecting at most one quality message from all participants. Then run an EC chain round, accepting all EC chains that match hashes we expect from the quality round. This avoids having to try to "guess" which EC chains are valid and completely solves all the spam issues.

Unfortunately, this means adding an additional round. We can avoid that additional round by including short ECChains inside quality messages, skipping the ECChain round if it looks like it's unnecessary.

I can get behind an extra phase, or maybe an extra long quality round, as ECChains "round" could be transparent to gpbft.

@Stebalien
Copy link
Member

Most likely we want some chain-rebroadcast still in case the chain does not get through the network in time.

I agree.

Shortcut: Instances can progress without any propagation of chain if:

As long as it doesn't introduce timing issues... I think it's fine because we need to see a strong quorum of votes to time out in prepare. Although it does increase our chances of failing in prepare (slightly).

I can get behind an extra phase, or maybe an extra long quality round, as ECChains "round" could be transparent to gpbft.

I think we can do that. Basically, have a separate module that receives both quality messages and EC Chains:

  1. Each time it receives a quality message, it records that we've received a quality message from that specific participant for the instance to prevent spam.
  2. If the quality message has no known chain, it buffers it and waits for an EC chain.
  3. Each time it receives a chain, it looks up any quality messages waiting for said chain. If found, it records the chain in the set of known chains and immediately forwards any buffered quality messages. Otherwise, it drops the EC chain and relies on rebroadcast later.
  4. Finally, after some timeout (delta?), this module broadcasts the EC chain from our proposal if we didn't send it inline with the initial quality message.

This should be fine because PREPARE will pause until we receive some quorum, or evidence that we cannot reach quorum.

@BigLep
Copy link
Member

BigLep commented Dec 18, 2024

2024-12-18 (some) standup notes:

masih added a commit that referenced this issue Dec 19, 2024
Implement chain exchange protocol over pubsub as a mechanism to
propagate `ECChain` across the network with reasonable spam protection.

To protect against spam the mechanism employs two separate caches for
chains that are generally discovered across the network and the ones
explicitly looked up or broadcasted by the local node. Both caches are
capped LRU, where the LRU recent-ness is used as a way to prioritise
chains we cache while keeping the total memory footprint fixed. This
approach is not the most memory efficient but is simpler to implement
as the LRU encapsulates a lot of the complexity.

The code has a lot of TODOs as places to improve or question to the
reviewer. To action most of the TODOs further refactoring across the
code is needed which is intended to be actioned in separate commits.

The code path introduced here is not integrated into F3 host; future PRs
will iteratively integrate the mechanism across F3 host and other
places.

Part of #792
masih added a commit that referenced this issue Dec 19, 2024
Implement chain exchange protocol over pubsub as a mechanism to
propagate `ECChain` across the network with reasonable spam protection.

To protect against spam the mechanism employs two separate caches for
chains that are generally discovered across the network and the ones
explicitly looked up or broadcasted by the local node. Both caches are
capped LRU, where the LRU recent-ness is used as a way to prioritise
chains we cache while keeping the total memory footprint fixed. This
approach is not the most memory efficient but is simpler to implement
as the LRU encapsulates a lot of the complexity.

The code has a lot of TODOs as places to improve or question to the
reviewer. To action most of the TODOs further refactoring across the
code is needed which is intended to be actioned in separate commits.

The code path introduced here is not integrated into F3 host; future PRs
will iteratively integrate the mechanism across F3 host and other
places.

Part of #792
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: In progress
Development

No branches or pull requests

4 participants