[Proposal] Change to points based system for S*Node Uptime calculation


#1

Rather than use an absolute time based (92%/98%) system for calculating payments, use a points based system such that it is both easy to calculate payments AND for the team to acknowledge system errors affecting nodes within a payment period by retroactively applying positive bonus points to payment periods to counteract the effect of systems errors.

  • Each downtime is still tracked with a start time for the 15min block allowing it to be calculated as included in a payment master or not (and add a column to the page because we already should know the PM id or at least the 576 block period in which a particular downtime/exception has occurred based on the start time).
  • When rechecked after 15mins if still open another points block is generated

Some examples:
Negative points accrual:

  • -1 point for sys/zend downtimes for each 15 minutes of downtime
  • -1 point for a failed challenge (chal(max)), no further penalties until a subsequent failed challenge from the server (auto-rechallenge)
  • -1 point for cert, peers, node or stake balance for each 15 minutes of downtime
  • -8 points for stkdup, zenver, trkver for each 15 mins of downtime

Note: this means:

  • Securenodes: 8 individual short errors or combinations of longer or more important errors accounts for 2 hours of downtime which is roughly equivalent to 92% (-1h55m)
  • Supernodes: 2 individual short errors or combinations of longer or more important errors accounts for 30 minutes of downtime which is roughly equivalent to 98% (-28m)

Positive points accrual, issued across all nodes rather than to individuals:

  • +1 point for known CloudFlare/Server issues for each 15 minutes of downtime
  • +1 point when long blocktimes are observed due to hashrate/difficulty phenomena (this also buffers against joinsplit/txid problems caused by this) for each 15 minutes beyond 2.5 minutes per block
  • +100 points when system is completely borked and all nodes with a valid stake get rewarded (it pays to be generous when you are at fault)

#2

Work in progress, happy to hear other examples of different points systems as well as constructive criticism


#3

For reference this is in response to a post i made in discord as seen below:

This latest server change combined with the recent blockchain disruptions has been a a bit of a nightmare for those in the node servicing business (including myself) as well as for individual node owners. As I do host more nodes than the average I do tend to see the full spectrum of problems that occur and have taken feedback from others as well before writing this.

As a result of yesterday’s server change these are the observed problems:

  • multiple nodes in downtime with multiple sys errors despite active connections and which won’t close even if the tracker software is restarted (need to change to NA server instead).
  • even if solved for x nodes, problem also appears to be ongoing as it appears suddenly on other nodes (appears to be linked to a restart of tracker software which then triggers attempt to connect to ts4.eu).

Additional common problems that are not accounted for within the payment system:

  • intermittent connection problems with/caused by CloudFlare or its upstream providers
  • phantom downtimes that appear to be caused by problems with the client/server architecture that node owners have no control over
  • long block times that cause failed challenges (tx not found) AND apparently seem to cause joinsplits problems that need to be solved by rescan (must be manually performed)
  • DNS issues on the tracker servers that cause not only failures in hostname/ip resolution but also cause cert validation errors (the fallback to using node connections is clearly not up to the task)
  • the auto-rechallenge system is broken such that you can have an open chal or chalmax error for 4+ hours with no auto-rechallenge (and the node itself believes the original challenge is/was successful)

#4

This is also in response to statements from dev team that it is hard to determine which nodes are affected by specific issues, I would argue this is the wrong approach and requires too much granularity in decision making.

  • If there is a system issue, act as if all are affected, if nodes are truly down it doesn’t make any difference if they get a few bonus points, because over the 576 block period they will still lose enough points to not qualify.
  • Forget about focussing on bad actors for punishment (stick) and instead reward stable owners with a sensible and understandable system for calculating the effect of node owner caused and system errors (carrot).

#5

I agree checks should be done periodically rather than based on an ‘always on’ TCP connection. This is especially valid having in mind that Cloudflare is between the nodes and the tracking server and regularly generates false positives and false negatives. The reason I haven’t raised this so far is because Cloudflare seems to only mask relatively short outages (1-2 minutes or less) and the system is still in beta - not a final version.

This raises a different question though - will these periodic tests still be done over 1) an ‘always on’ connection initiated by the node, or 2) will the nodes listen on a port and respond to tracker requests?
In case 1, the whole thing will still be going through Cloudflare as the central server needs DDOS protection. My concern is that the daily false positives of a minute or two could turn into 15 minute chunks with enough bad luck (when the 15 min check coincides with a CF issue). It would be silly for the tracker to take points away for a 15 minute downtime when the information is there that the connection was up shortly before and after the check.
In case 2, the Cloudflare element is removed from the equation and it should all be a lot more accurate and a lot easier on the central servers. I did make the case for this type of system during the initial discussion about the system but I guess it was not feasible to implement.

About the point calculations:

That is OK with a couple of caveats:

  1. a re-challenge should be issued during the next 15 min period. If still done daily, the suggested point scoring system would pay nodes that never complete challenges successfully.
  2. there should be a rate limit for re-challenges to prevent draining the challenge address in case some type of error occurs which causes the address to be charged the transaction fee but the challenge fails - most commonly this would be a timeout I guess.

I think penalties for challenge failures should depend on the type of the failure. If it is a timeout, apply a point for the 15 minute period when the challenge failed, and another point for every 15 min after that until a new challenge is requested and passed. This would go well with some form of key authenticated API for automatic challenge request submission that does not require email verification.
If it is some other type of failure, which is not necessarily down to node hardware but a result of some more temporary factor, then issue 3 re-challenges every 15 minutes (for a total of 4), then 2 more every 30 min with double negative points for each. If all of these fail the node does not get paid for the day, and challenges are issued again at the start of the next payment period.

Happy with the first point in general, but not sure how will Cloudflare problems become ‘known’ in practice. Do these need to be announced on the CF site or what objective way of determining CF status will this be based on?
With the second point, not sure how long block times affect node errors. What outages have you seen that are attributable to this?
As for the last point I agree everyone with a valid stake should be paid, except long term inactive/dead nodes, or ones which were out on the day before and after the system fault. The assigning of points is symbolic and the number doesn’t really matter. I guess it is a given that any points assigned under this scenario are only valid for the current payment period and do not carry over to the next.


#6

Yes, you are right that this should be addressed in one way or another. My preference though would be to solve it at the root by having the tracker system query the nodes directly by initiating a connection which is not intercepted by third parties. Has anyone looked into the security implications of this interception? I am not saying Cloudflare is in any way malicious, but why do we need to put up with a third party intercepting the TLS connections when we can have a more efficient system that does not require paid DDOS protection? I strongly hope the new system which is being developed addresses this, and wish there was more information available on what is being done in terms of design decisions.

These in my experience are mostly down to the above Cloudflare issues, but I have occasionally seen short-lived zend failures which I can’t find the reason for in the logs etc. Agree these need to be addressed too but again would prefer the work to be targeted at fixing the root cause. Of course, if some issues are found to be difficult to solve then a system which compensates for these faults would be helpful while the core issues are being resolved.

This is a good example of an error where a workaround is needed. Long block times will always exist, so if the ‘tx not found’ error can’t be fixed it should be offset by the tracker software. This workaround does not have to affect other reasons for failed challenges.
I have seen joinsplit errors in the zend debug log but haven’t linked them to a failure to start zend and the need for a rescan. Of course that doesn’t mean the issue is not there, and it is one that is particularly unpleasant as rescans take a long time, and will take even longer as chain grows.

From what I have seen in the white paper this should be addressed by the new system, where peer tracking will be entirely based on node connections. How much effort should be spent on working around any DNS issues depends on how soon is the new system expected to go live.

Yes - I didn’t even know there was an auto-rechallenge system in place… it does nothing from what I can tell, but may be I am missing some details on how the system operates. That last one is another major issue for me with the current tracker system - the code is not on the official github repo so I don’t really know what it does. I could only find the code for the node tracker app there.


#7

In a way, punishing bad actors rewards the good ones as the reward pool gets re-distributed. We need to have both carrot and stick :slight_smile:


#8

Wow thanks for the detailed response @nikmit!

Must admit my proposal was the work of 1 hour (and several months of irritation) so I never intended it to be fully formed from the get-go, something of a fresh-water pearl :slight_smile:

Lets see if I can discuss all, mostly I just agree with what you said.

CloudFlare:

  • Totally agree with early and later points, I don’t understand this part either because it can be demonstrated at any time, even when there aren’t upstream problems, that running through CF loses 3-7% of packets which is clearly unsuitable for always-on connections. In bad times this rises above 10%.
  • also agree with the MITM argument, CF is likely a good actor in the system but compromises are possible.

15 minute periods:

  • Could be 5, could be 1, doesn’t really matter as long as you scale the positive points appropriately also.
  • Agreed, creeping over the period barrier is the biggest downside I see to this system.

Always on vs periodic check:

  • This is also what I have discussed with @devman, I just don’t think that the node,js socket.io stuff is robust enough for to make through a 24 hour period with no dropped connections especially when routed though CF, however there is a 45 second buffer before a downtime is registered so this should be fine.
  • However there is a second issue, unrelated to CF, which is that there are also times (and I think this relates to another of your points) when the tracker seem to not catch that the node is actually connected and raises an error when there shouldn’t be one, the node socket itself doesn’t seem to recognise the “hangup” either. For this reason I would prefer periodic connections to check in, unfortunately I believe @devman has said the original security/threat model of the payment system called for always on connection so…

Auto-rechallenge:

  • correct,. it should, but it doesn’t currently and whose responsibility is that? I actually think this is a great way to demonstrate the need for a robust system that does what it says on the box and conversely a terrible way to “punish” people for potentially just a dropped connection due to CloudFlare or long block times due to blockchain/miner behaviour.
  • absolutely, the fact that the tracker will reissue challenges when a joinsplit is detected but does not have the mechanism available to restart zend with -rescan is worrying, still at the rate it is currently issuing rechallenges not such a big deal but would be worse if that is fixed.

This is actually already available in the form “https://securenodes$homeServer.zensystem.io/$tAddrOfNodeNotStake/$nodeId/send

The same way we test when this becomes really obvious, have regular multi-pings from different places over ipv4 and ipv6 and have a threshold over which an auto credit is issued, could be managed fully automatically.

There are actually two parts, each are “normal” phenomena, long block times because of hash/dififculty and opportunistic mining pools (wish we could punish this behaviour also) and orphaned tx due to minor chain-splits that by themselves should be expected but when coupled with long rechallenge periods are disastrous.

Long block times:

  • The tracker expects a reply tx within x minutes of being notified it has been sent, if it does not receive it because the block time is 1 hour, an exception is raised. The tracker should be instead recognising that there is a long block time and cancelling the challenge (or at least not raising a ‘chal’ exception). This happened to my nodes a lot recently so is not an edge case except when long block times are rare.
  • If the block is so long that a rechallenge (either manual or auto) issues both challenge tx into the same block and one is pass and the other fail it often raises an exception, but this is really an edge case (again though, I have seen it happen).

Orphaned tx:

  • This happens a lot and results in “no tx found” errors I believe, again when coupled with long auto-rechallenge is disastrous. Whose fault is it if the challenge tx is issued to the shorter chain? Certainly not the node owner’s and therefore shouldn’t be punished.

Those are not your joinsplits errors they are other people’s attempted z tx, basically your node is rejected them as double spending attempts. It has no effect on your node.

I wish I shared your optimism, we have been through several rounds of tweaking on this and it is still happening (originally the system used Google DNS which had rate limiting, this was very bad at one point, nearly all nodes had cert errors, fortunately this is better now)

Lol, which just goes to show how effective it is.

Yep, it is, in my opinion, short-sighted.

Oh for sure, we can’t assume 100% of people are good actors, but what the game theory study of this sytem should show is that the majority are and then your system should not punish those unfairly. I think there is too much emphasis on catching bad actors.


#9

I thought your post was very good and on an important subject I care about so maybe I got carried away a bit :slight_smile:
You clearly have done good work analysing CF, blockchain and tracker issues, thanks for the clarifications about long block times and the pointer at the automatic challenge sending mechanism!

I exchanged a couple of posts with @devman in this thread about the design of the system when it was initially being created. I think we need to know what the “security/threat model” is in detail as well as how the system handles it; security through obscurity can be useful temporarily but shouldn’t be a final solution. Especially for an open source project… I still can’t see how the current system is more secure than one which periodically polls the nodes but some weaknesses are becoming obvious.

Without having tested this myself I will only say that usually ping is handled differently to TCP traffic - on routers, switches and probably CF servers too. You would need to use something like the TCPPing probe from Smokeping for better accuracy or some custom made probe that does a SYN/SYN-ACK/FIN sequence. And even then there will likely be a range of issues which can’t be detected, e.g. anything affecting long-lived connections, problems with the connection forwarding or when CF mislabels a node connection as DDOS attack. The real solution to this is to stop trying to monitor nodes through a brick wall, but I can’t for a second blame you for suggesting a decent workaround to a problem which affects node operators daily.

I hope we get some attention here from the leadership and dev teams as I am sure the work on the new tracker system can benefit from some scrutiny from you, me and the Zen community in general.


#10

Yes of course you are correct, however this method has been used to demonstrate probelms with upstream providers of CF previously, a suite of regular tests would no doubt be even more effective (and could be easily automated).

There is attention, without commitment and there are a range of things they are already working on to improve the infrastructure to have more control over how/when/what/why happens to the servers so I think we will see improvements in the reliability.


#11

So far you have only run these tests when there were issues, to show that they were caused by CF - and some times they were. Once you start doing it constantly it can easily happen that you will find ‘CF problems’ without mass node failures, for reasons already mentioned. I guess there will be some trialling before this goes live, where mass node errors will be monitored alongside CF so the correlation can become obvious?


#12

Could be done for sure, using the all nodes API plus regular suite of tests we could certainly do correlations.