Server health tests are a little unfair due to absolute timing values

realkinetix commented

2018-04-03 22:22:16 +02:00

(Migrated from github.com)

As per this thread servers that are further geographically from a directory server get lower health scores due to the latency involved.

There's been some discussion on how to adjust the absolute curl timing value, such as:

Grab the average or median value of a series of pings to the server in question, and subtract that from the curl time (perhaps with a correction factor - maybe double that ping value before subtracting due to TCP handshake response time, etc.)

There may be more that could be done, but perhaps some fairly simple 'corrective' factors around that absolute curl process time would be helpful with health scoring the servers that are further away network-wise.

As per [this thread](https://social.isurf.ca/display/c443a55c425ac2a2a2c3863252458336) servers that are further geographically from a directory server get lower health scores due to the latency involved. There's been some discussion on how to adjust the absolute curl timing value, such as: - Grab the average or median value of a series of pings to the server in question, and subtract that from the curl time (perhaps with a correction factor - maybe double that ping value before subtracting due to TCP handshake response time, etc.) There may be more that could be done, but perhaps some fairly simple 'corrective' factors around that absolute curl process time would be helpful with health scoring the servers that are further away network-wise.

👍 1

MrPetovan commented

2018-04-03 22:32:27 +02:00

(Migrated from github.com)

Here's my plan to change the policy:

Ping all the known reachable servers
Divide the average response time by the average ping latency
Create break points based on percentiles on the resulting data
Grade servers on a curve based on this data so that the absolute value doesn't matter as much.

My first impulse would be to do it once, take the resulting break points and hard-code them in the health scoring process, but ideally the break point should be updated regularly. The issue is that it requires to ping every know server which takes significant amount of time especially with 5 tries.

Maybe it could be a rolling thing where the ping is checked at the same time as the server health, and the break points are moved less regularly.

Another idea would be to have a fully proportional score change based on min/max values, this wouldn't require any break points.

Here's my plan to change the policy: - Ping all the known reachable servers - Divide the average response time by the average ping latency - Create break points based on percentiles on the resulting data - Grade servers on a curve based on this data so that the absolute value doesn't matter as much. My first impulse would be to do it once, take the resulting break points and hard-code them in the health scoring process, but ideally the break point should be updated regularly. The issue is that it requires to ping every know server which takes significant amount of time especially with 5 tries. Maybe it could be a rolling thing where the ping is checked at the same time as the server health, and the break points are moved less regularly. Another idea would be to have a fully proportional score change based on min/max values, this wouldn't require any break points.

AndyHee commented

2018-04-27 17:35:59 +02:00

(Migrated from github.com)

The following is a proposed direction by a trained econometrician, whom I consulted about how to conceptually handle the issue by determining meaningful values for "server health".

Conditional distribution

request_time (Y) is a random variable and we are interested how it relates to another variable called avg_ping (X).

The most we can know about how X affects Y is contained in the conditional distribution of Y given X. This information is summarised by the conditional probability function, defined by:

for all values of x such that:

The interpretation is most easily be seen when X and Y are discreet. Then,

where the right-hand side is read as "the probability that Y = y given that X = x."

Skewness

Once we understand the probabilities we need to test whether the data set is skewed or normal. If skewed, we use a log function to normalise it.

Zoning

With a normal distribution we can easily define different zones to the left and right of the bell cure crest. Here we can establish the cut off points for the different level of healthiness of the server.

The following is a proposed direction by a trained econometrician, whom I consulted about how to conceptually handle the issue by determining meaningful values for "server health". ### Conditional distribution **request_time** (Y) is a random variable and we are interested how it relates to another variable called **avg_ping** (X). The most we can know about how X affects Y is contained in the conditional distribution of Y given X. This information is summarised by the conditional probability function, defined by: ![image](https://user-images.githubusercontent.com/28751375/39370173-78d9741a-4a68-11e8-86fc-51b2e53fa775.png) for all values of x such that: ![image](https://user-images.githubusercontent.com/28751375/39370199-80ec443e-4a68-11e8-9e03-2f915639cb9a.png) The interpretation is most easily be seen when X and Y are discreet. Then, ![image](https://user-images.githubusercontent.com/28751375/39370291-b93a0a7e-4a68-11e8-964a-b322ccab8059.png) where the right-hand side is read as "the probability that Y = y given that X = x." ### Skewness Once we understand the probabilities we need to test whether the data set is skewed or normal. If skewed, we use a log function to normalise it. ### Zoning With a normal distribution we can easily define different zones to the left and right of the bell cure crest. Here we can establish the cut off points for the different level of healthiness of the server. ![image](https://user-images.githubusercontent.com/28751375/39371767-c7cb15ac-4a6c-11e8-9189-ebaa9d753439.png)

AndyHee commented

2018-04-27 17:47:34 +02:00

(Migrated from github.com)

Here are the native math functions: http://php.net/manual/en/book.math.php
For statistics, it's a separate module: http://php.net/manual/en/book.stats.php

> Here are the native math functions: http://php.net/manual/en/book.math.php > For statistics, it's a separate module: http://php.net/manual/en/book.stats.php

MrPetovan commented

2018-04-27 17:56:16 +02:00

(Migrated from github.com)

Thanks for asking and for the work, I've a question: the bell curve studies only one variable (the response time in our case), how do you factor in the ping? Through the potential log function?

AndyHee commented

2018-04-27 18:29:07 +02:00

(Migrated from github.com)

I'm not entirely clear on this point yet. But I think once we understand the probability of how the ping (X) value affects the response time (Y), we should know how to "factor" it.

The log function -- and here I understand this a little better -- is merely to normalise the curve of your factored values. If you were to divided Y/X and your curve consequently looks like this, you cannot really do the zoning properly.

So if you use log(Y/X), you could already do some zoning, even Y/X is just a random factoring. The above graph then should look less skewed.

I'm not entirely clear on this point yet. But I think once we understand the probability of how the ping (X) value affects the response time (Y), we should know how to "factor" it. The log function -- and here I understand this a little better -- is merely to normalise the curve of your factored values. If you were to divided Y/X and your curve consequently looks like this, you cannot really do the zoning properly. ![image](https://user-images.githubusercontent.com/28751375/39373251-37bc7e56-4a71-11e8-866c-40592221523f.png) So if you use log(Y/X), you could already do some zoning, even Y/X is just a random factoring. The above graph then should look less skewed.

👍 1

MrPetovan commented

2018-04-27 18:54:22 +02:00

(Migrated from github.com)

Are we waiting on your friend then, or do we have to do anything by ourselves?

AndyHee commented

2018-04-27 19:18:45 +02:00

(Migrated from github.com)

I'm going to test the log function for the time being. Next Friday, I'll see him again. He will try to get someone who will do the empirical part for us on a spreadsheet. Once we know the factor formula and the benchmark values, it can be coded.

Today's discussion was very brief (after some other meeting) and were merely conceptual. I put up the concept for people to comment. But I understand this is quite advanced and also partially beyond my own expertise.

I'm going to test the log function for the time being. Next Friday, I'll see him again. He will try to get someone who will do the empirical part for us on a spreadsheet. Once we know the factor formula and the benchmark values, it can be coded. Today's discussion was very brief (after some other meeting) and were merely conceptual. I put up the concept for people to comment. But I understand this is quite advanced and also partially beyond my own expertise.

MrPetovan commented

2018-04-27 19:25:42 +02:00

(Migrated from github.com)

Beyond my own as well!

AndyHee commented

2018-04-30 16:03:25 +02:00

(Migrated from github.com)

So the LOG-normal functions works as predicted. Here a before and after on the "request_time" variable:

Skewed (before)

Normalized (after)

This would be ready for zoning, if it had the ping factored in. I'll try to give it go even before the next meeting.

--
But there is a remaining issue, as we have an incomplete data set. I think, we need all known nodes because the inclusion of these 257 nodes out of more than 1000 is done rather arbitrary.

@MrPetovan do you think you could obtain data for all known nodes? All we need to ensure is that there is no duplication (e.g. http vs https).

So the LOG-normal functions works as predicted. Here a before and after on the "request_time" variable: _Skewed_ (before) ![image](https://user-images.githubusercontent.com/28751375/39430512-3c58e460-4cb8-11e8-95d0-400864521ec4.png) _Normalized_ (after) ![image](https://user-images.githubusercontent.com/28751375/39430448-0f8c8f86-4cb8-11e8-8ab2-ece33117eb02.png) This would be ready for zoning, if it had the ping factored in. I'll try to give it go even before the next meeting. -- But there is a remaining issue, as we have an incomplete data set. I think, we need all known nodes because the inclusion of these 257 nodes out of more than 1000 is done rather arbitrary. @MrPetovan do you think you could obtain data for all known nodes? All we need to ensure is that there is no duplication (e.g. http vs https).

MrPetovan commented

2018-04-30 16:09:05 +02:00

(Migrated from github.com)

There’s a difference between “known node” and “active node”. I voluntarily limited the dataset to active nodes only because ping and request times don’t make sense for dead nodes.

AndyHee commented

2018-04-30 16:13:01 +02:00

(Migrated from github.com)

That's good. Sorry my confusion!

Yes, if that's all active nodes out there that's great.

Do you have a script for generating the data?

That's good. Sorry my confusion! Yes, if that's all active nodes out there that's great. Do you have a script for generating the data?

MrPetovan commented

2018-04-30 16:29:31 +02:00

(Migrated from github.com)

No, it was a simple SQL query against the site-health table, including only nodes with a positive health score.

No, it was a simple SQL query against the `site-health` table, including only nodes with a positive health score.

AndyHee commented

2018-04-30 16:40:20 +02:00

(Migrated from github.com)

Sorry, this is the part I don't understand.

In my site-health table I have 25 entries that are of positive value. The renaming 400 servers have a negative score. This includes for example my own server, your server, libranet.de.

So my question how do you make that leap from a negative health score to being inactive?

Sorry, this is the part I don't understand. In my `site-health` table I have 25 entries that are of positive value. The renaming 400 servers have a negative score. This includes for example my own server, your server, libranet.de. So my question how do you make that leap from a negative health score to being inactive?

MrPetovan commented

2018-04-30 16:42:55 +02:00

(Migrated from github.com)

Because the score penalty for connection timeout is very high, so anything with the lowest health score (-100) is unlikely to be active.

AndyHee commented

2018-04-30 17:15:28 +02:00

(Migrated from github.com)

But could this not be part of the latency problem we're trying to solve?

Could you have a look at my table taken from this side of the world? There are some active nodes with -100.

andy-site-health.csv.zip

--
It's not very important to have the complete dataset now, because I hope that we can do the health levels dynamically in the future based on percentage zones with relevant cut off points on each site (see here: https://github.com/friendica/dir/issues/43#issuecomment-385007692). But it may help us to define the zones properly up front. It would rule out any distortion.

We can do this any time once we have the correct probability formula and agreed on the different zones in a spreadsheet for checking.

In the future, each directory might once in awhile calculate automatically the cut off values for health levels based on our predefined zones. This presumably will be done on the basis of all known nodes in the DB.

But could this not be part of the latency problem we're trying to solve? Could you have a look at my table taken from this side of the world? There are some active nodes with -100. [andy-site-health.csv.zip](https://github.com/friendica/dir/files/1961046/andy-site-health.csv.zip) -- It's not very important to have the complete dataset now, because I hope that we can do the health levels dynamically in the future based on percentage zones with relevant cut off points on each site (see here: https://github.com/friendica/dir/issues/43#issuecomment-385007692). But it _may_ help us to define the zones properly up front. It would rule out any distortion. We can do this any time once we have the correct probability formula and agreed on the different zones in a spreadsheet for checking. In the future, each directory might once in awhile calculate automatically the cut off values for health levels based on our predefined zones. This presumably will be done on the basis of all known nodes in the DB.

MrPetovan commented

2018-04-30 17:34:08 +02:00

(Migrated from github.com)

The important field for available node is dt_last_seen. It's the last time the directory got a successful response. Compare this value with dt_last_probed which marks the last time the directory tried to probe the server. Any time there's a difference between those two date, a score penalty is applied to the server.

Also remember that high response time will also penalize servers, this is why you have mass-trespass.uk at -20 despite a successful last probe. And since you're far from most servers in this list, your scores will be overall lower than what I would get with a directory hosted in France.

This is why you probably should take into account nodes where dt_last_seen and dt_last_probed have a value and whose values aren't more than X [time unit] apart.

The important field for available node is `dt_last_seen`. It's the last time the directory got a successful response. Compare this value with `dt_last_probed` which marks the last time the directory tried to probe the server. Any time there's a difference between those two date, a score penalty is applied to the server. Also remember that high response time will also penalize servers, this is why you have `mass-trespass.uk` at -20 despite a successful last probe. And since you're far from most servers in this list, your scores will be overall lower than what I would get with a directory hosted in France. This is why you probably should take into account nodes where `dt_last_seen` and `dt_last_probed` have a value and whose values aren't more than X [time unit] apart.

AndyHee commented

2018-04-30 18:09:52 +02:00

(Migrated from github.com)

Yes, that makes sense. Failed probes, like outdated code, will reduce the overall score of course.

When I said dynamically calculated levels, I purely meant value of the response_time moderate by ping_avg.

My own node running on a piece of junkware is a good example; it has a very high response time but a relatively low ping (from my directory) and high high (from yours). The probability distribution formula will actually reduce my node's response_time value because of the low ping, but it probably will be only a tiny bit. My node will still very likely get a low score in terms of the high RT/ping value. That's the part we are interested in; to see how each percentage zone pans out along the curve at the crest and the base.

Are you able to see why I'm interested in the whole dataset of all known nodes? Do you think I could easily generate this myself for my directory by running a simple query?

Yes, that makes sense. Failed probes, like outdated code, will reduce the overall score of course. When I said dynamically calculated levels, I purely meant value of the response_time moderate by ping_avg. My own node running on a piece of junkware is a good example; it has a very high response time but a relatively low ping (from my directory) and high high (from yours). The probability distribution formula will actually reduce my node's response_time value because of the low ping, but it probably will be only a tiny bit. My node will still very likely get a low score in terms of the high RT/ping value. That's the part we are interested in; to see how each percentage zone pans out along the curve at the crest and the base. Are you able to see why I'm interested in the whole dataset of all known nodes? Do you think I could easily generate this myself for my directory by running a simple query?

MrPetovan commented

2018-04-30 18:34:11 +02:00

(Migrated from github.com)

Of course I see the interest, but I'm weary of inactive nodes skewing the data set in a specific direction.

The other table to consider is site-probe, you'll find the result of successful probes. You can use average values to get a single set of ping/request time per server, and you can join site-health so that even current inactive nodes historical data will be included, increasing the sample size without skewing the data set.

The query to extract the data would go along those lines (I can't test until 8 PM EST):

SELECT `base_url`, AVG(`request_time`), AVG(`avg_ping`)
FROM `site-probe`
JOIN `site-health` ON `site-health`.`id` = `site-probe`.`site_health_id`
WHERE `request_time` IS NOT NULL
AND `avg_ping` IS NOT NULL
GROUP BY `site-health`.`id`

Of course I see the interest, but I'm weary of inactive nodes skewing the data set in a specific direction. The other table to consider is `site-probe`, you'll find the result of successful probes. You can use average values to get a single set of ping/request time per server, and you can join `site-health` so that even current inactive nodes historical data will be included, increasing the sample size without skewing the data set. The query to extract the data would go along those lines (I can't test until 8 PM EST): ```sql SELECT `base_url`, AVG(`request_time`), AVG(`avg_ping`) FROM `site-probe` JOIN `site-health` ON `site-health`.`id` = `site-probe`.`site_health_id` WHERE `request_time` IS NOT NULL AND `avg_ping` IS NOT NULL GROUP BY `site-health`.`id` ```

AndyHee commented

2018-04-30 19:20:41 +02:00

(Migrated from github.com)

Thanks! No rush at all.

The conditional probability equation will not produce a value for any node that has a zero ping.

The very long request_times that just end up flat-lining at the base will turn into a steep drop through the normal-log function. See the before and after graph (see here: https://github.com/friendica/dir/issues/43#issuecomment-385407405)

We will do a purely qualitative judgment with the zoning as to how narrow each zone will. This will ensure that very slow machines can never have a medium or high rating.

Thanks! No rush at all. The conditional probability equation will not produce a value for any node that has a zero ping. The very long request_times that just end up flat-lining at the base will turn into a steep drop through the normal-log function. See the before and after graph (see here: https://github.com/friendica/dir/issues/43#issuecomment-385407405) We will do a purely qualitative judgment with the zoning as to how narrow each zone will. This will ensure that very slow machines can never have a medium or high rating.

tobiasd commented

2018-04-30 20:13:53 +02:00

(Migrated from github.com)

How can there be a ping of zero? Or do you mean a missing value?

MrPetovan commented

2018-05-01 01:05:44 +02:00

(Migrated from github.com)

Less than a millisecond ping. Could happen.

MrPetovan commented

2018-05-01 05:37:12 +02:00

(Migrated from github.com)

Here's the latest data from dir.friendica.social with the above query:
20180430-site-health.zip

Here's the latest data from dir.friendica.social with the above query: [20180430-site-health.zip](https://github.com/friendica/dir/files/1962930/20180430-site-health.zip)

👍 2

AndyHee commented

2018-05-01 06:59:59 +02:00

(Migrated from github.com)

Thanks MrPetovan for the data!!

@tobiasd highlights an important point. I think for the equation to give a probability, the ping value must be greater than zero. A superfast 0.001ms ping would still work.

In practice, this means admins who block pings will not have a health value. There are currently nodes that return a fast request_time but have a "0" ping, presumably because it failed.

Rather than giving these nodes a bad health score, they should be given a special status as "unknown health" or something like this. All the nodes that I saw fitting this category, were not open for registration. So this would be the trade off: if you run an open registration node, you need to be pingable. Otherwise you don't get a score.

Thanks MrPetovan for the data!! @tobiasd highlights an important point. I think for the equation to give a probability, the ping value must be greater than zero. A superfast 0.001ms ping would still work. In practice, this means admins who block pings will not have a health value. There are currently nodes that return a fast request_time but have a "0" ping, presumably because it failed. Rather than giving these nodes a bad health score, they should be given a special status as "unknown health" or something like this. All the nodes that I saw fitting this category, were not open for registration. So this would be the trade off: if you run an open registration node, you need to be pingable. Otherwise you don't get a score.

tobiasd commented

2018-05-01 07:12:11 +02:00

(Migrated from github.com)

One could use the average ping time for those nodes that block ping. I mean a fix value we find during the evaluation of the health determination round.

AndyHee commented

2018-05-01 07:50:37 +02:00

(Migrated from github.com)

Yes, that's a possibility, but it's open for manipulation. So if you have a slow request_time and a faster than average ping, you just block your ping and instantly get a better score.

It's a kind of shared risk issue. There are many legitimate reasons why people block ping, but if everyone were to block ping than the directories would not work any more.

Of course we would still provide individual outputs for nodes without ping, similar to screenshot below. But there would not be a final overall score; instead of the heart being green or whatever colour, the heart would just not be shown or something like this, while clearly indicating the node is active. Mostly it wouldn't matter, because only nodes open for registration are listed in ../servers.

Yes, that's a possibility, but it's open for manipulation. So if you have a slow request_time and a faster than average ping, you just block your ping and instantly get a better score. It's a kind of shared risk issue. There are many legitimate reasons why people block ping, but if everyone were to block ping than the directories would not work any more. Of course we would still provide individual outputs for nodes without ping, similar to screenshot below. But there would not be a final overall score; instead of the heart being green or whatever colour, the heart would just not be shown or something like this, while clearly indicating the node is active. Mostly it wouldn't matter, because only nodes open for registration are _listed_ in ../servers. ![screenshot_2018-05-01_12-31-34](https://user-images.githubusercontent.com/28751375/39462549-733493ce-4d3c-11e8-8714-7427eb7980d4.png)

tobiasd commented

2018-05-01 08:00:17 +02:00

(Migrated from github.com)

If someone is gaming the health test that way, we can just introduce a penalty by a horrible factor.

The fixed time could also be at the lower end of the bell-curve; average plus half of the full width half max value or so. Then the fixed time is more likely no desired value for gamers.

If someone is gaming the health test that way, we can just introduce a penalty by a horrible factor. The fixed time could also be at the lower end of the bell-curve; average plus half of the full width half max value or so. Then the fixed time is more likely no desired value for gamers.

AndyHee commented

2018-05-01 08:08:03 +02:00

(Migrated from github.com)

Yes, that would work. But than it would make the node's health look worse than it really is.

Do you think, it's important to have an overall score in such cases?

Yes, that would work. But than it would make the node's health look worse than it really is. Do you think, it's important to have an overall score in such cases?

tobiasd commented

2018-05-01 08:27:10 +02:00

(Migrated from github.com)

I don't think there will be gaming of the value, hence my suggestion with the average value to be neutral in that metric. I don't really have an opinion about "no health" value in the listing and if that would be better or worse for the node in terms of "advertising" the node to new users.

AndyHee commented

2018-05-01 09:02:30 +02:00

(Migrated from github.com)

That's a good point! "Unknown health" can sound worse than "below average health" for some people. We probably need better explanations that would help people to interpret the finer points, either way.

If we give those nodes an overall score, I think a penality would be good mainly for encouraging people to allow pings were this is possible. Mainly a penalty for not sharing the collective risk of contributing to the average and standard deviation ping values.

How exactly that penalty would look like (so that's not an invitation to fiddle the system), we need to see with the data and the calculated probabilities. Something like you said, average or above average ping value, or even a simple 0 = 1.

That's a good point! "Unknown health" can sound worse than "below average health" for some people. We probably need better explanations that would help people to interpret the finer points, either way. If we give those nodes an overall score, I think a penality would be good mainly for encouraging people to allow pings were this is possible. Mainly a penalty for not sharing the collective risk of contributing to the average and standard deviation ping values. How exactly that penalty would look like (so that's not an invitation to fiddle the system), we need to see with the data and the calculated probabilities. Something like you said, average or above average ping value, or even a simple 0 = 1.

AndyHee commented

2018-05-03 17:02:50 +02:00

(Migrated from github.com)

Hypolite I have a question about the datatbase, in particular about the ping value.

Looking at the site-probe table, I cannot see any ping value. What am I missing here?

I am able to query the db for the request_time and able to join it with health-site, by omitting avg_ping`.

Hypolite I have a question about the datatbase, in particular about the ping value. Looking at the `site-probe` table, I cannot see any ping value. What am I missing here? I am able to query the db for the request_time and able to join it with `health-site`, by omitting avg_ping`.

MrPetovan commented

2018-05-03 17:06:05 +02:00

(Migrated from github.com)

Either:

You need to update your database according to the dfrndir.sql file since there's no automatic update.
Or:
I need to update the dfrndir.sql file as the ping could be a dir.friendica.social-only feature at the moment. 😅

(Pretty sure it's the second one)

Either: - You need to update your database according to the `dfrndir.sql` file since there's no automatic update. Or: - I need to update the `dfrndir.sql` file as the ping could be a `dir.friendica.social`-only feature at the moment. 😅 (Pretty sure it's the second one)

AndyHee commented

2018-05-03 17:11:05 +02:00

(Migrated from github.com)

ping could be a dir.friendica.social-only feature

Ahh... it's not in the source yet, you're saying. Oh well. 😀

> ping could be a dir.friendica.social-only feature Ahh... it's not in the source yet, you're saying. Oh well. 😀

AndyHee commented

2018-05-03 17:18:32 +02:00

(Migrated from github.com)

What's the easiest way? I'll update my db and add a table? If you give me a hint, I might be able to do it.

But does the current code actually collect pings or only your unpublished version?

What's the easiest way? I'll update my db and add a table? If you give me a hint, I might be able to do it. But does the current code actually collect pings or only your unpublished version?

MrPetovan commented

2018-05-03 17:20:53 +02:00

(Migrated from github.com)

Only my unpublished version as well. The easiest way is for me to commit my code.

MrPetovan commented

2018-05-03 17:34:20 +02:00

(Migrated from github.com)

Sorry for the untimeliness :/

MrPetovan commented

2018-05-04 14:23:24 +02:00

(Migrated from github.com)

The MrPetovan/dir:master branch has been updated. You can use it to test the ping feature. Additionally, you can now use a CLI console tool to trigger a probe on a specific domain or site-health-id manually.

👍 1

AndyHee commented

2018-05-04 16:13:41 +02:00

(Migrated from github.com)

I am outlining here what seems like a workable solution. @Ken-Ko , one of the team members who is the main driver behind this, has just joined us on github and will follows the implementation thorough and advise us further if necessary.

The initially proposed idea of treating this as a conditional probability problem has been rejected, as it will lead to unnecessary levels of complexity. Instead the more plausible way is to tackle this with a linear regression, i. e. ordinary least squares (OLS).

This will give a "discounted_request_time" based on the request time in relation to the ping value. We will only need basic mathematical operations (addition, subtraction, division, multiplication) and exponentiation. This also will avoid the problem with a zero/ failed ping as 0 will get no discount. Here the basic equation:

discounted_request_time = request_time - (avg_ping * Coefficient)

The Coefficient based on the current dataset (dir.friendica.social-20180430-site-health) is 3.49712908930528

This results in the following hard-coded equation:
discounted_request_time = request_time - (avg_ping * 3.49712908930528)

In the next post, I'll outline how to calculate the Coefficient; so we can do this dynamically, as new nodes come online, or disappear, or hardware changes.

For the moment, I would like you to look at the actual data and inspect Column H (Discounted Y_a). There you see all nodes sorted by "speed" discounted_request_time. The faster the node, the lower the value. Some fast/ superfast nodes have negative values. We are not really interested in the actual value itself, but in which (of the six) zones each node will fall accordingly.

There is a very nice chart in the excel file that shows the distribution and will give us an idea about the different zones. Unfortunately, the chart is only visible in a specific propitiatory version (which I think is MS Excel 2016). I hope to show this chart here once we get an exported graphic file. If you happen to have that particular version, you can already see the preview. Otherwise just look at the ods file for the data.

The zoning will be done dynamically too in the future. I'm working on this at the moment, but we can also hard-code the values for the time being. This means with the above equation we can have this up and running for testing in no time.

dir-friendica-social-20180430-site-health-OLS.zip

propitiatory_copy.zip

I am outlining here what seems like a workable solution. @Ken-Ko , one of the team members who is the main driver behind this, has just joined us on github and will follows the implementation thorough and advise us further if necessary. The initially proposed idea of treating this as a conditional probability problem has been rejected, as it will lead to unnecessary levels of complexity. Instead the more plausible way is to tackle this with a linear regression, i. e. ordinary least squares (OLS). This will give a "discounted_request_time" based on the request time in relation to the ping value. We will only need basic mathematical operations (addition, subtraction, division, multiplication) and exponentiation. This also will avoid the problem with a zero/ failed ping as 0 will get no discount. Here the basic equation: _`discounted_request_time` = `request_time` - (`avg_ping` * Coefficient)_ The Coefficient based on the current dataset (dir.friendica.social-20180430-site-health) is 3.49712908930528 This results in the following hard-coded equation: _`discounted_request_time` = `request_time` - (`avg_ping` * 3.49712908930528)_ In the next post, I'll outline how to calculate the Coefficient; so we can do this dynamically, as new nodes come online, or disappear, or hardware changes. For the moment, I would like you to look at the actual data and inspect Column H (Discounted Y_a). There you see all nodes sorted by "speed" `discounted_request_time`. The faster the node, the lower the value. Some fast/ superfast nodes have negative values. We are not really interested in the actual value itself, but in which (of the six) zones each node will fall accordingly. There is a very nice chart in the excel file that shows the distribution and will give us an idea about the different zones. Unfortunately, the chart is only visible in a specific propitiatory version (which I think is MS Excel 2016). I hope to show this chart here once we get an exported graphic file. If you happen to have that particular version, you can already see the preview. Otherwise just look at the ods file for the data. The zoning will be done dynamically too in the future. I'm working on this at the moment, but we can also hard-code the values for the time being. This means with the above equation we can have this up and running for testing in no time. [dir-friendica-social-20180430-site-health-OLS.zip](https://github.com/friendica/dir/files/1974710/dir-friendica-social-20180430-site-health-OLS.zip) [propitiatory_copy.zip](https://github.com/friendica/dir/files/1974709/propitiatory_copy.zip)

MrPetovan commented

2018-05-04 17:01:42 +02:00

(Migrated from github.com)

Wow, thank you and your team for your work!

AndyHee commented

2018-05-05 05:32:34 +02:00

(Migrated from github.com)

Scoring (Q1-3)

The data gives us the following three quartiles:

First Quartile: 75
Second Quartile: 118
Third Quartile: 251

In a boxplot the whole data looks like this:

Here in more detail with the outliers removed, generated by BoxPlotR:

http://shiny.chemgrid.org/boxplotr/

So the quick hard-coded fix will look like this:

	//Speed scoring.
	if (intval($time) > 0) {
		//Pentaly / bonus points.
		if ($time > 520) {
			$current -= 10; //Bad speed.
		} elseif ($time > 251) {
			$current -= 5; //Still not good.
		} elseif ($time > 181) {
			$current += 0; //This is normal.
		} elseif ($time > 71) {
			$current += 5; //Good speed.
		} else {
			$current += 10; //Excellent speed.
		}

**Scoring (Q1-3)** The data gives us the following three quartiles: First Quartile: 75 Second Quartile: 118 Third Quartile: 251 In a boxplot the whole data looks like this: ![image](https://user-images.githubusercontent.com/28751375/39659289-58f94700-504f-11e8-9882-1bc0833ad1ba.png) Here in more detail with the outliers removed, generated by BoxPlotR: ![image](https://user-images.githubusercontent.com/28751375/39659275-feeb7076-504e-11e8-8c16-665fd4e83db3.png) http://shiny.chemgrid.org/boxplotr/ So the quick hard-coded fix will look like this: //Speed scoring. if (intval($time) > 0) { //Pentaly / bonus points. if ($time > 520) { $current -= 10; //Bad speed. } elseif ($time > 251) { $current -= 5; //Still not good. } elseif ($time > 181) { $current += 0; //This is normal. } elseif ($time > 71) { $current += 5; //Good speed. } else { $current += 10; //Excellent speed. }

MrPetovan commented

2018-05-05 06:41:49 +02:00

(Migrated from github.com)

I love everything about this.

👍 1

AndyHee commented

2018-05-05 12:38:13 +02:00

(Migrated from github.com)

AVG(`avg_ping`)

I'm looking at the average ping values as they come in at dir.hubup.pro.

I think, there seems to be an error in how the average of avg_ping is calculated. E.g. https://libranet.de in Germany gets 0.0894 or https://social.isurf.ca in California 0.9265 from Thailand.

Is it that a certain number of 0 values are ending up in the average?

### AVG(`avg_ping`) I'm looking at the average ping values as they come in at dir.hubup.pro. I think, there seems to be an error in how the average of `avg_ping` is calculated. E.g. https://libranet.de in Germany gets 0.0894 or https://social.isurf.ca in California 0.9265 from Thailand. Is it that a certain number of 0 values are ending up in the average? ![image](https://user-images.githubusercontent.com/28751375/39662332-219e52ba-508a-11e8-9b8d-cefa1b237712.png)

MrPetovan commented

2018-05-05 18:03:17 +02:00

(Migrated from github.com)

I don't know, I'm just taking the raw result from the ping command. Have you tried running a manual ping command against the same domains to see if there's a difference?

AndyHee commented

2018-05-06 07:18:32 +02:00

(Migrated from github.com)

Looking at the avg_ping column in site-probe everything is correct! The ping for https://libranet.de is 236 and for https://social.isurf.ca 227, identical to the values taken manually.

The problem occurs when running a query the database to generate an output. I can actually see some error messages in addition to the distorted output.

So the ping seems to work without any issues. I'll check what's going on with my db or phpmyadmin setup.

Just for clarification: When we pull and push as defined in sync-targets, we only transfer information about profiles and not any health related server information? Correct?

Looking at the `avg_ping` column in `site-probe` everything is correct! The ping for https://libranet.de is 236 and for https://social.isurf.ca 227, identical to the values taken manually. The problem occurs when running a query the database to generate an output. I can actually see some error messages in addition to the distorted output. So the ping seems to work without any issues. I'll check what's going on with my db or phpmyadmin setup. Just for clarification: When we pull and push as defined in `sync-targets`, we only transfer information about profiles and not any health related server information? Correct?

AndyHee commented

2018-05-06 12:34:13 +02:00

(Migrated from github.com)

I managed to traced the error that occurred when querying the database. It was a result of having modified the table index when I adding the two new columns. It's all fixed now and works as expected. 🙂

MrPetovan commented

2018-05-06 14:08:52 +02:00

(Migrated from github.com)

Just for clarification: When we pull and push as defined in sync-targets, we only transfer information about profiles and not any health related server information? Correct?

Yes, health is computed locally.

> Just for clarification: When we pull and push as defined in sync-targets, we only transfer information about profiles and not any health related server information? Correct? Yes, health is computed locally.

👍 1

AndyHee commented

2018-05-08 18:19:42 +02:00

(Migrated from github.com)

I'm analysing the data that dir.hubup.pro has generated over the last few days (since we are collecting avg_ping ). I currently have only 176 nodes, about 100 less than the last dir.friendica.social dataset.

Preliminary, it looks promising. I think there are similarities in the quartiles, despite that actually the values for discounted_request_time are very different. I'll wait for a few more days to get more data and will than show a detailed box chart to compare both directories.

I'm currently looking into ways how to calculate the quantiles and extreme values. I found something that looks like a potential direction for how to do quantiles in php (see below).

Once we know Q1 -3 you can also calculate the lower "extreme value" (i.e. Q3 + 1.5 x IQR).

function Median($Array) {
  return Quartile_50($Array);
}
 
function Quartile_25($Array) {
  return Quartile($Array, 0.25);
}
 
function Quartile_50($Array) {
  return Quartile($Array, 0.5);
}
 
function Quartile_75($Array) {
  return Quartile($Array, 0.75);
}

https://blog.poettner.de/2011/06/09/simple-statistics-with-php/

I'm analysing the data that dir.hubup.pro has generated over the last few days (since we are collecting `avg_ping` ). I currently have only 176 nodes, about 100 less than the last dir.friendica.social dataset. Preliminary, it looks promising. I think there are similarities in the quartiles, despite that actually the values for `discounted_request_time` are very different. I'll wait for a few more days to get more data and will than show a detailed box chart to compare both directories. I'm currently looking into ways how to calculate the quantiles and extreme values. I found something that looks like a potential direction for how to do quantiles in php (see below). Once we know Q1 -3 you can also calculate the lower "extreme value" (i.e. Q3 + 1.5 x IQR). ![image](https://user-images.githubusercontent.com/28751375/39768888-becf4062-5314-11e8-8463-002106e651db.png) ``` function Median($Array) { return Quartile_50($Array); } function Quartile_25($Array) { return Quartile($Array, 0.25); } function Quartile_50($Array) { return Quartile($Array, 0.5); } function Quartile_75($Array) { return Quartile($Array, 0.75); } ``` https://blog.poettner.de/2011/06/09/simple-statistics-with-php/

👍 1

MrPetovan commented

2018-05-08 18:31:08 +02:00

(Migrated from github.com)

This looks good, but I'm not sure what it would be for. Determining the cutoff points?

AndyHee commented

2018-05-08 18:42:41 +02:00

(Migrated from github.com)

Here I'm just showing again the proposed zoning based the data for dir.friendica.social 20180430

A = Excellent
B = Good
C = Normal
D = Still not good
E = Bad

Here I'm just showing again the proposed zoning based the data for dir.friendica.social 20180430 ![image](https://user-images.githubusercontent.com/28751375/39770411-289e3d3c-5319-11e8-8afe-7acff9d458fe.png) **A** = Excellent **B** = Good **C** = Normal **D** = Still not good **E** = Bad ![image](https://user-images.githubusercontent.com/28751375/39770490-5eda4d00-5319-11e8-8172-de01ef557cac.png)

AndyHee commented

2018-05-08 18:43:17 +02:00

(Migrated from github.com)

This looks good, but I'm not sure what it would be for. Determining the cutoff points?

Yes. See above.

Quartile_25 = Q1
Quartile_50 = Q2
Quartile_75 = Q3

> This looks good, but I'm not sure what it would be for. Determining the cutoff points? Yes. See above. `Quartile_25` = Q1 `Quartile_50` = Q2 `Quartile_75` = Q3

AndyHee commented

2018-05-08 19:07:56 +02:00

(Migrated from github.com)

Here the code with fixed values.

//Speed scoring.
if (intval($time) > 0) {
	//Pentaly / bonus points.
	if ($time > 515) {
		$current -= 10; //Bad speed.
	} elseif ($time > 251) {
		$current -= 5; //Still not good.
	} elseif ($time > 181) {
		$current += 0; //This is normal.
	} elseif ($time > 71) {
		$current += 5; //Good speed.
	} else {
		$current += 10; //Excellent speed.
	}

and here with dynamic points based on all discounted_request_time values sorted.

//Speed scoring.
if (intval($time) > 0) {
	//Pentaly / bonus points.
	if ($time > $lower_iqr) {
		$current -= 10; //Bad speed.
	} elseif ($time > $Quartile_75) {
		$current -= 5; //Still not good.
	} elseif ($time > $Quartile_50) {
		$current += 0; //This is normal.
	} elseif ($time > $Quartile_25) {
		$current += 5; //Good speed.
	} else {
		$current += 10; //Excellent speed.
	}

lower_iqr = Quartile_75 + 1.5 x (Quartile_75 - Quartile_25)

Here the code with fixed values. //Speed scoring. if (intval($time) > 0) { //Pentaly / bonus points. if ($time > 515) { $current -= 10; //Bad speed. } elseif ($time > 251) { $current -= 5; //Still not good. } elseif ($time > 181) { $current += 0; //This is normal. } elseif ($time > 71) { $current += 5; //Good speed. } else { $current += 10; //Excellent speed. } and here with dynamic points based on all `discounted_request_time` values sorted. //Speed scoring. if (intval($time) > 0) { //Pentaly / bonus points. if ($time > $lower_iqr) { $current -= 10; //Bad speed. } elseif ($time > $Quartile_75) { $current -= 5; //Still not good. } elseif ($time > $Quartile_50) { $current += 0; //This is normal. } elseif ($time > $Quartile_25) { $current += 5; //Good speed. } else { $current += 10; //Excellent speed. } `lower_iqr` = `Quartile_75` + 1.5 x (`Quartile_75` - `Quartile_25`)

👍 1

MrPetovan commented

2018-05-08 20:10:29 +02:00

(Migrated from github.com)

I'll admit I didn't expect to have so much fun when I agreed to maintain the Friendica Directory.

👍 1 🎉 1

AndyHee commented

2018-05-12 09:26:22 +02:00

(Migrated from github.com)

Some preliminary results. Here a comparison of the two directories, one in Western Europe and the other Southeast Asia.

The datasets were taken at different times and have different total number of nodes (181 v. 270).

Results based on this equation:

discounted_request_time = request_time - (avg_ping * coefficient)

The coefficient:
dir.friendica.social-20180430 = 3.49712908930528
dir.hubup.pro-20180512 = 0.146000348540368

Some preliminary results. Here a comparison of the two directories, one in Western Europe and the other Southeast Asia. The datasets were taken at different times and have different total number of nodes (181 v. 270). Results based on this equation: `discounted_request_time` = `request_time` - (`avg_ping` * `coefficient`) The coefficient: dir.friendica.social-20180430 = 3.49712908930528 dir.hubup.pro-20180512 = 0.146000348540368 ![index](https://user-images.githubusercontent.com/28751375/39954833-1eaeb07a-55f0-11e8-9865-c932f3c75166.png) ![39659275-feeb7076-504e-11e8-8c16-665fd4e83db3](https://user-images.githubusercontent.com/28751375/39954834-243844b6-55f0-11e8-98b9-617882bdbd18.png)

👍 1

AndyHee commented

2018-05-12 09:29:01 +02:00

(Migrated from github.com)

@MrPetovan the explanation for the coefficient, I gave, is incorrect. https://github.com/friendica/dir/issues/43#issuecomment-386648866

I'll try to give the correct version shortly. Hope you have not already coded this.

@MrPetovan the explanation for the coefficient, I gave, is incorrect. https://github.com/friendica/dir/issues/43#issuecomment-386648866 I'll try to give the correct version shortly. Hope you have not already coded this.

AndyHee commented

2018-05-12 10:24:40 +02:00

(Migrated from github.com)

Duplication of nodes

I have noticed there are some duplications in the base_url table.

Something like:
http://meld.de/
https://meld.de/

But we are quite sure there is only one node running there, despite the difference in protocols.

Even more concerning are duplications of entries with identical protocols. So for instance in Hypolite's dataset there are ten (10!) entries for https://libranet.de and about 15 for https://friendica.ladies.community each with different request_time values.

What's going on there and how to fix this?

## Duplication of nodes I have noticed there are some duplications in the `base_url` table. Something like: http://meld.de/ https://meld.de/ But we are quite sure there is only one node running there, despite the difference in protocols. Even more concerning are duplications of entries with identical protocols. So for instance in Hypolite's dataset there are ten (10!) entries for https://libranet.de and about 15 for https://friendica.ladies.community each with different `request_time` values. What's going on there and how to fix this?

MrPetovan commented

2018-05-12 13:28:43 +02:00

(Migrated from github.com)

The behavior is even stranger than you expect. These are the only 15 redundant base_urls in the dir.friendica.social database:

base_url	COUNT(*)
https://box25.it	7990
http://localhost	302
https://salesnet.tomeetu.de	22
https://friendica.ladies.community	17
https://libranet.de	10
https://privet.su	9
http://192.168.244.183	8
https://www.ladies.community	7
http://192.168.178.208	5
https://friendica.me	3
http://172.16.0.10	2
http://social.pelikancms.pl	2
https://social.gl-como.it	2
https://social.retr.co	2
https://friendica.christsmith.ca	2

The first issue is that there isn't a UNIQUE key on the base URL. The second issue is that there's no reduction to a normalized URL (without https) which would allow to rule out HTTP/HTTPS duplicates.

AndyHee commented

2018-05-12 14:15:51 +02:00

(Migrated from github.com)

Ohh.. what did we let ourselves in for here... 😲

I think this issue has effected the stats somehow. The coefficients are too different. Could you run this query again with:

WHERE `request_time` IS NOT NULL
AND `avg_ping` IS NOT NULL

We would like to run some further tests. Thanks.

Ohh.. what did we let ourselves in for here... 😲 I think this issue has effected the stats somehow. The coefficients are too different. Could you run this query again with: ``` WHERE `request_time` IS NOT NULL AND `avg_ping` IS NOT NULL ``` We would like to run some further tests. Thanks.

MrPetovan commented

2018-05-12 14:34:22 +02:00

(Migrated from github.com)

Here you are:

2018-05-12-site-probe.csv.zip

I did deduplicate base_url but I didn't added the nurl column. It shouldn't skew the data too much.

Here you are: [2018-05-12-site-probe.csv.zip](https://github.com/friendica/dir/files/1997599/2018-05-12-site-probe.csv.zip) I did deduplicate base_url but I didn't added the `nurl` column. It shouldn't skew the data too much.

👍 1

MrPetovan commented

2018-05-14 14:16:04 +02:00

(Migrated from github.com)

Of course, go ahead!

AndyHee commented

2018-05-14 15:31:22 +02:00

(Migrated from github.com)

[OK, I deleted some of my redundant posts above]

Ko tested the two new datasets for us and we found some interesting developments. I'm summarising a three page long report here and will give the practical implication.

For "dir.frienica.social" the removal of duplicated nodes seemed to make the relationship between request_time and avg_ping even stronger. This is good.

However, for "dir.hubup.pro" the data showed there was no relationship between request_time and avg_ping . The above (see https://github.com/friendica/dir/issues/43#issuecomment-388536649) rather different coefficients were already some indication of this. These two graphs might give you further some idea of the problem.

After removing all servers with zero avg_ping value (and some outliers), we now have established that in the Thai dataset, there is also a significant relationship between request_time and avg_ping. Which is good, because it allows us to use the OLS equations as planned.

discounted_request_time = request_time - (avg_ping * coefficient)

Here the coefficients (plus p-values about the likelihood of no relationship)
"dir.frienica.social = 3.316080104 (p = 0.0)
"dir.hubup.pro" = 4.965583808 (p = 0.001)

Practical implication

The calculation of the coefficient and the Q1, Q2, Q3, and IQR values (see here https://github.com/friendica/dir/issues/43#issuecomment-387473588) must exclude all nodes with zero avg_ping . These nodes (providing they have a request_time that is not zero) will of course still get a health score, but will not contribute to determining the coefficient and speed zones.

[OK, I deleted some of my redundant posts above] Ko tested the two new datasets for us and we found some interesting developments. I'm summarising a three page long report here and will give the practical implication. For "dir.frienica.social" the removal of duplicated nodes seemed to make the relationship between `request_time` and `avg_ping` even stronger. This is good. However, for "dir.hubup.pro" the data showed there was _no_ relationship between `request_time` and `avg_ping` . The above (see https://github.com/friendica/dir/issues/43#issuecomment-388536649) rather different coefficients were already some indication of this. These two graphs might give you further some idea of the problem. ![screenshot_2018-05-14_19-50-59](https://user-images.githubusercontent.com/28751375/39998507-7d01eedc-57b0-11e8-8da3-f13430810f53.png) After removing all servers with zero `avg_ping` value (and some outliers), we now have established that in the Thai dataset, there is also a significant relationship between `request_time` and `avg_ping`. Which is good, because it allows us to use the OLS equations as planned. `discounted_request_time` = `request_time` - (`avg_ping` * `coefficient`) Here the coefficients (plus p-values about the likelihood of no relationship) "dir.frienica.social = 3.316080104 (p = 0.0) "dir.hubup.pro" = 4.965583808 (p = 0.001) ### Practical implication The calculation of the coefficient and the Q1, Q2, Q3, and IQR values (see here https://github.com/friendica/dir/issues/43#issuecomment-387473588) must exclude all nodes with zero `avg_ping` . These nodes (providing they have a `request_time` that is not zero) will of course still get a health score, but will not contribute to determining the coefficient and speed zones.

tobiasd commented

2018-05-14 15:42:54 +02:00

(Migrated from github.com)

But then each directory server has to automatically recalculate the coefficients from time to time--right?

AndyHee commented

2018-05-14 15:44:48 +02:00

(Migrated from github.com)

Correct, and also its speed score zones.

Here an example for zones: https://github.com/friendica/dir/issues/43#issuecomment-387473588

Correct, and also its speed score zones. Here an example for zones: https://github.com/friendica/dir/issues/43#issuecomment-387473588

AndyHee commented

2018-05-14 16:18:11 +02:00

(Migrated from github.com)

OK, here the coefficient. Please excuse this non-standard notation. I hope this makes sense.

Coefficient=SUM of all x*y / SUM of all x^2

x = avg_ping - (AVERAGE of all avg_ping WHERE avg_ping is NOT zero)
y = request_time - (AVERAGE of all request_time WHERE avg_ping is NOT zero)

OK, here the coefficient. Please excuse this non-standard notation. I hope this makes sense. `Coefficient`=SUM of all x*y / SUM of all x^2 x = `avg_ping` - (AVERAGE of all `avg_ping` WHERE `avg_ping` is NOT zero) y = `request_time` - (AVERAGE of all `request_time` WHERE `avg_ping` is NOT zero)

MrPetovan commented

2018-11-12 05:52:45 +01:00

(Migrated from github.com)

Moved to https://github.com/friendica/friendica-directory/issues/4

Server health tests are a little unfair due to absolute timing values #43

Conditional distribution

Skewness

Zoning

AVG(avg_ping)

Duplication of nodes

Practical implication

AVG(`avg_ping`)