Spark Reputation Scoring

Each incoming web transaction will be graded on the following characteristics to ultimately assign each Spark node a reputation score.

  1. Completeness: This aspect evaluates whether the data is whole or if there are missing elements. It assesses if the data set includes all the necessary data points for the intended use case.

  2. Consistency: This dimension checks for uniformity in the data across different data sets or within the same data set over time.

  3. Timeliness: This measures whether the data is up-to-date and available when needed.

  4. Availability: This assesses the degree to which the data from each node can be made available.

To express the evaluation of incoming requests based on the described characteristics and to ultimately assign each node a reputation score, we can define a formula that incorporates each of these dimensions. Let's denote the reputation score as R, and the characteristics as follows:

  • C for Completeness

  • N for Consistency

  • T for Timeliness

  • A for Availability

R=wCC+wNN+wTT+wAAR=w C ​ ⋅C+w N ​ ⋅N+w T ​ ⋅T+w A ​ ⋅A
  • ,wNw_N wTw_T wAw_A ​ are the weights assigned to Completeness, Consistency, Timeliness, and Reliability, respectively and are used to normalize reputation score.

wC+wN+wT+wA=1w C ​ +w N ​ +w T ​ +w A ​ =1

Each variable described:

  • RR : Reputation Score - The final score assigned to each node, calculated based on the weighted sum of the four characteristics.

  • CC : Completeness - Evaluates if the data set is complete, including all necessary data points.

  • NN : Consistency - Checks for uniformity in the data, ensuring consistency across different sets or over time.

  • TT : Timeliness - Measures if the data is current when needed.

  • AA : Availability - Assesses the likelihood that data will be available from each node.

  • wCw_C wNw_N wTw_TwAw_A : Weights - These are the relative importance assigned to each characteristic, determining how much each aspect influences the final reputation score.

Dynamic allocation of weights:

wCw_C =0.5 (Completeness, CC , is highly prioritized).

For a time period t1t_1 to t2t_2 :

Nt1,t2N_{t_1, t_2} is the set of all observed values of NNfrom each active node in the time period.

Tt1,t2T_{t_1, t_2} is the set of all observed values of TTfrom each active node in the time period.

At1,t2A_{t_1, t_2} is the set of all observed values of AAfrom each active node in the time period.

σ(Nt1,t2)\sigma(N_{t_1, t_2}) is the standard deviation of Nt1,t2N_{t_1, t_2} ; it measures how much variability there is in consistency across all active nodes during the time period t1_{t_1} to t2_{t_2} .

σ(Tt1,t2)\sigma(T_{t_1, t_2}) is the standard deviation of Tt1,t2T_{t_1, t_2} ; it measures how much variability there is in timeliness across all active nodes during the time period t1_{t_1} to t2_{t_2}.

σ(At1,t2)\sigma(A_{t_1, t_2}) is the standard deviation of At1,t2A_{t_1, t_2} ; it measures how much variability there is in availability across all active nodes during the time period t1_{t_1}to t2_{t_2} .

The remaining total weight of 0.5 is distributed across ​wNw_N ,​wTw_T and ​wAw_A linearly according to their standard deviation.

Last updated