Why ProofSet size doesn’t really matter

Let the dataset have size N (call it N nodes).
Suppose a fraction \alpha is missing
- this means that \alpha*N nodes can not be proven and (1-\alpha)*N nodes can be proven if hit by a challenge.
A challenge picks a random node.. What matters to detection is the fraction of bad indices in the population, not the absolute count.
If you issue K independent challenges, you catch the prover unless the K challenges hit nodes that can be proven.

Analysis

Given each challenge is sampled at random (i.e. all challenges are independent) among N indexes, the probability that a single challenge hit a node that the prover can prove is (1-\alpha)

This means that for K challenges, the probability that the prover who lost an \alpha fraction of the data is not caught is

$p = (1-\alpha)^K$

As one can observe, given we are considering the percentage of storage lost (i.e. \alpha), the soundness error only depends on \alpha and K, not N.

Of course, the absolute value of data lost depends on ProofSet size, but this is not considered in the security analysis.

Assuming one proof per day, K challenges each, the soundness error decreases over time as

$$ p_{\text{day } T} = (1-\alpha)^{K\cdot T} $$

If we call \epsilon the "probability of evasion”, we have that

Detection probability vs data loss fraction

α (fraction of data lost)	Per-day evasion ( (1−α)^5 )	Per-day detection	30-day evasion ( (1−α)^(150) )	30-day detection
1%	0.95099	4.901 %	0.22145	77.855 %
5%	0.77378	22.622 %	0.00046	99.954 %
20%	0.32768	67.232 %	2.91 × 10⁻¹⁵	≈ 100 %