This post explains the sampling method introduced in the paper On the Cost of Type-Tag Soundness
Quick Reference: How to apply the method
- Find an untyped program, measure its running time.
- Define a granularity for type annotations (by-function, by-module, by-program, ….).
- Define a sample size s and number of samples r.
- Randomly select s configurations uniformly at random, measure their running time.
- Repeat the previous step r times.
- Pick a positive real number D.
- Count the proportion of configurations in each sample with running time less-than-or-equal-to D
- Build a 95% confidence interval for the r proportions computed in the previous step
- Conclusion: there is a good chance that your interval contains the true proportion of configurations with running time less-than-or-equal-to D
Background: what to measure
A migratory typing system adds static typing to a dynamically-typed (or, untyped) language. The recipe for “adding static typing” has a few steps:
- add a syntax for type annotations
- add a static type checker
- add a semantics for statically-typed parts of the program
If the semantics for statically-typed parts of the program is not the same as the semantics for dynamically-typed parts, then it is important to measure performance.
The key question is: how does adding type annotations affect the running time of a working program? We do not know how to answer this question directly.
An easier question, that we can answer, is: for a few programs each with one full set of type annotations, how does adding or removing the chosen type annotations affect the running time of these programs?
The next two sections give two methods for answering this question.
One way to answer our easier question is to remove type annotations one “unit” at a time and measure the running time of all these partially-typed programs. We call the “unit” the granularity of the performance evaluation. For example, some choices for granularity are to remove types one module at a time, to remove types one function at a time, or to remove types one variable at a time. We call the “partially-typed programs” the configurations of the original dynamically-typed program. Note that the number of configurations depends on the choice of granularity — I can’t just use the word configurations without telling you the granularity I have in mind.
After measuring the running time of all configurations, we can summarize the results. One way to summarize is to pick a number D and count the number of configurations that run at most D times slower than the original dynamically-typed program. If this number is large, then the takeaway is: if you are willing to accept at most a Dx slowdown, and you add your own type annotations to your own program, then there’s some hope that your configuration runs at most D times slower than your original program.
Simple Random Approximation Method
The method above does not scale to large programs or fine granularities because it asks for an exponential number of measurements. E.g., if there are 20 units to add or remove types from, then there are 1 million configurations to measure. Exponentials are bad.
On the Cost of Type-Tag Soundness, suggests a method based on simple random sampling that answers a similar question. Instead of measuring the true proportion of configurations that run at most D times slower than the original dynamically-typed program, we:
- pick a sample size s (in the paper, we used s = 10M where M is the number of units),
- pick a number of samples r (in the paper, we used r = 10),
- and build a 95% confidence interval for the true proportion of configurations that run at most D times slower than the original program (from the r proportions of configurations that run at most D times slower than the original program in each of the r samples).
The method is outlined above, described in the paper, and validated in that paper’s appendix. Please let us know if you have more questions.
Maybe you’re wondering, “gee why do they keep writing out ‘configurations that run at most ….’ instead of something shorter?”. Well, the short version is D-deliverable and it was introduced in the Is Sound Gradual Typing Dead? paper. Unfortunately, (1) that paper instantiated D to 3-deliverable in order to explain a few graphs and (2) at least two published papers (paper 1, paper 2) now cite us as saying 3x overhead is the cutoff between a good migratory typing system and a bad one.
If we can’t trust scientists to understand, then we definitely can’t trust you folks on the internet.
Q. What is the sampling method useful for?
- Making a confidence interval for the true proportion of configurations that run at most D times slower than some baseline, for your favorite value of D.
Q. What is the sampling method not useful for?
- Finding the slowest configuration.
- Finding the average running time of all configurations.
- Evaluations where “removing types” might involve changing List[Int] to List[Dyn], etc.
- Situations where its wrong to assume that a programmer will start from untyped and pick a configuration uniformly at random
- …. many more ….
Q. Why is it okay to choose D after collecting the samples?
The “quick reference” at the top of this post suggests choosing a value for D (the cutoff between good and bad performance) after sampling configurations and measuring their running time. This may sound strange, because (1) the value of D affects our bottom-line judgment about the proportion of configurations with good performance, and (2) shouldn’t and value that affects the bottom line be fixed before taking samples? (To avoid accidental data dredging.)
The reason it is ok to pick D after taking the sample is that the running times in the sample are independent of the choice of D.
For example, if one person chose D=3 and a second person chose D=9, both would follow the same protocol independent of D to take samples.
Q. How does migratory typing relate to gradual typing?
Q. Do you have code I can use to plot sampling data?
Yes, start here:
Please ask questions and open issues if you have trouble. The source is here:
Q. Where is code for the sampling paper?
Source is here:
Statistics is easy to do wrong. Please let us know if you think our method is doing bad statistics.