Computer Sciences Colloquium: New Tools for Scalable Weighted Sampling: Frequency Capping, Multi-Objective, and more

Edith Cohen

18 October 2015, 11:00

Schreiber Building, Room 006

Abstract:

Data modeled as elements with {\em keys} (cookies, users, queries) and numeric positive values, is prevalent. Analytics on such data typically utilizes statistics expressed as a sum over keys in a specified segment of a function $f$ applied to the weight of the key. The segment, for example, can specify users from a certain region, language, age, or demographics or IP flows with certain characteristics. Very common functions $f$ are {\em Distinct}, which is the number of active keys in the segment, {\em Sum} is the sum of their weights, {\em frequency moments}, which raise the weight to a power $p$, {\em frequency capping} statistics, which cap the weight by a parameter $T$, and threshold functions. The data is often streamed or distributed. Efficient computation requires one or few passes over the data while maintaining state much smaller than the number of keys. We say that the data is in {\em aggregated} form when each key appears in at most one element. When multiple elements can have the same key we say the data is {\em unaggregated} and interpret the weight of a key as the sum of the weights of its elements. A very effective tool for working with very large data is computing a smaller sample from which statistics can be estimated. To obtain optimal tradeoffs of sample size and estimation quality for $f$-statistics, we need to sample keys with probability proportional to $f$ applied to their weight. The first tool we present applies to aggregated data and addresses applications that require estimates with statistical guarantees for a set of different $f$-statistics. We design a {\em multi-objective} sampling scheme that provides guarantees for all monotone non-decreasing $f$ (this includes all moments, capping, and threshold functions). The sampling can be performed on streamed or distributed data using state proportional to sample size. We show that the size of the sample is larger than a respective dedicated sample by only a logarithmic factor in the number of keys and that this size quality tradeoff is tight. The second tool we present, and our main technical result, is a novel sampling framework for unaggregated data. Our framework unifies classic stream sampling schemes dedicated for distinct or sum statistics (distinct reservoir sampling and sample and hold). Importantly, we provide the first solution for general frequency capping statistics with size quality tradeoffs that nearly match the best possible with aggregated data. We also motivate capping statistics by discussing their use in online advertising.