We consider the problem of repeatedly choosing policies to maximize social welfare. Welfare is a weighted sum of private utility and public revenue. Earlier outcomes inform later policies. Utility is not observed, but indirectly inferred. Response functions are learned through experimentation.
We derive a lower bound on regret, and a matching adversarial upper bound for a variant of the Exp3 algorithm. Cumulative regret grows at a rate of T2/3. This implies that (i) welfare maximization is harder than the multi-armed bandit problem (with a rate of T1/2 for finite policy sets), and (ii) our algorithm achieves the optimal rate. For the stochastic setting, if social welfare is concave, we can achieve a rate of T1/2 (for continuous policy sets), using a dyadic search algorithm.
We analyze an extension to nonlinear income taxation, and sketch an extension to commodity taxation. We compare our setting to monopoly pricing (which is easier), and price setting for bilateral trade (which is harder).
We use cookies to provide you with an optimal website experience. This includes cookies that are necessary for the operation of the site as well as cookies that are only used for anonymous statistical purposes, for comfort settings or to display personalized content. You can decide for yourself which categories you want to allow. Please note that based on your settings, you may not be able to use all of the site's functions.
Cookie settings
These necessary cookies are required to activate the core functionality of the website. An opt-out from these technologies is not available.
In order to further improve our offer and our website, we collect anonymous data for statistics and analyses. With the help of these cookies we can, for example, determine the number of visitors and the effect of certain pages on our website and optimize our content.