Aspiring Data Science

#ml #regression #astroml #cylowess #locfit #gaussianprocess

Как сделать Linear, Ridge, Lasso нелинейными переводом входов в гауссов базис.

https://notebook.community/DanielAndreasen/Programmers-Club/notebook/Regression%20-%20an%20example%20notebook

97 viewsAnatoly Alekseev, 12:57

#bayesian #optimization #python #gaussianprocess

https://medium.com/@okanyenigun/step-by-step-guide-to-bayesian-optimization-a-python-based-approach-3558985c6818

Medium

Step-by-Step Guide to Bayesian Optimization: A Python-based Approach

Building the Foundation: Implementing Bayesian Optimization in Python

88 viewsAnatoly Alekseev, 15:51

Aspiring Data Science

#bayes #gaussianprocess #kernels

https://peterroelants.github.io/posts/gaussian-process-kernels/

peterroelants.github.io

Gaussian processes (3/3) - exploring kernels

Explore the Gaussian process kernels fitted by the previous post by using various visualizations.

98 viewsAnatoly Alekseev, 05:52

Aspiring Data Science

#hpt #hpo #bayesian #gaussianprocess

Раскрыты недостатки GP:

1) чувствительность к масштабу параметров
2) чувствительность к выбранному ядру (по сути, приор определяет догадки о виде истинной black box функции)
3) плохая работа с категориальными параметрами (Hamming kernel?)

Когда докладчика спросили про параллелизацию, мне кажется, он лопухнулся, сказав, что тут у случайного поиска преимущество, ведь его можно выполнять параллельно, а gp - нет. Ну сам же показывал графики Expected Improvement, кто запрещает брать из них вместо единичного argmax батч argsorted()[:batch_size]?

https://www.youtube.com/watch?v=jtRPxRnOXnk

YouTube

Thomas Huijskens - Bayesian optimisation with scikit-learn

Filmed at PyData London 2017

Description
Join Full Fact, the UK's independent factchecking charity, to discuss how they plan to make factchecking dramatically more effective with technology that exists now.

Abstract
Factchecking is just one solution to…

81 viewsAnatoly Alekseev, edited 20:49

Aspiring Data Science

#gaussianprocess #optimization #global

Честно говоря, пока что кажется, что толку от "байесовости" гауссовых процессов в задаче глобальной оптимизации не так уж много. Да, рассчитывается неопределённость в каждой точке, ну так она:

1) зачастую крайне слабо отражает реальное положение относительно искомой функции
2) пропорциональна расстоянию до ближайших исследованных точек, так что её можно оценить и для "классических" ml алгосов.

Есть ещё аргумент, что функции приобретения в случае gp рассчитываются аналитически, но ведь их можно заменить эвристикой. Скоро узнаем.

81 viewsAnatoly Alekseev, edited 23:47

Aspiring Data Science

#smbo #hpo #hpt #gaussianprocess #parzen #tpe #cmaes #constantliar

Algorithms for Hyper-Parameter Optimization

Статья о Sequential Model-Based Global Optimization (SMBO) в т.ч. от Yoshua Bengio, на которую ссылаются в доке hyperopt.

"EI functions are usually optimized with an exhaustive grid search over the input space, or a Latin Hypercube search in higher dimensions. However, some information on the landscape of the EI criterion can be derived from simple computations [16]:

1) it is always non-negative and zero at training
points from D,
2) it inherits the smoothness of the kernel k, which is in practice often at least once
differentiable, and noticeably,
3) the EI criterion is likely to be highly multi-modal, especially as the number of training points increases. The authors of [16] used the preceding remarks on the landscape of EI to design an evolutionary algorithm with mixture search, specifically aimed at optimizing EI, that is shown to outperform exhaustive search for a given budget in EI evaluations."
Как следует из статьи, пытаются даже само Ожидаемое Улучшение (EI) максимизировать генетиком, я просто не понимаю, для чего, ведь полный перебор по нему идёт очень быстро. UPD: хотя дальше есть указание на странное время работы алгоритма. Applying the GP to the problem of optimizing DBN performance, we allowed 3 random restarts to the CMA+ES algorithm per proposal x
∗, and up to 500 iterations of conjugate gradient method in
fitting the length scales of the GP. The squared exponential kernel [14] was used for every node. The CMA-ES part of GPs dealt with boundaries using a penalty method, the binomial sampling part dealt with it by nature. The GP algorithm was initialized with 30 randomly sampled points in H. After 200 trials, the prediction of a point x∗ using this GP took around 150 seconds.

"Finally, we remark that all hyper-parameters are not relevant for each point. For example, a DBN with only one hidden layer does not have parameters associated to a second or third layer. Thus it is not enough to place one GP over the entire space of hyper-parameters. We chose to group the hyper-parameters by common use in a tree-like fashion and place different independent GPs over each group. As an example, for DBNs, this means placing one GP over common hyper-parameters, including categorical parameters that indicate what are the conditional groups to consider, three GPs on the parameters corresponding to each of the three layers, and a few 1-dimensional GPs over individual conditional hyper-parameters, like ZCA energy (see Table 1 for DBN parameters)."
Акцентируются сложности сэмплирования параметров поодиночке.

"The tree-structured Parzen estimator (TPE) models p(x|y) by transforming that generative process, replacing the distributions of the configuration prior with non-parametric densities. In the experimental section, we will see that the configuation space is described using uniform, log-uniform, quantized log-uniform, and categorical variables. In these cases, the TPE algorithm makes the following replacements: uniform → truncated Gaussian mixture, log-uniform → exponentiated truncated Gaussian mixture, categorical → re-weighted categorical."
Кратко о работе парценовских оценщиков.

"For the GP approach, the so-called constant liar approach was used: each time a candidate point x∗ was proposed, a fake fitness evaluation equal to the mean of the y’s within the training set D was assigned temporarily, until the evaluation completed and reported the actual loss f(x
∗). For the TPE approach, we simply ignored recently proposed points and relied on the stochasticity of draws from `(x) to provide different candidates from one iteration to the next. The consequence of parallelization is that each proposal x∗ is based on less feedback. This makes search less efficient, though faster in terms of wall time." Фраза constant liar напоминает сразу о плешивом либо усатом )

https://proceedings.neurips.cc/paper_files/paper/2011/file/86e8f7ab32cfd12577bc2619bc635690-Paper.pdf

83 viewsAnatoly Alekseev, edited 00:18

About

Blog

Apps

Platform