Aspiring Data Science

#lightgbm #bugs

ЛайтГБМ может херить категориальные входы при предсказании (хотя он менять входы вообще никак не должен). Сколько же крови мне этот баг попил... Думаю, откуда чёртовы нули эти берутся, я же датасет вообще не меняю.

Но как же армия кэгглеров, которые юзают ансамбли, почему этого никто не заметил и давно не зарепортил?

Мне теперь только в XGBoost-е ошибку найти осталось, и закрою гештальт.

UPD.

"jmoralez commented 18 minutes ago
Hey, thanks for using LightGBM and sorry for the troubles. We used to take a shallow copy there but it wasn't obvious that the predict step depended on that and a recent refactor removed it. We'll work on a fix."

Странно, и вовсе не полгода им понадобилось, чтобы отреагировать. Катбуст/Яндекс, учитесь.

https://github.com/microsoft/LightGBM/issues/6195

GitHub

LightGBM corrupts categorical columns with unseen values on prediction · Issue #6195 · microsoft/LightGBM

Description In predict_proba of LGBMClassifier at least, if the input is a pandas dataframe, in a categorical column, when a value is met not seen while fitting, entire column becomes corrupt. Repr...

👍1

99 viewsAnatoly Alekseev, edited 01:17

Aspiring Data Science

#lightgbm #improvements

Достало, что лайтгбм лезет в то, как я называю свои признаки.

https://github.com/microsoft/LightGBM/issues/6202

GitHub

Lift restrinctions on feature names ("LightGBMError: Do not support special JSON characters in feature name") · Issue #6202 · …

Summary Currently, it can be hard to plug in LightGBM into existing ML system because of it's selectivity to feature naming. Underscores, or even non-english language symbols trigger "Ligh...

👀1

94 viewsAnatoly Alekseev, 05:30

Aspiring Data Science

#lightgbm

Внезапно выяснил, что бустинг от мелкософт может требовать очень много памяти. Оказалось, если у входного фрейма пандас есть столбцы int32/uint32, он конвертирует всё в float64. Написал вот такую утилитку для сохранения float32 ценой потери точности:

def ensure_dataframe_float32_convertability(df:pd.DataFrame)->None:
    """Lightgbm uses np.result_type(*df_dtypes) to detect array dtype when converting from Pandas input,
    which results in float64 for int32 and above. For the rational mem usage, it makes sense to convert cols to float32 directly before training lightgbm."""
    for precise_dtype in "uint32 int32".split():

        tmp=df.select_dtypes(precise_dtype)
        if tmp.shape[1]>0:
            logger.info(f"Converting {tmp.shape[1]:_} {precise_dtype} columns to float32")
            df[tmp.columns]=tmp.astype(np.float32)

👍1

125 viewsAnatoly Alekseev, 12:21

About

Blog

Apps

Platform