Aspiring Data Science

#pandas #bugs #bollocks

Только решил поработать с финансовыми данными, так на них обосрался pandas. Ну как так-то, а? Он же такой медленный, такой проверенный временем и сотнями тысяч кодеров.

https://github.com/pandas-dev/pandas/issues/52505

GitHub

BUG: incorrect reading of CSV containing large integers · Issue #52505 · pandas-dev/pandas

Pandas version checks I have checked that this issue has not already been reported. I have confirmed this bug exists on the latest version of pandas. I have confirmed this bug exists on the main br...

😁1

42 views01:33

Aspiring Data Science

#numpy #bugs

Нампай тож свалился на этом проекте )) Захотел 4 эксбибайта памяти.

https://github.com/numpy/numpy/issues/23564

GitHub

BUG: Memory Overflow in np.histogram with bins="auto" · Issue #23564 · numpy/numpy

Describe the issue: Something is wrong with the "auto" option. Reproduce the code example: import numpy as np hist, bin_edges = np.histogram( np.array( [ -4.24264069e00, -5.55111512e-17, ...

49 viewsedited 00:53

Aspiring Data Science

#ml #catboost #metrics #bugs

Утро прошло в жарких спорах о точности. Нашёл предположительный баг в том, как катбуст считает precision.

https://github.com/catboost/catboost/issues/2422

GitHub

Precision calculation error in Early Stopping. Request to add pos_label. · Issue #2422 · catboost/catboost

Problem: catboost version: 1.2 Operating System: Win CPU: + GPU: + Я думаю, в коде catboost вычисляющем precision где-то перепутаны предсказания и истинные значения, поэтому ранняя остановка по точ...

121 viewsAnatoly Alekseev, edited 11:11

Aspiring Data Science

#lightgbm #bugs

ЛайтГБМ может херить категориальные входы при предсказании (хотя он менять входы вообще никак не должен). Сколько же крови мне этот баг попил... Думаю, откуда чёртовы нули эти берутся, я же датасет вообще не меняю.

Но как же армия кэгглеров, которые юзают ансамбли, почему этого никто не заметил и давно не зарепортил?

Мне теперь только в XGBoost-е ошибку найти осталось, и закрою гештальт.

UPD.

"jmoralez commented 18 minutes ago
Hey, thanks for using LightGBM and sorry for the troubles. We used to take a shallow copy there but it wasn't obvious that the predict step depended on that and a recent refactor removed it. We'll work on a fix."

Странно, и вовсе не полгода им понадобилось, чтобы отреагировать. Катбуст/Яндекс, учитесь.

https://github.com/microsoft/LightGBM/issues/6195

GitHub

LightGBM corrupts categorical columns with unseen values on prediction · Issue #6195 · microsoft/LightGBM

Description In predict_proba of LGBMClassifier at least, if the input is a pandas dataframe, in a categorical column, when a value is met not seen while fitting, entire column becomes corrupt. Repr...

👍1

99 viewsAnatoly Alekseev, edited 01:17

Aspiring Data Science

#parquet #pyarrow #bugs

Удалось выследить очень противный баг в pyarrow (а именно этот движок использует по умолчанию пандас при чтении паркета).
При чтении больших файлов со смешанными типами столбцов расходовалось памяти вдвое больше, чем надо, причём не релизилось. Настоящая утечка. На Винде точно есть, про никсы не знаю.
Я его видел ещё год или два назад, не стал репортить, думал, и без меня починят.

https://github.com/apache/arrow/issues/38736

GitHub

Memory leak on Windows when reading parquet with mixed dtypes via Pyarrow · Issue #38736 · apache/arrow

Describe the bug, including details regarding any error messages, version, and platform. I've been noticing a memory leak for several years now. When reading a big parquet file, pyarrow lib or ...

120 viewsAnatoly Alekseev, edited 23:04

About

Blog

Apps

Platform