Aspiring Data Science

#python #books

Ссылки на посты по книжке "Л. Рамальо. Python – к вершинам мастерства: Лаконичное и эффективное программирование" (в оригинале - Fluent Python, 2nd Edition). Содержат материал, который показался мне интересным и вошёл в категорию #codegems.

Затрагиваются механизмы сопоставления (match), классы данных и их аналоги, аннотирование типами, инструменты itertools, работа с классами/ООП, генераторы, контекстные менеджеры, асинхронка, дескрипторы классов.

Пробегитесь по темам, если есть незнакомые слова, возможно, есть смысл перечитать актуальную доку Питон )

1. [], {} и ()/match/ChainMap/MappingProxyType
2. class init/dict/json
3. unicode: NFC/NDF, strxfrm/NamedTuple/dataclass
4. more dataclass/typehints
5. weakrefs/functional programming/more typehints
6. Any/|/TypeVar/TypeAlias/typing.Protocol
7. positional-only/closures/singledispath/decorator via class
8. getattr/reduce via initializer/zip,zip_longest/principle of failing fast
9. goose typing/vurtual subclass/Hashable/ABC/Decimal
10. UserDict, UserList, UserString/MRO/mixin/get_annotations
11. (sub)generator/coprogram/type: ignore/with/@contextmanager
12. else in for,while,try/scientific sins/GIL/getswitchinterval/asyncio
13. asyncio.to_thread/asyncpg/asyncio.Semaphore/async with/keyword.iskeyword
14. property/vars/metaprogramming
15. class descriptors

Если решите читать книгу - ТОЛЬКО в оригинале, русский перевод плох.

#python #codegems

По умолчанию любой экземпляр пользовательского класса считается истинным, но положение меняется, если реализован хотя бы один из методов bool или len. Функция bool(x), по существу, вызывает x.bool() и использует полученный результат. Если…

⚡1

851 viewsAnatoly Alekseev, edited 08:22

Aspiring Data Science

#sklearn #codegems

Прошло больше года, и только сейчас в сайкит-лёрн замёрджили ГОТОВУЮ ветку, которая втрое ускоряет расчёт classification_report.

Adrin Jalali закодил решение в течение суток с момента, как я нашёл дублирующие вызовы в sklearn. Остальное время заняли бюрократические согласования.

Вывод: open-source это страшная бюрократия, не надейтесь, что вам что-то быстро исправят, делайте сами. Вместе с тем, не ленитесь создавать хотя бы багрепорты, чтобы следующие поколения DS-ов хоть немного могли почиллить 😅

Примерно с той же эффективностью можно оставлять багрепорты или запросы функциональности для catboost - да, сделают в течение года-полутора (что по сути вечность), но я таким путём за последние N лет пропихнул улучшений 4-5 туда.

GitHub

Speed up classification_report · Issue #26808 · scikit-learn/scikit-learn

Describe the workflow you want to enable I'm concerned with slow execution speed of the classification_report procedure which makes it barely suitable for production-grade workloads. On a 8M sa...

🔥3

102 viewsAnatoly Alekseev, edited 22:46

Aspiring Data Science

#numpy #codegems #vectorize

Будьте осторожны с vectorize )

💯1

115 viewsAnatoly Alekseev, 22:15

Aspiring Data Science

#gpt #llms #codegems #openai

Красивый способ извлечь текстовые данные в структурированном виде. Пример Extracting data from research papers using Structured Outputs.

from pydantic import BaseModel
from openai import OpenAI

client = OpenAI()

class ResearchPaperExtraction(BaseModel):
    title: str
    authors: list[str]
    abstract: str
    keywords: list[str]

completion = client.beta.chat.completions.parse(
    model="gpt-4o-2024-08-06",
    messages=[
        {"role": "system", "content": "You are an expert at structured data extraction. You will be given unstructured text from a research paper and should convert it into the given structure."},
        {"role": "user", "content": "..."}
    ],
    response_format=ResearchPaperExtraction,
)

research_paper = completion.choices[0].message.parsed

Example response:

{
"title": "Application of Quantum Algorithms in Interstellar Navigation: A New Frontier",
"authors": [
"Dr. Stella Voyager",
"Dr. Nova Star",
"Dr. Lyra Hunter"
],
"abstract": "This paper investigates the utilization of quantum algorithms to improve interstellar navigation systems. By leveraging quantum superposition and entanglement, our proposed navigation system can calculate optimal travel paths through space-time anomalies more efficiently than classical methods. Experimental simulations suggest a significant reduction in travel time and fuel consumption for interstellar missions.",
"keywords": [
"Quantum algorithms",
"interstellar navigation",
"space-time anomalies",
"quantum superposition",
"quantum entanglement",
"space travel"
]
}

111 viewsAnatoly Alekseev, 07:06

Aspiring Data Science

#codegems #seedir

Классная утилита seedir:

import seedir as sd

sd.seedir(
    DATA_FOLDER,
    style="lines",
    itemlimit=10,
    depthlimit=3,
    exclude_folders=".ipynb_checkpoints",
    sort=True,
)

data/
├─categories/
│ └─categories/
│ ├─unique.item_brand.parquet
│ ├─unique.item_category.parquet
│ ├─unique.item_id.parquet
│ ├─unique.item_shop.parquet
│ ├─unique.user_age.parquet
│ ├─unique.user_brands.parquet
│ ├─unique.user_categories.parquet
│ ├─unique.user_consumption_2.parquet
│ ├─unique.user_gender.parquet
│ └─unique.user_geography.parquet
├─processed/
│ ├─train/
│ │ ├─_file_list.txt
│ │ ├─_metadata
│ │ ├─_metadata.json
│ │ ├─part_0.parquet
│ │ └─schema.pbtxt
│ └─valid/
│ ├─_file_list.txt
│ ├─_metadata
│ ├─_metadata.json
│ ├─part_0.parquet
│ └─schema.pbtxt
├─train/
│ └─part.0.parquet
├─valid/
│ └─part.0.parquet
└─workflow/
├─categories/
│ ├─unique.item_brand.parquet
│ ├─unique.item_category.parquet
│ ├─unique.item_id.parquet
│ ├─unique.item_shop.parquet
│ ├─unique.user_age.parquet
│ ├─unique.user_brands.parquet
│ ├─unique.user_categories.parquet
│ ├─unique.user_consumption_2.parquet
│ ├─unique.user_gender.parquet
│ └─unique.user_geography.parquet
├─metadata.json
└─workflow.pkl

140 viewsAnatoly Alekseev, 14:13

Aspiring Data Science

#codegems #frameworks

С заменой sklearn/pytorch я не согласен, это глупость. selectolax я не знаю, а вот под всем остальным скорее подпишусь. Лучше начинать проекты с рекомендуемых тулз вместо исторически самых популярных.

https://python.plainenglish.io/5-overrated-python-libraries-and-what-you-should-use-instead-106bd9ded180

149 viewsAnatoly Alekseev, edited 04:49

Aspiring Data Science

#python #codegems #chainmap

https://towardsdev.com/what-99-of-python-developers-dont-know-about-chainmap-and-why-it-s-a-game-changer-d9e749f965df

Medium

What 99% of Python Developers Don’t Know About ChainMap (And Why It’s a Game-Changer

“Knowledge is power, but knowledge without implementation is useless.” — Charles F. Kettering

138 viewsAnatoly Alekseev, 06:17

Aspiring Data Science

#python #codestyle #codegems

Есть о чём задуматься, типа конструкций for elem in mylist or []:

https://levelup.gitconnected.com/12-production-grade-python-code-styles-ive-picked-up-from-work-ad32d8ae630d

Medium

12 Production-Grade Python Code Styles I’ve Picked Up From Work

Read Free…

91 viewsAnatoly Alekseev, edited 05:24

Aspiring Data Science

#polars #pandas #codegems

Что в поларс НЕ понравилось.

Активное переименование методов от версии к версии. Это сводит с ума все AI, которые пробуешь использовать для генерации кода.

Мало примеров в доке к сложным методам типа .over и .group_by, .group_by_dynamic. Реально непонятно, как и на каких принципах оно отработает. В документации расписано слабовато.

Иногда polars начинает жрать всю память на относительно простых операциях, где мог бы быть более экономным. Иногда проскакивают ошибки/панические атаки, зачастую сложновоспроизводимые.

И самое главное. Если у вас большие фреймы и много операций (математических, group_by), поларс отработает быстро, но гарантированно забьёт вам всю оперативку. То же сделает и пандас, но за ним неиспользованную память можно хотя бы (если повезёт) подчистить вызовом ctypes.CDLL("libc.so.6").malloc_trim(0)

Поларс же, похоже, создаёт какие-то маленькие и разбросанные в адресном пространстве арены памяти, которые таким способом не подчищаются. И вообще никаким не подчищаются, только завершением процесса. К примеру, у меня после интенсивных расчётов и джойнов фрейм в 100 гигов, а занято памяти на терабайт. И освободить её на линуксе никак нельзя, я уже и аллокаторы пробовал альтернативные типа jemalloc, tmalloc - без толку. Это я считаю большой проблемой вычислений и big data, и странно, что никого это не колышет особо. По факту сбора мусора в линукс нет, если вы не знали, ребята. По крайней мере после отработки поларса. "Спасибо" странным авторам аллокаторов памяти.

Что забавно, на винде этот вопрос решается!!! )) Используйте

def trim_windows_process_memory(pid: int = None) -> bool:
    """Causes effect similar to malloc_trim on -nix."""

    # Define SIZE_T based on the platform (32-bit or 64-bit)
    if ctypes.sizeof(ctypes.c_void_p) == 4:
        SIZE_T = ctypes.c_uint32
    else:
        SIZE_T = ctypes.c_uint64

    # Get a handle to the current process
    if not pid:
        pid = ctypes.windll.kernel32.GetCurrentProcess()

    # Define argument and return types for SetProcessWorkingSetSizeEx
    ctypes.windll.kernel32.SetProcessWorkingSetSizeEx.argtypes = [
        ctypes.wintypes.HANDLE,  # Process handle
        SIZE_T,  # Minimum working set size
        SIZE_T,  # Maximum working set size
        ctypes.wintypes.DWORD,  # Flags
    ]
    ctypes.windll.kernel32.SetProcessWorkingSetSizeEx.restype = ctypes.wintypes.BOOL

    # Define constants for SetProcessWorkingSetSizeEx
    QUOTA_LIMITS_HARDWS_MIN_DISABLE = 0x00000002

    # Attempt to set the working set size
    result = ctypes.windll.kernel32.SetProcessWorkingSetSizeEx(pid, SIZE_T(-1), SIZE_T(-1), QUOTA_LIMITS_HARDWS_MIN_DISABLE)

    if result == 0:
        # Retrieve the error code
        error_code = ctypes.windll.kernel32.GetLastError()
        logger.error(f"SetProcessWorkingSetSizeEx failed with error code: {error_code}")
        return False
    else:
        return True

и спите спокойно.

Следующая жёсткая проблема - производительность с dtype=category. Не храните паркетные файлы с этим типом данных, лучше конвертируйте в string (с compression='zstd', конечно). Иначе потом они будут склеиваться ВЕЧНОСТЬ, при любых настройках StringCache.

Stack Overflow

Polars DF takes up lots of RAM

I'm running a python script that analyses a dataframe (loaded from a parquet file).
I've ran my program using the memory_profiler package to see the mem signature it has - on 2 DataFrames with a si...

175 viewsAnatoly Alekseev, edited 21:34

Aspiring Data Science

#python #codegems

Как узнать, кто тянет вас на дно?! )

python -X importtime -c "from mymodule import *" 2>import_times.txt

101 viewsAnatoly Alekseev, edited 07:07

Aspiring Data Science

#polars #deltalake #orjson #codegems

Попробовал deltalake в решении по сбору данных. отстой, лучше бы любую СУБД заюзал типа постгре или даже монго. Некоторые выводы из мини-проекта:

1) orjson is x20 faster than json

2) xxhash.xxh128 is x6 faster than hashlib.sha256

3) deltalake package is (at least so far) the toy solution. does not support concurrent writes, I had to use manual locking. with many small updates, requires frequent tables "re-optimizing". i just needed a "primary key" functionality from it - and it's slow, while spending LOTS of CPU. I should have better used any RDBMS, or mongo, instead.

В каком случае deltalake можно использовать: когда записываете данные редко, и с таблицей работает один поток. Либо хочется хостить данные в облачном хранилище типа gcp напрямую в паркете. Еще можно воспользоваться полуручным локом на время операций с дельта таблицей:

import os
import logging
from urllib.parse import urlparse
from filelock import FileLock, Timeout

logger = logging.getLogger(__name__)


def is_local_path(path: str) -> bool:
    parsed = urlparse(path)
    # If there's no scheme or it's explicitly "file"
    if parsed.scheme in ("", "file"):
        return not path.startswith(("s3://", "azure://"))

    # Special case: Windows drive letter (e.g., "R:\...")
    if os.name == "nt" and len(parsed.scheme) == 1 and parsed.scheme.isalpha():
        return True

    return False


def safe_delta_write(path: str, delta_op_func, *, lock_timeout: int = 120, lock_suffix=".lock"):
    """
    Wraps any Delta Lake operation (write_deltalake, merge+execute) with local file locking.

    Parameters:
        path (str): Delta table path.
        delta_op_func (callable): A function that performs the actual Delta operation.
        lock_timeout (int): How many seconds to wait for the lock before skipping.
        lock_suffix (str): Suffix for the lock filename.

        Usage Examples
        🔁 For .merge().when_not_matched_insert_all().execute():

        def merge_ads_static():
            return DeltaTable(ADS_STATIC_PATH).merge(
                static_df,
                predicate="t.id = s.id",
                source_alias="s",
                target_alias="t",
                writer_properties=DELTALAKE_OPTIONS.get("writer_properties")
            ).when_not_matched_insert_all().execute()

        safe_delta_write(ADS_STATIC_PATH, merge_ads_static)

        📝 For write_deltalake() appends:

        def write_market_ads():
            return write_deltalake(
                MARKET_ADS_PATH,
                market_df,
                mode="append",
                partition_by=["date"],
                **DELTALAKE_OPTIONS
            )

        safe_delta_write(MARKET_ADS_PATH, write_market_ads)
    """
    if is_local_path(path):
        lock_file = os.path.join("/tmp", f"{os.path.basename(path).replace('/', '_')}{lock_suffix}")
        lock = FileLock(lock_file)

        try:
            with lock.acquire(timeout=lock_timeout):
                logger.debug(f"Acquired lock for local Delta path: {path}")
                return delta_op_func()
        except Timeout:
            logger.warning(f"Timeout while waiting for lock on {path}. Skipping operation.")
        except Exception as e:
            logger.exception(f"Delta operation failed on {path}: {e}")
            raise (e)
    else:
        logger.warning(f"Delta operation on non-local path: {path}. Proceeding without lock.")
        try:
            return delta_op_func()
        except Exception as e:
            logger.exception(f"Delta operation failed on {path}: {e}")

132 viewsAnatoly Alekseev, edited 18:46

Aspiring Data Science

#algorithms #bloomfilter #hyperloglog #minhash #lshforest #fastquantile #rollingquantiles #python #codegems

https://youtu.be/rOi0fybeUf8?si=FvW-LWy8Bo0QFIer

YouTube

Артур Соловьёв | Fast and approximate responses

Спикер: Артур Соловьёв, HFT, Senior Engineer

Data Fest 2025: https://ods.ai/events/datafest2025
Презентацию к докладу Вы можете скачать в треке секции OptimalDL: https://ods.ai/tracks/df25-optimaldl
______
Наши соц.сети:
Telegram: https://t.me/datafest
Вконтакте:…

91 viewsAnatoly Alekseev, edited 23:01

Aspiring Data Science

#python #codegems

https://python.plainenglish.io/6-python-libraries-so-modular-i-stopped-using-frameworks-bdaa939cffd9

Medium

6 Python Libraries So Modular, I Stopped Using Frameworks

Build only what you need — without getting locked in

81 viewsAnatoly Alekseev, edited 02:33

About

Blog

Apps

Platform