Блог* – Telegram

Блог*

@dereference_pointer_there

1.83K subscribers

3.23K photos

127 videos

15 files

3.52K links

Блог со звёздочкой.

Много репостов, немножко программирования.

Небольшое прикольное комьюнити: @decltype_chat_ptr_t
Автор: @insert_reference_here

Download Telegram

About

Blog

Apps

Platform

1.83K subscribers

Красивая история как Rockstar парсили 10мб жсон в GTA V во имя сатане. Для заядлых геймеров даже патчик есть. Неофиц. https://nee.lv/2021/02/28/How-I-cut-GTA-Online-loading-times-by-70/ Напомнило https://t.me/oleg_log/3970

#prog #performancetrap #article

It Can Happen to You

Или как один человек — достаточно умный, чтобы написать невероятно быстрый 3d-визуализатор stl-файлов, допустил ровно ту же самую ошибку, из-за чего время на парсинг формата в текстовой форме превосходило время, потраченное на все остальные стадии визуализации, вместе взятые.

778 viewsedited 21:19

#prog #performancetrap #article s от небезызвестного Daniel Lemire

Mispredicted branches can multiply your running times

Benchmarking is hard: processors learn to predict branches

(thanks @al_tch)

🔥2👍1

776 viewsedited 19:44

#prog #rust #performancetrap #article

A performance retrospective using Rust (part 3)

Или когда ? оказался медленнее try!.

agourlay.github.io

A performance retrospective using Rust (part 3)

Yet another programming blog

👍4

700 viewsedited 18:17

#prog #rust #performancetrap #article

Contention on multi-threaded regex matching

Или о подводных камнях при использование регекспов в Rust из нескольких потоков

More Stina Blog!

Contention on multi-threaded regex matching

Let’s say you need to match the same regex across a large number of strings – perhaps you’re applying a grep-like filter to data generated or received by your program. This toy ex…

💩1

669 viewsedited 15:56

#prog #rust #performancetrap #article

Читайте документацию, программисты.

Upgradable parking_lot::RwLock might not be what you expect

Собственно, статья может быть сведена к единственному абзацу из документации parking_lot (конкретно к методу RwLock::upgradeable_read):

Locks this rwlock with upgradable read access, blocking the current thread until it can be acquired.

The calling thread will be blocked until there are no more writers or other upgradable reads which hold the lock. There may be other readers currently inside the lock when this method returns.

More Stina Blog!

Upgradable parking_lot::RwLock might not be what you expect

Let’s say we’re building a simple table indexed by integers starting with 0. Although the keys are contiguous, the table is loaded from key-value pairs that arrive in arbitrary order. T…

💩1

679 viewsedited 16:03

#prog #rust #performancetrap #article

The stable HashMap trap

You read about faster hash functions and switch to one. Most of your code gets the expected speed boost, but some parts mysteriously get slower – much slower, especially when dealing with large hashmaps. If this sounds familiar, you might have encountered the stable HashMap trap.

More Stina Blog!

The stable HashMap trap

You read about faster hash functions and switch to one. Most of your code gets the expected speed boost, but some parts mysteriously get slower – much slower, especially when dealing with lar…

💩1

789 viewsedited 20:57

#prog #performancetrap #video

"Performance Matters" by Emery Berger

Фактически презентация двух инструментов для анализа производительности.

Первый — Stabilizer. Производительность программ в немалой степени зависит от того, как данные располагаются в памяти, и от окружения, в котором программы запускаются. Автор видео ссылается на статью, которая показывает, что эффект этих переменных может быть весьма значителен и перекрывать даже разницу между оптимизированным и неоптимизированным кодом. Stabilizer в рантайме каждые пол-секунды меняет раскладку кода и данных в куче, что позволяет снимать профиль производительности с учётом всех возможных влияний раскладки кода. Из-за применимости в данном случае центральной предельной теоремы общее влияние раскладки описывается (для достаточно большого количества исследованных данных) нормальным распределением, что позволяет задействовать статистические методы для того, чтобы замерить, насколько вклад в изменение производительности обусловлен изменениями в коде. К сожалению, этот инструмент более активно не развивается.

Второй инструмент (более живой) — это coz, causal profiler. Этот профайлер позволяет ценой небольших аннотаций исходного кода оценить, насколько сильно изменение производительности одного компонента сказывается на производительности системы в целом. Так как просто взять и ускорить код невозможно, coz достигает требуемых эффектов за счёт замедления всех остальных компонентов. В видео рассказывается о том, как coz помог в реальных случаях, на какие неожиданные узкие места указывал и о том, насколько хорошо замеренные прибавки в производительности согласовывались с предсказаниями инструмента.

Забавно, что это видео я уже смотрел, Даня упоминал coz у себя на канале, но только сейчас наткнулся на него снова и выложил у себя.

"Performance Matters" by Emery Berger

Performance clearly matters to users. For example, the most common software update on the AppStore is "Bug fixes and performance enhancements." Now that Moore's Law has ended, programmers have to work hard to get high performance for their applications. But…

🔥3👍2

850 views22:29

мне не нравится реальность

cursed-fact-of-the-day: бинарный поиск по массиву из 2^20 элементов примерно на 20% медленнее, чем бинарный поиск по массиву из 2^20 + 123 элементов. Причина: https://en.algorithmica.org/hpc/cpu-cache/associativity/ Источник: twitter@sergey_slotin

#prog #article #performancetrap

Gallery of Processor Cache Effects

👍1

843 views22:13

#prog #performancetrap

677 views23:52

#prog #rust #article #performancetrap

SQLx Compile Time Woes, или как значительно ускорили компиляцию проектов, использующих sqlx (и эти изменения уже полгода как внесены).

TL;DR: кеширование.

cosmichorror.dev

SQLx Compile Time Woes

A curious case of climbing compile-times

💩1

738 views17:47

#prog #article #performancetrap

The 'premature optimization is evil' myth

I have heard the “premature optimization is the root of all evil” statement used by programmers of varying experience at every stage of the software lifecycle, to defend all sorts of choices, ranging from poor architectures, to gratuitous memory allocations, to inappropriate choices of data structures and algorithms, to complete disregard for variable latency in latency-sensitive situations, among others.

Mostly this quip is used defend sloppy decision-making, or to justify the indefinite deferral of decision-making. In other words, laziness. It is safe to say that the very mention of this oft-misquoted phrase causes an immediate visceral reaction to commence within me… and it’s not a pleasant one.

In this short article, we’ll look at some important principles that are counter to what many people erroneously believe this statement to be saying. To save you time and suspense, I will summarize the main conclusions: I do not advocate contorting oneself in order to achieve a perceived minor performance gain. <...> What I do advocate is thoughtful and intentional performance tradeoffs being made as every line of code is written. Always understand the order of magnitude that matters, why it matters, and where it matters. And measure regularly! <...> Given the choice between two ways of writing a line of code, both with similar readability, writability, and maintainability properties, and yet interestingly different performance profiles, don’t be a bozo: choose the performant approach. Eschew redundant work, and poorly written code. And lastly, avoid gratuitously abstract, generalized, and allocation-heavy code, when slimmer, more precise code will do the trick.

<...>

These kinds of “peanut butter” problems add up in a hard to identify way. Your performance profiler may not obviously point out the effect of such a bad choice so that it’s staring you in your face. Rather than making one routine 1000% slower, you may have made your entire program 3% slower. Make enough of these sorts of decisions, and you will have dug yourself a hole deep enough to take a considerable percentage of the original development time just digging out.

<...>

There aren’t many ways to introduce a multisecond delay into your program at a moment’s notice. But I/O can do just that.

Code with highly variable latency is dangerous, because it can have dramatically different performance characteristics depending on numerous variables, many of which are driven by environmental conditions outside of your program’s control. As such it is immensely important to document where such variable latency can occur, and to program defensively against it happening.

<...>

I can’t tell you how many times I’ve seen programmers employ unsafe pointer arithmetic to avoid the automatic bounds checking generated by the CLR JIT compiler. It is true that in some circumstances this can be a win. But it is also true that most programmers who do this never bothered to crack open the resulting assembly to see that the JIT compiler does a fairly decent job at automatic bounds check hoisting. This is an example where the cost of the optimization outweighs the benefits in most circumstances. The cost to pin memory, the risk of heap corruption due to a failure to properly pin memory or an offset error, and the complication in the code, are all just not worth it. Unless you really have actually measured and found the routine to be a problem.

<...>

I’m not saying Knuth didn’t have a good point. He did. But the “premature optimization is the root of all evil” pop-culture and witty statement is not a license to ignore performance altogether. It’s not justification to be sloppy about writing code.

Joe Duffy - The 'premature optimization is evil' myth

Joe Duffy's Blog | Adventures in the high-tech underbelly

👍4🥴4❤1❤‍🔥1💩1

793 views22:16

#prog #rust #article #performancetrap

Why my Rust benchmarks were wrong, or how to correctly use std::hint::black_box?

Guillaume Endignoux

Why my Rust benchmarks were wrong, or how to correctly use std::hint::black_box? | Blog | Guillaume Endignoux

In a previous blog post, I described some benchmarks I wrote for a program written in Rust.While presenting the results, I mentioned a strange behavior: things that should have been very fast (a few nanoseconds) were reported as instantaneous.I wrote that…

❤1💩1

863 viewsedited 22:02

#prog #erlang #article #performancetrap

Elixir RAM and the Template of Doom

Или как детали реализации виртуальной машины Erlang протекают в виде влияния на производительность вывода строк

www.evanmiller.org

Elixir RAM and the Template of Doom – Evan Miller

👍2🔥1🤯1🖕1

1.07K views13:43

#prog #rust #python #performancetrap #article

Rust-Python FFI

Или о некоторых возможных проблемах при интеропе Rust и Python, включая проблемы производительности и проблемы с эргономикой.

815 views20:40

#prog #rust #python #article #suckassstory #performancetrap

Rust std fs slower than Python!? No, it's hardware!

Редкий случай, когда удалось отследить баг и подтвердить, что он действительно в железе. Ссылки на патчи в glibc прилагаются.

TL;DR: оба варианта кода используют mmap в качестве буфера для считывания из файла, но в Python этот буфер используется с некоторым смещением. На некоторых процессорах — в том числе в том, который используется на машине автора — команда rep movsb — которая использовалась в реализации memcpy — парадоксальным образом работает на порядок более медленно при работе с выровненным буфером.

Rust std fs slower than Python!? No, it's hardware!

Achieving Data Freedom Through Open Source and Rust

🤯10

3.26K viewsedited 13:39

#prog #rust #performancetrap #article

Identifying Rust's collect::<Vec<_>>() memory leak footgun

TL;DR: для Vec::from_iter есть несколько специализаций, которые позволяют в некоторые случаях переиспользовать выделенную память, если цепочка итераторов начинается с vec::IntoIter, даже если размер элементов итогового вектора меньше размера элементов изначального вектора, и это может привести к созданию векторов с большим количеством неиспользуемой ёмкости.

Considerations on Codecrafting

Identifying Rust’s collect::<Vec>() memory leak footgun

Over the weekend, I was working on a personal Rust project when I ran into an excessive memory usage problem. After an evening of trial and error, I found a workaround to fix the memory usage, but I still didn’t understand how the issue was even possible…

🎉3🥰1🤔1

931 viewsedited 14:34

#cpp #article #performancetrap

Производительность неупорядоченных контейнеров зависит, понятное дело, от структуры данных, используемой для реализации. Однако требования к API могут существенно ограничить возможные варианты реализации и, соответственно, потенциал для производительности.

Пожалуй, наиболее показательный пример применительно к C++ — это std::unordered_map. В отличие от std::map, к этому контейнеру нет требования, что элементы должны храниться в отсортированном порядке. По идее, это должно развязывать руки тем, кто пишет контейнеры. Однако стандарт C++ также накладывает серьёзные ограничения на стабильность итераторов для unordered_map. Именно, стандарт предписывает, что операции мапы инвалидируют ссылки на ключи и элементы только в том случае, если элементы удаляют — все остальные операции, включая всё то, что вызывает пересчёт хешей, ссылки сохраняют. На практике это означает, что любая конформная реализация обязана класть хранимые элементы в отдельный узел — с указателями на следующий и предыдущий элемент — память под который выделяется в куче. Мало того, что это требует +2 * sizeof(ptr) на каждый элемент — из-за использования аллокатора память под элементы выделяется в относительно произвольном порядке, что не позволяет нормально задействовать кеш процессора. Альтернативные реализации, такие, как abseil::flat_hash_map, могут иметь на порядок более высокую производительность за счёт использования структур данных, лучше использующих кеш процессора — ценой гарантий стабильности ссылок, разумеется.

Но есть и более тонкие ловушки. У std::multiset — упорядоченного контейнера, поддерживающего более чем однократное вхождение элементов — есть методы lower_bound, upper_bound и equal_range. Первые два метода возвращают первое и последнее вхождение указанного элемента в упорядоченной последовательности, а третий возвращает два итератора, между которыми находятся все элементы, равные предоставленному. В C++11 был добавлен неупорядоченный аналог этого контейнера — std::unordered_multiset. lower_bound и upper_bound по понятным причинам не имеют смысла для этого контейнера, а вот equal_range есть — видимо, для облегчения миграции с std::multiset. Но есть небольшая проблема: этот метод возвращает пару обычных итераторов. А это подразумевает, что все равные элементы хранятся в контейнере подряд. Соответственно, insert должен делать дополнительную работу по поиску места для вставки, и становится всё медленнее по мере увеличения числа элементов. В статье unordered_multiset’s API affects its big-O автор модифицировал реализацию unordered_multiset из libc++, убрав метод equal_range. Замеры показали, что это позволяет изменить код, чтобы существенно ускорить метод insert.

abseil / Abseil Containers

An open-source collection of core C++ library code

👍18❤1🤣1

963 viewsedited 17:40

#prog #article (#performancetrap?)

Strangely, Matrix Multiplications on GPUs Run Faster When Given "Predictable" Data! (перевод)

TL;DR: GPU быстрее перемножают матрицы с одинаковыми данными. А разгадка в том, что перемножение более разнообразных данных вызывает больше переключений состояний транзисторов, отвечающих за логику, что увеличивает энергопотребление и приводит к троттлингу частот.

🤯13🤮1

1.05K viewsedited 15:24

#prog #article (и #performancetrap, видимо?)

The RAM myth (перевод)

Или о оптимизационных трюках, которые позволяют шардировать данные в RAM значительно быстрее наивного подхода (спойлер: они лучше утилизируют кэш). Бенчмарки наглядно показывают, насколько неадекватным является представление о RAM как о линейной памяти с константным доступом.

purplesyringa's blog

The RAM myth is a belief that modern computer memory resembles perfect random-access memory. Cache is seen as an optimization for small data: if it fits in L2, it’s going to be processed faster; if it doesn’t, there’s nothing we can do.
Most likely, you believe…

👍15❤‍🔥3👎1

1.33K views18:33

#prog #article #performancetrap

Use Fast Data Algorithms

As an engineer who primarily works with data and databases I spend a lot of time moving data around, hashing it, compressing it, decompressing it and generally trying to shovel it between VMs and blob stores over TLS. I am constantly surprised by how many systems only support slow, inefficient, and expensive ways of doing these operations.

In my experience, these poor algorithm choices are orders of magnitude slower than modern alternatives.

❤6👍1💯1

1.01K views10:00