Bitcoin Core Github
42 subscribers
126K links
Download Telegram
💬 furszy commented on issue "GUI crashes in regtest":
(https://github.com/bitcoin-core/gui/issues/918#issuecomment-3665898521)
Can you check the directory permissions please
🤔 l0rinc requested changes to a pull request: "Add initial vectorized chacha20 implementation for 2-3x speedup"
(https://github.com/bitcoin/bitcoin/pull/34083#pullrequestreview-3586783659)
I went through the code roughly, I like that we're focusing on this and looking forward to specializing other parts of the code that are this critical (looking at you, SipHash!).

My biggest objection currently is that it's broken on big-endian systems - left a suggestion how to reproduce and fix it. This also reveals the lack of testing - we have to find a way to selectively enable and disable these optimizations to make sure we have tests that compare their outputs.
I agree with AJ that we
...
💬 l0rinc commented on pull request "Add initial vectorized chacha20 implementation for 2-3x speedup":
(https://github.com/bitcoin/bitcoin/pull/34083#discussion_r2626253726)
can we assume that all big endian systems have `__builtin_bswap32` available.
https://github.com/bitcoin/bitcoin/blob/432b18ca8d0654318a8d882b28b20af2cb2d2e5d/src/compat/byteswap.h#L14-L16 indicates we could use our existing helpers here instead:
```C++
ALWAYS_INLINE void vec_byteswap(vec256& vec)
{
if constexpr (std::endian::native == std::endian::big) {
for (size_t i = 0; i < 8; ++i) {
vec[i] = internal_bswap_32(vec[i]);
}
}
}
```
A small, fixed-
...
💬 l0rinc commented on pull request "Add initial vectorized chacha20 implementation for 2-3x speedup":
(https://github.com/bitcoin/bitcoin/pull/34083#discussion_r2627477691)
Hmmm, it seems to me we're doing heavy recursive template instantiationshere to unroll loops.
My understanding (and experience with the mentioned Obfuscation PR) is that modern compilers are good at unrolling fixed-bound loops.
Can you please try if this also works and results in the same speedup?

```suggestion
template <size_t BITS, size_t I>
ALWAYS_INLINE void arr_add_xor_rot(std::array<vec256, I>& arr0, const std::array<vec256, I>& arr1, std::array<vec256, I>& arr2)
{
for (size_t
...
💬 l0rinc commented on pull request "Add initial vectorized chacha20 implementation for 2-3x speedup":
(https://github.com/bitcoin/bitcoin/pull/34083#discussion_r2627310095)
Forcing vectorized operations on an emulated big-endian system (otherwise it falls back to the scalar implementation due to the safety checks):

```patch
diff --git a/src/crypto/chacha20_vec_base.cpp b/src/crypto/chacha20_vec_base.cpp
index 9fda9452a1..a2aa1e5552 100644
--- a/src/crypto/chacha20_vec_base.cpp
+++ b/src/crypto/chacha20_vec_base.cpp
@@ -20,7 +20,7 @@
# define CHACHA20_VEC_DISABLE_STATES_8
# define CHACHA20_VEC_DISABLE_STATES_6
# define CHACHA20_VEC_DISABLE_STATES_4
...
💬 l0rinc commented on pull request "Add initial vectorized chacha20 implementation for 2-3x speedup":
(https://github.com/bitcoin/bitcoin/pull/34083#discussion_r2626256140)
It seems to me the existing standard test vectors are too short (mostly < 128 bytes) to trigger the vectorized optimization - can we extend the tests to runa and compare the new specializations (could show as `skipped` when the given architecture isn't available locally)?
💬 l0rinc commented on pull request "Add initial vectorized chacha20 implementation for 2-3x speedup":
(https://github.com/bitcoin/bitcoin/pull/34083#discussion_r2627488702)
Could we use https://github.com/bitcoin/bitcoin/blob/e16c22fe025f82166c7f3f15a37c96bf4a06e4cf/src/attributes.h#L19-L25 instead?
💬 l0rinc commented on pull request "Add initial vectorized chacha20 implementation for 2-3x speedup":
(https://github.com/bitcoin/bitcoin/pull/34083#discussion_r2627561084)
I haven't used this before but the internets tells me we could the attribute syntax here, something like:
```suggestion
template <typename T, size_t N>
using VectorType [[gnu::vector_size(sizeof(T) * N)]] = T;

using vec256 = VectorType<uint32_t, 8>;
```
💬 l0rinc commented on pull request "Add initial vectorized chacha20 implementation for 2-3x speedup":
(https://github.com/bitcoin/bitcoin/pull/34083#discussion_r2627508017)
There's a lot of repetition here - could we extract that to an `ALWAYS_INLINE` lambda and have something like:
```C++
if constexpr (ENABLE_16) process_blocks(16);
else if constexpr (ENABLE_8) process_blocks(8);
...
```

I haven't implemented it locally, maybe it's naive, let me know what you think.
💬 l0rinc commented on pull request "Add initial vectorized chacha20 implementation for 2-3x speedup":
(https://github.com/bitcoin/bitcoin/pull/34083#discussion_r2627510708)
+1 for `if constexpr`, should simplify the code a lot (especially if we migrate from recursive templates as well)
💬 sipa commented on pull request "Add initial vectorized chacha20 implementation for 2-3x speedup":
(https://github.com/bitcoin/bitcoin/pull/34083#discussion_r2627635541)
I believe it's possible to use compile-time loops here instead of recursive templates (if runtime loops don't get optimized sufficiently):

```c++
/** Store a vector in all array elements */
template <size_t I>
ALWAYS_INLINE void arr_set_vec256(std::array<vec256, I>& arr, const vec256& vec)
{
[&]<size_t... ITER>(std::index_sequence<ITER...>) {
((std::get<ITER>(arr) = vec),...);
}(std::make_index_sequence<I>());
}
```
📝 fanquake opened a pull request: "[30.x] Finalise v30.1"
(https://github.com/bitcoin/bitcoin/pull/34092)
Finalise `v30.1`.
💬 fanquake commented on pull request "log: Use `__func__` for -logsourcelocations":
(https://github.com/bitcoin/bitcoin/pull/34088#issuecomment-3666065446)
https://github.com/bitcoin/bitcoin/actions/runs/20307712389/job/58333931476?pr=34088#step:11:3126:
```bash
/home/admin/actions-runner/_work/_temp/src/test/logging_tests.cpp:135:120: error: no member named 'function_name' in 'SourceLocation'
135 | expected.push_back(tfm::format("[%s:%s] [%s] %s%s", util::RemovePrefix(loc.file_name(), "./"), loc.line(), loc.function_name(), prefix, msg));
|
...
💬 sipa commented on pull request "Add initial vectorized chacha20 implementation for 2-3x speedup":
(https://github.com/bitcoin/bitcoin/pull/34083#discussion_r2627673584)
Alternatively, with a compile-time loop:

```c++
/** Perform add/xor/rotate for the round function */
template <size_t BITS, size_t I>
ALWAYS_INLINE void arr_add_xor_rot(std::array<vec256, I>& arr0, const std::array<vec256, I>& arr1, std::array<vec256, I>& arr2)
{
[&]<size_t... ITER>(std::index_sequence<ITER...>) {
((
arr0[ITER] += arr1[ITER],
arr2[ITER] ^= arr0[ITER],
vec_rotl<BITS>(arr2[ITER])
), ...);
}(std::make_index_sequence<I>())
...
💬 theuni commented on pull request "Add initial vectorized chacha20 implementation for 2-3x speedup":
(https://github.com/bitcoin/bitcoin/pull/34083#discussion_r2627688908)
See branch here: https://github.com/theuni/bitcoin/commits/chacha20-vectorized/

I decided to exclude the impls which require runtime detection from this PR, as I think how that should be done is a separate conversation.

To answer your question more specifically: some impls may require different compilation flags (`-mavx2`/`-mavx512vl`), which have to be in their own compilation units.
💬 0xB10C commented on pull request "log: Use `__func__` for -logsourcelocations":
(https://github.com/bitcoin/bitcoin/pull/34088#issuecomment-3666122528)
Concept ACK! I've turned off `-logsourcelocations` since it wasn't practical to use anymore for me after #32604. This would allow me to turn it back on.
💬 theuni commented on pull request "Add initial vectorized chacha20 implementation for 2-3x speedup":
(https://github.com/bitcoin/bitcoin/pull/34083#discussion_r2627732760)
` if constexpr` certainly makes more sense where possible. I'll have a look, thanks!
💬 stickies-v commented on pull request "log: Use `__func__` for -logsourcelocations":
(https://github.com/bitcoin/bitcoin/pull/34088#discussion_r2627752316)
> This seems like it could silently make some places use the wrong function name.

Not with `function_name_short()`, though? I think `SourceLocation::function_name()` behaving identical to `std::source_location::function_name()` is desirable, letting users explicitly choose between the stdlib behaviour or our custom behaviour?
👍 hebasto approved a pull request: "[30.x] Finalise v30.1"
(https://github.com/bitcoin/bitcoin/pull/34092#pullrequestreview-3588583767)
ACK 2a21824b1190968ee8ba3120400c00e56be10591, I have reviewed the code and it looks OK.
💬 stickies-v commented on pull request "log: Use `__func__` for -logsourcelocations":
(https://github.com/bitcoin/bitcoin/pull/34088#discussion_r2627773419)
Eh, good point. It compiles for me but unsure how it'll behave across platforms.
👋 fanquake's pull request is ready for review: "[30.x] Finalise v30.1"
(https://github.com/bitcoin/bitcoin/pull/34092)