Bitcoin Core Github
42 subscribers
126K links
Download Telegram
💬 l0rinc commented on pull request "Add initial vectorized chacha20 implementation for 2-3x speedup":
(https://github.com/bitcoin/bitcoin/pull/34083#discussion_r2627310095)
Forcing vectorized operations on an emulated big-endian system (otherwise it falls back to the scalar implementation due to the safety checks):

```patch
diff --git a/src/crypto/chacha20_vec_base.cpp b/src/crypto/chacha20_vec_base.cpp
index 9fda9452a1..a2aa1e5552 100644
--- a/src/crypto/chacha20_vec_base.cpp
+++ b/src/crypto/chacha20_vec_base.cpp
@@ -20,7 +20,7 @@
# define CHACHA20_VEC_DISABLE_STATES_8
# define CHACHA20_VEC_DISABLE_STATES_6
# define CHACHA20_VEC_DISABLE_STATES_4
...
💬 l0rinc commented on pull request "Add initial vectorized chacha20 implementation for 2-3x speedup":
(https://github.com/bitcoin/bitcoin/pull/34083#discussion_r2626256140)
It seems to me the existing standard test vectors are too short (mostly < 128 bytes) to trigger the vectorized optimization - can we extend the tests to runa and compare the new specializations (could show as `skipped` when the given architecture isn't available locally)?
💬 l0rinc commented on pull request "Add initial vectorized chacha20 implementation for 2-3x speedup":
(https://github.com/bitcoin/bitcoin/pull/34083#discussion_r2627488702)
Could we use https://github.com/bitcoin/bitcoin/blob/e16c22fe025f82166c7f3f15a37c96bf4a06e4cf/src/attributes.h#L19-L25 instead?
💬 l0rinc commented on pull request "Add initial vectorized chacha20 implementation for 2-3x speedup":
(https://github.com/bitcoin/bitcoin/pull/34083#discussion_r2627561084)
I haven't used this before but the internets tells me we could the attribute syntax here, something like:
```suggestion
template <typename T, size_t N>
using VectorType [[gnu::vector_size(sizeof(T) * N)]] = T;

using vec256 = VectorType<uint32_t, 8>;
```
💬 l0rinc commented on pull request "Add initial vectorized chacha20 implementation for 2-3x speedup":
(https://github.com/bitcoin/bitcoin/pull/34083#discussion_r2627508017)
There's a lot of repetition here - could we extract that to an `ALWAYS_INLINE` lambda and have something like:
```C++
if constexpr (ENABLE_16) process_blocks(16);
else if constexpr (ENABLE_8) process_blocks(8);
...
```

I haven't implemented it locally, maybe it's naive, let me know what you think.
💬 l0rinc commented on pull request "Add initial vectorized chacha20 implementation for 2-3x speedup":
(https://github.com/bitcoin/bitcoin/pull/34083#discussion_r2627510708)
+1 for `if constexpr`, should simplify the code a lot (especially if we migrate from recursive templates as well)
💬 sipa commented on pull request "Add initial vectorized chacha20 implementation for 2-3x speedup":
(https://github.com/bitcoin/bitcoin/pull/34083#discussion_r2627635541)
I believe it's possible to use compile-time loops here instead of recursive templates (if runtime loops don't get optimized sufficiently):

```c++
/** Store a vector in all array elements */
template <size_t I>
ALWAYS_INLINE void arr_set_vec256(std::array<vec256, I>& arr, const vec256& vec)
{
[&]<size_t... ITER>(std::index_sequence<ITER...>) {
((std::get<ITER>(arr) = vec),...);
}(std::make_index_sequence<I>());
}
```
📝 fanquake opened a pull request: "[30.x] Finalise v30.1"
(https://github.com/bitcoin/bitcoin/pull/34092)
Finalise `v30.1`.
💬 fanquake commented on pull request "log: Use `__func__` for -logsourcelocations":
(https://github.com/bitcoin/bitcoin/pull/34088#issuecomment-3666065446)
https://github.com/bitcoin/bitcoin/actions/runs/20307712389/job/58333931476?pr=34088#step:11:3126:
```bash
/home/admin/actions-runner/_work/_temp/src/test/logging_tests.cpp:135:120: error: no member named 'function_name' in 'SourceLocation'
135 | expected.push_back(tfm::format("[%s:%s] [%s] %s%s", util::RemovePrefix(loc.file_name(), "./"), loc.line(), loc.function_name(), prefix, msg));
|
...
💬 sipa commented on pull request "Add initial vectorized chacha20 implementation for 2-3x speedup":
(https://github.com/bitcoin/bitcoin/pull/34083#discussion_r2627673584)
Alternatively, with a compile-time loop:

```c++
/** Perform add/xor/rotate for the round function */
template <size_t BITS, size_t I>
ALWAYS_INLINE void arr_add_xor_rot(std::array<vec256, I>& arr0, const std::array<vec256, I>& arr1, std::array<vec256, I>& arr2)
{
[&]<size_t... ITER>(std::index_sequence<ITER...>) {
((
arr0[ITER] += arr1[ITER],
arr2[ITER] ^= arr0[ITER],
vec_rotl<BITS>(arr2[ITER])
), ...);
}(std::make_index_sequence<I>())
...
💬 theuni commented on pull request "Add initial vectorized chacha20 implementation for 2-3x speedup":
(https://github.com/bitcoin/bitcoin/pull/34083#discussion_r2627688908)
See branch here: https://github.com/theuni/bitcoin/commits/chacha20-vectorized/

I decided to exclude the impls which require runtime detection from this PR, as I think how that should be done is a separate conversation.

To answer your question more specifically: some impls may require different compilation flags (`-mavx2`/`-mavx512vl`), which have to be in their own compilation units.
💬 0xB10C commented on pull request "log: Use `__func__` for -logsourcelocations":
(https://github.com/bitcoin/bitcoin/pull/34088#issuecomment-3666122528)
Concept ACK! I've turned off `-logsourcelocations` since it wasn't practical to use anymore for me after #32604. This would allow me to turn it back on.
💬 theuni commented on pull request "Add initial vectorized chacha20 implementation for 2-3x speedup":
(https://github.com/bitcoin/bitcoin/pull/34083#discussion_r2627732760)
` if constexpr` certainly makes more sense where possible. I'll have a look, thanks!
💬 stickies-v commented on pull request "log: Use `__func__` for -logsourcelocations":
(https://github.com/bitcoin/bitcoin/pull/34088#discussion_r2627752316)
> This seems like it could silently make some places use the wrong function name.

Not with `function_name_short()`, though? I think `SourceLocation::function_name()` behaving identical to `std::source_location::function_name()` is desirable, letting users explicitly choose between the stdlib behaviour or our custom behaviour?
👍 hebasto approved a pull request: "[30.x] Finalise v30.1"
(https://github.com/bitcoin/bitcoin/pull/34092#pullrequestreview-3588583767)
ACK 2a21824b1190968ee8ba3120400c00e56be10591, I have reviewed the code and it looks OK.
💬 stickies-v commented on pull request "log: Use `__func__` for -logsourcelocations":
(https://github.com/bitcoin/bitcoin/pull/34088#discussion_r2627773419)
Eh, good point. It compiles for me but unsure how it'll behave across platforms.
👋 fanquake's pull request is ready for review: "[30.x] Finalise v30.1"
(https://github.com/bitcoin/bitcoin/pull/34092)
💬 theuni commented on pull request "Add initial vectorized chacha20 implementation for 2-3x speedup":
(https://github.com/bitcoin/bitcoin/pull/34083#discussion_r2627863562)
Because they're compiler-specific, the code makes no assumptions about the size/structure/alignment of vec256. The only accesses are via `operator[]`. So afaik, there's no need to check for this.
💬 maflcko commented on pull request "move-only: MAX_BLOCK_TIME_GAP to src/qt":
(https://github.com/bitcoin-core/gui/pull/919#issuecomment-3666304646)
Looks like the valgrind fuzz times out, but this seems unrelated. Probably needs the timeout bumped.
🤔 ismaelsadeeq reviewed a pull request: "kernel: Separate UTXO set access from validation functions"
(https://github.com/bitcoin/bitcoin/pull/32317#pullrequestreview-3587725625)
Concept ACK

ACK first four commits, I think the remaining commits can even be made better.
comments are inline
💬 ismaelsadeeq commented on pull request "kernel: Separate UTXO set access from validation functions":
(https://github.com/bitcoin/bitcoin/pull/32317#discussion_r2627058813)
In "consensus: Use Coin span in GetP2SHSigOpCount" 5709d073e06280c915e7d74e8f49e4904598c348

nit: here and other places, these new variables should be snake case?