Daniel Lemire's blog
343 subscribers
1 photo
195 links
This channel can be used to follow Daniel Lemire's blog. It is COVID-free. If you would like news on COVID, follow https://t.me/covidinfoenglish
Download Telegram
In C, how do you know if the dynamic allocation succeeded?

In the C programming language, we allocate memory dynamically (on the heap) using the malloc function. You pass malloc a size parameter corresponding to the number of bytes you need. The function returns either a pointer to the allocated memory or the NULL pointer if the memory could not be allocated. Or so you may think. Let us write a program that allocates 1 terabytes of memory and then tries to write to this newly allocated memory: #include #include int main() { size_t large = 1099511627776; char *buffer = (char *)malloc(large); if (buffer == NULL) { printf("error!n"); return EXIT_FAILURE; } printf("Memory allocatedn"); for (size_t i = 0; i < large; i += 4096) { buffer[i] = 0; } free(buffer); return EXIT_SUCCESS; } After running and compiling this program, you would expect to either get the message “error!”, in which case the program terminates immediately… or else you might expect to see “Memory allocated” (if 1 terabyte of memory is available) in which case the program should terminate successfully. Under both macOS/clang and Linux/GCC, I find that the program…

https://lemire.me/blog/2021/10/27/in-c-how-do-you-know-if-the-dynamic-allocation-succeeded/
“Hello world” is slower in C++ than in C (Linux)

A simple C program might print ‘hello world’ on screen: #include #include int main() { printf("hello worldn"); return EXIT_SUCCESS; } You can write the equivalent in C++: #include #include int main() { std::cout
“Hello world” is slower in C++ than in C (Linux)

A simple C program might print ‘hello world’ on screen: #include #include int main() { printf("hello worldn"); return EXIT_SUCCESS; } You can write the equivalent in C++: #include #include int main() { std::cout
Catching sanitizer errors programmatically

The C and C++ languages offer little protection against programmer errors. Errors do not always show up where you expect. You can silently corrupt the content of your memory. It can make bugs difficult to track. To solve this problem, I am a big fan of programming in C and C++ using sanitizers. They slow your program, but the check that memory accesses are safe, for example. Thus if you iterate over an array and access elements that are out of bounds, a memory sanitizer will immediately catch the error: int array[8]; for(int k = 0;; k++) { array[k] = 0; } The sanitizer reports the error, but what if you would like to catch the error and store it in some log? Thankfully, GCC and LLVM sanitizers call a function (__asan_on_error()) when when an error is encounter, allowing us to log it. Of course, you need to record the state of your program. The following is an example where the state is recorded in a string. #include #include #include std::string message; extern "C" { void __asan_on_error() {…

https://lemire.me/blog/2022/08/20/catching-sanitizer-errors-programmatically/
Optimizing compilers reload vector constants needlessly

Modern processors have powerful vector instructions which allow you to load several values at once, and operate (in one instruction) on all these values. Similarly, they allow you to have vector constants. Thus if you wanted to add some integer (say 10001) to all integers in a large array, you might first load a constant with 8 times the value 10001, then you would load elements from your array, 8 elements by 8 elements, add the vector constant (thus do 8 additions at once), and then store the result. Everything else being equal, this might be 8 times faster. An optimizing compiler might even do this optimization for you (a process called ‘auto-vectorization). However, for more complex code, you might need to do it manually using “intrinsic” functions (e.g., _mm256_loadu_si256, _mm256_add_epi32, etc.). Let us consider the simple case I describe, but where we process two arrays at once… using the same constant: #include #include void process_avx2(const uint32_t *in1, const uint32_t *in2, size_t len) { // define the constant, 8 x 10001 __m256i c = _mm256_set1_epi32(10001); const uint32_t *finalin1…

https://lemire.me/blog/2022/12/06/optimizing-compilers-reload-vector-constants-needlessly/
Counting cycles and instructions on ARM-based Apple systems

In my blog post Counting cycles and instructions on the Apple M1 processor, I showed how we could have access to “performance counters” to count how many cycles and instructions a given piece of code took on ARM-based mac systems. At the time, we only had access to one Apple processor, the remarkable M1. Shortly after, Apple came out with other ARM-based processors and my current laptop runs on the M2 processor. Sadly, my original code only works for the M1 processor. Thanks to the reverse engineering work of ibireme, a software engineer, we can generalize the approach. We have further extended my original code so that it works under both Linux and on ARM-based macs. The code has benefited from contributions from Wojciech Muła and John Keiser. For the most part, you setup a global event_collector instance, and then you surround the code you want to benchmark by collector.start() and collector.end(), pushing the results into an event_aggregate: #include "performancecounters/event_counter.h" event_collector collector; void f() { event_aggregate aggregate{}; for (size_t i = 0; i < repeat; i++) { collector.start(); function(); //…

https://lemire.me/blog/2023/03/21/counting-cycles-and-instructions-on-arm-based-apple-systems/
Under Linux, libSegFault and addr2line are underrated

Many modern programming languages like Java produce useful error messages when they fail. Some indicate where in the program (e.g., down to the line of source code) the failure occurred. Slowly, the C++ language is moving in this direction. A particularly nasty problem in C and C++ are segmentation faults: they occur when you are trying to access a memory region that is not allocated. For example, this will occur if you are dereferencing a null pointer. You can mostly avoid segmentation faults in C++ by avoiding pointers, but they are instances when pointers are useful or necessary. And even if you are not causing them yourself, the software that you are using might be triggering them. Thankfully, you can often get a useful message when you have a segmentation by linking with the libSegFault library under Linux. Let me illustrate. The following program has a segmentation fault on line six:
#include <stdio.h> #include <stdlib.h> int go() { int x = ; x += *((int *)NULL); return x; } int main() {


https://lemire.me/blog/2023/05/01/under-linux-libsegfault-and-addr2line-are-underrated/
Under Linux, libSegFault and addr2line are underrated

Many modern programming languages like Java produce useful error messages when they fail. Some indicate where in the program (e.g., down to the line of source code) the failure occurred. Slowly, the C++ language is moving in this direction. A particularly nasty problem in C and C++ are segmentation faults: they occur when you are trying to access a memory region that is not allocated. For example, this will occur if you are dereferencing a null pointer. You can mostly avoid segmentation faults in C++ by avoiding pointers, but they are instances when pointers are useful or necessary. And even if you are not causing them yourself, the software that you are using might be triggering them. Thankfully, you can often get a useful message when you have a segmentation by linking with the libSegFault library under Linux. Let me illustrate. The following program has a segmentation fault on line six:
#include <stdio.h> #include <stdlib.h> int go() { int x = ; x += *((int *)NULL); return x; } int main() {


https://lemire.me/blog/2023/05/01/under-linux-libsegfault-and-addr2line-are-underrated/
Parsing time stamps faster with SIMD instructions

In software, it is common to represent time as a time-stamp string. It is usually specified by a time format string. Some standards use the format %Y%m%d%H%M%S meaning that we print the year, the month, the day, the hours, the minutes and the seconds. The current time as I write this blog post would be 20230701205436 as a time stamp in this format. It is convenient because it is short, easy to read and if you sort the strings lexicographically, you also sort them chronologically. You can generate time stamps using any programming language. In C, the following program will print the current time (universal, not local time):
#include <time.h> #include <stdio.h> int main() { char buffer[15]; struct tm timeinfo; time_t rawtime; time(&rawtime); gmtime_r(&rawtime, &timeinfo); size_t len = strftime(buffer, 15, "%Y%m%d%H%M%S", &timeinfo); buffer[14] = ' '; puts(buffer); } 
We are interested in the problem of parsing these strings. In practice, this means that we want to convert them to an integer presenting the number of seconds since the Unix epoch. The Unix epoch is January 1st…

https://lemire.me/blog/2023/07/01/parsing-time-stamps-faster-with-simd-instructions/
Packing a string of digits into an integer quickly

Suppose that I give you a short string of digits, containing possibly spaces or other characters (e.g., "20141103 012910"). We would like to pack the digits into an integer (e.g., 0x20141103012910) so that the lexicographical order over the string matches the ordering of the integers. We can use the fact that in ASCII, the digits have byte values 0x30, 0x31 and so forth. Thus as a sequence of bytes, the string "20141103 012910" is actually 0x32, 0x30, 0x31… so we must select the least significant 4 bits of each byte, discarding the rest. Intel and AMD processors have a convenient instruction which allows us to select any bits from a word to construct a new word (pext). A problem remains: Intel and AMD processors are little endian, which means that if I load the string in memory, the first byte becomes the least significant, not the most significant. Thankfully, there is an instruction to reverse the order of the bytes. In C, the desired function using Intel intrinsics might look like this:
#include <x86intrin.h> // Windows: <intrin.h> #include


https://lemire.me/blog/2023/07/07/packing-a-string-of-digits-into-an-integer-quickly/
Packing a string of digits into an integer quickly

Suppose that I give you a short string of digits, containing possibly spaces or other characters (e.g., "20141103 012910"). We would like to pack the digits into an integer (e.g., 0x20141103012910) so that the lexicographical order over the string matches the ordering of the integers. We can use the fact that in ASCII, the digits have byte values 0x30, 0x31 and so forth. Thus as a sequence of bytes, the string "20141103 012910" is actually 0x32, 0x30, 0x31… so we must select the least significant 4 bits of each byte, discarding the rest. Intel and AMD processors have a convenient instruction which allows us to select any bits from a word to construct a new word (pext). A problem remains: Intel and AMD processors are little endian, which means that if I load the string in memory, the first byte becomes the least significant, not the most significant. Thankfully, Intel and AMD can handle this byte order during the load process. In C, the desired function using Intel intrinsics might look like this:
#include <x86intrin.h> // Windows: <intrin.h>


https://lemire.me/blog/2023/07/07/packing-a-string-of-digits-into-an-integer-quickly/
C23: a slightly better C

One of the established and most popular programming languages is the C programming language. It is relatively easy to learn, and highly practical. Maybe surprisingly, the C programming language keeps evolving, slowly and carefully. If you have GCC 23 or LLVM (Clang) 16, you already have a compiler with some of the features from the latest standard (C23).
// Only include stdio.h if it exists #if __has_include (<stdio.h>) #include <stdio.h> #endif #include <stdlib.h> [[deprecated]] void f() {} [[nodiscard]] int g(int x) { return x + 1; } int main() { f(); // compile-time warning: 'f' is deprecated g(1); // compile-time warning: ignoring // return value of function declared with 'nodiscard' attribute auto x = 0b1111; typeof(x) y = 1'000'000; // type of y is the same as x printf("%dn", x); // prints 15 printf("%dn", y); // prints 1000000 constexpr int N = 10; // compile-time asserts using static_assert static_assert (N == 10, "N must be 10"); bool a[N]; // array of N booleans for (int i = 0; i < N; ++i) { a[i] = true; } printf("%dn",


https://lemire.me/blog/2024/01/21/c23-a-slightly-better-c/
C++ web app with Crow: early scalability results

Last year, I looked at writing small “hello world” web applications in various programming languages (Go, JavaScript, Nim…). Go, using nothing but the standard library, did well. In these benchmarks, I am just programming an http route that returns a small string (e.g., ‘hello world’). The query is from the host itself. The intent behind such a benchmark is to measure how well an web application might scale in the best of cases. I call such a benchmark ‘simplistic’ because nobody only ever returns just a short string and you do not usually query the server from the host. At the time, I had wanted to compare with a C++ library, and I ended up trying the lithium framework which scaled very well. Jake Arkinstall pointed out that he uses Crow. To build web applications in C++. So I decided to take Crow on a spin. My simplistic application has only few lines:
#include "crow.h" int main() { crow::SimpleApp app; app.loglevel(crow::LogLevel::Warning); CROW_ROUTE(app, "/simple")([](){ return "Hello world"; }); app.port(18080).multithreaded().run(); } 
To build it, I…

https://lemire.me/blog/2024/04/06/c-web-app-with-crow-early-scalability-results/