• Re: A very slow program

    From Vir Campestris@[email protected] to alt.comp.lang.c,comp.lang.c on Wed Oct 2 11:16:37 2024
    From Newsgroup: comp.lang.c

    On 16/09/2024 06:22, Paul wrote:
    Good prime code, comes with its own timing routines 🙂

    http://cr.yp.to/primegen.html

    http://cr.yp.to/primegen/primegen-0.97.tar.gz

    Paul

    I pricked my ears up when I saw this - I've been spending a quite
    ridiculous amount of time tuning prime code.

    Happily for my sanity the program I published over on comp.lang.c++(1)
    is quite a bit faster than that one. On the other hand, I'm tuning for
    modern processors, and the comment in there say inter alia "It generates
    the 50847534 primes up to 1000000000 in just 8 seconds on a Pentium II-350"

    I don't remember the last time I saw one of those.

    <https://en.wikipedia.org/wiki/List_of_Intel_Pentium_II_processors>

    tells me it was release in 1998...

    Andy
    --

    (1) See thread "Sieve of Eratosthenes optimized to the max" and look for
    a post from me which is quite big
    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From Paul@[email protected] to alt.comp.lang.c,comp.lang.c on Wed Oct 2 06:37:46 2024
    From Newsgroup: comp.lang.c

    On Wed, 10/2/2024 6:16 AM, Vir Campestris wrote:
    On 16/09/2024 06:22, Paul wrote:
    Good prime code, comes with its own timing routines 🙂

    http://cr.yp.to/primegen.html

        http://cr.yp.to/primegen/primegen-0.97.tar.gz

       Paul

    I pricked my ears up when I saw this - I've been spending a quite ridiculous amount of time tuning prime code.

    Happily for my sanity the program I published over on comp.lang.c++(1) is quite a bit faster than that one. On the other hand, I'm tuning for modern processors, and the comment in there say inter alia "It generates the 50847534 primes up to 1000000000 in just 8 seconds on a Pentium II-350"

    I don't remember the last time I saw one of those.

    <https://en.wikipedia.org/wiki/List_of_Intel_Pentium_II_processors>

    tells me it was release in 1998...

    Andy

    I started on PCs, with a Celery 300 for home use.
    That was the one that didn't have cache.

    And the generation before I got a machine, there
    were cache DIMMs for adding a cache function, and
    the cache DIMM in some cases, only covered half
    the address space, instead of covering the whole
    thing. The things you have to put up with.

    Paul

    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From Bonita Montero@[email protected] to alt.comp.lang.c,comp.lang.c on Wed Oct 2 14:44:04 2024
    From Newsgroup: comp.lang.c

    Ive developed a Sieve of Erastosthenes in C++20 with an unbeaten
    performance. With this code I get all primes <= 2 ^ 32 in 140ms
    on my AMD 7950X 16-core Zen4-system.
    It calculates first only the primes up to the square root of the
    maximum. The remaining bits are partitioned according to the num-
    ber of CPU-cores. Within each thread the bitmap is partitioned in
    parts that fit into the L2-cache (queried via CPUID).
    All operations are aligned on a cacheline-boundary to avoid false
    sharing and locking of shared words.

    #include <iostream>
    #include <cstdlib>
    #include <charconv>
    #include <cstring>
    #include <vector>
    #include <algorithm>
    #include <cmath>
    #include <bit>
    #include <fstream>
    #include <string_view>
    #include <thread>
    #include <utility>
    #include <new>
    #include <span>
    #include <array>
    #include <cassert>
    #include <sstream>
    #if defined(_MSC_VER) && (defined(_M_IX86) || defined(_M_X64))
    #include <intrin.h>
    #elif (defined(__GNUC__) || defined(__clang__)) && (defined(__i386__) || defined(__x86_64__))
    #include <cpuid.h>
    #endif

    #if defined(_MSC_VER)
    #pragma warning(disable: 26495) // uninitialized member
    #endif

    using namespace std;

    #if defined(__cpp_lib_hardware_interference_size)
    constexpr ptrdiff_t CL_SIZE = hardware_destructive_interference_size;
    #else
    constexpr ptrdiff_t CL_SIZE = 64;
    #endif

    size_t getL2Size();
    bool smt();

    int main( int argc, char **argv )
    {
    try
    {
    if( argc < 2 )
    return EXIT_FAILURE;
    auto parse = []( char const *str, auto &value, bool bytes )
    {
    bool hex = str[0] == '0' && tolower( str[1] ) == 'x';
    str += 2 * hex;
    auto [ptr, ec] = from_chars( str, str + strlen( str ), value, !hex ?
    10 : 16 );
    if( ec != errc() )
    return false;
    if( bytes && tolower( *ptr ) == 'x' && !ptr[1] )
    {
    value *= 8;
    --value;
    ++ptr;
    }
    return !*ptr;
    };
    size_t end;
    if( !parse( argv[1], end, true ) )
    return EXIT_FAILURE;
    if( end < 2 )
    return EXIT_FAILURE;
    if( (ptrdiff_t)end++ < 0 )
    throw bad_alloc();
    unsigned
    hc = jthread::hardware_concurrency(),
    nThreads;
    if( argc < 4 || !parse( argv[3], nThreads, false ) )
    nThreads = hc;
    if( !nThreads )
    return EXIT_FAILURE;
    using word_t = size_t;
    constexpr size_t
    BITS_PER_CL = CL_SIZE * 8,
    BITS = sizeof(word_t) * 8;
    size_t nBits = end / 2;
    union alignas(CL_SIZE) ndi_words_cl { word_t words[CL_SIZE / sizeof(word_t)]; ndi_words_cl() {} };
    vector<ndi_words_cl> sieveCachelines( (nBits + BITS_PER_CL - 1) /
    BITS_PER_CL );
    span<word_t> sieve( &sieveCachelines[0].words[0], (nBits + BITS - 1) /
    BITS );
    assert(to_address( sieve.end() ) <= &to_address( sieveCachelines.end()
    words[0]);
    fill( sieve.begin(), sieve.end(), -1 );
    auto punch = [&]( size_t start, size_t end, size_t prime, auto )
    {
    size_t
    bit = start / 2,
    bitEnd = end / 2;
    if( bit >= bitEnd ) [[unlikely]]
    return;
    if( prime >= BITS ) [[likely]]
    do [[likely]]
    sieve[bit / BITS] &= rotl( (word_t)-2, bit % BITS );
    while( (bit += prime) < bitEnd );
    else
    {
    auto pSieve = sieve.begin() + bit / BITS;
    do [[likely]]
    {
    word_t
    word = *pSieve,
    mask = rotl( (word_t)-2, bit % BITS ),
    oldMask;
    do
    {
    word &= mask;
    oldMask = mask;
    mask = rotl( mask, prime % BITS );
    bit += prime;
    } while( mask < oldMask );
    *pSieve++ = word;
    } while( bit < bitEnd );
    }
    };
    size_t sqrt = (ptrdiff_t)ceil( ::sqrt( (ptrdiff_t)end ) );
    [&]
    {
    for( size_t bSqrt = sqrt / 2, bit = 1; bit < bSqrt; ++bit ) [[likely]]
    {
    auto pSieve = sieve.begin () + bit / BITS;
    for( ; ; )
    {
    if( word_t w = *pSieve >> bit % BITS; w ) [[likely]]
    {
    bit += countr_zero ( w );
    break;
    }
    ++pSieve;
    bit = bit + BITS & -(ptrdiff_t)BITS;
    if( bit >= bSqrt ) [[unlikely]]
    return;
    }
    size_t prime = 2 * bit + 1;
    punch( prime * prime, sqrt, prime, false_type () );
    }
    }();
    auto scan = [&]( size_t start, size_t end, auto fn )
    {
    size_t
    bit = start / 2,
    bitEnd = end / 2;
    if( bit >= bitEnd ) [[unlikely]]
    return;
    auto pSieve = sieve.begin() + bit / BITS;
    word_t word = *pSieve++ >> bit % BITS;
    for( ; ; )
    {
    size_t nextWordBit = bit + BITS & -(ptrdiff_t)BITS;
    for( unsigned gap; word; word >>= gap, word >>= 1, ++bit )
    {
    gap = countr_zero( word );
    bit += gap;
    if( bit >= bitEnd ) [[unlikely]]
    return;
    fn( bit * 2 + 1, true_type() );
    }
    bit = nextWordBit;
    if( bit >= bitEnd ) [[unlikely]]
    return;
    word = *pSieve++;
    }
    };
    vector<jthread> threads;
    threads.reserve( nThreads );
    static auto dbl = []( ptrdiff_t x ) { return (double)x; };
    auto initiate = [&]( size_t start, auto fn )
    {
    double threadRange = dbl( end - start ) / nThreads;
    ptrdiff_t t = nThreads;
    size_t trEnd = end;
    size_t prevStart, prevEnd;
    bool prevValid = false;
    do [[likely]]
    {
    size_t trStart = start + (ptrdiff_t)(--t * threadRange + 0.5);
    trStart &= -(2 * CL_SIZE * 8);
    trStart = trStart >= start ? trStart : start;
    if( trStart < trEnd ) [[likely]]
    {
    if( prevValid ) [[likely]]
    threads.emplace_back( fn, prevStart, prevEnd );
    prevStart = trStart;
    prevEnd = trEnd;
    prevValid = true;
    }
    trEnd = trStart;
    } while( t );
    if( prevValid ) [[likely]]
    fn( prevStart, prevEnd );
    threads.resize( 0 );
    };
    double maxCacheRange = dbl( getL2Size() * 8 * 2 ) / 3 * (smt() &&
    nThreads > hc / 2 ? 1 : 2);
    initiate( sqrt,
    [&]( size_t rStart, size_t rEnd )
    {
    double
    thisThreadRange = dbl( rEnd - rStart ),
    nCachePartitions = ceil( thisThreadRange / maxCacheRange ),
    cacheRange = thisThreadRange / nCachePartitions;
    for( size_t p = (ptrdiff_t)nCachePartitions, cacheEnd = rEnd,
    cacheStart; p--; cacheEnd = cacheStart ) [[likely]]
    {
    cacheStart = rStart + (ptrdiff_t)((double)(ptrdiff_t)p * cacheRange
    + 0.5);
    cacheStart &= -(2 * CL_SIZE * 8);
    cacheStart = cacheStart >= sqrt ? cacheStart : sqrt;
    scan( 3, sqrt,
    [&]( size_t prime, auto )
    {
    size_t start = (cacheStart + prime - 1) / prime * prime;
    start += start % 2 ? 0 : prime;
    punch( start, cacheEnd, prime, true_type() );
    } );
    }
    } );
    if( bool count = false; argc >= 3 && (!*argv[2] || (count = strcmp(
    argv[2], "*" ) == 0)) )
    {
    if( !count )
    return EXIT_SUCCESS;
    atomic<size_t> aNPrimes( 1 );
    initiate( 3,
    [&]( size_t rStart, size_t rEnd )
    {
    size_t nPrimes = 0;
    scan( rStart, rEnd, [&]( ... ) { ++nPrimes; } );
    aNPrimes.fetch_add( nPrimes, memory_order::relaxed );
    } );
    cout << aNPrimes.load( memory_order::relaxed ) << " primes" << endl;
    return EXIT_SUCCESS;
    }
    ofstream ofs;
    ofs.exceptions( ofstream::failbit | ofstream::badbit );
    ofs.open( argc >= 3 ? argv[2] : "primes.txt", ofstream::binary |
    ofstream::trunc );
    constexpr size_t
    BUF_SIZE = 0x100000,
    AHEAD = 32;
    union ndi_char { char c; ndi_char() {} };
    vector<ndi_char> rawPrintBuf( BUF_SIZE + AHEAD - 1 );
    span printBuf( &to_address( rawPrintBuf.begin() )->c, &to_address(
    rawPrintBuf.end() )->c );
    auto
    bufBegin = printBuf.begin(),
    bufEnd = bufBegin,
    bufFlush = bufBegin + BUF_SIZE;
    auto print = [&]() { ofs << string_view( bufBegin, bufEnd ); };
    auto printPrime = [&]( size_t prime, auto )
    {
    auto [ptr, ec] = to_chars( &*bufEnd, &bufEnd[AHEAD - 1], prime );
    if( (bool)ec ) [[unlikely]]
    throw system_error( (int)ec, generic_category(), "converson failed" );
    bufEnd = ptr - &*bufBegin + bufBegin; // pointer to iterator - NOP
    *bufEnd++ = '\n';
    if( bufEnd >= bufFlush ) [[unlikely]]
    print(), bufEnd = bufBegin;
    };
    printPrime( 2, false_type() );
    scan( 3, end, printPrime );
    print();
    }
    catch( exception const &exc )
    {
    cout << exc.what() << endl;
    }
    }

    #if (defined(__GNUC__) || defined(__clang__)) && (defined(__i386__) || defined(__x86_64__)) || defined(_MSC_VER) && (defined(_M_IX86) || defined(_M_X64))
    array<unsigned, 4> cpuId( unsigned code )
    {
    int regs[4];
    #if defined(_MSC_VER)
    __cpuid( regs, code );
    #elif defined(__GNUC__) || defined(__clang__)
    __cpuid(code, regs[0], regs[1], regs[2], regs[3]);
    #endif
    using u = unsigned;
    return array<u, 4> { (u)regs[0], (u)regs[1], (u)regs[2], (u)regs[3] }; }

    bool smt()
    {
    if( cpuId( 0 )[0] < 1 )
    return false;
    return cpuId( 1 )[3] >> 28 & 1;
    }

    size_t getL2Size()
    {
    if( cpuId( 0x80000000u )[0] < 0x80000006u )
    return 512ull * 1024;
    return (cpuId( 0x80000006u )[2] >> 16) * 1024;
    }
    #else
    size_t getL2Size()
    {
    return 512ull * 1024;
    }

    bool smt()
    {
    return false;
    }
    #endif
    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From Janis Papanagnou@[email protected] to comp.lang.c on Wed Oct 2 16:41:32 2024
    From Newsgroup: comp.lang.c

    On 16.09.2024 00:45, Student Project wrote:
    [ snip C program ]
    /*
    The result on my slow machine:
    Program takes an average of 0.018000 seconds to find 78498 primes.
    */

    Note, this is the same runtime magnitude that Unix'es 'primes'
    command needs to run on 1000000 numbers to create 78498 primes
    (0m00.01s) [on my very old (and slow) Unix system].

    Is there a point in still writing own programs for that task?
    I see some point when examining performance optimizations (as
    someone downthread seems to have done). For cryptography there's
    certainly demands for yet faster (parallelized) algorithms - if
    quantum system algorithms don't make it superfluous (lately, or
    in near future). But for cryptography you'd anyway need larger
    numeric domains, beyond '[long]* integer' arithmetics. (I don't
    know whether the [downthread posted] C++ code was designed to
    support arbitrary length arithmetics.)

    Optimized algorithms of new methods alone might not be a C topic.
    But given your posting name, "Student Project", I suppose you're
    anyway just using it as example for learning the C language?

    For ordinary users it's probably sufficient to use an existing
    program; on Unix 'primes 1 1000000000' runs in about 8 seconds.
    And it's checking 2^32 numbers in 50 seconds (including 'wc'
    and creating output [suppressed by redirection]), generating
    50847534 primes. A simple prime test of some "arbitrary large"
    number runs in no time, as does factorization of numbers with
    the Unix'es 'factor' program. Just a simple command line call.
    That existing 'primes' program also has some limit; the man page
    documents a value of 4294967295, but no error message is created
    for values up to around 1.8446744*10^19 on my system.

    Janis
    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From Keith Thompson@[email protected] to comp.lang.c on Wed Oct 2 15:43:45 2024
    From Newsgroup: comp.lang.c

    Janis Papanagnou <[email protected]> writes:
    [...]
    For ordinary users it's probably sufficient to use an existing
    program; on Unix 'primes 1 1000000000' runs in about 8 seconds.
    And it's checking 2^32 numbers in 50 seconds (including 'wc'
    and creating output [suppressed by redirection]), generating
    50847534 primes. A simple prime test of some "arbitrary large"
    number runs in no time, as does factorization of numbers with
    the Unix'es 'factor' program. Just a simple command line call.
    That existing 'primes' program also has some limit; the man page
    documents a value of 4294967295, but no error message is created
    for values up to around 1.8446744*10^19 on my system.

    On my system (Ubuntu 22.04.5 LTS, 64 bits), the primes(1) man page says:

    "The stop value must not be greater than 3825123056546413050."

    4294967295 is of course 2**31-1. I don't know the significance of 3825123056546413050. The change was made by Debian, but the sources
    don't explicitly give a reason. Anyone who wants to look into it can
    see the patches here:

    https://salsa.debian.org/games-team/bsdgames
    --
    Keith Thompson (The_Other_Keith) [email protected]
    void Void(void) { Void(); } /* The recursive call of the void */
    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From Vir Campestris@[email protected] to alt.comp.lang.c,comp.lang.c on Thu Oct 3 17:06:13 2024
    From Newsgroup: comp.lang.c

    On 02/10/2024 13:44, Bonita Montero wrote:
    Ive developed a Sieve of Erastosthenes in C++20 with an unbeaten
    performance. With this code I get all primes <= 2 ^ 32 in 140ms
    on my AMD 7950X 16-core Zen4-system.
    It calculates first only the primes up to the square root of the
    maximum. The remaining bits are partitioned according to the num-
    ber of CPU-cores. Within each thread the bitmap is partitioned in
    parts that fit into the L2-cache (queried via CPUID).
    All operations are aligned on a cacheline-boundary to avoid false
    sharing and locking of shared words.

    Unbeaten?

    This isn't the same program you posted before over on comp.lang.c++, but
    it builds and runs fine on my machine.

    for 4e9 primes my machine take nearly 10 seconds for 4294967295 primes.
    The one I posted before over there takes 1.5 seconds.

    This version is if anything slightly slower than your other one.

    You might like to have a look.

    (And Keith, yes, I realise there is no point whatsoever in doing this...
    but it's been fun. First time in years I've written anything I'm not
    getting paid for!)

    Andy
    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From Paul@[email protected] to alt.comp.lang.c,comp.lang.c on Thu Oct 3 17:25:14 2024
    From Newsgroup: comp.lang.c

    On Wed, 10/2/2024 8:44 AM, Bonita Montero wrote:
    Ive developed a Sieve of Erastosthenes in C++20 with an unbeaten
    performance. With this code I get all primes <= 2 ^ 32 in 140ms
    on my AMD 7950X 16-core Zen4-system.

    Why not measure against the programs provided in a Linux distro ?

    $ time primesieve 1e6
    Sieve size = 256 KiB
    Threads = 1
    100%
    Seconds: 0.000
    Primes: 78498

    real 0m0.002s <=== Without printing the numbers, is pretty fast
    user 0m0.000s
    sys 0m0.000s

    $ time primesieve 1e6 --print > NUL

    real 0m0.009s <=== Without Terminal to slow I/O, is in the Unix ballpark user 0m0.005s
    sys 0m0.000s

    $ time primesieve 4294967295
    Sieve size = 256 KiB
    Threads = 16 <=== 5700G Zen3 (8C 16T), stock speed, execution under WSL2
    100%
    Seconds: 0.072
    Primes: 203280221

    real 0m0.073s
    user 0m0.869s
    sys 0m0.009s

    Paul
    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From Bonita Montero@[email protected] to alt.comp.lang.c,comp.lang.c on Fri Oct 4 06:53:00 2024
    From Newsgroup: comp.lang.c

    Am 03.10.2024 um 18:06 schrieb Vir Campestris:

    for 4e9 primes my machine take nearly 10 seconds for 4294967295
    primes. The one I posted before over there takes 1.5 seconds.

    Show me your code. I'm pretty sure you never posted faster code.


    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From Vir Campestris@[email protected] to alt.comp.lang.c,comp.lang.c on Sun Oct 20 11:49:22 2024
    From Newsgroup: comp.lang.c

    On 04/10/2024 05:53, Bonita Montero wrote:
    Am 03.10.2024 um 18:06 schrieb Vir Campestris:

    for 4e9 primes my machine take nearly 10 seconds for 4294967295
    primes. The one I posted before over there takes 1.5 seconds.

    Show me your code. I'm pretty sure you never posted faster code.


    Try this.
    --
    /*
    Programme to calculate primes using the Sieve or Eratosthenes
    Copyright (C) 2024 the person known as Vir Campestris

    This program is free software: you can redistribute it and/or modify
    it under the terms of the GNU General Public License as published by
    the Free Software Foundation, either version 3 of the License, or
    (at your option) any later version.

    This program is distributed in the hope that it will be useful,
    but WITHOUT ANY WARRANTY; without even the implied warranty of
    MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
    GNU General Public License for more details.

    You should have received a copy of the GNU General Public License
    along with this program. If not, see <https://www.gnu.org/licenses/>.
    */

    #include <algorithm>
    #include <cassert>
    #include <chrono>
    #include <climits>
    #include <cmath>
    #include <csignal>
    #include <cstring>
    #include <fstream>
    #include <iomanip>
    #include <iostream>
    #include <list>
    #include <memory>
    #include <new>
    #include <thread>
    #include <vector>

    // If these trigger you have a really odd architecture. static_assert(sizeof(uint8_t) == 1);
    static_assert(CHAR_BIT == 8);

    // The type of a prime number.
    // size_t isn't big enough on a 32-bit machine with 2GB memory allocated
    and 16 primes per byte.
    typedef uint64_t PRIME;

    // This is the type of the word I operate in. The code will run OK with
    uin8_t or uint32_t,
    // but *on my computer* it's slower.
    typedef uint64_t StoreType;

    // tuneable constants
    constexpr size_t tinyCount = 5;
    constexpr size_t smallCount = 10;
    constexpr size_t bigCache = 16384;

    // this is a convenience class for timing.
    class Stopwatch {
    typedef std::chrono::steady_clock CLOCK;
    typedef std::pair<unsigned, std::chrono::time_point<CLOCK> > LAP;
    typedef std::vector<LAP> LAPS;
    LAPS mLaps;
    public:
    typedef std::vector<std::pair<unsigned, double>> LAPTIMES;
    Stopwatch() {}

    // record the start of an interval
    void start()
    {
    mLaps.clear();
    mLaps.emplace_back(std::make_pair(0, CLOCK::now()));
    }

    void lap(unsigned label)
    {
    mLaps.emplace_back(std::make_pair(label, CLOCK::now()));
    }

    // record the end of an interval
    void stop()
    {
    mLaps.emplace_back(std::make_pair(-1, CLOCK::now()));
    }

    // report the difference between the last start and stop
    double delta() const
    {
    assert(mLaps.size() >= 2);
    assert(mLaps.front().first == 0);
    assert(mLaps.back().first == -1);
    return std::chrono::duration_cast<std::chrono::nanoseconds>(mLaps.back().second
    - mLaps.front().second).count() / (double)1e9;
    }

    LAPTIMES const laps() const
    {
    LAPTIMES ret;
    assert(mLaps.size() >= 2);
    ret.reserve(mLaps.size() - 1);
    auto lap = mLaps.begin();
    auto start = lap->second;

    for(++lap; lap != mLaps.end(); ++lap)
    {
    ret.emplace_back(
    std::make_pair(
    lap->first,

    std::chrono::duration_cast<std::chrono::nanoseconds>(lap->second - start).count() / (double)1e9));

    }
    return ret;
    }
    };

    // Internal class used to store the mask data for one or more
    // primes rather than all primes. This has an initial block,
    // followed by a block which contains data which is repeated
    // for all higher values.
    // Example: StoreType== uint8_t, prime == 3, the mask should be
    // 90 meaning 1,3,5,7,11,13 are prime but not 9, 15
    // 24,49,92 covers odds 17-63
    // 24,49,92 covers 65-111, with an identical mask
    // As the mask is identical we only need to store the
    // non-repeat, plus one copy of the repeated block and data to
    // show where it starts and ends. Note that for big primes the
    // initial block may be more than one StoreType.
    // As there are no common factors between any prime and the
    // number of bits in StoreType the repeat is the size of the
    // prime.
    // Note also that a masker for two primes will have an
    // initial block which is as big as the larger of the sources,
    // and a repeat which is the product of the sources.
    class Masker final
    {
    // The length of the initial block.
    size_t initSize;

    // The size of the repeated part. The entire store size is the
    sum of these two.
    size_t repSize;

    // Buffer management local class. I started using std::vector
    for the store,
    // but I found it was slow for large sizes as it insists on
    calling the constructor
    // for each item in turn. And there may be billions of them.
    class Buffer final
    {
    StoreType* mpBuffer; // base of the data referred to
    size_t mSize; // size of the buffer in
    StoreType-s.

    public:
    // Normal constructor
    Buffer(size_t size = 0) : mpBuffer(nullptr), mSize(size)
    {
    if (size > 0)
    {
    mpBuffer = static_cast<StoreType*>(std::malloc(size * sizeof(StoreType)));

    if (!mpBuffer)
    throw std::bad_alloc();
    }
    }

    Buffer(Buffer&) = delete;

    // Move constructor
    Buffer(Buffer&& source)
    : mpBuffer(source.mpBuffer), mSize(source.mSize)
    {
    source.mSize = 0;
    source.mpBuffer = nullptr;
    };

    Buffer& operator=(Buffer&) = delete;

    // Move assignment
    Buffer& operator=(Buffer&& source)
    {
    if (mpBuffer)
    std::free(mpBuffer);
    mpBuffer = source.mpBuffer;
    mSize = source.mSize;
    source.mSize = 0;
    source.mpBuffer = nullptr;
    return *this;
    }

    void resize(size_t newSize)
    {
    mpBuffer = static_cast<StoreType*>(std::realloc(mpBuffer, newSize *
    sizeof(StoreType)));
    if (!mpBuffer)
    throw std::bad_alloc();
    mSize = newSize;
    }

    // Get the data start
    inline StoreType* operator*() const
    {
    return mpBuffer;
    }

    // Get the owned size
    inline size_t const & size() const
    {
    return mSize;
    }

    // clean up.
    ~Buffer()
    {
    if (mpBuffer)
    std::free(mpBuffer);
    }
    } mBuffer;

    // Raw pointer to the store for speed.
    StoreType * mStorePtr;

    // shift from the prime value to the index in mStore
    static constexpr size_t wordshift = sizeof(StoreType) == 1 ? 4
    : sizeof(StoreType) == 4 ? 6 : sizeof(StoreType) == 8 ? 7 : 0;
    static_assert(wordshift);

    // Mask for the bits in the store that select a prime.
    static constexpr size_t wordmask = (StoreType)-1 >> (sizeof(StoreType) * CHAR_BIT + 1 - wordshift);

    // convert a prime to the index of a word in the store
    static inline size_t index(PRIME const & prime)
    {
    return prime >> wordshift;
    }

    // get a mask representing the bit for a prime in a word
    static inline StoreType mask(PRIME const & prime)
    {
    return (StoreType)1 << ((prime >> 1) & wordmask);
    }


    public:
    // This class allows us to iterate through a Masker
    indefinitely, retrieving words,
    // automatically wrapping back to the beginning of the repat
    part when the end is met.
    class MaskIterator
    {
    StoreType const * index, *repStart, *repEnd;
    public:
    MaskIterator(Masker const * origin)
    : index(origin->mStorePtr)
    , repStart(index + origin->initSize)
    , repEnd(repStart + origin->repSize)
    {}

    inline StoreType const & operator*() const
    {
    return *index;
    }

    inline StoreType next() // returns value with iterator post-increment
    {
    auto & retval = *index;
    if (++index >= repEnd)
    index = repStart;
    return retval;
    }
    };

    // Default constructor. Makes a dummy mask for overwriting.
    Masker()
    : initSize(0)
    , repSize(1)
    , mBuffer(1)
    , mStorePtr(*mBuffer)
    {
    *mStorePtr = 0; // nothing marked unprime
    }

    // Move constructor (destroys source)
    Masker(Masker&& source)
    : initSize(source.initSize)
    , repSize(source.repSize)
    , mBuffer(std::move(source.mBuffer))
    , mStorePtr(*mBuffer)
    {
    }

    // Copy constructor (preserves source)
    Masker(Masker& source)
    : initSize(source.initSize)
    , repSize(source.repSize)
    , mBuffer(initSize + repSize)
    , mStorePtr(*mBuffer)
    {
    memcpy(*mBuffer, *source.mBuffer, mBuffer.size() * sizeof(StoreType));
    }

    // move assignment (destroys source)
    Masker& operator=(Masker&& source)
    {
    initSize = source.initSize;
    repSize = source.repSize;
    mStorePtr = source.mStorePtr;
    mBuffer = std::move(source.mBuffer);
    return *this;
    }

    // Construct for a single prime
    Masker(PRIME prime)
    : initSize(this->index(prime)+1)
    , repSize(prime)
    , mBuffer(initSize + prime)
    , mStorePtr(*mBuffer)
    {
    // Pre-zero the buffer, because for bigger primes we
    don't write every word.
    // This is fast enough not to care about excess writes.
    memset(mStorePtr, 0, mBuffer.size() * sizeof(StoreType));

    // The max prime we'll actually store.
    // *2 because we don't store evens.
    PRIME const maxPrime = (initSize + repSize) * CHAR_BIT
    * sizeof(StoreType) * 2;

    // Filter all the bits. Note that although we
    // don't strictly need to filter from prime*3,
    // but only from prime**2, doing from *3 makes
    // the init block smaller.
    PRIME const prime2 = prime * 2;

    for (PRIME value = prime * 3; value < maxPrime; value
    += prime2)
    {
    mStorePtr[this->index(value)] |= this->mask(value);
    }
    }

    // Construct from two others, making a mask that excludes all
    numbers
    // marked not prime in either of them. The output init block
    // will be the largest from either of the inputs; the repeat
    block size will be the
    // product of the input repeat sizes.
    // The output could get quite large - 3.7E12 for all primes up
    to 37
    Masker(const Masker& left, const Masker& right, size_t
    storeLimit = 0)
    : initSize(std::max(left.initSize, right.initSize))
    , repSize (left.repSize * right.repSize)
    , mBuffer()
    {
    if (storeLimit)
    repSize = std::min(initSize + repSize,
    storeLimit) - initSize;

    auto storeSize = initSize + repSize;

    // Actually construct the store with the desired size
    mBuffer = storeSize;
    mStorePtr = *mBuffer;

    // get iterators to the two inputs. These automatically
    wrap
    // when their repsize is reached.
    auto li = left.begin();
    auto ri = right.begin();

    for (size_t word = 0; word < storeSize; ++word)
    {
    mStorePtr[word] = li.next() | ri.next();
    }
    }

    // Construct from several others, making a mask that excludes
    all numbers
    // marked not prime in any of them. The output init block
    // will be the largest from any of the inputs; the repeat size
    will be the
    // product of all the input repeat sizes.
    // The output could get quite large - 3.7E12 for all primes up
    to 37
    // the type iterator should be an iterator into a collection of Masker-s.
    template <typename iterator> Masker(const iterator& begin,
    const iterator& end, size_t storeLimit = 0)
    : initSize(0)
    , repSize(1)
    , mBuffer() // empty for now
    {
    // Iterate over the inputs. We will
    // * Determine the maximum init size
    // * Determine the product of all the repeat sizes.
    // * Record the number of primes we represent.
    size_t nInputs = std::distance(begin, end);
    std::vector<MaskIterator> iterators;
    iterators.reserve(nInputs);

    for (auto i = begin; i != end; ++i)
    {
    initSize = std::max(initSize, i->initSize);
    repSize *= i->repSize;
    iterators.push_back(i->begin());
    }
    if (storeLimit)
    repSize = std::min(initSize + repSize,
    storeLimit) - initSize;
    auto storeSize = initSize + repSize;
    // Actually construct the store with the desired size
    mBuffer = storeSize;
    mStorePtr = *mBuffer;

    // take the last one off (most efficient to remove)
    // and use it as the initial mask value
    auto last = iterators.back();
    iterators.pop_back();
    for (auto word = 0; word < storeSize; ++word)
    {
    StoreType mask = last.next();
    for(auto &i: iterators)
    {
    mask |= i.next();
    }
    mStorePtr[word] = mask;
    }
    }

    template <typename iterator> Masker& andequals(size_t
    cachesize, iterator begin, iterator end)
    {
    static constexpr size_t wordshift = sizeof(StoreType)
    == 1 ? 4 : sizeof(StoreType) == 4 ? 6 : sizeof(StoreType) == 8 ? 7 : 0;
    static constexpr size_t wordmask = (StoreType)-1 >> (sizeof(StoreType) * CHAR_BIT + 1 - wordshift);

    struct value
    {
    PRIME step; // this is the
    prime. The step should be 2P, but only half the bits are stored
    PRIME halfMultiple; // this is the current multiple, _without_ the last bit
    value(PRIME prime): step(prime), halfMultiple((prime*prime) >> 1) {}
    bool operator<(PRIME halfLimit) const
    {
    return halfMultiple < halfLimit;
    }
    void next(StoreType * store)
    {
    static constexpr size_t wsm1 =
    wordshift - 1;
    auto index = halfMultiple >> wsm1;
    auto mask = (StoreType)1 <<
    (halfMultiple & wordmask);
    *(store + index) |= mask;
    halfMultiple += step;
    }
    };
    std::vector<value> values;
    values.reserve(std::distance(begin, end));
    for (auto prime = begin; prime != end; ++prime)
    {
    auto p = *prime;
    values.emplace_back(p);
    }

    size_t limit = 0;
    do
    {
    limit = std::min(limit + cachesize, initSize + repSize + 1);
    PRIME halfMaxPrime = (limit * 16 *
    sizeof(StoreType)) / 2 + 1;
    for (auto & value: values)
    {
    while (value < halfMaxPrime)
    value.next(mStorePtr);
    }
    }
    while (limit < initSize + repSize);

    return *this;
    }

    // duplicates the data up to the amount needed for a prime newSize
    void resize(PRIME newSize)
    {
    size_t sizeWords = this->index(newSize) + 1;
    assert(sizeWords > (initSize + repSize)); // not
    allowed to shrink!
    mBuffer.resize(sizeWords);
    mStorePtr = *mBuffer;

    auto copySource = mStorePtr + initSize;
    do
    {
    auto copySize = std::min(sizeWords - repSize - initSize, repSize);
    auto dest = copySource + repSize;
    memcpy(dest, copySource, copySize * sizeof(StoreType));
    repSize += copySize;
    } while ((initSize + repSize) < sizeWords);
    }

    // returns true if this mask thinks value is prime.
    bool get(PRIME value) const
    {
    assert(value < sizeof(StoreType) * CHAR_BIT * 2 * mBuffer.size());
    auto ret =
    (value <= 3)
    || ((value & 1) &&
    (mStorePtr[this->index(value)] & this->mask(value)) == 0);

    return ret;
    }

    // Get the beginning of the bitmap.
    // Incrementing this iterator can continue indefinitely,
    regardless of the bitmap size.
    MaskIterator begin() const
    {
    return MaskIterator(this);
    }

    size_t repsize() const
    {
    return repSize;
    }

    size_t size() const
    {
    return initSize + repSize;
    }

    // prints a collection to stderr
    void dump(size_t limit = 0) const
    {
    std::cerr << std::dec << repSize;
    std::cerr << std::hex;

    if (limit == 0) limit = initSize + repSize;

    auto iter = begin();
    for (auto i = 0; i < limit; ++i)
    {
    // cast prevents attempting to print uint8_t as
    a char
    std::cerr << ',' << std::setfill('0') << std::setw(sizeof(StoreType) * 2) << uint64_t(mStorePtr[i]);
    }
    std::cerr << std::endl;
    std::cerr << std::dec << repSize;

    for (auto p = 1; p < limit * 16 * sizeof(StoreType); ++p)
    {
    if (get(p))
    {
    std::cerr << ',' << p;
    }
    }
    std::cerr << std::endl;
    }
    };

    int main(int argc, char**argv)
    {
    // find out if the user asked for a different number of primes
    PRIME PrimeCount = 1e9;
    if (argc >= 2)
    {
    std::istringstream arg1(argv[1]);
    std::string crud;
    arg1 >> PrimeCount >> crud;
    if (!crud.empty() || PrimeCount < 3)
    {
    double isitfloat;
    arg1.seekg(0, arg1.beg);
    crud.clear();
    arg1 >> isitfloat >> crud;
    if (!crud.empty() || isitfloat < 3)
    {
    std::cerr << "Invalid first argument
    \"" << argv[1] << "\" Should be decimal number of primes required\n";
    exit(42);
    }
    else PrimeCount = isitfloat;
    }
    }

    std::string outName;
    if (argc >= 3)
    {
    outName = argv[2];
    }

    Stopwatch s; s.start();

    Masker primes;
    PRIME nextPrime;
    {
    nextPrime = 3;
    Masker tiny(nextPrime);
    std::vector<Masker> smallMasks;
    smallMasks.emplace_back(tiny);

    // Make a mask for the really small primes, 3 5 7 11
    size_t count = 1; // allows for the 3
    for ( ; count < tinyCount; ++count)
    {
    do
    {
    nextPrime += 2;
    } while (!tiny.get(nextPrime));
    tiny = Masker(tiny, nextPrime);
    }

    // that masker for three will be correct up until 5
    squared,
    // the first non-prime whose smallest root is not 2 or 3
    assert (nextPrime < 25 && "If this triggers tinyCount
    is too big and the code will give wrong results");

    // this limit is too small, but is easy to calculate
    and big enough
    auto limit = nextPrime*nextPrime;

    smallMasks.clear();
    for (; count < smallCount; ++count)
    {
    do
    {
    nextPrime += 2;
    } while (!tiny.get(nextPrime));
    smallMasks.emplace_back(nextPrime);
    }
    assert(nextPrime <= limit && "If this triggers
    smallCount is too big and the code will give wrong results");

    // Make a single masker for all the small primes. Don't
    make it bigger than needed.
    auto small = Masker(smallMasks.begin(),
    smallMasks.end(), PrimeCount / (2 * CHAR_BIT * sizeof(StoreType)) + 1);
    s.lap(__LINE__);

    // This is the first step that takes an appreciable
    time. 2.5s for primes up to 1e11 on my machine.
    primes = Masker(tiny, small, PrimeCount / (2 * CHAR_BIT
    * sizeof(StoreType)) + 1);
    s.lap(__LINE__);
    }

    // if the buffer isn't big enough yet expand it by duplicating
    early entries.
    if (primes.size() < PrimeCount / (2 * CHAR_BIT *
    sizeof(StoreType)) + 1)
    {
    primes.resize(PrimeCount);
    s.lap(__LINE__);
    }

    do
    {
    std::vector<PRIME> quickies;
    PRIME quickMaxPrime = std::min(nextPrime*nextPrime, PRIME(sqrt(PrimeCount) + 1));
    do
    {
    do
    {
    nextPrime += 2;
    } while (!primes.get(nextPrime));
    quickies.emplace_back(nextPrime);
    } while (nextPrime < quickMaxPrime);
    s.lap(__LINE__);

    // this function adds all the quickies into the mask,
    but does all of them in
    // one area of memory before moving onto the next. The
    area of memory,
    // and the list of primes, both fit into the L1 cache.
    // You may need to adjust bigCache for best performance.
    // This step takes a while - two lots of 20s for qe11
    primes.
    primes.andequals(bigCache, quickies.begin(),
    quickies.end());
    s.lap(__LINE__);
    } while (nextPrime*nextPrime < PrimeCount);

    s.stop();
    std::cerr << primes.size() << "," << s.laps().back().second << std::endl;
    for (auto l: s.laps()) std::cerr << l.first << "," << l.second
    << std::endl;

    if (outName.length())
    {
    std::ofstream outfile(outName);
    outfile << "2\n";
    for (PRIME prime = 3; prime < PrimeCount; prime += 2)
    {
    if (primes.get(prime))
    {
    outfile << prime << '\n';
    }
    }
    }

    return 0;
    }

    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From gazelle@[email protected] (Kenny McCormack) to alt.comp.lang.c,comp.lang.c on Sun Oct 20 12:14:42 2024
    From Newsgroup: comp.lang.c

    In article <vf2n7i$cko5$[email protected]>,
    Vir Campestris <[email protected]d> wrote:
    ...
    #include <algorithm>

    What newsgroup(s) is/are we in?

    Let's check:

    Newsgroups: alt.comp.lang.c,comp.lang.c

    Hmmm.
    --
    A pervert, a racist, and a con man walk into a bar...

    Bartender says, "What will you have, Donald!"

    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From Bonita Montero@[email protected] to alt.comp.lang.c,comp.lang.c on Mon Oct 21 17:36:57 2024
    From Newsgroup: comp.lang.c

    Am 20.10.2024 um 12:49 schrieb Vir Campestris:
    On 04/10/2024 05:53, Bonita Montero wrote:
    Am 03.10.2024 um 18:06 schrieb Vir Campestris:

    for 4e9 primes my machine take nearly 10 seconds for 4294967295
    primes. The one I posted before over there takes 1.5 seconds.

    Show me your code. I'm pretty sure you never posted faster code.


    Try this.


    Still about half the speed of my code for 2 ^ 32.

    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From Richard Harnden@[email protected] to comp.lang.c on Tue Oct 22 17:13:08 2024
    From Newsgroup: comp.lang.c

    On 20/10/2024 11:49, Vir Campestris wrote:
    On 04/10/2024 05:53, Bonita Montero wrote:
    Am 03.10.2024 um 18:06 schrieb Vir Campestris:

    for 4e9 primes my machine take nearly 10 seconds for 4294967295
    primes. The one I posted before over there takes 1.5 seconds.

    Show me your code. I'm pretty sure you never posted faster code.


    Try this.


    I just want to say that those are very good comments.

    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From Vir Campestris@[email protected] to alt.comp.lang.c,comp.lang.c on Sat Oct 26 15:28:24 2024
    From Newsgroup: comp.lang.c

    On 21/10/2024 16:36, Bonita Montero wrote:
    Am 20.10.2024 um 12:49 schrieb Vir Campestris:
    On 04/10/2024 05:53, Bonita Montero wrote:
    Am 03.10.2024 um 18:06 schrieb Vir Campestris:

    for 4e9 primes my machine take nearly 10 seconds for 4294967295
    primes. The one I posted before over there takes 1.5 seconds.

    Show me your code. I'm pretty sure you never posted faster code.


    Try this.


    Still about half the speed of my code for 2 ^ 32.


    Fascinating.

    But after my latest experience perhaps not surprising.

    You're on Windows aren't you? I do know you have a far more powerful
    machine than I do. I'm running Ubuntu with Gnu compiler. (I tried Clang,
    but it's slower)

    When I try your code on my system it takes 10.4s eleapsed, and 12.1s of
    CPU to calculate 4294967296. That code I supplied up there takes only
    1.47s elapsed, and a little less CPU - it's single threaded.

    So on _my_ system my code is about 7 times faster!

    But my recent experience - I've been playing with using a modulo30
    store, not just odds.

    My code makes bitmasks for 2 sets of small primes (these can be small,
    there's a repeat at the product of all the primes) then combines the 2
    to make the initial mask for the desired number of primes. It then masks
    out the bigger primes.

    With a mask of odds-only the mask words can be 64 bit, and combining
    them is way faster than Mod30 - where the mask is a byte. So I put
    together some code so that it would put the two small masks together
    with 64 bit operations. That bit of the code got twice as fast. Then I realised that the next part, for reasons I have been unable to
    establish, runs three times more slowly.

    But only when I'm using GCC (not Clang) and only when -Ofast
    optimisation is turned on. That loop is the same speed in all 0other cases.

    I then tried putting some debug timing statements in, and I could get
    the same effect by inserting some in in various places.

    I looked at the output assembly, and there seem to be random changes (instructions that are not dependent on each other being re-ordered) so
    it's not obvious why this is happening.
    I think I'm bored enough with this project to just leave it. I've been
    more interested in correctness, not speed, in recent years.

    Andy
    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From Vir Campestris@[email protected] to alt.comp.lang.c,comp.lang.c on Sat Oct 26 15:30:15 2024
    From Newsgroup: comp.lang.c

    On 20/10/2024 13:14, Kenny McCormack wrote:
    In article <vf2n7i$cko5$[email protected]>,
    Vir Campestris <[email protected]d> wrote:
    ...
    #include <algorithm>

    What newsgroup(s) is/are we in?

    Let's check:

    Newsgroups: alt.comp.lang.c,comp.lang.c

    Hmmm.


    And the thread title is... ?

    Just kill the thread if you don't care. I won't start another one.

    Andy

    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From Bonita Montero@[email protected] to alt.comp.lang.c,comp.lang.c on Sat Oct 26 16:33:45 2024
    From Newsgroup: comp.lang.c

    Am 26.10.2024 um 16:28 schrieb Vir Campestris:

    You're on Windows aren't you? I do know you have a far more powerful
    machine than I do. I'm running Ubuntu with Gnu compiler. (I tried Clang,
    but it's slower)

    I compared your code against mine on a AMD 7950X.


    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From Vir Campestris@[email protected] to alt.comp.lang.c,comp.lang.c on Sat Oct 26 21:02:10 2024
    From Newsgroup: comp.lang.c

    On 26/10/2024 15:33, Bonita Montero wrote:
    Am 26.10.2024 um 16:28 schrieb Vir Campestris:

    You're on Windows aren't you? I do know you have a far more powerful
    machine than I do. I'm running Ubuntu with Gnu compiler. (I tried
    Clang, but it's slower)

    I compared your code against mine on a AMD 7950X.


    Which operating system, and which compiler?

    It's probably also worth trying for a much larger number of primes. I
    wrote my command line parser so it will try to read the number as a
    double if it isn't an integer.

    For 100000000000 primes my code on my machine is 39s, and yours is 3:46.

    (I've got the mod30 version down to 28s now, but met an intractable
    optimiser funny)

    Andy
    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From Bonita Montero@[email protected] to alt.comp.lang.c,comp.lang.c on Sun Oct 27 06:02:25 2024
    From Newsgroup: comp.lang.c

    Am 26.10.2024 um 22:02 schrieb Vir Campestris:

    Which operating system, and which compiler?

    Win11, clang-cl 17.0.3. MSVC is about the same performance.

    --- Synchronet 3.20a-Linux NewsLink 1.114
  • From Vir Campestris@[email protected] to alt.comp.lang.c,comp.lang.c on Mon Oct 28 16:21:24 2024
    From Newsgroup: comp.lang.c

    On 27/10/2024 05:02, Bonita Montero wrote:
    Am 26.10.2024 um 22:02 schrieb Vir Campestris:

    Which operating system, and which compiler?

    Win11, clang-cl 17.0.3. MSVC is about the same performance.

    The plot thickens.

    I'm in Ubuntu 24.01 LTS.

    I normally use Gnu C++, g++ (Ubuntu 13.2.0-23ubuntu4) 13.2.0.

    Using g++ for 10000000000 primes my code took 3.3s, and yours 22.7.

    If I rebuild them using clang (Ubuntu clang version 18.1.3) then
    my code takes 14.5s (far slower) and yours takes 11.6 (much quicker than
    with Gnu and a bit quicker than mine).

    Andy
    --- Synchronet 3.20a-Linux NewsLink 1.114