Need to let loose a primal scream without collecting footnotes first? Have a sneer percolating in your system but not enough time/energy to make a whole post about it? Go forth and be mid: Welcome to the Stubsack, your first port of call for learning fresh Awful youā€™ll near-instantly regret.

Any awful.systems sub may be subsneered in this subthread, techtakes or no.

If your sneer seems higher quality than you thought, feel free to cutā€™nā€™paste it into its own post ā€” thereā€™s no quota for posting and the bar really isnā€™t that high.

The post Xitter web has spawned soo many ā€œesotericā€ right wing freaks, but thereā€™s no appropriate sneer-space for them. Iā€™m talking redscare-ish, reality challenged ā€œculture criticsā€ who write about everything but understand nothing. Iā€™m talking about reply-guys who make the same 6 tweets about the same 3 subjects. Theyā€™re inescapable at this point, yet I donā€™t see them mocked (as much as they should be)

Like, there was one dude a while back who insisted that women couldnā€™t be surgeons because they didnā€™t believe in the moon or in stars? I think each and every one of these guys is uniquely fucked up and if I canā€™t escape them, I would love to sneer at them.

(Semi-obligatory thanks to @dgerard for starting this.)

  • self@awful.systems
    link
    fedilink
    English
    arrow-up
    5
    Ā·
    22 hours ago

    do you figure itā€™s $1000/query because the algorithms they wrote with their insider knowledge to cheat the benchmark are very expensive to run, or is it $1000/query because theyā€™re grifters and all high mode does is use the model trained on frontiermath and allocate more resources to the query? and like any good grifter, theyā€™re targeting whales and institutional marks who are so invested that throwing away $1000 on horseshit feels like a bargain

    • froztbyte@awful.systems
      link
      fedilink
      English
      arrow-up
      4
      Ā·
      edit-2
      21 hours ago

      so, for an extremely unscientific demonstration, here (warning: AWS may try hard to get you to engage with Explainer[0]) is an instance of an aws pricing estimate for big handwave ā€œsome gpu computeā€

      and when I say ā€œextremely unscientificā€, I mean ā€œI largely pulled the numbers out of my assā€. even so, theyā€™re not entirely baseless, nor just picking absolute maxvals and laughing

      parameters assumptions made:

      • ā€œsomewhat beefyā€ gpu instances (g4dn.4xlarge, selected through the tried and tested ā€œsquint until it looks rightā€ method)
      • 6-day traffic pattern, excluding sunday[1]
      • daily ā€œ4h peakā€ total peak load profile[2]
      • 50 instances mininum, 150 maximum (letā€™s pretend weā€™re not openai but are instead some random fuckwit flybynight modelfuckery startup)
      • us west coast
      • spot instances, convertible spot reserves, 3y full prepay commit (yeah I know full vs partial is a big diff; once again, snore)

      (and before we get any fucking ruleslawyering dumb motherfuckers rolling in here about accuracy or whatever: get fucked kthx. this is just a very loosely demonstrative example)

      so youā€™d have a variable buffer of 50ā€¦150 instances, featuring 3.2ā€¦9.6TiB of RAM for working set size, 800ā€¦2400 vCPU, 50ā€¦150 nvidia t4 cores, and 800ā€¦2400GiB gpu vram

      letā€™s presume a perfectly spherical ops team of uniform capability[3] and imagine that we have some lovely and capable active instance prewarming and correct host caching and whatnot. yā€™know, things to reduce user latency. letā€™s pretend weā€™re fully dynamic[4]

      so, by the numbers, then

      1y times 4h daily gives us 1460h (in seconds, thatā€™s 5256000). this extremely inaccurate full-of-presumptions number gives us ā€œservice-capable life timeā€. the times your concierge is at the desk, the times you can get pizza delivered.

      x3 to get to lifetime matching our spot commit, x50ā€¦x150 to get to ā€œtotal possible instance hoursā€. which is the top end of our sunshine and rainbows pretend compute budget. which, of course, we still have exactly no idea how to spend. because we donā€™t know the real cost of servicing a query!

      but letā€™s work backwards from some made-up shit, using numbers The Poor Public gets (vs numbers Free Microsoft Credits will imbue unto you), and see where we end up!

      so that means our baseline:

      • upfront cost: $4,527,400.00
      • monthly: $1460.00 (x3 x12 = $52560)
      • whatever the hell else is incurred (s3, bandwidth, ā€¦)
      • >=200k/y per ops/whatever person we have

      3y of 4h-daily at 50 instances = 788400000 seconds. at 150 instances, 2365200000 seconds.

      so we can say that, for our deeply Whiffs Ever So Slightly values, a secondā€™s compute on the low instance-count end is $0.01722755 and $0.00574252 at the higher instance-count end! which gives us a bit of a handle!

      this, of course, entirely ignores parallelism, n-instance job/load/whatever distribution, database lookups, network traffic, allllllll kinds of shit. which we canā€™t really have good information on without some insider infrastructure leaks anyway. if we pretend to look at the compute alone.

      so what does $1000/query mean, in the sense of our very ridiculous and fantastical numbers? since the units are now The Same, we can simply divide things!

      at the 50 instance mark, weā€™d need to hypothetically spend 174139.68 instance-seconds. thatā€™s 2.0154 days of linear compute!

      at the 150 instance mark, 522419.05 instance-seconds! 6.070 days of linear compute!

      so! what have we learned? well, weā€™ve learned that we couldnā€™t deliver responses to prompts in Reasonable Time at these hardware presumptions! which, again, are linear presumptions. and thereā€™s gonna be a fair chunk of parallelism and other parts involved here. but even so, turns out itā€™d be a bit of a sizable chunk of compute allocated. to even a single prompt response.

      [0] - a product/service whose very existence I find hilarious; the entire suite of aws products is designed to extract as much money from every possible function whatsoever, leading to complexity, which they then respond to byā€¦ producing a chatbot to ā€œguide usersā€

      [1] - yes yes I know, the world is not uniform and the fucking promptfans come from everywhere. Iā€™m presuming amerocentric design thinking (which imo is probably not wrong)

      [2] - letā€™s pretend that the calculatorsā€™ presumption of 4h persistent peak load and our presumption of short-duration load approaching 4h cumulative are the same

      [3] - oh, who am I kidding, you know itā€™s gonna be some dumb motherfuckers with ansible and k8s and terraform and chucklefuckery

      • froztbyte@awful.systems
        link
        fedilink
        English
        arrow-up
        3
        Ā·
        21 hours ago

        when digging around I happened to find this thread which has some benchmarks for a diff model

        itā€™s apples to square fenceposts, of course, since one llm is not another. but it gives something to presume from. if g4dn.2xl gave them 214 tok/s, and if we make the extremely generous presumption that tok==word (which, well, no; cf. strawberry), then any Use Deserving Of o3 (letā€™s say 5~15k words) would mean you need a tok-rate of 1000~3000 tok/s for a ā€œreasonableā€ response latency (ā€œ5-ish secondsā€)

        so youā€™d need something like 5x g4dn.2xl just to shit out 5000 words with dolphin-llama3 in ā€œquickā€ time. which, again, isnā€™t even whatever the fuck people are doing with openaiā€™s garbage.

        utter, complete, comprehensive clownery. era-redefining clownery.

        but some dumb motherfucker in a bar will keep telling me itā€™s the future. and I get to not boop 'em on the nose. le sigh.