Hacker News — vinext + Cloudflare Workers

new
past
show
ask
show
jobs
submit

▲We found a bug in the hyper HTTP library (blog.cloudflare.com)

160 points by Pop_- 5 days ago | 82 comments

edelbitter 1 days ago [-]

Cloudflare does not notice (until a customer complains) that they are sending broken responses at scale? I would have thought they would notice this from sampling and linting a few replies.. just in case they did something like Cloudbleed again.

ramon156 1 days ago [-]

Can you get reasonable results without exposing sensitive info? I'm asking because I genuinely have no idea what it's like at their scale

1 days ago [-]

Twey 1 days ago [-]

This would have been flagged by Clippy lints `let_underscore_untyped` or `let_underscore_must_use`, which sadly are not enabled by default.

microgpt 1 days ago [-]

Or just by not writing let _ =

Twey 1 days ago [-]

All recurrent people problems are system problems.

microgpt 23 hours ago [-]

As seen by the fact that forcing the programmer to write let _ = to silence the warning did not fix the bug.

You know what might've solved this though? Using threads instead of async

Twey 13 hours ago [-]

I think that's more of a workaround than a fix. Relying on unstructured concurrency does mean you no longer have to understand the scheduling of your program's fibres (… until you do), but it introduces a bunch of new footguns around things like correct cancellation, error propagation, and predicting resource lifetimes.

pwdisswordfishq 1 days ago [-]

Ehh, easy fix

    #[allow(clippy::let_underscore_untyped,clippy::let_underscore_must_use)]
    let _ = self.poll_flush(cx)?;

Twey 1 days ago [-]

I said ‘flagged’, not ‘fixed’ :)

You can always write the wrong code if you want it enough. But hopefully a warning would have prompted someone to think harder about this flow.

pwdisswordfishq 1 days ago [-]

But "let _ =" is already an explicit suppression of a must-use warning. Where does this arms race of "no, I really know what I am doing, compiler" versus "no, this really looks like a mistake, programmer" end?

Twey 1 days ago [-]

That's an excellent question I don't have an answer for in general :)

IMHO the goal is usually for the compiler not to make these decisions but to provide the tools for the APIs people build to make them. That's kind of passing the buck, though.

I guess in this case the core problem is that the API for these I/O calls has no representation in the type system for what's happening to the buffer. Proxying it as ‘the programmer must think about this code path’ is a reasonable best-effort but, evidently, sometimes inadequate.

tialaramex 24 hours ago [-]

I do feel like Rust did enough to allow software engineers and their managers to make an explicit choice here.

Twey 1 hours ago [-]

I think it definitely sits at a reasonable point in the tradeoff space here, but it's not the only reasonable point. And it's the nature of tradeoffs that some cases will slip through the cracks.

PoignardAzur 24 hours ago [-]

And this is why you should warn on `clippy::allow_attributes_without_reason` in your projects.

nesarkvechnep 1 days ago [-]

Yeah, but you must know about them and the possible bug first in order to allow them...

Twey 1 days ago [-]

Hence ‘sadly’. IMNSHO both of these (or at least _untyped) should be enabled by default. Untyped `let _` is too big a footgun during refactorings.

Joker_vD 1 days ago [-]

At which point you wouldn't have written this bug in the first place; or the warnings would trigger immediately, you'd change _ to an actual variable and then remove the warning pragmas because now you don't assign to _.

Twey 1 days ago [-]

`Poll` is marked `#[must_use]` so if you were assigning to something other than `_` you'd get a warning that you're ignoring the `Pending` path. The Clippy lint is only for `_` which Rust considers a use by default.

turboponyy 23 hours ago [-]

Not really. If I'm using a linter, I go and configure the strictest possible ruleset, and only disable rules when justified on a need-by-need basis. It's just a matter of discipline.

lunar_mycroft 1 days ago [-]

You can set the lints to `forbid` instead of `deny`, which means they can't be `allowed` like that.

jimmypk 24 hours ago [-]

[flagged]

worldsavior 1 days ago [-]

> We spent six weeks chasing a nearly invisible bug — a race condition that occurred only under specific conditions — in the hyper library that impacted how the Images binding returned processed image data back to the client. In the end, it took four lines of code to fix it.

That's a long time, must be frustrating.

gmueckl 21 hours ago [-]

It is a long time and it gets frustrating when there is significant time where there is flailing with no visible progress.

I have had long bug hunts (~a month each) and witnessed ones that took much, much longer. But the longest one I witnessed was drawn out because reproduction was initially unreliable and could take weeks to months. Thankfully, reproduction was by letting a box sit in a corner while tje people involved moved on to other tasks. This kept everybody sane.

Atotalnoob 11 hours ago [-]

Sometimes the best option is to enhance observability and see when it happens again

pseudony 1 days ago [-]

So “fearless concurrency” still only happens when one just decides to not be afraid… :)

c0balt 1 days ago [-]

This does not appear to be a concurrency bug though?

microgpt 1 days ago [-]

Of course it's a concurrency bug. It races sending data to the kernel against the kernel sending data to the network. If the wrong one wins the bug occurs.

tetha 20 hours ago [-]

But it did not take 2 threads within the same application to interact in a bad way on data the system controlled to cause this problem.

This reads more like an overly broad transition in a deterministic state machine. The fix was to split up a bad transition to shutdown.

microgpt 20 hours ago [-]

Concurrency bugs don't have to be within a single process.

inexcf 19 hours ago [-]

By that definition, every write() call that doesn't check for EAGAIN is a concurrency bug you're racing the disk controller. The term stops meaning anything.

inexcf 21 hours ago [-]

Isn't that like saying there can never be a language with safe concurrency since the code could interact with C code that segfaults? I dunno this kinda reminds me of the 10/10 Rust CVE that turned out to be cmd.exe on Windows not sanitizing inputs and languages like Java just labeled it "won't fix".

microgpt 20 hours ago [-]

You mean the one where Windows doesn't have argv the way Unix does, and instead just has a single string that is interpreted slightly differently by each executable? That is a language making false assertions about how the underlying platform works, causing an impedance mismatch that is impossible to fix.

inexcf 19 hours ago [-]

Yeah, the one most languages (except for Rust)* decided was not a language problem and did not fix.

*should clarify, Node.js, PHP, and Haskell did ship patches. Python, Ruby, Erlang, and Go opted for documentation updates; Java went "won't fix."

pseudony 1 days ago [-]

“ a race condition that occurred only under specific conditions — in the hyper library”

conradludgate 9 hours ago [-]

Fearless concurrency has always been regarding data races. There's no fear of undefined behaviour due to a race condition.

Rust has never promised to solve race conditions as a whole.

microgpt 1 days ago [-]

Would using Rust have prevented this?

testdelacc1 22 hours ago [-]

I get that it’s fun to dunk on Rust when a Rust bug surfaces. But is it a bit petty to bring this out when there’s any type of bug of any severity in any Rust software?

In this case a small minority of requests were getting truncated responses.

No one said Rust software is bug free. If someone thinks that they’ve been seriously misled.

re-thc 1 days ago [-]

Isn't this already Rust?

pjmlp 1 days ago [-]

That was obviously a joke question, pointing that Rust isn't the solution for everything.

lelanthran 1 days ago [-]

Woosh :-)

Ygg2 24 hours ago [-]

No. Anyone expecting that hasn't read No Silver Bullet essay.

tialaramex 23 hours ago [-]

Actually I suspect that Rust is a Silver Bullet in that sense. That essay seems to be a case where people know of the essay but haven't read it. Normally in English a "Silver Bullet" is something much bigger, a panacea or cure all which entirely solves a problem but in his essay Brooks is talking about order-of-magnitude improvements, and that looks a lot like Rust.

Brooks was expecting such "Silver Bullet" improvements as often as every few decades, we're arguably overdue significantly. He cites Ada as an example of where such an improvement might come from, well, Rust isn't Ada but a lot of the same ideas about correctness are present.

Google reports order of magnitude changes from their Rust work for example.

Ygg2 16 hours ago [-]

> Actually I suspect that Rust is a Silver Bullet in that sense.

Not really. If anything I'm pretty sure Fred would compare it to Ada:

    > Nevertheless, Ada will not prove to be the silver bullet that slays the software productivity monster, It is, after all, just another high-level language, and the biggest payoff from such languages came from the first transition, up from the accidental complexities of the machine into the more abstract statement of step-by-step solutions.

And I think he'd be on the money. Rust and Ada are revolutionary languages in their own right, but 10x holy grail they aren't.

tialaramex 13 hours ago [-]

> And I think he'd be on the money. Rust and Ada are revolutionary languages in their own right, but 10x holy grail they aren't.

Google reports 4x fewer (and still falling) rollbacks for Rust versus C++ in their Android development for example. That's not 10x but it's pretty sizeable.

pjmlp 3 hours ago [-]

Which they could equally achieve by reducing even further the use of NDK, or moving more stuff into Kotlin/Java processes using Android IPC.

Naturally there are several ideological questions between that approach from the anti GC crowd, and adopting Rust instead.

Not that breaks the C++ vs Rust from your comment, though.

tancop 2 hours ago [-]

thats how you get bloated memory use and disk space. you know, the things everyone complains about when it comes to modern tech

pjmlp 2 hours ago [-]

Only by programmers that missed algorithms and data structures classes, unfortunately plenty of them nowadays, fresh out of JavaScript and Python Web bootcamps with zero CS background.

Folks that wonder pre-history tech with amazement and lack of understanding of how it used to work in first place.

https://news.ycombinator.com/item?id=48727323

HumanOstrich 23 hours ago [-]

Order of magnitude more complexity.

Cthulhu_ 1 days ago [-]

The Hyper library in question is a Rust library.

Did you read the article, or are you a "use rust" parrot / bot based on titles?

waysa 1 days ago [-]

Sarcasm. (I guess)

frankharv 1 days ago [-]

Obviously written by a C freak using a BSD

geodel 1 days ago [-]

Agree. This is warning to people who thought Rust is optional at cloud scale.

nopurpose 1 days ago [-]

Nice writeup, but I don't understand how `curl` didn't trigger bug for them (or any other hyper HTTP server out there), given the explanation in the article.

`curl --http1.1` sends `Connection: Close` so sender (hyper) must attempt to shutdown connection after sending whole body. Surely any network is slower than memory copy into socket kernel buffers, so it must reliably trigger condition "buffer flush can't be done in one go" and thus trigger early TCP shutdown.

23 hours ago [-]

100ms 1 days ago [-]

> The failure was caused by a timing-dependent race condition in hyper’s HTTP/1 connection handling. When the reader was slower and the socket buffer filled, poll_flush returned Poll::Pending, but the dispatch loop discarded that result. Hyper then treated the response as complete and shut down the socket while data remained buffered internally, causing the client to receive an EOF before the full body arrived.

https://github.com/hyperium/hyper/issues/4022

Saved you 3000 words

michalc 1 days ago [-]

Reminds me of another “slow client”-related bug in gunicorn: https://github.com/benoitc/gunicorn/issues/3334

microgpt 1 days ago [-]

That's not even a bug. That's how TCP works. If you keep sending data to a socket the other side has closed, you get RST.

edelbitter 1 days ago [-]

In case of plain HTTP over TCP, there is even a hint in the spec about why and how a server might want to avoid fully closing prematurely.

https://datatracker.ietf.org/doc/html/rfc9112#section-9.6 (this was already in https://datatracker.ietf.org/doc/html/rfc7230#section-6.6)

microgpt 20 hours ago [-]

This is relevant if the client sends multiple requests but the server decides to close the connection after one of them. The server should discard the additional requests until the client signals no more requests are coming.

NooneAtAll3 1 days ago [-]

what is RST?

ignoramous 23 hours ago [-]

Connection "reset": https://en.wikipedia.org/wiki/Transmission_Control_Protocol#...

microgpt 20 hours ago [-]

connection reset - TCP says "this connection is too messed up, abort, abort!"

The relevant condition here is where one side closed its socket but the other side didn't and keeps sending data to the closed socket. That's obviously an improper way to end a connection. A graceful shutdown does not send RST and ensures all data is received on both sides.

moralestapia 21 hours ago [-]

Hey, you have to justify three engineers full time's worth of salary.

1 days ago [-]

nopurpose 1 days ago [-]

[dead]

logicchains 1 days ago [-]

[flagged]

fwlr 1 days ago [-]

    let _ = …?

This is the Rust idiom for “I am intentionally ignoring this return value”. The linter would have caught

    self.poll_read()?;

and in fact one of the options the linter itself suggests in this case is exactly this “let underscore equals” idiom. (Arguably, this code exists because of the linter, not due to its absence!)

In any case, the return value is being “handled” - the question mark examines the result and breaks the loop if the result is not `Ok(…)`, ie if the call is not successful.

Intentionally ignoring the successful return value isn’t necessarily terrible, either - you could be calling the function for its side effect, and you don’t care what the specific result of that effect is, just as long as there is some effect. E.g. maybe you have a state machine, and this is the code that repeatedly drives it.

(Not coincidentally, polling is what you do to Futures, and Futures are state machines that you need to repeatedly drive…)

In conclusion, I do not think this is prima facie terrible code, nor is it an obvious bug. Async rust is subtle and complicated, and not always fully understood by those who nevertheless have to use it.

logicchains 1 days ago [-]

>This is the Rust idiom for “I am intentionally ignoring this return value”.

That doesn't make the code any less awful, it just makes idiomatic Rust sound awful. Discarding a return value without even a comment to explain why shouldn't be allowed in any critical project, and the linter should be perfectly capable of ensuring that a comment accompanies the discard and complaining loudly when it doesn't.

dwroberts 24 hours ago [-]

This is missing that it’s a human issue though. If someone is determined to discard an error and not do anything about it, they’ll just put in a dummy comment to appease the linter any way.

Force people to handle errors and you end up with the exception fiasco in eg Java where everything ends up being a runtime exception to avoid it

throw_await 22 hours ago [-]

As the top comment states, there is a lint rule, but you have to turn it on.

lifthrasiir 1 days ago [-]

It is an explicit way to discard return values; `self.poll_read(cx)?` etc. alone would warn. Or in this case, `Poll<Result<(), Error>>` is unwrapped once and `Result<(), Error>` is being discarded. The decision to discard `Result<(), Error>` should have been intentional, albeit turned out to be not always the case.

watt 1 days ago [-]

If they're not going to handle the return values, they should change the function signature to reflect this aspirational contract, that that function "never fails".

I see in the article they did change the poll_flush to run just-in-time at poll_shutdown. So they definitely can make a "best effort" poll_flush version that just does not return any errors for use in that loop.

But all in all? Amateur hour.

a_cul 1 days ago [-]

You're missing how rust works. The function is explicitly allowed to fail, which is why it returns a Result<(), Error>. They're using the function calls within for their side effects. The ? at the end of each line signals that the function will short-circuit return with an error if the function call fails, and only if it is successful it returns the actual value: they just don't care about this value, hence the let _ =. Basically, they are doing the equivalent of:

  let _, err = function_call();
  if err {
    return err
  }
  ...

watt 1 days ago [-]

What I am saying, is make another version of the function, which is explicitly not allowed to fail, if you want to use it in the loop.

1 days ago [-]

QuantumNomad_ 1 days ago [-]

Assigning to _ in Rust specifically means that you intentionally want to discard the value, and the clippy linter and the Rust compiler both know that.

1 days ago [-]

giammbo 1 days ago [-]

[flagged]

r_lee 1 days ago [-]

LLM?

giammbo 1 days ago [-]

why?

Thaxll 1 days ago [-]

So much for Rust forcing you to handle errors.

Matl 1 days ago [-]

Go does force you too, but it also supports _ as a bypass - because sometimes you do know better. Just not in this case.

Rust never promised it'll let programmers turn off their brain, that's what LLMs are for.

wongarsu 1 days ago [-]

You could argue the bug happened exactly because hyper's poll_flush treats flushing some but not all data as a successful return, not an error case.

atoav 1 days ago [-]

You could say the exact same thing about safety belts and airbags in cars after someone has died in a crash.

Why even bother with measures that prevent many problems if they won't prevent all of them, right?

chlorion 1 days ago [-]

This is the argument I like too.

It's the same argument anti-vaxers love to make. "Well you can still get covid after getting the shot", which is something I read and heard quite a lot. That doesn't make the thing useless.

Humans are really dumb.

atoav 19 hours ago [-]

The older I get the more I realise that I have taken way to much of my mathematical intuition for granted.

This I'd an example of people not grasping simple probabilities. Forced to play Russian Roulette they prefer the revolver with 5 bullets in its chambers or the one with just a single bullet, because in both cases people die.

Other concepts many people do not grasp are feedback-loops and exponentials. And by not grasp I mean: You explain it to them, the nod along and when faced with the thing in slightly different clothes they will actively deny it'd existence.

jerf 24 hours ago [-]

There's a hidden equivocation there. "Handling" errors, as far as the language is concerned, mean you do something with them, but explicitly discarding them is most definitely a "something".

From a human perspective we can consider that not handling the error.

But the language has no mechanism for "knowing" that discarding the error is wrong. Discarding errors is a fully valid mechanism that we must be able to do in a program because it is sometimes correct. There really isn't even a sensible way to define a way to "force" a user to "handle" errors. The language can only be designed to make it hard to forget to "handle" them somehow in the way the language sees, but it is always possible for the user to incorrectly handle them, of which discarding them when they shouldn't have is only one particularly cognitively-available option but is hardly the full scope of possibilities. Probably isn't even the most common mistake to make, I would imagine there are far more errors that are not handled "correctly" than ones that are spuriously discarded.

Note I keep saying "language" rather than Rust. All a language can do is surface the issue, and Rust does that. It can't force good code. No language can.

xacky 23 hours ago [-]

Yet Cloudflare relies on bugs in browsers to "verify" you.

algoth1 1 days ago [-]

I wonder if this bug was found via project glasswing

re-thc 1 days ago [-]

> I wonder if this bug was found via project glasswing

Did you read how they said it took weeks? Would run out of tokens at that rate...

Rendered at 13:35:36 GMT+0000 (Coordinated Universal Time) with Cloudflare Workers.