-
Notifications
You must be signed in to change notification settings - Fork 23
Meeting notes 5/31/2018 #42
Comments
[TE] Add SQL Server (on linux) as a database type TechEmpower/FrameworkBenchmarks#3807 |
Pipelining in MySQLBackgroundThe MySQL protocol is packet-oriented (details). (Note: multiple MySQL packets may be contained in a single TCP packet if they're small enough.) Each MySQL packet (up to 16MiB) has a one-byte sequence number. Client/server interactions are grouped into "conversations" where the first packet has sequence number 0 and each subsequent packet has the next sequence number (with wraparound). If a party in the conversation needs to send multiple packets, it numbers them i, i+1, ..., i+k and the other party replies (if necessary) with i+k+1. Request/ResponseSome conversations have multiple back-and-forth packets, e.g., log on:
Others (e.g., ping) are very simple, one packet from each side:
A query will result in multiple packets being sent back from the server; note that they're numbered consecutively:
BatchingI understand "batching" to mean putting multiple individual SQL statements into one
This is done, for example, here: https://github.com/TechEmpower/FrameworkBenchmarks/blob/724a773096f403d149cfaf59b8257465bbf70103/frameworks/CSharp/aspnetcore/Benchmarks/Data/RawDb.cs#L130 PipeliningBy "pipelining" in MySQL, I mean starting multiple separate client conversations before the server response has been received. This is not an official part of the MySQL protocol and some servers (e.g., Amazon Aurora) will simply not respond to it. It looks something like:
It appears to work because MySQL Server reads the first packet with sequence ID 0, sends its response, then pulls the next packet off its receive buffer. It reads that as a new conversation with sequence ID 0, so it sends its response to that, and so on. One reason that it works is that very few conversations in the MySQL protocol have multiple replies from both sides; it's generally a client request followed by one or more server packets, then the conversation is implicitly over. One core difference between batching and pipelining is that pipelining lets you send packets of different kinds; batching is just one outgoing packet with a larger payload. The only place I've attempted to use it is in the library internals, e.g., sending one TCP packet that includes all the MySQL packets needed to send a "reset connection" request. (As hinted at earlier, this works great with MySQL Server, but not Amazon Aurora, so I reverted it and now all connection resets require multiple network roundtrips.) This pipelining doesn't map easily on to the ADO.NET API. Conceivably one could let a client execute multiple Hope this helps; please let me know if anything's unclear. |
As the authentication is the same, could connection pooling work to some degree down the same connection for multiplexing? (e.g. use less "physical connections" than the total "active connections" for fuller TCP packets and less chatter)
|
The current model that is suggested by ADO.NET is to pool actual connections and associate them with a single ADO.NET connection. What other "fast" framework do is open a specific number of database connections, and not associate them with the ADO.NET connection itself. But when a command is sent, the ADO.NET connection will look for any currently open connection, or one by affinity, and then use it. It means that two requests, getting two ADO.NET connections, could be using the same database connection, and allowing pipelining. Even without using pipelining this could still improve perf as less database connections are active, and prepared statements are shared more efficiently. |
Base pooling around data stream rather than ADO connection; like?
Makes sense... |
I think this is theoretically possible, but haven't ever tested it. Presumably each
The assumption of a dedicated connection is pretty common in client code: variables, temp tables, session settings, etc. are all associated with the current session, which is represented by a |
I was suggesting the opposite, a single physical connection shared by multiple ado.net ones. |
It still would cause problems if two non-pipeline-aware clients both execute |
A few comments from me... First, the batching vs. pipelining distinction @bgrainger made above isn't very important for the PostgreSQL case. PostgreSQL does have two sub-protocols - simple and extended. The simple protocol has a single wire message (called Query), which may contain multiple SQL statements separated by semicolons, and which PostgreSQL parses on its end (this is what @bgrainger calls "batching"). Unfortunately the simple protocol is quite crippled - it does not support binary value transfers, parameters or prepared statements. The extended protocol, on the other hand, requires a series of messages for each SQL statement. Executing an unprepared statements involves sending 5 protocol messages (Parse/Bind/Describe/Execute/Sync), while a prepared statement involves sending 3 (Bind/Execute/Sync). It is not permitted to include multiple SQL statements separated by semicolons, so Npgsql performs rudimentary SQL parsing client-side, splitting the CommandText on semicolons and sending the above set of messages for each parsed SQL statement. I think that the difference between the above two types of batching is rather uninteresting - it's an internal protocol/implementation detail that doesn't affect much, and I tend to think of both of the above as "batching". Note that PostgreSQL messages do not have any sort of ID or running number, so we can send as many messages as we want up-front; PostgreSQL just reads and executes messages as they come in, and the client's responsibility is to keep track of ordering and parse the results accordingly. I tend to think of pipelining as the ability of a single client (important!) to send a 2nd request before the 1st one has completed; in ADO.NET terms this would mean invoking Finally there's multiplexing, which to me describes the ability to combine queries from multiple clients into one or more physical connections - we're talking about something like a multiple-producer/multiple-consumer model.
And now for some perf analysis:
I understand that one (some?) of the top TechEmpower performers apparently multiplexes (have not actually looked into this), but do we have any justification for thinking that this is what makes it faster? |
Oh, that reminds me... MySQL also supports a binary protocol, which you have to use if you want to use prepared statements (which are not currently supported in MySqlConnector). Like PG, it doesn't support multiple statements, so MySqlConnector would have to parse the SQL and break them up. It has a three-phase Bind/Execute/Reset model. Implementing prepared statements would provide an estimated 15% improvement for |
Huh, interesting that the protocols are so similar in that way... I guess both started out from a simple text-based protocol and later added the advanced features as a new sub-protocol... Implementing preparation definitely sounds like a high priority perf gain, you're welcome to help yourself to Npgsql's query parser if it can help you out. |
Interesting... as the final command in data updates is 20 update statements separated by semicolons. What would the best approach be to send this as a batch of commands? (From user api side) |
Querys (20) would be such a test; also data updates (as that is queries + updates); and spawning them on multiple tasks. I tried it with Mono, the only safe way I saw at the time was to use multiple connections for the concurrency - however multiplying up the connection count made it very unhappy so I had to back it out https://github.com/TechEmpower/FrameworkBenchmarks/pull/3784/files |
As things currently stand, you would pack them all into a single command's CommandText and execute. Npgsql splits them and send via separate protocol messages, which potentially fit into a single outgoing TCP packet, if the updates are small enough. https://github.com/dotnet/corefx/issues/3688 tracks adding a better API for batching which would possibly obviate client-side parsing. In addition, preparation should be used, it has a significant impact for PostgreSQL. I admit I haven't looked much at the update tests as the focus was on fortunes, I'll try to take ask a look and see where we are on the rest. |
I'll take a look at Querys and the other db benchmarks this weekend to see what can be improved |
So that would be allowed by the updates part of data updates; however would be disallowed by the queries part as the queries need to be sent by the app as non-batched (though driver is allowed to then batch them) However you could; validly, do overlapping queries e.g. var task1 = cmd1.ExecuteReaderAsync();
var task2 = cmd2.ExecuteReaderAsync();
var task3 = cmd3.ExecuteReaderAsync();
var task4 = cmd4.ExecuteReaderAsync();
var task = await Task.WhenAny(task1, task2, task3, task4);
using (var rdr = await task)
{
// ....
} |
Does it split the string and create more strings? Or just serialize in chunks to the wire? |
If I understand you correctly, you're saying that this is disallowed: var cmd = new NpgsqlCommand("SELECT 1; SELECT 2", conn);
var reader = cmd.ExecuteReader(); While the following is allowed: var cmd1 = new NpgsqlCommand("SELECT 1", conn);
var cmd2 = new NpgsqlCommand("SELECT 2", conn);
var reader1 = cmd.ExecuteReader1();
var reader2 = cmd.ExecuteReader2(); If so, that seems a bit ridiculous - they're equivalent in almost any way, at least perf-wise (again, assuming a single producer, no multiplexing i.e. handling a single HTTP request)... The only user-facing difference would seem to be error handling (i.e. does the 2nd query get executed if the first fails). The first is already supported - Npgsql and PostgreSQL don't care if the batched statements are updates or selects - while the 2nd isn't. Can you point to the part in the TechEmpower docs which says the 1st is disallowed while the 2nd is OK?
At the moment it splits, creating more strings (not ideal). Unfortunately there's another unrelated aspect that requires this: PostgreSQL parameter placeholders are positional ( There's one more aspect interacting with all this - prepared statements. If you call npgsql/npgsql#1701 was about caching the raw SQL of the entire command (CommandText) rather than individual statements' SQLs, precisely to avoid having to parse CommandText when calling The above is quite a lot of Npgsql internals to take in :) Let me know if you want any more details, and looking forward to hearing any ideas. |
https://www.techempower.com/benchmarks/#section=code “Test type 3: Multiple database queries”, point 6:
“General Requirements”, point 7:
|
@bgrainger hmm, thanks. However, I'm not sure that actually disallows I think it can be claimed that The restriction in point 6 seems to be more about using some SQL features to express the same thing in one query - see this comment. For example, it's forbidden to do This seems supported by the following from the comment linked to above:
Sending a single NpgsqlCommand with So according to my interpretation this is allowed. I've posted a question to be absolutely sure. |
I agree that the rules are a little ambiguous. If they wanted separate DB queries (and network roundtrips), it would be nice if there were a clear data dependency in the tests, i.e., the second query depended on the results of executing the first query and thus couldn't be pipelined. |
FWIW, I just looked at the source code of some of the fastest MySQL entrants (in the Multiple Queries and Data Updates categories); AFAICT each of them is executing a separate |
Nah, wasn't the point I was making. The second example you gave is more or less the same as doing it in a loop (as they are sync queries) So the first is disallowed because it is directly batching in application code, and that's only allowed as a network optimisation by the driver:
What I was suggesting is using async; stashing aside the |
Even in async, if we compare this: var cmd = new NpgsqlCommand("SELECT 1; SELECT 2", conn);
var reader = await cmd.ExecuteReaderAsync(); with this: var cmd1 = new NpgsqlCommand("SELECT 1", conn);
var cmd2 = new NpgsqlCommand("SELECT 2", conn);
var task1 = cmd1.ExecuteReaderAsync();
var task2 = cmd2.ExecuteReaderAsync();
var task = await Task.WhenAny(task1, task2); Once again, at the network level the above two fragments produce almost the exact same thing: a single roundtrip sending two queries, in almost identical wire protocol messages - there's no advantage in the 2nd fragment compared to the first one. This is also why I can't really see any sane reason for the 1st fragment to be disallowed in a benchmark, while the 2nd fragment would be allowed. So I don't really see the sync/async aspect as being very important here. The only advantage of the 2nd fragment, is if you don't have all the queries you want to send upfront (e.g. because they're gradually coming in via some HTTP connection or something). The 2nd method would allow you to send each command as it's coming in, whereas the 1st only allows you to batch several upfront, but then wait until that batch is complete. |
One last comment, to reiterate on @bgrainger's earlier comment. If we look at this: var task1 = cmd1.ExecuteReaderAsync();
var task2 = cmd2.ExecuteReaderAsync();
var task = await Task.WhenAny(task1, task2); When does it make sense for task2 to complete? If it completes as soon as we're starting to receive the 2nd resultset (which is what the ADO.NET API suggests), then task2 can only complete after the 1st resultset has been consumed. In other words, the There's nothing necessarily wrong with this, just pointing out that in the end there's a single TCP connection on which we're serially sending queries - there's no actual parallelism going on. |
For the sync version var cmd1 = new NpgsqlCommand("SELECT 1", conn);
var cmd2 = new NpgsqlCommand("SELECT 2", conn);
var reader1 = cmd.ExecuteReader1();
var reader2 = cmd.ExecuteReader2(); The statement Whereas for the async version it can.
Which is the point. Batching the queries as one command in the app as Rephrasing; the 20 query test as an http type test. If instead you had to download 20 images from a remote site - it becomes a test as to whether you can do it in parallel and how well you can do that as well as how your download works. Does it take 20 x the download time; or 1 x the longest download because they are done concurrently? If the db driver then takes advantage of this an pipelines/batches them; that's great, but its an implementation detail outside of the code you are writing using the framework and the tests are about how code you can write with the framework performs (the db driver would count as part of framework rather than user code) They are also looking to remove the database component to focus specifically on this concurrency:
https://twitter.com/BrianHauerTSO/status/1000127582627545088
|
Whether its parallel, concurrent, pipelined or serial sends (request->response->request) is an implementation detail of how its achieved. The important factor is its not the user code doing it - but the benefit of the framework for idiomatic(ish) user code. |
Hmm... reading the clarification: https://groups.google.com/forum/#!msg/framework-benchmarks/A78HrYsz4AQ/_zneyvo2CwAJ
ADO allows multiple active result sets, so you could do var cmd = new NpgsqlCommand("SELECT 1; SELECT 2", conn);
var reader = await cmd.ExecuteReaderAsync();
reader.ReadAsync();
reader.NextResult(); // move to next query
reader.ReadAsync(); And that would be valid if the invalid query in the middle of Question, would it? How would that present via the api? (The invalid query) |
@benaadams with the current behavior of Npgsql, if a statement in the middle of a batch fails then the rest of the statements are skipped by PostgreSQL. This is achieved via a feature of the PostgreSQL protocol, whereby the moment a protocol message generates an error, PostgreSQL will skip all subsequent messages until it sees a Sync message. So Npgsql will send In .NET API terms, you would be able to read resultsets and call What I don't understand is why this failure semantics has anything to do with TechEmpower benchmark requirements - what possible interest is it of TechEmpower if Npgsql's batching logic skips remaining messages after error or not? In other words, should a database driver utilization be crippled in the benchmark just because it provides batching semantics which skip after error? What exactly would be the point of that? Once again, I do understand the point of disallowing a single complex SQL statement such as Out of curiosity, @benaadams, when executing two commands asynchronously with pipelining, would you expect the driver to retain the current streaming behavior, forcing querys' resultsets to be consumed in the same order that they were sent? Or would you expect resultsets to be buffered in memory, allowing the 2nd resultset to be consumed before the 1st? This question is irrelevant for multiplexing where buffering is clearly mandatory. Finally, regardless of TechEmpower rules... Just to make sure I'm clear, I don't really have objections to pipelining or even multiplexing - they do have their perf benefits (although pipelining by a single client seems to be helpful only for a scenario which I find relatively rare - requests gradually coming in and being gradually pipelined on the same physical db connection). It definitely is a considerable departure from the current driver model, and I'm still not 100% sure I understand the perf impacts (i.e. what is the perf penalty of switching from the current streaming behavior to buffering?). But it's definitely an interesting direction. At some point we may want to explore a design where a database driver has a low-level send/receive API (somewhat similar to libpq), on top of which we would be able to build either a classical ADO.NET model or a multiplexing layer. |
Because the semantics of the test are a for a set of independent queries:
If a failure in one query causes the following queries not to execute then they aren't being run as independent queries; so its a semantically different behaviour - therefore not a valid optimisation to be equivalent i.e. its an observable change in behaviour (though likely the correct behaviour for an actual combined query?)
I was expecting it to fail horribly; as it would likely be stomping over state, which is why for the Mono version I opened multiple connections instead and then distributed the queries amongst them. However that pushed the connection count from 512 to 5120 and executed them in parallel - which just made everything unhappy - so I had to back it out. |
That's exactly what I have a hard time accepting. So if Npgsql happened to implement its batching by sending Parse1/Bind1/Describe1/Execute1/Sync/Parse1/Bind1/Describe1/Execute1/Sync (note that additional Sync in the middle), then it would be eligible for using this optimizations in the benchmarks? How does that make sense, who benefits from this seemingly arbitrary exclusion? Would it make sense for Npgsql to have another batching mode without the skip-on-error logic so that batching can be used in TE? To be clear, I understand how what you're saying can be understood from the TE rules, but I'm questioning the sense of those rules. The way I see it, forbidding the "complex query" case makes sense for preventing one benchmark implementation from things like using SQL In any case, this discussion needs to happen with the TechEmpower folks rather than with you @benaadams (although it's interesting to hear your opinion). I'll post the arguments above in the Google discussion tomorrow. |
I think we can leave the TechEmpower eligibility questions to the thread where Io posted, but it's still interesting to discuss batching/pipelining/multiplexing here... I'm specifically interested in what people think about the impact of switching to buffering of resultsets as a result of multiplexing and possibly some forms of pipelining (since resultsets aren't consumed in order). I tend to think that streaming is very important when compared to buffering, but I know EF Core does buffer in some common cases. What are people's intuitions (or experience) about the degradation caused by buffering? |
Most of my (micro) ORM experience is with Dapper and they buffer by default. |
Attendees: @ajcvickers, @DamianEdwards, @divega, @sebastienros
We discussed today the next steps we can focus on:
Low hanging fruit
Longer term investments
The text was updated successfully, but these errors were encountered: