Corax Query Plan visualization
Corax is the new indexing and querying engine in RavenDB, which recently came out with RavenDB 6.0. Our focus when building Corax was on one thing, performance. I did a full talk explaining how it works from the inside out, available here as well as a couple of podcasts.
Now that RavenDB 6.0 has been out for a while, we’ve had the chance to complete a few features that didn’t make the cut for the big 6.0 release. There is a host of small features for Corax, mostly completing tasks that were not included in the initial 6.0 release.
All these features are available in the 6.0.102 release, which went live in late April 2024.
The most important new feature for Corax is query plan visualization.
Let’s run the following query in the RavenDB Studio on the sample data set:
from index 'Orders/ByShipment/Location' where spatial.within(ShipmentLocation, spatial.circle( 10, 49.255, 4.154, 'miles') ) and (Employee = 'employees/5-A' or Company = 'companies/85-A') order by Company, score() include timings()
Note that we are using the includetimings() feature. If you configure this index to use Corax, issuing the above query will also give us the full query plan. In this case, you can see it here:
You can see exactly how the query engine has processed your query and the pipeline it has gone through.
We have incorporated many additional features into Corax, including phrase queries, scoring based on spatial results, and more complex sorting pipelines. For the most part, those are small but they fulfill specific needs and enable a wider range of scenarios for Corax.
Over six months since Corax went live with 6.0, I can tell that it has been a successful feature. It performs its primary job well, being a faster and more efficient querying engine. And the best part is that it isn’t even something that you need to be aware of.
Corax has been the default indexing engine for the Development and Community editions of RavenDB for over 3 months now, and almost no one has noticed.
It’s a strange metric, I know, for a feature to be successful when no one is even aware of its existence, but that is a common theme for RavenDB. The whole point behind RavenDB is to provide a database that works, allowing you to forget about it.
Comments
Can you perhaps show a scenario where this could be used to improve a query?
peter,
It is usually something that requires a complex query and then analyzing its query plan. A good example may be if you are using a range on a DateTime value, where you are scanning through a lot of values.
Consider:
Assume that
OrderedAt
is aDateTime
with millisecond precision, but your$start
and$end
are always day boundaries. You'll see that you are doing an expensive operation. The same query, expressed as:Will actually be faster, because of higher granularity.
Other things may relate to how we filter things. For example, if you move stuff that you know would filter most items to the left, it may change the query plan. (We tend to do that already, but sometimes you can re-write the query)
Oren, the example you have given is kind of interesting. Do you mean if we index a
DateTime
property that could be up to millisecond precision, and during query, we can specify we want to use date precision? Or do you mean index withDate
precision?If we talk about real life scenario,
DateTime.Date
is kind of useless. I don't know any application will use such query unless they are UK only application or application that says they only deal date with UTC. In practice, it is often hour precision or minute precision due to the time zone.. Majority of time zone are hourly based. Where some of them are minutes bases. e.g. 15 or 30 minutes. (Newfoundland UTC-03:30, India UTC+5:30, Eucla UTC+8:45)Additionally, if we use
LastModified
property as example, query precision is often at hour or minutes, where order people expect to be millisecond or at least second precision. If we use real life scenario as I have described, what would be the best practice?From my understanding, if what you have showed above only means we need to index in date precision, then we have to either index same property twice, one in millisecond precision for ordering and the other with minutes precision for query if space is not an issue. Or we just have to take that query
DateTime
is expensive.Jason,
When you index a
DateTime
, you can specify at what granularity level.In the query above, I'm cheating a bit, becauseOrderedAt
andOrderedAt.Date
using a dynamic query actually specify different fields. And yes, if you reduce the granularity, you give the database engine a far easier time. Date level is easiest, but you can drop the granularity to the max level you need, and that would still be useful. Let's takeOrderedAt
as a great example, why do you care about the value at more than a minute's granularity? Is there a business meaning to knowing whatever order #1 came before #2 if they are on the same minute? As an aside, note that "expensive" is relative. Let's say that you have > 100M orders. And you want to index theOrderedAt
value. If you are working with dates at millisecond granularity , you likely have roughly 100M unique values.If you are working at date granularity, you have hundreds to low thousands. If you are working with minutes, you have tens of thousands or low hundreds of thousands of values. That matters, because it means that a query such as "give me the most recent values" needs to access one bucket of values (or just a few) or many different buckets.Note that this is relevant only at the edge, not for most queries
Oren,
Thanks for going into detail of the discussion. I would agree with you that everything is situational. It all depends on context and what we used that for. It is also good that you have clarified that query on
OrderedAt.Date
, I was curious if I have missed any feature.For topic on date time precision on index, I am totally agree with your view, that is the reason got me into thinking what I can do for our existing product. Since we are index several date time properties. Such as create time, last modified time and state change time. Each of them act like order ID, which are unique due to the millisecond precision. For create time and state change time, precision up to seconds is fine, and up to minute is debatable but should be good as well.
The real impact is on last modified time. Since we have a unique usage, we use that property to achieve delta query. Our product's delta query is inspired by Microsoft's delta query. We use soft delete on given item and utilize last modified time to achieve such API. The key aspect of the delta query is
deltaLink
andnextLink
.nextLink
is a pagination through a change setdeltaLink
is a way to start next batch of change set.RavenDB maintain a last modified property for each document, which is easy to use and also has lowest maintenance point. Other similar feature that RavenDB has is subscription and stream, each of them have their own unique behaviour. But they are hard to be utilized.
The reason for us to provide such API is synchronization. Webhook can achieve synchronization but is not reliable. Same as what Microsoft have stated, delta query is best way to provide such external capability. If it is internal or managed synchronization, we will use RavenDB's feature instead. My question is if there are better way of achieve such external synchronization or whether RavenDB team will be able to provide a new feature for that. It depends on if it is a low hanging fruit. Since you guys already have a lot of mechanism to synchronize between node. For us, each customer have their own database, so delta query doesn't have to provide filter. It is essentially full collection sync.
On the side note, many of RavenDB's own feature is actually very handy to be provide as external feature. One of example is operation. In RavenDB 3.5, operation ID is public, where we can create an operation and respond through API right away, then client side can use the ID to retrieve progress. In RavenDB 4.x and up, it is no longer public property. I have to read the source code to mimic the logic. So such change has partially killed a feature. Of course, I am standing from a business application development point of view, where you guys are standing from database security and vulnerability point of view. What I am saying is, many of RavenDB capability, our application developer could be exposed through our API, it could be a win/win situation. First we can achieve unique feature that others cannot, second it increase our dependency on RavenDB.
Jason,
If you need to sync, be aware that RavenDB already has several features built-in for that.For example - the various ETL features. What do you mean, operation ID in this context (may be better to switch to the GitHub discussions, btw)?
Where are you syncing the data to ? You mentioned that this is a full collection sync, so my first guess would be to utilize subscriptions to do that if you need custom code. But to be honest, the ETL is likely better
There are two context. One is sync, one is bulk action.
For sync, I do aware that RavenDB have several way to sync, even ETL. What we are looking into is allow 3rd party consumer to sync specific collection for automation or analytics purpose. We could use ETL, which means we need to use RavenDB API to setup ETL. That will be similar to integration, instead of manually setup in studio.
Ideal case is that we would expect database is isolated that only our server can reach. Then provide alternative way to sync data to any destination. That's why I was looking at Microsoft's delta query approach. If we allow customer to setup ETL, that means our database need to reach external IP address.
As for bulk action. I mean set based dynamic query. The operation object that's been created once operation start, in RavenDB 3.5, you can read the ID of operation. Then create operation object again by using that ID.
So instead of wait on operation, our server initialize the set operation, then return operation ID to the web application of ours. Then our web application will use the operation ID to query for latest progress and display in UI. Isn't that a great feature? :D
When we migrated to RavenDB 5, I notice the operation ID is not private property. So I have to look at source code to mimic everything RavenDB client does. So I could get operation id.
Jason,
Please note, we actually provide a hub & sink model, where the exteranl service will connect to RavenDB, and then RavenDB will push the data to them. See: https://ravendb.net/docs/article-page/6.0/csharp/studio/database/tasks/ongoing-tasks/hub-sink-replication/overview
That way, you don't need to mess with firewall configurations and the like when a new customer goes live. You can also replicate just some parts of the database using Filtered Replication.
Regarding operation ids - that is because it is a per node object, but you may be using a different node on the next call. So it isn't as stable as you may think. Note that you can force it to happen, by getting the operation id and the node tag, then using that.
Although, we already provide a Changes API for that purpose that can watch notifications on an operation.
Oren. Hub and Sink is definitely an option, as I have stated, it is more like internal tool. Currently, we host our own database, the database itself won't be reachable from external IP. My boss even once want to completely lock it, so the database VM cannot reach external IP. Of course, if we do that, we will run into issue on RavenDB certificate verification.
If we look at this from different point of view, I certainly don't want to expose raw database data to customer, nothing to do with security but more of contract. If through API, then the data transfer object model is consistent. It won't change even if we are changing database model to an extend. As for now, delta query style is an easiest way to expose data without too much of debt, then optionally we can setup replication if we don't care about model consistency.
For
Operation
, when initializeOperation
from standard method, the response is Operation object. Where ID and NodeTag is hidden. I have also checked RavenDB 3.5, they were hidden as well, I don't know why I was thinking it was exposed. What we have done is write our ownOperationExecutor
to returnOperationIdResult
instead.The use case for operation is for batch modification. For an large collection of ticket as an example, our ticket status is customizable. So, if customer want to move all ticket with status X to Y, then they can initialize the operation and monitor the progress. Of course such feature is not commonly used.
I can understand from database developer point of view such operation is not stable, since in a cluster, this operation is only running on single node. But on the other hand, if an operation can finish 99.99% of work, avoid user going into each object to click, I guess that's good enough. Anyway, whether RavenDB choose to expose that two property or not, as long as
RavenCommand
can be utilized, we are all good.Unquestionably consider that which you said. Your favorite reason appeared to be on the web the easiest factor to have in mind of. I say to you, I definitely get irked even as folks consider worries that they plainly do not recognize about. You controlled to hit the nail upon the highest and also defined out the entire thing without having side effect , other people can take a signal. Will likely be back to get more. Thanks
Comment preview