An interesting project related to the task queue processing come to the company I work for. It was previously developed by another team. We needed to detect and resolve issues that occurred at high load on the queue.
In short, the project consisted of several databases and applications located on different servers. A ‘Task’ in the given project is a stored procedure or a .NET application. Correspondingly, the ‘task’ must be performed on a certain database and on a certain server.
All queue-related data is stored on the dedicated server. As for the servers at which tasks must be performed, they store only metadata. That is, procedures, functions, and service data related to this server. All task-related data comes from a Linked Server.
Why is it so?
- Convenience. We can anytime specify that now server B stores data.
- It was implemented before us.
There are two most popular approaches to the queue processing:
- Sending notification about the queue existence to the task engine.
- Polling a queue for the task existence.
Initially, the second variant was implemented in the project. To minimize the wait time, our application polls a queue every 100–500ms.
Actually, there is nothing bad in it, except one thing — at such implementation, a table is blocked for an extra time. Leaping ahead, I can tell that row block with read of unblocked rows takes place in the query.
READPAST, ROWLOCK, UPDLOCK
So, let’s go back to the problem. During analysis, I noticed the batch requests/sec counter value in Active Monitor. This value at small amount (about 50) of tasks in the queue, exceeded 1000, and CPU load was rising steeply.
The first thought was that I needed to switch to implementation of the first variant (sending notification to the task engine). This method was implemented with usage of Service Blocker and SignalR:
- Serive Blocker was used for sending notification about a new task;
- SignalR was used for sending notification to the task engine.
This tool was already being used in the project, I was short of time, and that is why, I decided not to implement similar tool, for example, NserviceBus.
I was really amazed that this solution did not help. Yes, performance has been improved, but it did eliminate the problem completely. I wrote a stress test for debugging that added more than 500 tasks to the queue.
Creation of such stress test allowed me to detect the root of all evil.
Analysis of the list of active queries and productivity reports during high load showed the presence of ‘very interesting queries’ that consisted of a single command:
Further analysis showed that these are queries with LinkedServer. A question arose: Does a query, like select * from RemoteServer.RemoteDatabase.dbo.RemoteTable where FieldId = @Value result in a query (fetch api_cursor0000000000000003) on RemoteServer? It turns out that yes, it does, even when Linked Server is MS SQL.
For more dramatic example, let’s create the ‘Test’ table (code of the table creation is available in the appendix to the article) on server A, and then execute a query on server B:
select * from dev2.test_db.dbo.test
where dev2 is server A.
At first execution of the query, we will have a similar log in profiler on server A. Screenshot is here — https://bit.ly/2BVWDq6.
The complete log is located here.
Now, let’s execute queries by ID:
select * from dev2.test_db.dbo.test where ID = 3
Screenshot — https://bit.ly/2wtkOWS.
The complete log is available here.
If you liked the article, continue reading here — https://bit.ly/2Pgr9Nj.
Thanks for your time.