Friday, December 30, 2011

Look mom: a mutating table error without a trigger!

Did you ever wonder why it says 'trigger/function' in the error message of ORA-04091?


ORA-04091: table ... is mutating, trigger/function may not see it


We know (and understand) by now that a row trigger cannot read a mutating table, but what's the /function all about in above message text? Well there is a completely different scenario that has nothing to do with triggers, where you can run into this error. I thought to spend a short post on that first, so that you really fully understand that ORA-04091 is your friend. And again in this scenario it prevents you from coding logic that might work differently tomorrow than from what it did today.

Let's quickly show you this scenario. We create the EMP table again and insert some test data into it.


drop table EMP;
create table EMP
(EMPNO    number(3,0)  not null primary key
,ENAME    varchar2(20) not null
,SAL      number(4,0)  not null)
/
insert into emp(empno,ename,sal) values(100,'Toon',4000);
insert into emp(empno,ename,sal) values(101,'Izaak',5000);
insert into emp(empno,ename,sal) values(102,'Marcel',7000);
insert into emp(empno,ename,sal) values(103,'Rene',8000);
commit;


Our HR department decided that above salaries should be aligned a bit more in the year 2012. And in their wisdom they decided that the following update should be run at the end of the year:


update EMP e1 set e1.SAL =
 e1.SAL + ((select avg(e2.SAL) from EMP e2) - e1.SAL)/2
/

This update brings everybody's salary a bit more towards the current average salary of all employees. The dba does a testrun of the script in the acceptance instance:



It runs well, HR confirms that it does what it was intended to do, and dba is fine with this in production.

Here come the QA department though: they need to 'ok' the running of this script in production too. So they inspect the script with all QC guidelines in mind. Result of their review is that they won't allow this. There is distinct 'business logic' in there that needs to be removed from the statement and implemented in a stored function, so that it 'can be reused sometime in the future' (I've seen this happen in real companies...). They mandate that a function be implemented such that the update statement will be of the following form:


update EMP e set e.SAL = f_new_sal(e.SAL)
/

So the developers pump out the following function for this:


create or replace function f_new_sal
(p_current_sal in number) return number as
--
pl_avg_sal number;
--
begin
  --
  select avg(SAL) into pl_avg_sal
    from EMP;
  --
  return p_current_sal + (pl_avg_sal - p_current_sal)/2;
  --
end;
/


Everybody is happy. The dba executes the testrun again:



WTF? Oracle throws a mutating table error? But there's no trigger involved here...

Why is this happening?

If you understood the explanation of this error in my previous posts sofar, you should be able to figure this one out. Oracle again detects that we seem to be selecting from a table that's currently mutating. Think about what is happening here:

  • The update starts identifying the rows that need their SAL column value changed;
  • It does this in some undetermined order (!);
  • When the first rows is found, it invokes our function to determine the new SAL column value for it;
  • Our function queries the EMP table, which is currently mutating: bingo we hit the error.

When Oracle steps out of the SQL-engine that's currently executing the update statement, and invokes the function, then this function -- just like an after row update trigger would -- sees the intermediate states of EMP as they exist during execution of the update statement. This implies that the return value of our function invocations heavily depend on the order in which the rows happen to be updated.

Wanna see this happening? Let's use our database link "workaround" to prevent the mutating error. We create a loopback database link and change our function accordingly:

create or replace function f_new_sal
(p_current_sal in number) return number as
--
pl_avg_sal number;
--
begin
  --
  select avg(SAL) into pl_avg_sal
  from EMP@loopback;                -- Here: added db-link.
  --
  return p_current_sal + (pl_avg_sal - p_current_sal)/2;
  --
end;
/

Dba can now run the update without hitting the error:



But look at the SAL values, only Toon's salary is updated correctly. These are not accepted by HR...

And these values will be different if we force a different order of updating the rows, for instance like this:

update (select * from EMP order by EMPNO desc) e 
  set e.SAL = f_new_sal(e.SAL)
/

The proof is in eating the pudding:



Now Rene's salary is the only one being updated correctly.

So again, don't forget: ORA-04091 is your friend.

Note that in the original version of the update statement (without the function), Oracle ensures that the repeated 'invocations' of the subquery (select AVG(e2.SAL) etc.) all see the same snapshot of the database: the one that existed at the start of the update statement. So there is no indeterministic behavior in that case.

A fix for the issue demonstrated could be to just add the autonomous transaction pragma to the function. We would then get the same (deterministic) behavior as the original version of the update statement.
A better fix in this case though would be to not have the function at all, and just go with the original update statement.

Friday, December 23, 2011

The mutating table error prevents non-deterministic behavior of your code

A short recap of workarounds 1a and 1b from our previous post.
  • In workaround 1a we ensured that the query that was selecting from the mutating table, was being executed from within an autonomous transaction.
  • In workaround 1b we ensured that this query was being executed via a loopback database link.
We also discussed the major difference between the two workarounds:
  • 1a: the query sees the mutating table as it existed prior to the start of the main transaction.
  • 1b: the query sees the mutating table as it is being mutated by the triggering DML statement. Essentially this workaround behaves as if Oracle did not have a mutating table error at all.
Please never implement these workarounds. Especially 1b for the reason I will demonstrate in this post. And do you really want the overhead of going through a database link every time the row-trigger fires? And potentially have twice the number of database sessions in your database instance? I hope not.

Apart from this, you will have planted a booby trap in your application code: your code can now behave non-deterministically.


To discuss this we will investigate how our row-trigger behaves not in the case of a single row insert, but a multi-row insert. Given our business rule, no manager can be without a clerk in the same department, a multi-row insert that inserts both the manager as well as the necessary clerk as the first two employees for a new department in one go, is a valid insert statement that our trigger-based rule-enforcement code should allow.

Let's look at the behavior of our row-trigger using workaround 1b (loopback db-link) first. Suppose we have a table called EMP_STAGE. This (staging) table is holding employee-rows that are to be inserted into the real EMP table. Table EMP_STAGE currently holds two employees: a manager and a clerk both working in department 42.

Now see what our row-trigger does.


Our row-trigger does not allow this insert. Why? Well think about it. How is the statement 'insert into EMP select * from EMP_STAGE' executed? Particularly in what order are the selected from EMP_STAGE and then inserted into EMP? Apparently in the case above the manager was the first one that was selected and inserted into EMP. Which would then fire our row-trigger, which would then see the mutating table (via the loopback db-link) and not find the required clerk in department 24. Hence our row-trigger raised the business rule error. And the multi-row insert statement never reaches inserting the second row (the clerk), it is cancelled by the raise_application_error and rolled back.

But what if the rows happened to be processed in a different order? We can force this by simply adding an order-by clause to the subquery in the insert statement (we re-execute the insert and have added 'order by empno desc'):


Our two-row insert statement now executes succesfully!

So what happened in this case? Due to the order by clause we've ensured that employee 111 (the clerk) got inserted first. The row-trigger fires, sees it's not a manager and therefore nothing needs to be done. Next employee 110 got inserted. The row-trigger fires again, sees it's a manager, and executes the business rule check by calling our procedure. This procedure now queries the EMP table (via the loopback db-link) and it actually sees (!) the clerk that just got inserted by the same statement, and therefore executes successfully.

Now the order in which rows get inserted by a multi-row insert statement is (and rightfully should be) of no concern to you. Think of the subquery instead of being a simple 'select * from emp_stage', being a complex multi-table join: depending on current contents/sizes of the tables involved, statistics might differ which in turn might cause different execution plans for the insert statement, which in turn might change the order in which rows get inserted by a multi-row insert statement. If your code is not immune to this order then in effect your code executes in a non-deterministic fashion. And this is the essence of why ORA-04091 is your friend. It protects you against developing software that might work today, and not work tomorrow.

So how does this all work with the autonomous transaction workaround? Here's the same two-row insert for that scenario, executed without and with the forced order (EMP again has no employees in department 42).


As you can see both versions of the insert statements fail with this workaround. The explanation for this is simple. Since the query that checks for the necessary clerk is run from inside an autonomous transaction, all it sees is the EMP table as it existed prior to start of the insert statement: in which there were no employees for department 42. So in this case the 'workaround' is not a workaround at all: yes it prevents the mutating table error, but no it does not allow a valid insert-statement to execute successfully. Note that it even wouldn't allow two subsequent single row-inserts: first insert a clerk, then insert a manager. With this workaround you are forced to commit the clerk first, and only then insert the manager.

So, in summary:


Again, forget about these two red herrings: don't use them.

In the following posts we will start developing how to correctly work around the mutating table error. And as many of you probably know, it involves a combination of row and statement triggers.

In the meantime happy holidays!

Tuesday, December 13, 2011

"Workarounds" for ORA-04091

In the previous post, we demonstrated running into the mutating table error. The example involved the well-known EMP table and a business rule that we are trying to implement: all departments with a Manager should also employ a Clerk. We started our implementation for that rule by writing a straight-forward 'after-row-insert' trigger on the EMP table.


  • The row trigger detects whenever a Manager is inserted (remember, we can only detect this with a row trigger, not with a statement trigger);
  • It then calls a stored procedure to which it supplies the department number (deptno-column value) of the inserted Manager;
  • The procedure then queries the EMP table to see if a Clerk exists in this department;
    • If it finds no such Clerk, the procedure raises an application error, which causes the after-insert-row trigger to fail. This in turn prevents the insert of the Manager, since it would violate our business rule (the Manager insert will undergo a statement-level rollback due to the row-trigger failing);
    • If it finds such a Clerk, then the procedure finishes successfully, which causes the after-insert-row trigger to finish successfully, which causes the Manager insert to execute successfully.
A straight-forward 'use-case' of using a (row) trigger, right? Of course this setup didn't work, since the cursor in the stored procedure will always hit the mutating table error, thus preventing a Manager insert even if the necessary Clerk would indeed be present.

So, what's the workaround for this? Well there are a two I would like to discuss in this post: one of them is very 'popular', the other one is -- I guess -- not so well-known.

I cannot stress enough upfront right now, that both 'workarounds' are red herrings.

Workaround 1a: use the autonomous_transaction pragma inside the stored procedure.

If we add just one line at the top of the stored procedure, we can fix this baby (see red-arrow line below).


By ensuring that all code inside the P_CHECK_FOR_CLERK procedure will run as an autonomous transaction, we can prevent the mutating table error. Why? Because it causes Oracle to 'step out' of the current transaction context, in which EMP is currently 'mutating', and open a new transaction in which EMP is not mutating, and therefor can be queried without hitting the mutating table error. So all we need to do is add the PRAGMA line highlighted above. We do not need to add an explicit COMMIT (or ROLLBACK) at the end of the procedure, which is usually required to end an autonomous transaction, since the procedure doesn't really start a transaction at all: is just issues a query.

This is the popular workaround. Now for the lesser known workaround.

Workaround 1b: use a loopback db-link to query the mutating table.

By querying the EMP table to look for a Clerk, via a loopback database link, we can fix this baby too.


So we create a loopback database link (which we call loopback above) which connects back to our application owner schema. We then add the text '@loopback' to the table name (EMP) inside the FROM-clause of our query that was hitting the mutating table (second red arrow above). Done. We'll demonstrate in a minute that this 'fixes' our issue as well.

But before we do that you really need to understand a fundamental difference between these two workarounds.


In the preliminaries that we discussed in the previous post, one of the observations that was made about an autonomous transaction was that all code in such transaction cannot see any (uncommitted) changes made by the other (parent) transaction. So what this means for our example is that the cursor in the stored procedure (the one that tries to find a Clerk) which is executed by an 'after-row' trigger, will see the EMP table without the Manager that was inserted by our triggering insert statement.

Yet.

The cursor that uses the loopback database link, which is executed from within a new session created by the database link, *will* see the mutating table as it exists at that point-in-time in the transaction of the 'original' session (the one that issued the Manager insert). Why? Because Oracle detects that both sessions are now part of a distributed transaction, and the database link session will see the same database snapshot as the other session. Now this is pretty radical, I think. When I discovered this behavior I was stunned for a few moments: here we have a database session (OK, it's one that was created by a database link, but still) a session that is able to see uncommitted changes of a different session! I will show you a demo that proves this behavior in a minute.

Now wouldn't that be an in-depth (trick) question for some fancy certification exam:

"When can a database session see uncommitted changes of another session?"
a) Never.
b) Always, this is default behavior.
c) If it's using an autonomous transaction.
d) If the session was opened by a database link.

;-)

Now let's test both workarounds. Remember we are trying to insert a manager into department 42, in which no other employee works yet.



Et voilá: no more mutating table error. Our row-trigger + procedure work, they give us the business rule error.

So what's wrong with both approaches? Can you guess it? I'll give you a hint: think about multi-row inserts. Will they work given workaround 1a (the autonomous transaction) or workaround 1b (the db-link trick)?

I'll discuss this in the sequel, which should follow promptly.