Thursday, 8 March 2007
Dispatch-Merge Database Pattern
I am currently using PostgreSQL to store intrusion detection logging information. For my purposes it is useful to process it on a daily basis but older data is of no use. The difficulty I have is that due to the high volume of data (several million records per day), the delete+vacuum process tends to takes an unworkably long period of time and has a very negative impact upon the logging and analysis process.
This describes a construction that I have found useful in maintaining a running system. The basic idea is to split table T into series of tables and then round-robin dispatch events based on time. In the example, I use two tables T_0 and T_1 and use a trigger to add events destined for T into one of T_0 or T_1 based on the day. In that day that one of the split tables _0 is being written to, I can TRUNCATE the other table thereby avoiding the cost of delete and vacuum. For convenience, I use a VIEW to merge the T_i.
The following is an implementation PostgreSQL:
-- -- execute these two commands as postgres user
-- CREATE FUNCTION plpgsql_call_handler() RETURNS language_handler AS '\$libdir/plpgsql' LANGUAGE C;
-- CREATE TRUSTED PROCEDURAL LANGUAGE plpgsql HANDLER plpgsql_call_handler;
CREATE SCHEMA events;
SET search_path TO "events";
CREATE SEQUENCE "seq_event";
CREATE TABLE event (
seqid integer DEFAULT nextval('seq_event') NOT NULL,
src INET,
description TEXT,
time TIMESTAMP,
PRIMARY KEY (seqid));
-- copy the event database structure to event_0 and event_1
-- NB should actually use create table to keep primary key but
-- want the write-up to remain short
SELECT * into event_0 from event limit 0;
SELECT * into event_1 from event limit 0;
-- a merged view of event_0 and event_1
CREATE VIEW event_merge AS SELECT * FROM event_0 UNION SELECT * from event_1;
-- dispatch function returns NULL so that nothing ever ends up in event
CREATE OR REPLACE FUNCTION dispatch_event () RETURNS TRIGGER AS $$
DECLARE
mydest INTEGER;
BEGIN
-- doy is day of year so that the destination table alternates day by day
-- thus one can truncate the table not being inserted into
mydest := (date_part('doy',NEW.time)::integer)&1;
IF mydest = 0 THEN
INSERT INTO event_0 SELECT NEW.*;
ELSE
INSERT INTO event_1 SELECT NEW.*;
END IF;
RETURN NULL;
END;
$$ LANGUAGE 'plpgsql';
CREATE TRIGGER dispatch_event BEFORE INSERT ON event FOR EACH ROW EXECUTE PROCEDURE dispatch_event ();
Some data to try it out...
-- some test data
INSERT INTO event ("src","time","description") VALUES ('127.0.0.1'::inet, now(),'event 0');
INSERT INTO event ("src","time","description") VALUES ('127.0.0.1'::inet, now(),'event 1');
INSERT INTO event ("src","time","description") VALUES ('127.0.0.1'::inet, now(),'event 2');
INSERT INTO event ("src","time","description") VALUES ('127.0.0.1'::inet, now(),'event 3');
-- wait 24 hours or...
INSERT INTO event ("src","time","description") VALUES ('127.0.0.1'::inet, now()+interval '24 hours','event 4');
INSERT INTO event ("src","time","description") VALUES ('127.0.0.1'::inet, now()+interval '24 hours','event 5');
INSERT INTO event ("src","time","description") VALUES ('127.0.0.1'::inet, now()+interval '24 hours','event 6');
INSERT INTO event ("src","time","description") VALUES ('127.0.0.1'::inet, now()+interval '24 hours','event 7');
