Skip to content

BEST Data Processing

Shekhar Krishnan edited this page Jun 30, 2015 · 5 revisions

BEST Database:

The Database comprises of 6 tables namely:

  • STOP MASTER: It has STOPCD(Serial No.), STOPNM(Stop Name), STOPFL, ROADCD(Road Code), A_CODE(Area Codes)

  • ROAD MASTER: It has ROADCD(Road Code) and ROADNM(Road Name)

  • AREA MASTER: It has A_CODE(Area Code) and AREANM(Area Name)

  • ROUTE DETAILS: It has RNO(Its a 4 digit code denoting the different Routes), STOPSR(Stop Sequence for a particular Route No.), STOPCD(Stop Codes), STAGE(It has true or false value depending on whether the fare is increased from that stop onwards or not) and KM(has 0.0 value where stage is false and non zero if otherwise)

  • **ROUTE MASTER:**It has ROUTE(a 4 digit code denoting the different Routes), ROUTE ALIAS(Describes the various types of buses. eg. LTD, LTD EXTRA, AC, EXTRA .etc.), FROM(First Stop of the Route), TO(Last Stop of the Route), DISTANCE(Maximum span of the Route from the ‘FROM’ Stop to the ‘TO’ Stop) and STAGES(Total no. of Stages in the Route)

  • ROUTE ATLAS:

  • It has ROUTE_NO (The description of the various bus types i.e. Alias),

  • DEPOT(Refers to the Bus Depot),

  • FROM(First Stop of the Route),

  • FIRST BUS(The time for the arrival of the first bus at the ‘From’ Stop),

  • LAST BUS(The time at which the last bus leaves the ‘From’ Stop),

  • TO(The Last Stop of the Route),

  • FIRST BUS(The time for the arrival of the first bus at the ‘To’ Stop),

  • LAST BUS(The time at which the last bus leaves the ‘To’ Stop),

  • ROUTE SPAN(The span of the Route in Kms),

  • RUNNING TIME(It is divided into 4 sub columns-the running time corresponding to 7am-11am,11am-5pm,5pm-8pm and 8pm onwards),

  • HEADWAY(It denotes how frequently the buses arrive at the Stop. It is further divided into 5 sub columns-upto 7am, 7am-11am, 11am-5pm, 5pm-8pm and 8pm onwards) and

  • SCH TYPE(The days on which the corresponding Routes run)

NOTE: Each row denotes the up trip as well the down trip i.e. trip from the ‘From’ Stop to the ‘To’ Stop and trip from ‘To’ Stop to the ‘From’ Stop. We haven’t incorporated the BUSES(Am,Noon,Pm),BUS TYPE,RELIEF POINT,TRAVEL TIME columns because we don’t require that information to convert the data into GTFS.

Pre-Filling Missing Data in the BEST Database

As there was a lot of missing entries in route atlas so we filled the missing values by writing code in c++. This takes input column number of schedule column in the corresponding file.

atlas.cpp

This code copies down the schedule i.e. the last column in atlas.

This takes input the column number of route number,from stop and to stop.

atlas_again.cpp

this code copies down the from and to stops in route_atlas. here the logic used is if we are at a particular route number and we find that either from or to column or both are missing in a row and there is a previous row which have same route number as the current row and also has from and to stop so we copy it down.

for this code to work route number should be the first column in your file.

route_no.cpp

This code copies the route number in route_atlas.

headway_again.cpp

This adds a dummy column named id to the route atlas as the last column of route atlas.the entries in this column are row numbers of atlas. in first row id=1,in second id=2 and so on. this was added to uniquely identify a row as well as order the atlas as when we apply queries on atlas table in postgresql the order doesn't remain same so to order them we added the column id.

For the missing values of the headway columns there postgreSQL queries written.Now there were two cases:

  1. One where a particular headway column for a given route_no have headway given but for another row for same route_no and same from and to stops headway had been missing, for these values the above headway values were just copied down.
  2. For the rest of them we took the maximum of the non null headway columns belonging to the same route_no group and copied them down as per the suggestion of the BEST officer.

Our Project makes use of POSTGRESQL Database and C++ language for the implementation. One of the major advantages of using a PostgreSQL database is that it has native programming interfaces for languages like python,C,C++ etc.

Data Validaton

After studying the data in the database, we realised that a lot of human errors were introduced in the excel sheets. Also the correct data types of the respective columns needed to be matched. Hence data validation was essential to remove the errors. So there were a few queries for pre-data validation. Some of the instances are :

#select span from route_at where span like ('%.%.'); #select first from route_at where first similar to '%([a-z]|[A-Z])%';

Joining RouteAtlas with RouteMaster

The given data as explained above was divided into 6 tables which had to be properly joined for the respective queries to be executed.The first task was to generate the route_id (route_code's) of all the entries of route_atlas.

For this we had to match the route_alias attribute of route_master with route_no attribute of route_atlas along with fuzzy match the from and to stops of route_atlas with those of route_master. We had to do this as to create the final gtfs we had to include the route_id's for all the entries of route_atlas.

For fuzzy-matching we tried for fuzzywuzzy library of python which by means of ratio,partial ratio and token sort ratio used to match the strings partially.But as there couldn't be a proper threshold defined for the ratio thus we sorted out for other means.

In postgreSQL there is an extension known as fuzzystrmatch which performs the desired task only.We considered Soundex algorithm for fuzzy-string matching.

SOUNDEX

Soundex is a phonetic algorithm for indexing names byb sound, as pronounced in English.The goal is for homophones to be encoded to the same representation so that they can be matched despite minor differences in spelling.

Soundex consist of two functions: Soundex and Difference. While the soundex method produces the 4 digit encoded form of the given string the difference method takes two strings as input and return the number of matches between the 4 digit codes of the respective strings.

PostgreSQL incorporates soundex in its fuzzstrmatch extension.But there is one lacking as soundex completely ignores the numerals.So we rewrote the code for soundex and difference methods for incorporating numbers too.

The code for soundex method is as follows (written in postgreSQL):

CREATE OR REPLACE FUNCTION soundex(input text) RETURNS text IMMUTABLE STRICT COST 500 LANGUAGE plpgsql AS $$ DECLARE soundex text = ''; char text; symbol text; last_symbol text = ''; pos int = 1; BEGIN WHILE length(soundex) < 4 LOOP char = upper(substr(input, pos, 1)); pos = pos + 1; CASE char WHEN '' THEN -- End of input string IF soundex = '' THEN RETURN ''; ELSE RETURN rpad(soundex, 4, '0'); END IF; WHEN 'B', 'F', 'P', 'V' THEN symbol = '1'; WHEN 'C', 'G', 'J', 'K', 'Q', 'S', 'X', 'Z' THEN symbol = '2'; WHEN 'D', 'T' THEN symbol = '3'; WHEN 'L' THEN symbol = '4'; WHEN 'M', 'N' THEN symbol = '5'; WHEN 'R' THEN symbol = '6';

The code for incorporating the numbers was included as follows: WHEN '1' THEN symbol = '!'; WHEN '2' THEN symbol = '@'; WHEN '3' THEN symbol = '#'; WHEN '4' THEN symbol = '$'; WHEN '5' THEN symbol = '%'; WHEN '6' THEN symbol = '^'; WHEN '7' THEN symbol = '&'; WHEN '8' THEN symbol = '*'; WHEN '9' THEN symbol = ''; WHEN '0' THEN symbol = '~'; ELSE -- Not a consonant; no output, but next similar consonant will be re-recorded symbol = ''; END CASE;

`IF soundex = '' THEN`
  `-- First character; only accept strictly English ASCII characters`
  `IF char ~>=~ 'A' AND char ~<=~ 'Z' OR char ~>=~ '0' AND char ~<=~ '9' THEN`
    `soundex = char;`
    `last_symbol = symbol;`
  `END IF;`
`ELSIF last_symbol != symbol THEN`
  `soundex = soundex || symbol;`
  `last_symbol = symbol;`
`END IF;`

END LOOP;

RETURN soundex; END; $$;

And the corresponding code of difference method using the above soundex code:

CREATE OR REPLACE FUNCTION difference(input1 text,input2 text) RETURNS int COST 500 LANGUAGE plpgsql AS $$ DECLARE diff int =0; char1 text= soundex(input1); char2 text= soundex(input2); BEGIN if substr(char1,1,1) = substr(char2,1,1) then diff=diff+1; end if; if substr(char1,2,1) = substr(char2,2,1) then diff=diff+1; end if; if substr(char1,3,1) = substr(char2,3,1) then diff=diff+1; end if; if substr(char1,4,1) = substr(char2,4,1) then diff=diff+1; end if; RETURN diff; END; $$;

This was the code for postgreSQL for soundex and difference methods.

Once the methods were ready all we had to do was to apply them on route_atlas and generate the corresponding route_id's for all the entries.

Generating Route IDs for Entries in Route_Atlas

There were a set of queries which were applied to generate the route_id's and there by map all the entries of route_atlas with the corresponding values of route_master.Following step by step one can easily map all the entries:

NOTE:For the schema of the corresponding tables please refer to previously mentioned context.

1.First route_alias attribute of route_master was fuzzy matched with the corresponding rfromsoute_no attribute of route_atlas along with matching the from stops of both the tables and to stops of both the tables along with from of master with to of atlas and to of master with from of atlas so that all the cases are incorporated:

update atlas set route_id = rm.route_cd from rmaster rm where difference(rm.route_alias,atlas.route_no)>3 and ((difference(rm.froms,atlas.froms)>2 and difference(rm.tos,atlas.tos)>2)or(difference(rm.froms,atlas.tos)>2 and difference(rm.tos,atlas.froms)>2));

2.Then the ring routes whose route_codes's end with 4 at the end generally and have a 'R' suffix in the route_no attribute of route_master, were matched explicitly.This is the following query:

update atlas set route_id = rm.route_cd from rmaster rm where (atlas.route_no like '___R') and (rm.route_alias like '%RING%' or rm.route_cd like '___4') and substr(atlas.route_no,1,3) like substr(rm.route_cd,1,3);

3.After this the extra and extra-limited routes were matched explicitly by means of following query:

update atlas set route_id = rm.route_cd from rmaster rm where (rm.route_cd like '___2' OR ROUTE_CD LIKE '___3' OR ROUTE_ALIAS LIKE '%EXTRA%') and substr(atlas.route_no,1,2) like substr(rm.route_alias,1,2) and ((difference(rm.froms,atlas.froms)>2 and difference(atlas.tos,rm.tos)>2) or (difference(rm.froms,atlas.tos)>2 and difference(atlas.froms,rm.tos)>2));

4.Now once this had been done for the rest of the data route_no attribute of route_atlas was fuzzy matched with route_alias of route_master and one of the from stops or to stops were matched to incorporate the rest of the data as follows (note, this applied to only those values which had yet not updated):

UPDATE atlas SET ROUTE_ID = RM.ROUTE_CD FROM RMASTER RM WHERE difference(ROUTE_NO,ROUTE_ALIAS)>3 AND (difference(atlas1.froms,RM.FROMS)>2 OR difference(atlas1.tos,RM.TOS)>2) AND ROUTE_ID ISNULL;

5.Finally those rows where route_alias matched exactly with route_no of atlas and where from and to stops were both not null simultaneously were included as follows:

update atlas set route_id=route_cd from rmaster where route_alias=route_no and froms is not null and tos is not null and route_id isnull;

6.Finally the rest of the rows remaining (around 30) were manually added.

7.Lastly the tuples where the from and to stops were all not given and even route_id was not there were not incorporated in the database.

delete from atlas1 where ROUTE_ID ISNULL and froms isnull and tos isnull;