-
Notifications
You must be signed in to change notification settings - Fork 15
Best database
We continue to be plagued by problems to do with missing values in BEST Atlas, for purpose of calculating schedule and trip times in GTFS and for anything to do with bus schedules and timings.
In December 2011 we last attempted to process and import the BEST Atlas (November 2011 edition) by isolating UniqueRoutes with specific schedule (day of the week or holiday) types. We did not, however, import all the rows in Atlas and lost a great deal of important data for routing and scheduling.
See below the different columns in BEST Atlas and specific descriptions of the problems of missing values and problems encountered with the import scripts.
- Atlas Import http://code.camputer.org/chaloBEST/annotate/head%3A/chaloBEST/imports/import_atlas.py
- Atlas Fix Missing Data http://code.camputer.org/chaloBEST/annotate/head%3A/chaloBEST/imports/fix_missing_atlas_data.py and
- Many rows are missing From and To stops.
- Check with Sanjay how missing From and To stops are treated
- Should take the prior row's value for From stop or To stop.
- There seems to be a mistake in the way that FullRoutes have been imported from Atlas, where if there is more than one row where Span is the maximum, only one is imported.
- This has significant cascading effects when BEST Atlas is parsed into FullRoutes and UniqueRoutes in the ChaloBEST database. This must be urgently corrected.
- Route 259 (Oshiwara to Gorai Depot has same from and to stop) in UniqueRoutes. Route 523 Ltd has the wrong To stop (Dindoshi-Dindoshi).
- For 81 Ltd in in Atlas there are two schedules for "MS" schedule type, but the ChaloBEST database only has one.
- We hypothesise that a simplifying assumption was made at the time of import of UniqueRoutes that the combination of sub-route (unique from/to) and schedule type is unique. In fact, there can be multiple schedule types that share the same sub-route.
- This is instantiated in 81 Ltd where there are two schedules for "MS", one in morning-afternoon and another in afternoon-evening.
- As seen below, there are only three UniqueRoutes for 81LTD, whereas in Atlas there are five (see rows 314-318):
- Or for 174, there are nine rows for unique routes, but only two schedule types (MS&HOL and SUN), and in the ChaloBEST DB only two have been imported:
- Where ScheduleType is null, within each unique route/row, copy the values from the prior row where it is not null.
- Already done in Sanjay's earlier Atlas import. Check for odd strings like "FW", "AW", etc.
- Where RouteSpan is null, do not assume zero, calculate the cumulative distance between From stop and To stop in that row.
- Lookup to RouteDetails (foreign key to StopCode) and add up the Distance between the stops to get the correct distance.
- Where Runtime values are null, calculate the runtime using the speed and time of day and day of the week. To do this:
- Limit by time of day in FirstFrom, FirstTo, LastFrom, LastTo and corresponding Runtime column in Atlas for four runtimes: 7.00-11.00, 11.00-17.00, 17.00-20.00, or 20.00-7.00.
- Calculate the average speed of the full route which has a corresponding Runtime column for the time slab, and multiply that value by the RouteSpan for the route with missing values.
- Populate the missing Runtime values for *all* missing fields within a route, schedule and runtime slab recursively.
- Where Headway values are null, find the nearest headway time of day and day of the week. To do this:
- Limit by time of day in FirstFrom, FirstTo, LastFrom, LastTo and corresponding Headway columns in Atlas for five headways: 0.00-7.00, 7.00-11.00, 11.00-17.00, 17.00-20.00, or 20.00-0.00.
- Find the next nearest Headway value within the schedule type (time and day of the week).
- Populate the missing Headway values for *all* missing fields within a route, schedule and headway slab recursively.
In meetings with the BEST Traffic Planning Department in Nov and Dec 2011, we were able to clarify issues with the earlier datasets they shared with us (documented below) and request cleaner and more raw data. The BEST has also undertaken steps to correct the earlier problems of HardCodedRoutes which did not follow the earlier four-digit routecode logic (first three digits for Route Number and last for Route Type).
As of November 2011, they have now re-organised the RouteType, with unique numbers for nine different types of buses. They have also now sent us entirely new tables for RouteMaster and RoadMaster, and expanded columns inside StopMaster to denote roads and direction. Their new StopMaster dataset contains many more stops than before, as they now show stops between intersecting roads as per RoadMaster.
While the Master tables for Routes, Stops, Areas, Roads, and Fares have a clean logic and relations via codes and keys, the Atlas is revised more frequently (every three to four months) and comes in different versions. Atlas contains all the routes, but no keys to join via RouteCode in the Master and RouteDetails tables. AtlasRaw comes with RouteCodes, but with all the special AS, AC and Express Routes missing (the erstwhile HardCodedRoutes, see below).
The Master tables and RouteDetails tables are beautifully structured and make it easy for us to add new tables for Landmarks, add StopNames and AreaNames, draw AreaPolygons and RoadLines, etc. within a simple relational-spatial database schema. The problem is with matching and joining the data in Atlas and AtlasRaw with their respective RouteCodes and StopCodes, for which we propose to create a new table called AtlasNew (see diagram below).
To rectify these issues with transforming BEST's Atlas into a proper relational database, First we must join RouteCode used in the RouteMaster and RouteDetails tables to RouteAlias in Atlas. Secondly, we must match and join the full name text strings in the From and To fields in Atlas to StopCodes in StopMaster. Both of these will involve some amount of fuzzy text-matching.
1. To join RouteAlias in Atlas to RouteAlias in RouteMaster, we need to do a fuzzy match between the route number and suffix "Ltd", "R", "Exp", etc. in Atlas RouteAlias with the full form of this route in RouteMaster RouteAlias. We noted that in the latter master table there were variations of routes that are not noted in Atlas, which only shows one route number/type for all the rows with variations of the route. However certain routes like 2 have regular, limited and now AC services, so there are routecodes for 0020, 0021, and 0026.
2. Since route types in Atlas are only indicated as a text string RouteAlias, the only way to distinguish the different rows per route in Atlas is by their Span, since distances vary with different route variations. In RouteMaster every RouteCode has a unique Distance which allows us to distinguish between variations in that route. Also in RouteDetails Distance the sum of the values per RouteCode will also equal the RouteMaster Distance. If there is any value less than the maximum Atlas Span, then the route is a variation of the stereotypical full route, i.e. one where Atlas From is the same stop as the first StopCode for any route in RouteDetails and also where Atlas To is the same stop as the last StopCode in RouteDetails for any given route.
3. AtlasRaw is an alternative version of Atlas with different information, but which has RouteCodes for every RouteAlias. Using these two tables, we can perform the same task as in 1. and 2. above much more easily. However, the problem of missing 122 HardCodedRoutes from AtlasRaw remains. In Atlas all BEST routes are given, with a total number of 1755 routes and their variations. In AtlasRaw only 1603 routes and their variations are given.
4. When we last confronted this problem in July 2011, we found it was enough to copy and paste the RouteCode column from AtlasRaw into Atlas and manually enter the remaining 122 RouteCodes by hand, with reference to the HardCodedRoutes table where the remaining routes are indicated. So for the 122 routes leftover (assuming that the rest of the 1603 routes in Atlas and AtlasRaw match perfectly in 1. above) we can populate AtlasNew with the RouteCodes in HardCodedRoutes.
5. Dealing with variations in any full route is the second big problem of matching/replacing Atlas From and Atlas To with their respective StopCodes in StopMaster to produce our cleaner and better NewAtlas. We decided to try this by calculating distances.
6. Since we can assume that the full routes have the same Atlas Span as RouteMaster Distance and match them automatically with the StopCodes for Atlas From and Atlas To contained in RouteDetails, we are left with the others which are shorter versions of the full route.
7. In the latest StopMaster table from BEST we confirmed that the new column Chowki is a true/false binary which indicates if the StopCode is a route terminus or not. From an eyeball inspections there are only around 10% of StopMaster StopCodes which are true Chowki. This means there are only around 250-300 possible string matches for non-full routes in Atlas From and Atlas To which we must match to their StopCodes and put in NewAtlas.
8. AtlasRaw also contains a field called Primary for every "RouteCode", indicating if the route is 1 All Days, 2 for Mon-Fri, 3 for Saturdays, and 4 for Sundays/Holidays. This can be matched with Schedule in Atlas with some massaging of the data, since most variations of full routes are based on the day of the week. Comparison of Atlas with AtlasRaw may yield more insights in how to solve this second problem.
9. In processing Atlas as above into AtlasNew, the question arose as to how to treat the many fields which have null values. The rule is that only RouteCode/RouteAlias (subject to join), Depot and Schedule should inherit the previous value in their respective columns if they are null values.
10. All other fields, mainly Headway, should remain null as in the original Atlas. We noted that null values for Headway only occur in minor variations of the full route, for which BEST does not maintain rigourous schedules.
11. The Depot field in Atlas and StopMaster is a two or three letter code which corresponds to the terminus depots of every bus route. We have manually created a Depot lookup table which contains the depot's letter code, the full form of its name (i.e. Dharavi, Worli, Bandra, Magathana, Dindoshi, etc.) and the corresponding StopCode for the depot. These StopCodes were taken from StopMaster manually, and it appears that for several depots there are more than one StopCode. Dharavi Depot (DH) has four unique StopCodes, Gorai Depot (GRD), Magathane (MGT) and Kurla (KLD) each have two unique StopCodes.
12. We have also manually created lookup tables for Schedule which resolves the letter codes in Atlas to their numbered days of the week, with HOL given the number 8. We have also created a separate Holidays table with a list of the dates and names of state holidays from Maharashtra Govt Gazette in MM.DD.YY format.
The table below shows the schema and design for our integrated ChaloBEST database, which includes:
- BOLD for all primary and foreign keys for joining these tables -- RouteCode, StopCode, RoadCode, AreaCode, Schedule and Holiday:
- RED for the BEST Atlas and AtlasRaw, AreaMaster, StopMaster, RoadMaster, FareMaster, RouteMaster, RouteDetails provided by the BEST;
- ORANGE for fields in Atlas needed rectify the above problems and join RouteAlias to RouteCodes and From and To to StopCodes;
- GREY for redundant or currently unused fields in the original BEST datasets;
- YELLOW for all fields and lookup tables for timing (headway/frequency) and schedule calculation;
- PINK for all fields concerned with distance (span/km) calculation;
- GREEN for our new Area, Stop, Road and Landmark tables for Marathi, alternative names, storing location (point, line or polygon), and the slug for the permanent URL at http://chalobest.in/*;
- BLUE for spatial fields/geometries of lat-lon points for StopCodes, polygons for AreaCodes, and lines for RoadCodes;
Atlas | AtlasRaw | StopMaster | AreaMaster | RoadMaster | FareMaster | RouteMaster | RouteDetails | RouteTypes | HardCodedRoutes | AtlasNew | Schedule | Holidays | Depot | Stop | Area | Road | Landmark |
RouteAlias | RouteAlias | StopCode | AreaCode | RoadCode | Slab | RouteCode | RouteCode | RouteCode | RouteCode | RouteCode | Schedule | Day | DepotCode | StopCode | AreaCode | RoadCode | LandmarkName |
BusesAM | RouteCode | StopName | AreaName | RoadName | ORD | RouteAlias | StopSerial | RouteType | RouteAlias | RouteType | Days | Name | DepotName | StopNameMarathi | AreaNameMarathi | RoadNameMarathi | LandmarkMarathi |
BusesNoon | From | StopFlag | LTD | From | StopCode | FareType | FareType | FromStopCode | Slug | Slug | Slug | Slug | |||||
BusesPM | FirstFrom | Chowki | EXP | To | Stage | FirstFrom | AlternativeName | AlternativeName | AlternativeName | LandmarkLatLon | |||||||
RouteType | LastFrom | RoadCode | AS | Distance | Km | LastFrom | LatLonUp | AreaPolygon | RoadLine | ||||||||
Depot | To | AreaCode | ACEXP | Stages | ToStopCode | LatLonDown | |||||||||||
From | FirstTo | Depot | FirstTo | ||||||||||||||
FirstFrom | LastTo | LastTo | |||||||||||||||
LastFrom | AM | Span | |||||||||||||||
To | Noon | HeadwayUpTo7 | |||||||||||||||
FirstTo | PM | Headway7to11 | |||||||||||||||
LastTo | Schedule | Headway11to17 | |||||||||||||||
Span | Primary | Headway17to20 | |||||||||||||||
Runtime7to11 | Headway20toLast | ||||||||||||||||
Runtime11to17 | Schedule | ||||||||||||||||
Runtime17toEnd | Depot | ||||||||||||||||
RuntimeNight | Primary | ||||||||||||||||
HeadwayUpto7 | |||||||||||||||||
Headway7to11 | |||||||||||||||||
Headway11to17 | |||||||||||||||||
Headway17to20 | |||||||||||||||||
Headway20toLast | |||||||||||||||||
ReliefPoint | |||||||||||||||||
TravelTime | |||||||||||||||||
Schedule |
This is the design schema for our integrated database, which includes:
- The core Atlas, AreaMaster, StopMaster and Route tables provided by the BEST (coloured in RED);
- Supplementary tables for routes and scheduling to augment the BEST data, derived from their additional tables (coloured in YELLOW);
- New AreaLocation, StopLocation, and AreaName, StopName tables created by Gnowledge and CAMP for Marathi translations and alternative names (coloured in GREEN);
- Data contributed by MESN for storing landmarks, streets, and features in PostGIS for lat-lon coordinates for StopCodes and polygons for AreaCodes (coloured in BLUE);
- All primary and foreign keys for joining these tables -- RouteCode, AreaCode, StopCode, Schedule and Holiday (indicated as BOLD):
Atlas | StopMaster | AreaMaster | ExpressRoutes | AC-ExpressRoutes | AS-Routes | RegularRoutes | RouteType | HardCodedRoutes | Schedule | Holidays | StopLocation | AreaLocation | StopName | AreaName |
RouteCode | StopCode | AreaCode | RouteCode | RouteCode | RouteCode | RouteCode | RouteCode | RouteCode | Schedule | Holiday | StopCode | AreaCode | StopCode | AreaCode |
RouteAlias | StopName | AreaName | StopSerial | StopSerial | StopSerial | StopSerial | RouteType | RouteAlias | Days | Days | LatLonUp | LatLonPolygon | StopNameMarathi | AreaNameMarathi |
From | AreaCode | StageNo | StageNo | StageNo | StageNo | LatLonDown | ShortName | ShortName | ||||||
FirstFrom | DisplayName | StopCode | StopCode | StopCode | StopCode | AlternativeName1 | AlternativeName1 | |||||||
LastFrom | Length | Stage | Stage | Stage | Stage | AlternativeName2 | AlternativeName2 | |||||||
To | Km | Km | Km | Km | AlternativeName3 | AlternativeName3 | ||||||||
FirstTo | StopName | StopName | StopName | StopName | Street | |||||||||
LastTo | Landmark1 | |||||||||||||
RouteSpan | Landmark2 | |||||||||||||
RunTime7to11 | Landmark3 | |||||||||||||
RunTime11to17 | ||||||||||||||
RunTime17toEnd | ||||||||||||||
RunTimeNight | ||||||||||||||
HeadwayBefore7 | ||||||||||||||
Headway7to11 | ||||||||||||||
Headway11to17 | ||||||||||||||
Headway17to20 | ||||||||||||||
Headway20toLast | ||||||||||||||
ReliefPoint | ||||||||||||||
TravelTime | ||||||||||||||
Schedule |
Courtesy of Mr Datta Ghogare of BEST Traffic Planning Dept, on 10 May 2011 we received two excel files, "best_atlas" is the same as the earlier atlas.dbf and covers time and schedule for all routes, "best_route_data" has seven sheets, all eight tables are shown below. RouteCodes is a key to the different route codes, StopMaster is the earlier stopmast.dbf, AreaMaster is areamast.dbf, and these tables were earlier all in routedetl.dbf but are now separated by type: ExpressRoutes, AC-ExpressRoutes, AS-Routes, and Regular Routes.
Atlas | RouteCodes* | StopMaster | AreaMaster | ExpressRoutes | AC-ExpressRoutes | AS-Routes | RegularRoutes |
Route | RouteCodes | StopCode | A_Code (AreaCode) | RNO (RouteCode) | Route | Route | RNO (RouteCode) |
From | RouteType | StopName | AreaName | StopSr (StopSerial) | StopSerial | StopSerial | StopSr (StopSerial) |
FirstFrom | FareType | AreaCode | StageNo | StageNo | StageNo | StageNo | |
LastFrom | HardCodedRoutes | DisplayName | StopCD (StopCode) | StopCode | StopCode | StopCD (StopCode) | |
To | RouteAlias | Length | Stage | Stage | Stage | Stage | |
FirstTo | FareType | Km | Km | Km | Km | ||
LastTo | StopName | StopName | StopName | StopName | |||
RouteSpan | |||||||
RunTime7to11 | |||||||
RunTime11to17 | |||||||
RunTime17toEnd | |||||||
RunTimeNight | |||||||
HeadwayBefore7 | |||||||
Headway7to11 | |||||||
Headway11to17 | |||||||
Headway17to20 | |||||||
Headway20toLast | |||||||
ReliefPoint | |||||||
TravelTime | |||||||
ScheduleType |
- The table above called RouteCodes contains two tables which divide route codes into normal codes which follow a consistent logic, and HardCodedRoutes which have been added on an ad-hoc basis (see explanation of logic below):
RouteCodes | SpecialRoutes |
RouteCode | HardCodedRoutes |
RouteType | RouteAlias |
FareType | FareType |
- The original BEST Atlas spreadsheet above did not contain RouteCodes, by which the Atlas can be joined to the four route tables (AS, Express, AC-Express and Regular Routes). When we approached BEST for a revised Atlas, they sent us one without RunTime or Headway as in the original Atlas. However we were able to manually join the RouteCode in the new Atlas with the old one by simple copy and paste of the 1559 routes. The remaining 123 routes had to be coded manually using the HardCodedRoutes table in the SpecialRoutes table.
-
RouteCode (also named as RNO or Route in the four route tables) is the main key linking Atlas to the four route tables and always will contain four digits. The first three indicate the bus route number (if it is double digit the first digit will be 0). The last digit indicates the RouteType:
- 0 Ordinary
- 1 Limited
- 2 Ordinary Extra
- 3 Limited Extra
- 4 Ring Ordinary
- 5 Ring Limited
- 6 Express
- 9 AS Regular or AC Express
- Since the fourth digit of RouteCode always indicates type, we should always query RouteCode directly rather than the Route in Atlas which often has sufffixes like Ltd., Exp., R, etc.
- This works for 1559 routes which follow the route coding system explained above.
- It appears that since the introduction of so many new categories of bus, the old route coding system is being bypassed in favour of hard coding numbers for new routes on an ad hoc basis, hence the need for the HardCodedRoutes table. However except for these 123 hard coded routes, all other routes follow the above system.
This has been superseded and absorbed in our newly integrated database with spatial coordinates, landmarks, and Marathi translation. See database integration notes above.
Atlas | RouteType | HardCodedRoutes | Schedule | Holidays | StopMaster | AreaMaster | ExpressRoutes | AC-ExpressRoutes | AS-Routes | RegularRoutes |
RouteAlias | RouteCode | RouteCode | Schedule | Holidays | StopCode | AreaCode | RouteCode | RouteCode | RouteCode | RouteCode |
From | RouteType | RouteAlias | Days | StopName | AreaName | StopSerial | StopSerial | StopSerial | StopSerial | |
FirstFrom | AreaCode | StageNo | StageNo | StageNo | StageNo | |||||
LastFrom | DisplayName | StopCode | StopCode | StopCode | StopCode | |||||
To | Length | Stage | Stage | Stage | Stage | |||||
FirstTo | Km | Km | Km | Km | ||||||
LastTo | StopName | StopName | StopName | StopName | ||||||
RouteSpan | ||||||||||
RunTime7to11 | ||||||||||
RunTime11to17 | ||||||||||
RunTime17toEnd | ||||||||||
RunTimeNight | ||||||||||
HeadwayBefore7 | ||||||||||
Headway7to11 | ||||||||||
Headway11to17 | ||||||||||
Headway17to20 | ||||||||||
Headway20toLast | ||||||||||
ReliefPoint | ||||||||||
TravelTime | ||||||||||
Schedule |
- In the revised schema we have split the RouteCodes table into RouteType and HardCodedRoutes. We have discarded the FareType data as it is not needed.
- Note that all RouteCodes are four digits, where the first three digits are the bus route number (except for HardCodedRoutes) and the fourth digit of both RouteCode and HardCodedRoute always corresponds RouteType.
- HardCodedRoutes is an anomaly in the BEST's own route coding logic since the first three digits are just an internal ID, hence the need for a lookup table.
-
Schedule and Holiday will always be the first to check in any query depending on the day of the week when the SMS is received.
- Schedule decodes the days of the week, numbered from 1-7 for Monday to Sunday, and "H" for holidays.
- The fixed dates for state holidays in 2011 taken from Maharashtra Govt Gazette are in MM.DD.YY format.
- In this revised Atlas, we have renamed Route as RouteAlias to avoid confusion.
- Also note the following fields have been renamed as they are keys for linking between tables:
- A_Code as AreaCode.
- StopSr as StopSerial and StopCd as StopCode.
- RNO and Route as RouteCode.
In the workshop on 30 April 2011 we explored mapping the schema from the BEST database to the models used by http://busroutes.in in the hope of borrowing their interface for developing a web and OpenStreetMap frontend to the service.
- This is the data model used by http://busroutes.in which is based on scrapers for the Chennai and Delhi Transit websites.
- They have lat-lon coordinates for every Stage/Stop, which we do not have. Without this we cannot use OpenStreetMap.
- Their model for Stage (what they call a Stop) is at https://github.com/yuvipanda/wtfimb/blob/master/stages/models.py
- Their model for Routes and RouteStage is at https://github.com/yuvipanda/wtfimb/blob/master/routes/models.py
- This schema is considerably less robust than we require to incorporate all the data provided to us by BEST.
- An asterisk (*) denotes redundant or anomalous objects in the schema or categories which can be reworked (see below).
Stage | Routes | RouteStage |
---|---|---|
display.name | display.name | route |
lat | slug | stage |
lon | mtc_name* | sequence |
location* | types | |
mtc.name* | start | |
importance | end | |
is.terminus | stages | |
softlinks | time | |
city | fare | |
type | ||
city |
This is the mapping of the BEST to busroutes.in schemas we devised in the workshop on 30 April 2011. We will have to rework this schema to bring in the BEST data which was lost in the import to the BusRoutes model, or design a new model ourselves. We decided in July 2011 against any such attempt and will build our own web frontend.
Stage | Routes | RouteStage |
---|---|---|
display.name = stopname | display.name = route | route = route_code |
lat = null | slug = assigned by django | stage |
lon = null | mtc_name = area_code | sequence |
location = null | types = "b" (constant) | |
mtc.name = area_code | start = from/stopcode | |
importance = null | end = to/stopcode | |
is.terminus = null | stages = link via RouteStage | |
softlinks = null | time = null | |
city = "mumbai" (constant) | fare = null | |
type = "b" (constant) | ||
city = mumbai (constant) |
This section is very old and is now deprecated. Only reproduced here for reference.
These are the four linked tables provided by the BEST Traffic Planning Dept as of August 2005, which we imported into postgresql database in the 30 December 2010 workshop, and have used for all testing and queries up to May 2010 when we received new data from BEST.
- Note that the keys linking the four tables are area code, route code, and stop code. Area codes are spelled both as "area_code" and "areacode", so are stop codes spelled both as "stop_code" and "stopcode" in different tables.
- Also note that routes include both the bus route number and suffixes such as "LTD.", "EXP", for special buses. All routes of the same number share a route code but have different route numbers. We need a separate linked column for "types" (see below) to denote these special busses (as also "AC", "AS", and other new prefixes since 2005).
areamast.dbf | routdetl.dbf | stopmast.dbf | atlas.dbf |
---|---|---|---|
area_code | rout_code | stopcode | routecode |
area_name | stop_sr | stopname | route |
stop_code | areacode | from | |
stage_code | areaname | first_from | |
kms | last_from | ||
area_code | to | ||
first_to | |||
last_to | |||
routedist | |||
run_time1 | |||
run_time2 | |||
run_time3 | |||
run_time4 | |||
frequency1 | |||
frequency2 | |||
frequency3 | |||
frequency4 | |||
schedule |