FESTA handbook Guidelines for FOT Databases and Analysis Tools

From FOT-Net WIKI
Revision as of 16:33, 1 April 2014 by Brignolo (talk | contribs)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigationJump to search

Back to FESTA Handbook main page: FESTA handbook

8. Guidelines for Databases and Analysis Tools

As state-of-the-art FOTs has proven that various types of studies demand different data models and hardware specifications, this section will not describe a generic solution for all types of FOTs.

The following sections will focus on database for large FOTs (where thousands of hours of raw data are collected even using different server locations). Such a considerable amount of data, especially if video data is also collected, will test hardware to the limit. A smaller FOT might still use the guidelines in this chapter and apply them to a less complex database.

Recommendations for data handling, especially in connection to data sharing, can be found in FOT-Net 2 D3.2.

When designing data analysis it is important to consider the following general recommendations:

• Create documentations of developed tools and processes in order to enable usage after project end (e.g. usage in follow-up project).

• Provide detailed test protocol with all relevant test scenarios and potential risks.

• Make sure that complete data analysis process (data processing and analysis tools) is developed before piloting phase, in order to test and assess all data analysis step within piloting phase.

8.1 Database design and implementation

8.1.1 Preferred database models

The main challenge of an FOT database is to make thousands of driving hours manageable from a storage perspective and available for ad-hoc analysis. FOTs often have very specific demands, making it difficult to recommend one generic database model. However, the concept of storing the time history data in a single table has been found to be more efficient from the perspectives of analysis and database performance. There are of course need for meta data, user added data (as manual annotations) and events that give value to the raw data, but the time history data is really the part that put high demands on the software and hardware.

Strategy: A measure equals a column in a table in the database. In order to avoid costly join operations when performing analysis it is also to be considered to keep the database as de-normalized as storage allows. Some data in trips table, such as driver ID or vehicle ID, can be also included in the time history tables. From a storage or relational database perspective, this is not at all preferred, but when doing analysis it will reduce complexity of SQLSOL queries and also save computing time.

8.1.2 Data filling in the database

Beside the sensor and video data collected by the DAS, other data needs to be transcribed in the database in order to prepare the database for data analysis. The documentation of the data shall include both descriptions and format of the data, as well as descriptions of different meta data. This is especially important if the data is planned to be shared and re-used after the project, see FOT-Net 2 D3.2.

8.1.2.1 Trip ID and time data transcription into the database

Trip ID and time are cornerstone indices in the database designs. This means that sensor data must be time stamped when inserted in the database. State-of-the-art FOT relational databases use a common sample rate to ensure the validity of trip ID and time. If different sample rates are needed within the same FOT database, the different datasets should be organised into different tables. Furthermore, data with frequency differing from the default one (e.g. 10 Hz) should be clearly marked as potentially incompatible with the main data. If the need to join this data with the main data arises, it is suggested that data from the deviant table is extracted, resampled and inserted into a table with a common sample rate.

8.1.2.2 Additional data

Additional data can be collected, especially if other actors such as infrastructure have incidence in the hypothesis and performance indicators defined (see Chapters 4 and 5). This data (e.g., traffic status or road event information in the case of cooperative systems) is usually provided by external entities (e.g., Traffic Management Centres). Collection of this data is necessary to address the FOT needs. In this case, it is highly important to synchronise this data with the one collected in the trip (e.g., FOT time stamp). Data to be collected and its specific requirements should be defined according to the Chapters 4 and 5.

8.1.2.3 Transitional data transcription into the database

Transitional data can be stored separately into tables that only contain data when transitions occur. Despite the potentially huge storage overhead, the trend is to handle transition data the same way as measure data to simplify analysis. events data can be described as shortcuts or pointers to specific events within the database. It is up to each individual FOT to define what an event is and the algorithm that defines it. Manual annotations are another way to create pointers to events in the database.

8.1.2.4 Background data transcription into the database

There are two types of background data which should be stored in the database model: 1) the driver’s and 2) the vehicle background data. Driver data should be stored in the driver table but any data to identify the driver should be kept securely and separately (see Chapter 3).

8.1.2.5 Subjective data transcription into the database

To reduce errors, automatic transcription of subjective data is always preferable. Transcription of audio/voice messages is recommended.

8.1.2.6 Time history data transcription into the database

Most of the data in a typical FOT are stored in time history tables. For the database model 1 (see description above) it is important not to create tables with too many columns. When using the database model 2 (see description above), specific database tools and functions could be considered. Examples of these database tools and functions are: table partitioning, block compression, index pre-definition.

8.1.2.7 Events/classifications transcription in the database

An FOT database can consist entirely of events if a triggered data collection approach is adopted (as opposed to continuous data collection; see Chapter 6). In other FOTs, where data collection is continuous, the ability to find and classify events of interest is of central importance. Classification and use of “events” (classified time periods) is an important aspect of FOT analyses (see Section 7.3.10). Some events are straightforward and simple to identify, for example hard braking defined as peak deceleration > 0.7 g, and may not need to be saved as a discrete or transition variable. However, many events involve a considerable amount of effort to find and validate and are worth saving into a discrete variable database or index to facilitate data query and analysis. event pointers should be saved to speed up analyses and can be used in combination to describe more complex situations with multiple events (see Chapter 6).

8.1.2.8 Adding tables to the database

Tables are initially created manually based on information in the sensor matrix (see Chapter 5). Any change to the database design must be documented thoroughly.

8.1.3 User data spaces and data sharing

It is vital to keep track of where data is created and manipulated. A copy of the collected raw data should be kept as original data in a read only space to prevent accidental data loss. In addition, it might only be the person responsible for data uploading who can insert new data and the FOT database owner who can delete data. As the analysis work begins, there will be the need for the analyst to store new data (coming from combination or processing of the raw data; see Chapter 6) in a private user space. If this new data is also relevant for other users of the database, a solution to share this data in a project internal space should be implemented. The approval process to share data should be described and meta description of the data is needed (as data origin and function/method/algorithm applied to the data). Some of the data could be of public interest, in some cases on an aggregated level and therefore exported or accessible via web interface. Other data, such as video and GPS, could be made available to researchers using more rigorous procedures. Sharing this data must be approved by all stakeholders in the FOT preferably early on in the project. More detailed information on data sharing can be found in FOT-Net 2 D3.2.

8.1.4 Hardware and storage

It is important to consider that the database server will rest inactive during data collection and then run at 100 % of its capacity during analysis. If the supporting organisation can provide flexible solutions, such as server virtualisation and/or clustering, the FOT study access to the database when running analysis on the data can be prioritised. When the project ends, machine usage can be set to a minimum until a subsequent study needs to use the data.

Very fast and reliable disks can be used even with a limited budget. In most cases storage at some kind of disk cabinet, NAS (Network Attached Storage) or SAN (Storage Area Network) is most appropriate. A storage setup with some kind RAID configuration should be considered, in order to be better prepared if a disk crashes or some data blocks are corrupted. The database should use faster disks than the file server and using disk cache is recommended to increase the performance of the system.

8.1.5 Risk management

An FOT study can generate huge amounts of data; especially when video is used, and the management must decide on the need for backup and acceptable downtime for recovery of the FOT database. It is up to the steering committee of the study to have a documented backup policy and crash recovery strategy. Further, the backup strategy might need to vary during the lifecycle of the study (collecting phase, analysis phase). If so, each phase and strategy should be documented. Disaster recovery (when local database and backup hardware are destroyed) strategy must be taken care of and there should be an offsite backup of the data.

8.2 Database and data storage implementation

8.2.1.1 Database

Storage of all data but video should be stored in a relational database.

This implementation must consider what to do with data loss from a sensor. Various strategies can be employed: if a sensor gives no data, a NULL value can be inserted. State-of-the-art FOTs suggest that using the last known sensor value makes analysis easier. The problem with data that is actually not valid has to be dealt with (see Chapter 6).

8.2.1.2 Video data storage

A common way of storing video data in FOT-context is to store the video files on a file server and store the links to the video files in the database (to link the videos to the trips).

For very large FOTs, large amounts of video data can exceed the limits of file systems or storage appliances. This can cause extra complexity, for example to add logic (scripts) to enable a single mount point that is preferable from a data management point-of-view. Different file systems and appliances should be evaluated; for instance ZFS or equivalent for almost unlimited shares.

It is worth to examine different video codecs; using an optimal codec can reduce the storage need significantly. The cost (mainly CPU) for re-processing is very high and a large FOT might have need for High Performance Cluster to complete in acceptable time frames.

8.2.1.3 FOTsHardware

It is recommended to separate the database and video file server in order to configure the hardware individually. Outsourcing system operations is possible; however, the costs for network bandwidth, backup, and administration can be very high.

8.2.1.4 Distributed system at various locations

It is strongly recommended that the database is not distributed. For the database, use a single common database. For video storage, also other options can be considered (see section 5.2.4 in D2.2).

However, due to the location and size of the FOT , it might be necessary to establish a distributed solution for data storage. This is especially true when deploying an FOT in different countries with different local data responsible. In this case, it is recommended to establish a central data server per FOT in charge of gathering all data from all individual databases.

Connection to this central database should be guaranteed so information can be easily transferred from and to this database. A broadband IP connection is then recommended, with simultaneous access.

Synchronisation between the central database and individual databases is to be considered, with automatic synchronisation every certain time period. Manual synchronisation is also allowed, but as a complement to automatic synchronisation.

'8.2.1.5 Physical access

Physical access as well as the approval process for access to the hardware must be documented.

8.2.1.6 Logical access

Logical access as well as approval for access to the database must be documented. A role-based access is advised when any user to a certain role of the database obtain certain access. This also applies for the supporting operating system. Any FOT must define the roles and permissions of the database. These roles can be:

  • Database administrator ('DBA): Unrestricted access to the database.
  • FOT database owner: Unrestricted access to FOT database data and permitted distribute role access to users.
  • Uploader: Allowed to insert and update data in the FOT database.
  • Analyst: Allowed to read data from the FOT database and to manipulate data in private user space.
  • Publisher: Permitted to insert/update/delete data in shared user space.
  • Web application: Permitted to read data from specific user space containing aggregated data.

8.2.1.7 Personal integrity and sensitive data

Driver data must be stored according to the access restrictions defined by the steering committee. In a collaborative study, some data may be classified as sensitive by one partner or even by a supplier of measurement equipment.

8.2.1.8 Private vs. public data

Private data should be kept in a private “user space” (database or schema), in order not to risk inadvertent confusion with original project data.

8.2.1.9 Backup

An FOT database backup strategy should be based on “acceptable downtime”. Off site backups are mandatory for managing a disaster scenario. The majority of the data is never edited (video and raw data in the database) and data mirroring should be sufficient. For data created by private, organisation, or public user spaces, a daily backup strategy should be applied.

8.2.1.10 Video data (file server)

Please refer to section 5.2.9 in D2.2 for a list of potential standard backup solutions.

8.2.1.11 Database data

The backup policy must be based on the time it takes to recover data and the acceptable loss of data. Even though some studies may use the original logger data as backup, any private or published data created afterwards must have valid continuous backups.

8.2.1.12 Database acceptance

Before an FOT is launched the FOT database architecture should be reviewed by a system evaluator to ensure that all requirements are fulfilled and to verify policy documents.

8.3 Off-line quality management procedures

8.3.1 Quality assurance of objective data

8.3.1.1 Quality assurance before data is uploaded to database

Before uploading objective data from a vehicle, a well-defined algorithm should be applied to all the data in order to verify data consistency and validity.

8.3.1.2 Quality assurance of video data

To catch problems with camera failure or other video related problems, a video checking strategy should be implemented. A tool for viewing one or several images per trip can be useful. Moreover, a function to verify at least the size of video files is necessary — the size is somewhat proportional to recording duration.

8.3.1.3 Driver ID verification/input

Again, it may be necessary to have a process that allows the analysts to view, for example, one image per trip and match this with the IDs of the drivers allowed to drive a specific vehicle. If a driver is unknown, then the data for a particular trip may have to be neglected. A software tool for doing this manual identification of drivers is preferable. Be advised that some eye trackers (if available) provide DriverID functionality.

8.3.2 Quality assurance of subjective data

In order to address the validity of the data, the formulation of the questions (and possible answers) is a key issue, especially when designing a questionnaire to be distributed to respondents. Questions must evidently be formulated in a clear and unambiguous way. In addition, questions must also, e.g.: be specific, not too complicated, be formulated in simple terms that can be understood by the interviewee. Hypothetical questions are the most difficult questions and should be avoided.

Regardless of data source, missing data is a threat to the quality (see Chapter 9). In the case of a missing questionnaire, efforts must be made to ensure that data collection is as complete as possible and reminders must be administered. Furthermore, the number of questions should be thought through, in order to limit the number of questions. In addition, the number of open questions should be as few as possible in order to reduce the effort of the respondents. The interviewer plays an important role in collecting data in an interview situation. Interviewer bias, that is the influence of the interviewer on the respondents’ response, can be avoided by administering a questionnaire. However the interviewer may also increase the quality of the data collected by, for instance, answering to questions and using probing questions.

8.3.3 Measures naming guidelines

It is recommended that the FOT project decides on and adheres to a set of naming conventions for measurements. The strategy used should be well documented and thoroughly enforced. Motivations for a clear naming convention include: 1) project-wide consistency, 2) clarity for direct understanding of used measures in analysis, 3) differentiation of non-comparable measurements, and 3) avoidance of confusion.

When specific measurements are named, references to the following measure attributes are recommended: indicative name, associated source, sample rate, and any other FOT specific descriptor. The compounds should be joined consistently to create a single word. Possible strategies are: “camel case” (SomeSignal), underlines (some_signal), or hyphens (some-signal). Depending on context and FOT specific requirements all or only a subset of the compounds can be used.

Examples: GroundSpeed_GPS_1Hz], GroundSpeedGps1Hz]

The aim is to clearly understand what a measurement “is”, where it comes from, and how it relates to other measurements. To avoid the risk of making faulty comparisons, measurements that are non-comparable should be named differently.

8.3.4 Automatic pre-processing and performance indicator calculations during data upload

It is recommended to define procedures and implementation schemes on how to add calculation of pre-processing and performance indicators in the upload process (see Chapter 9). These calculations should preferably be read-only for the users. The actual algorithms for the pre-processing and performance indicator calculations in this step have to be well defined and tested (on for example pilot test data), or based on previous experience. Since the estimation of some specific performance indicator may set specific requirements on the raw data (see Chapter 5), these constrains have to be taken into account when implementing the automatic pre-processing.

8.4 Data analysis tools

The focus of this section is to describe analysis tools, not to describe analysis procedures or methods.

8.4.1 Data classification/transcription

The exact coding scheme/syntax for events (time segments) will vary widely across FOTs. However, the following features have been identified as important software functionalities:

  • organising or categorising subjects into groups and subgroups;
  • defining any set and structure of codes, and associating software buttons and keyboard keys to each category;
  • editing or updating the coding scheme;
  • defining events as either a state event(e.g. glance left, glance right) or a point event(e.g. stop light);
  • defining if state events are mutually exclusive or start/stop and set a default state;
  • defining if codes are nominal (e.g. road types) or rating scales (e.g. observer ratings of drowsiness);
  • defining if codes are compulsory or optional, logging freely written comments created by human analyst (no coding scheme); and
  • support for inter- and intra-rater reliability analyses.

8.4.2 Time history and event analysis with video

This section describes the basic functionality of tools for viewing numerical time history data and the associated environment sensing data which includes video data, map data (e.g. GPS), and traffic state data (e.g. radar).

8.4.2.1 Recommended functionalities for visualisation and interaction with data

  • replay single-participant data (numerical time-history data, video data, map data, and traffic state data) simultaneously. Multiple windows for different plots and illustrations provide maximal flexibility to arrange and resize is often spread out on multiple computer screens. Recommended visualisation functionality:
    • Video recordings synchronised with other raw data plots;
    • Continuous variables and performance indicators which can be plotted (and zoomed) on graphs;
    • General information (FOT reference, subject ID, event lists, etc.);
  • aggregate and visualise multiple participants’ data at once to compare flows of events.

8.4.2.2 Recommended functionality to support data analysis

  • database query functionality (e.g. SQL)
  • signal processing of numerical data (see also Chapter 9)
  • fully customizable mathematical computation, analysis, and algorithm development functionality, automatic or semi-automatic calculation of performance indicators, and application of trigger algorithms to find events of interest (e.g. lane changes, near crashes, jerks)
  • image processing of video data (e.g. machine vision algorithms to detect traffic signal status)
  • grouped analysis of data (e.g. scripts)
  • export results function to tabular format or statistical packages.

A general recommendation for an analysis package is to use SQL software for database queries, mathematical analysis software for computation (such as Matlab), and common statistical software packages (such as SPSS). If huge datasets have to be analysed or more specific requirements exist, then more specialised or proprietary solutions may be necessary. SQL and some software tools may require a fairly high level of knowledge to use, so it may be advantageous to develop proprietary easy-to-use graphical user interfaces.

8.5 Database usage

Further usage of the database in other FOTs can be considered. In this case, data availability should be addressed. Data acquisition should also take into account this eventual further usage of the database so additional data which might be useful for other FOTs can be selected to be stored, reducing time and cost constraints for upcoming FOTs.