Laboratory for Smart City and Spatial Big Data Analytics


Urban Spatiotemporal Big Data Analytics

Spatiotemporal big data analytics is the emerging technology to employ various spatial big data (e.g. location-based social network data, GPS-associated document, cellphone data, WIFI data, etc.) for geographical modelling, analysis and mining. Our team devotes to improving reliability of the analytics and applying the knowledge to real-world scenarios.

Pattern mining and knowledge discovery from spatiotemporal data

Pattern mining is to extract patterns from data that reveal implicit regularities, abnormalities and other interactions in the data. The resultant patterns ultimately lead to knowledges for understanding research subjects and making practical decisions. We innovate spatiotemporal data pattern mining methods with improved reliability of the data mining procedures and resultant knowledge. The algorithms we have focused on include spatiotemporal association rule mining (SARM) and sequential pattern mining from human movement trajectories.

Crisp-fuzzy SARM.SARM is to extract patterns in the form of rules like “X -> Y” from spatial data. For example, a rule “(house) near water & young age → high unit price” means that if a house is spatially near waterscape, has young house age, then it tends to have a high unit price. The usefulness of the resultant rules is evaluated by quantitative rule interestingness measures (RIMs), and statistical testing is a key technique to avoid spurious rules, i.e., rules occurring purely by chance instead of associations between the features of entities in the real world.

We have proposed crisp-fuzzy SARM, a novel SARM method that can enhance the reliability of resultant rules. The method firstly prunes dubious rules using statistically sound tests and crisp supports for the patterns involved, and then evaluates RIMs of accepted rules using fuzzy supports. Crisp-fuzzy SARM can enhance the reliability of SARM results in three aspects: a) to improve the number of significant rules by 50% or more compared with using conventional fuzzy SARM; b) to control the probability that the entire result contained any spurious rules (i.e., the familywise error rate) below arbitrary user-specified values, e.g., 5%; 3) to avoided large positive errors in RIM values committed by crisp SARM. The method has been applied to investigate the locational factors for popular business sites from social media data.

Crisp-fuzzy spatiotemporal association rule mining for business site evaluation.
Left: map of the study area. Right: top-5 rules (locational factors) for successful food business site

Differential Evolution algorithm for mining SIGnificant Fuzzy Association Rules (DESigFAR).The DESigFAR algorithm we proposed utilizes differential evolution (DE), one of the best performing evolution algorithm (EA) for mining optimized and statistically significant fuzzy association rules. DESigFAR can a) obtain 2-10 times as many rules and as high RIM values as conventional non-evolutionary SARM; b) for the first time control the familywise error rate and percentage of spurious rules upon arbitrary user specified level (e.g., 5%) in an EA environment, through two new statistically sound significance tests on the rules, namely the experimentwise and generationwise adjustment approach. The method has been applied to investigate hotel room price determinants and wildfire risk factors.

Procedure of DESigFAR algorithm
Application of DEsigFAR to investigating hotel room price determinants

Related Publications:
[1] Zhang, A., Shi, W., 2019. Mining significant fuzzy association rules with differential evaluation algorithm. Applied Soft Computing, DOI: 10.1016/j.asoc.2019.105518. (IF = 5.472, Q1 in Artificial Intelligence)
[1] Shi, W., Zhang, A., Webb, G.I., 2018. Mining significant crisp-fuzzy spatial association rules. International Journal of Geographical Information Science, 30(4), 928-963. (IF = 3.545, Q1 in Geography)
[3] Zhang, A., Shi, W., Webb, G.I., 2016. Mining significant association rules from uncertain data. Data Mining and Knowledge Discovery, 30(4), 928–963. (IF = 3.16, Q1 in Artificial Intelligence)

Recommending Desirable Thematic Regions

The increasing availabilities of location-based social networks (LBSNs) provide researchers with new tools and data sources to detect and recommend desirable locations, which facilitates users’ travels and social interactions. However, social media data suffers from short noisy text and lack of a priori knowledge, impeding the usefulness of traditional semantic modelling methods. Another challenge is the need for an effective strategy for the selection/recommendation of candidate regions. To address these challenges, we propose a comprehensive workflow which combines semantic and location information of social media data to recommend thematic urban regions to users with specific interests. The key modules are described as below:

1.Geographical topic discovery

Geographical topic discovery is to discover topics of spatial big data (e.g., GPS-associated document, geo-tagged social media data) in geographical context. Traditional statistical methods show limited effectiveness to handle social media data due to the predominance of short and noisy texts. Consequently, we develop an unsupervised data-driven method to investigate the geographic pattern of social media topics. Specifically, an undirected ‘hashtag network’ model is built where each hashtag is denoted as a network node and the co-occurrence frequency between hashtags is assigned as the edge weight. A greedy optimization method is then implemented to explore the communities from the hashtag network, and a common topic is assigned to the hashtags of the same community. The method is data-driven and requires no well-organized training data or a priori knowledge such as topic counts, thus reducing potential perceptual biases.

Geographical Topic Discovery based on Hashtag Community Detection

2. Regional Desirability Prediction with HITS-based Model.

Crowd flow analysis is a series of modelling, analysis and mining of human footprint. As the first base approach of understanding social space phenomenon, it plays an important role in aspects of economy, transportation, urban planning, etc. Traditional spatial data (e.g. census data, night light data) show limitations in Spatiotemporal granularity and object classification. We employ spatial big data like GPS-associated document, WIFI data and geo-tagged social media data for a more-refined crowd flow analysis both outdoor and indoor.

HITS-based Model

3. Regional Desirability Prediction with Neural Network Model.

The increasing availabilities of location-based social networks (LBSNs) provide researchers with new tools and data sources to detect and recommend attractive locations, which facilitates users’ travels and social interactions. However, it remains a challenge that how to develop an effective model to to select and recommend candidate regions accordingly. We proposed a new region ranking and recommending strategy so that regions with high desirability can be more accurately predicted.

Neural Network-based Model for Predicting Regional Desirability

Related Publications:
[1] Shi W, Liu Z*, An Z, et al. RegNet: a neural network model for predicting regional desirability with VGI data [J]. International Journal of Geographical Information Science (accepted)
[1] Liu Z, Zhou X, Shi W*, et al. Recommending attractive thematic regions by semantic community detection with multi-sourced VGI data[J]. International Journal of Geographical Information Science, 2019, 33(8): 1520-1544.