We have built openGauss, an enterprise-grade open-source database system. openGauss has fulfilled its design goal of high performance, high availability, high security, and high intelligence. For high performance, it leverages NUMA (non-uniform memory access)-aware data access among multiple cores to enable efficient concurrent transaction processing, and symmetric multi-processing to make use of parallel processing resources adaptively. Moreover, memory-optimized tables (MOTs) are designed to put everything in memory. For high availability, a three-tier pooling architecture that shares storage among the master and standby instances is proposed to achieve availability at 99.99%, containing both a distributed memory service (DMS) and a distributed storage service (DSS). For high security, it is a fully encrypted database with safe storage features, efficient complex querying, and tamper-proof. For high intelligence, an AI-based optimizer in the kernel and a self-driving platform named DBMind are demonstrated to achieve better performance and greater user-friendliness. openGauss has served over 150 enterprises and institutions since its release in 2020. We share the lessons we learned from its development and operation, and our customers.
- Article type
- Year
- Co-author
The rapid development of social networks has resulted in a proliferation of user-generated content (UGC), which can benefit many applications. In this paper, we study the problem of identifying a user’s locations from microblogs, to facilitate effective location-based advertisement and recommendation. Since the location information in a microblog is incomplete, we cannot get an accurate location from a local microblog. As such, we propose a global location identification method, GLITTER. GLITTER combines multiple microblogs of a user and utilizes them to identify the user’s locations. GLITTER not only improves the quality of identifying a user’s location but also supplements the location of a microblog so as to obtain an accurate location of a microblog. To facilitate location identification, GLITTER organizes points of interest (POIs) into a tree structure where leaf nodes are POIs and non-leaf nodes are segments of POIs, e.g., countries, cities, and streets. Using the tree structure, GLITTER first extracts candidate locations from each microblog of a user which correspond to some tree nodes. Then GLITTER aggregates these candidate locations and identifies top-k locations of the user. Using the identified top-k user locations, GLITTER refines the candidate locations and computes top-k locations of each microblog. To achieve high recall, we enable fuzzy matching between locations and microblogs. We propose an incremental algorithm to support dynamic updates of microblogs. We also study how to identify users’ trajectories based on the extracted locations. We propose an effective algorithm to extract high-quality trajectories. Experimental results on real-world datasets show that our method achieves high quality and good performance, and scales well.
Data visualization transforms data into images to aid the understanding of data; therefore, it is an invaluable tool for explaining the significance of data to visually inclined people. Given a (big) dataset, the essential task of visualization is to visualize the data to tell compelling stories by selecting, filtering, and transforming the data, and picking the right visualization type such as bar charts or line charts. Our ultimate goal is to automate this task that currently requires heavy user intervention in the existing visualization systems. An evolutionized system in the field faces the following three main challenges: (1) Visualization verification: to determine whether a visualization for a given dataset is interesting, from the viewpoint of human understanding; (2) Visualization search space: a "boring" dataset may become interesting after an arbitrary combination of operations such as selections, joins, and aggregations, among others; (3) On-time responses: do not deplete the user’s patience. In this paper, we present the DeepEye system to address these challenges. This system solves the first challenge by training a binary classifier to decide whether a particular visualization is good for a given dataset, and by using a supervised learning to rank model to rank the above good visualizations. It also considers popular visualization operations, such as grouping and binning, which can manipulate the data, and this will determine the search space. Our proposed system tackles the third challenge by incorporating database optimization techniques for sharing computations and pruning.