Feature-Centric Video Transmission and Analytics in Large-Scale Internet of Video Things

The interconnection of large-scale visual sensors is called the Internet of Video Things (IoVT), which brings a qualitative leap to the interaction of urban information. However, communication delay and resource allocation have brought challenges to the development of IoVT. In this paper, we propose a novel city surveillance IoVT architecture to improve performance. This paradigm consists of front-end target region capture, edge computing and cloud-end feature matching, which can adapt the channel and computing resource allocation ratio flexibly, avoiding communication link congestion caused by unnecessary video uploading. Simulation results show that the proposed scheme is feasible, and can realize efficient data transmission and analysis in an IoVT-based smart city.


W
ith the vigorous development of the Internet of Things (IoT) [1] over the past two decades, the efficiency of information transmission has greatly improved.Now the number of visual sensors in urban areas has significantly increased [2] .Compared to other sensors, visual sensors have the capability to provide more comprehensive and versatile information, making them the cornerstone for applications like smart cities and intelligent transportation.
The increasing number of surveillance cameras has also resulted in a dramatic surge in video data for monitoring purposes.However, traditional surveillance videos rely on the "Compress Then Analyze (CTA)" strategy for transmission and computation.This approach necessitates the upload of compressed video from several channels with unnecessary content and relies on powerful back-end processing capacity for analysis, resulting in a significant waste of transmission resources.
Numerous academics have made some changes in light of the aforementioned inadequacies.Compared with the original image or video data, the feature information is obtained by processing and extracting the original data, which can represent the main characteristics and patterns in the original data, and the redundancy of the data is low, which can reduce the consumption of storage and computing resources, and provide valuable information support for various applications and tasks.Therefore, Duan et al. [3] proposed an AI-oriented large-scale video management paradigm based on the strategy of "Analyze Then Compress (ATC)".Unlike CTA mode, it first extracts features from the original video and utilizes a compact feature stream to connect the edge and the cloud, which significantly saves the required communication transmission resources.Lou et al. [4] developed the idea of "feature aggregation in real-time, video uploading on demand" to support big data analysis.All the works above have promoted the birth of the Internet of Video Things (IoVT) [5] .
IoVT represents the interconnection of large-scale visual sensors, which plays a unique characteristic in city surveillance transmission, storage and analysis.Inspired by IoVT, Chen et al. [6] proposed a three-phase resource-effective solution to perform surveillance operations in a large-scale wireless IoVT.Kochan et al. [7] proposed a new IoVT platform, which uses SDN to overcome challenges such as flexible management, control, and maintenance of IoVT devices.Now, it has become a consensus in the academic circle that information technology is moving towards humanoid development, and IoVT is also looking forward to enabling a new future through humanoid technology [8] .
As shown in Fig. 1, the model provides a straightforward IoVT paradigm for urban monitoring.On the front-end, we store and process part of the raw videos generated by thousands of cameras selectively, and upload valuable information to the edge in the meantime.The computing task is offloaded to the edge server partly, and the rest of the task is uploaded to the cloud server through the core network for analysis by advanced cloud computing equipment.
Currently, in view of the unique advantages of the video Internet of Things system, the rich visual sensor data obtained can be combined with modern technologies such as deep learning, and can be well applied in medical, traffic, manufacturing and other fields.Taking industrial manufacturing as an example, sensors are extensively deployed throughout the production process, enabling the realization of intelligent industrial operations through data analysis.This integration of data and intelligence constitutes the foundation of the Industrial Internet of Things (I-IoT) [9] .Among these sensors, the visual sensor holds significant importance in industrial automation, particularly in defect detection.By analyzing and transmitting visual data, it facilitates the timely identification and resolution of production issues, leading to cost reduction and increased efficiency.
Although IoVT can effectively solve the problem of information isolation in previous surveillance systems, the accompanying communication and computational overhead should not be underestimated [10,11] .Currently, existing methods have not ensured the real-time performance of the system by analyzing the allocation ratio of communication and computational resources between the edge and cloud-ends.In pursuit of heightened efficiency and real-time capabilities within multiplex transmission, we introduce a versatile IoVT resource allocation paradigm.
This approach follows a feature-centric paradigm, involving the conversion of original videos into distinct features.Subsequent computational tasks revolve around these features, enabling smooth real-time video transmission and analysis.In the following sections, we will demonstrate the feasibility of this model using mathematical modeling and experiments.The main contributions can be summarized as follows: • A high efficiency IoVT framework for urban surveillance is proposed by analyzing the characteristics of real-time transmission of video uploads on demand.Based on the proposed framework, we build a mathematical model of IoVT and consider real-time transmission as a constraint for related problems.
• Efficient utilization of channel resources is achieved under video-on-demand transmission.Determine the allocation ratio of communication and computing resources based on the traffic flow and demand for surveillance video in a given area.
• The allowed maximum uploading video quantity is studied, which is deployed in a given number of cameras under one edge control range.Real-time transmission is achieved under the condition of reasonable resource allocation.In addition, the maximum number of cameras that the system can support is discussed.

System Model
The proposed IoVT architecture primarily consists of three components, as shown in Fig. 2: the camera front-end, edge-end and cloud-end.Assuming that a large cloud center with a large database and powerful cloud server is located in a city.Each Mobile Edge Computing (MEC) server supports cameras within its subordinate range where represents the number of Base Stations (BS) directly connected to the cameras, and is the number of cameras served under the BS.The proposed model is focused on the overall time consumed in order to meet the demand for real-time transmission.These detailed contents are elaborated as follows.

Target region capture at the front-end
There are MEC servers at the edge side, and each MEC server processes the data generated by cameras within its subordinate region, which can be denoted as follows: ; ; a certain camera , corresponding to the area within the monitoring range; the average number of pedestrians passing through per unit time satisfies the Poisson distribution [12] with the parameter .We divide the working process of the whole system into several minimum time units .It is assumed that the original video bit rate is , and the number of pedestrians captured passing through the intersection within time is .
In the front-end, we set intelligent cameras under a random edge service area, and each camera is equipped with a small chip.These intelligent cameras can capture the Target Regions (TR) through a built-in YOLO-v5 network [13] .The computation time cost in each front-end camera is obtained as follows: in which is the time for required for one frame, and is the raw video frame rate.

Transmission from the front-end to the edge
The data of each TR is denoted as .Thus, the data of the camera , which need to be uploaded to the edge within time , is .According to Shannon theory [14] , we obtain Eq. ( 2).
where is the bandwidth of the wireless channel, and refers to the -th camera.indicates the transmission rate of the uplink represents the proportion of communication resources allocated to front-end camera under BS .
denotes the transmission power, is the wireless channel gain between the user and the BS, and is the noise power.
To ensure the operation of the traditional monitoring center, we also analyze the transmission of raw surveillance videos.Thus, the transmission task of the front-end involves sending information about target regions along with the corresponding primary view video.
For the original videos, in order to reduce the transmission load on the system and enhance data availability, only videos with analytical value are transmitted.The parameter is used, where or , to determine whether the video data should be uploaded or not.Assuming that the transmission rate of the wired channel is , it will be directly transmitted through the wired channel to the cloud after reaching the local BS.The MEC server does not need to process the original video data.So the video data merely traverses it as an intermediary channel.The actual data transmitted from the front-end to the edge for processing is only the data concerning target regions.Consequently, the overall transmission delay from the front-end to the edge can be formulated as follows [15] :

Feature compression at the edge
At the edge side, suppose that indicates the average time required for a TR inserted into the Convolutional Neural Network (CNN) for feature extraction.The ratio of MEC computational resources allocated to the uploaded data of camera is .Thus, the time required to extract features from the target patches is : Then, the target region patches are transformed into features with an average compression ratio of .

Transmission from the edge to the cloud
Since it is a wired link between the edge and the cloud, we assume that the communication distance from the edge to the cloud-end is equal.The transfer rate is , and each MEC server has a separate wired channel linked with the cloud.The bandwidth of the cable channel will also be allocated to the data from each camera in a ratio .Then the total time cost of data sent from the MEC server to the cloud can be formulated as While the data arrives at the cloud-end, feature matching is performed with requested target images in the database.Based on the similarity of the match, we can determine if the video needs to be uploaded.Due to the extremely high processing power of the cloud server, we assume that its computational power is infinite and ignore its calculation time.Combine Eqs. ( 1)-( 3) and ( 5), the data generated by camera in time slot eventually reaches the cloud and will a cost total time delay as follows: 2 Real-Time Transmission Analysis According to Eq. ( 6), the total transmission delay of a random camera can be obtained.To satisfy the real-time transmission requirement, it should make , .Therefore, the final formula is as follows: Taking Formula (7) as the time delay constraint condition, we need to solve the problem of how many cameras can be installed at most under different video uploading densities.We set an appropriate source allocation ratio , , and .In addition, since is a random variable, we use the properties of Poisson distribution to get in the following calculations.Meanwhile, the number of pedestrians in a unit time is expressed by to simplify the analysis.
To take full advantage of the network, we need to ensure that the channel is always in a state of high utilization.However, when the real-time transmission is satisfied, the computing resources do not need to be fully exhausted.First, we discussed the situation where computing resources are relatively abundant.To simplify the conditions, we assume that all cameras have the same video uploading probability , and the average bit rate of video is .Therefore, using Eqs.( 8) and ( 9), one can determine the ratio of communication resource allocation and .
Communication resources are allocated according to the proportion of transmission volume to ensure transmission efficiency and make full use of channels.The calculation resource allocation ratio of each camera is determined based on the delay restriction conditions Formula (7), which can be formulated as Formula (10).
If communication resources are sufficient, must satisfy the condition of Formula (13).
Otherwise, the computation source ratio is set according to the amount of each camera.Under this allocation strategy, the expression of is as follows: At this time, the edge server has reached the limit of computing power, we can only adjust the probability of video uploading when the communication channel is also fully utilized.Taking Eqs. ( 8), (9), and ( 14) back into delay constraint Formula (7), we can get the maximum value of .According to the given numbers of BSs and front-end cameras, the maximum video uploading ratio that meets real-time transmission can be obtained.In Formula (15), the maximum video uploading ratio is displayed.
Based on the above formulas, we can make the entire IoVT system meet the real-time conditions by adjusting the video upload probability while giving more cameras.Of course, when the maximum crowd density under a certain edge is given, if we know the number of cameras set under a BS, the maximum number of cameras is obtained with a given upload probability as Eq.(18).
Equations ( 19) and ( 20) provide the expressions for parameters and .

Simulation Result and Discussion
In this section, a series of experiments are performed to verify the soundness of the proposed mathematical model.

Simulation setup
We choose the MOT20 [16] dataset as the surveillance video resource.MOT20 is a MOTChallenge benchmark for extremely crowded scenes consisting of 8 video sequences.In the MOT20 training set, videos with different lengths and scenes are selected to represent front-end cameras 1−8 served by the same edge server .Taking sub video MOT20-01 as an example, its duration is 17 s.Meanwhile, video MOT20-01 corresponds to the camera 1, and the human traffic in the video is .Therefore, we can make a rough estimate of the average passenger flow in a given area and represent the Poisson intensity of pedestrian flow as .The relevant information of video sequences used in the experiment and the results are shown in the Table 1.

Simulation analysis
In the experiment, we tested the average size of TRs, the average bit rate and frame rate of the test video, the average computation speed of the YOLO-v5 network, the average speed of 10 RTX2060 graphics cards for a single image, and the compression rate of the feature extracted from the VGG16 network [17] .The experimental results and the corresponding parameters are shown in the Table 2.

Analysis of resource allocation i i
For the experiments, we assume that the eight videos in MOT20 correspond to eight different surveillance cameras and divide them into two groups: Cameras 1−4 and 5−8, which belong to BS = 1 and = 2, respectively.These cameras are under the control of an edge and let the video segments pass through the YOLO-v5 network for pedestrian detection and TRs extraction.Here, we select the cameras under BS-1 as an example.The source allocation curves are shown in Figs. 3 and 4.
The experimental results indicate that when the system has sufficient communication and computing resources, the communication and computing resources of the system are sufficient, the communication resources of the entire system tend to be evenly distributed.Only when the probability of video upload is low, TR or characteristic resources become the main transmission object.If the number of cameras on the edge is close to saturation, the communication resources tend to be evenly distributed as the probability of video upload increases.At this time, the allocation ratio of computing resources will become the β k primary means of adjusting the delay.Then, there will be a significant surge in .However, if there are too many cameras, the probability of video upload can only be reduced to meet realtime requirements.
In Fig. 4a, the sum of computing resources is still very abundant.But when both the communication resources and the channel resources are both fully loaded, the uploading probability of the video needs to be adjusted.With a certain communication resource allocation ratio, the relationship of the probability of video upload and the is shown in Fig. 4b.It is clear that communication resources limit the number of computational tasks uploaded to the edge server, and accordingly, there will be a rise in computing resources allocated to the device.Morever, according to Formula (10), a higher passenger flow increases the proportion of computational resources, leading to a decrease in compared to a lower value as shown in Fig. 5a.

Analysis of system threshold λ k
As the number of cameras increases, the allowable maximum video upload probability diminishes.To ensure the system's realtime performance and reliability, we also examines the maximum number of cameras that the system can support.In the simulation experiment, we choose the video MOT20-05, which has a very dense flow in public places with the largest .Based on this, we get the curve figure of camera configurations number and video uploading probability as shown in Fig. 5b.The general consensus affirms that opting for the transmission of only crucial feature information leads to a considerable reduction in data redundancy, thereby easing the burden on the system's transmission and computational processes.When the system can only support feature transmission, as shown in Eq.   (18), the uploading ratio will reach zero.At this time, with a given resource, the number of cameras is only related to the cameras' quantity under each base station.The maximum number of cameras under different are shown in Table 3.For large-scale urban surveillance camera networks, the sheer magnitude of monitoring data is immensely substantial.Therefore, in the realistic IoVT scene, we recommend that the system assign an appropriate weight to each camera according to the importance of the district.When the total video upload rate of the system decreases, the camera with a higher weight value will upload the video first to realize reasonable and efficient information transmission.

Performance comparison
Nowadays, many academics have conducted in-depth research on IOVT and some IoVT tests have connected hundreds of cameras and processed the surveillance videos recorded in the urban region of Hangzhou.The following will elaborate on the superiority of the proposed framework.
A SDN-based IoVT platform was proposed by Kochan et al. [7] to address issues with flexible IoVT device administration, control, and maintenance.Different from the framework proposed in this paper, it focuses on task allocation at the edge and in the cloud by analyzing the priority as well as the complexity of tasks, and there is no rational allocation of communication and computational resources and the system is not very flexible.Ren et al. [19] investigated the collaboration between cloud computing and edge computing, and formulated a problem of joint communication and computation resource allocation to minimize the weightedsum delay of all mobile devices.The experimental results show the effectiveness of cloud-edge collaboration architecture.Due to differences in the emphasis of existing work, we have chosen not to compare our work in this paper with existing methods.Instead, we focus on demonstrating the superiority of the system through a comparison with traditional methods.
Traditional methods require large bandwidth and high computation capability.In terms of urban surveillance, assuming that video is transmitted at an average bit rate of 10 Mbps, according to its test results.If we want to achieve real-time transmission, a bandwidth of at least 600 Mbps is required for the cameras of 60 channels [20] .Compared with this traditional transmission method, our method has a certain reduction in bandwidth consumption.Moreover, in the traditional way, at least 6 servers are equipped with multiple high-performance GPUs to ensure the operation of the smart city.We estimate that a modest server will need at least 2880 megahash per second (MH/s) of processing power if it has 4 RTX3090 graphics cards installed.The performance improvement can be shown in Fig. 6.In summary, our system can flexibly allocate resources according to the density of uploading videos, which can realize the optimization of system resource allocation and provide an effective solution paradigm for IoVT-based smart city.

Conclusion
In this work, the information transmission paradigm is proposed to be adapted to urban IoVT networks by combining target capture, feature transmission and analysis.According to pedestrian flow density and video uploading probability, the resource allocation scheme is adopted at this stage.This strategy is accepted to solve the communication channel congestion problem, transmitting video according to the video uploading density in order to guarantee the requirement of real-time.The rationality of our proposed system framework is finally proved through mathematical modeling and experiments.In the future, we may be able to integrate this system with advanced technologies to build a more powerful smart city IoVT network and eventually realize smart management of cities.

AFig. 2
Fig. 2 Overall architecture of the proposed IoVT framework for urban surveillance.

Fig. 3 5 Fig. 4
Fig. 3 Communication resource allocation for wireless and wired channel.

Fig. 6
Fig. 6 Comparison of communication and computation source consumption in traditional way and ours.