In this paper, we build a $TC$TC-$Stream$Stream, a high-performance graph processing system specific for a triangle counting algorithm on graph data with up to tens of billions of edges, which significantly exceeds the device memory capacity of Graphics Processing Units (GPUs). The triangle counting problem is a broad research topic in data mining and social network analysis in the graph processing field. As the scale of the graph data grows, a portion of the graph data must be loaded iteratively. In the existing literature, graphs with billions of edges need to be done distributively, which is cost-intensive. Also, many disk-based triangle counting systems are proposed for CPU architectures, but their tackling performances are inefficient. To solve the above problem, we propose $TC$TC-$Stream$Stream, and it focuses on three issues: 1) For power-law graphs, because the amount of tasks of each vertex or edge is inconsistent, it is bound to cause different demands of computing and memory resources for different task types. We propose a parallel vertex approach and the reordering of vertices for graph data that can be placed in the GPU device memory to ensure the maximum workload balancing; 2) A binary-search-based set intersection method is designed to achieve the maximum parallelism in GPU; 3) For the graph data that exceeds the GPU device memory capacity, we develop a novel vertical partition algorithm to guarantee the independent computing on each partition so that the three computation processes, i.e., the computation on GPU, the data transmission between main memory of CPU and SSD, and the communication between the CPU and the GPU can be perfectly overlapped. Moreover, the $TC$TC-$Stream$Stream optimizes edge-iterator models and benefits from multi-thread parallelism. Extensive experiments conducted on large-scale datasets showed that the $TC$TC-$stream$stream running on a single Tesla V100 GPU performs $2.4-6times$2.4-6× and $1.8-4.4times$1.8-4.4× faster than the state-of-the-art single-machine in-memory triangle counting system and GPU-based triangle counting system, respectively, and achieves $2.4times$2.4× faster than the state-of-the-art out-of-core distributed system PDTL running on an 8-node cluster when processing the graph data with 42.5 billion edges, which demonstrates the high performance and cost-effectiveness of the $TC$TC-$Stream$Stream.