MixQ
Quantizing outliers is the main challenge when quantizing activation tensors in inferencing. Previous work has shown that the outliers are located in fixed channels. However, few of them identify the regularity of outliers when decoding tokens. Existing open-sourced work also did not achieve the ideal speedup compared to the FP16 baseline. In this project, we show the locality of outliers when decoding tokens and design a mixed-precision kernel to achieve state-of-the-art performance.