-
Notifications
You must be signed in to change notification settings - Fork 158
Description
It appears that when encountering invalid an UTF-8 sequence in decodeUtf8With
, it calls decodeUtf8With2
. In the inner
loop there, it tries to decode the very first character, and then recursively calls itself on the rest.
This is an issue when the invalid code sequence appears at the (almost very end) of the string: For each character popped off the beginning, we try to decode the entire remainder of the string.
I wrote the following reproducer (using GHC 9.4.4 with text-2.0.1
and bytestring-0.11.3.1
).
module Main where
import Control.Exception
import Data.Foldable (for_)
import System.CPUTime (getCPUTime)
import qualified Data.Text.Encoding as Enc
import qualified Data.ByteString as BS
main :: IO ()
main = do
putStrLn "Running"
for_ [1..75] $ \factor -> do
let len = factor * 10000
-- The string must not end with the invalid sequence, otherwise it'll get cut off eagerly
let invalid = BS.replicate len 32 <> BS.replicate 1 216 <> BS.singleton 32
evaluate invalid
before <- getCPUTime
let numIterations = 10
for_ [1..numIterations] $ \_ -> do
evaluate $ Enc.decodeUtf8Lenient invalid
after <- getCPUTime
let nanos = (after - before) `div` 1000 `div` numIterations
putStrLn $ show (BS.length invalid) <> "\t" <> show nanos
Plotting the length vs run time records it prints clearly shows a quadratic growth:
Raw data of my run
Length (Bytes) | Nanoseconds |
---|---|
10002 | 363155 |
20002 | 440389 |
30002 | 758905 |
40002 | 1278086 |
50002 | 2079332 |
60002 | 2975872 |
70002 | 4092472 |
80002 | 5322233 |
90002 | 6755837 |
100002 | 8353817 |
110002 | 10090751 |
120002 | 12221735 |
130002 | 14223879 |
140002 | 16317896 |
150002 | 19183989 |
160002 | 21774090 |
170002 | 24199864 |
180002 | 26894248 |
190002 | 30935749 |
200002 | 34761685 |
210002 | 37555264 |
220002 | 40420152 |
230002 | 45140731 |
240002 | 50637179 |
250002 | 54801430 |
260002 | 58834538 |
270002 | 63470996 |
280002 | 71390759 |
290002 | 73341154 |
300002 | 78781234 |
310002 | 87749863 |
320002 | 93538205 |
330002 | 96206465 |
340002 | 104967901 |
350002 | 112452576 |
360002 | 118074408 |
370002 | 124928813 |
380002 | 133200718 |
390002 | 139740339 |
400002 | 147652567 |
410002 | 157100436 |
420002 | 163282491 |
430002 | 171497972 |
440002 | 182591830 |
450002 | 191219840 |
460002 | 198834990 |
470002 | 211232776 |
480002 | 220616658 |
490002 | 228667562 |
500002 | 241215551 |
510002 | 249927112 |
520002 | 260673296 |
530002 | 272971252 |
540002 | 282285464 |
550002 | 292999742 |
560002 | 307085056 |
570002 | 319529945 |
580002 | 329887608 |
590002 | 342599272 |
600002 | 351595686 |
610002 | 367887033 |
620002 | 380861250 |
630002 | 393925018 |
640002 | 409024668 |
650002 | 423207592 |
660002 | 434496271 |
670002 | 450354392 |
680002 | 464690475 |
690002 | 476479403 |
700002 | 493291074 |
710002 | 506146007 |
720002 | 519209847 |
730002 | 538656736 |
740002 | 549421069 |
750002 | 568419934 |
This issue seems to be mainly caused by the fact that isValidBS
(regardless of its actual implementation as chosen by the preprocessor defines) only returns a boolean. That way, in the failure case, the error recovery code around it has no way of knowing where the faulty sequence starts.
If instead it would return an Int
indicating the number of valid bytes found in the beginning, the error-recovery code would immediately know how many bytes can be used as-is for the decoded text, and where to start error recovery (like inserting replacement characters). It seems to me that the decoder used in isValidBS
should already have this information (after all, it needs to know what part of the string it's currently processing), so maybe just exposing this would be an option.