Decoding of bytestrings with invalid UTF-8 has quadratic run time

It appears that when encountering invalid an UTF-8 sequence in `decodeUtf8With`, it calls `decodeUtf8With2`. In the `inner` loop there, it tries to decode the very first character, and then recursively calls itself on the rest.

This is an issue when the invalid code sequence appears at the (almost very end) of the string: For each character popped off the beginning, we try to decode the entire remainder of the string.

I wrote the following reproducer (using GHC 9.4.4 with `text-2.0.1` and `bytestring-0.11.3.1`).

```haskell
module Main where

import Control.Exception
import Data.Foldable (for_)
import System.CPUTime (getCPUTime)
import qualified Data.Text.Encoding as Enc
import qualified Data.ByteString as BS

main :: IO ()
main = do
  putStrLn "Running"
  for_ [1..75] $ \factor -> do
    let len = factor * 10000
    -- The string must not end with the invalid sequence, otherwise it'll get cut off eagerly
    let invalid = BS.replicate len 32 <> BS.replicate 1 216 <> BS.singleton 32
    evaluate invalid
    before <- getCPUTime
    let numIterations = 10
    for_ [1..numIterations] $ \_ -> do
      evaluate $ Enc.decodeUtf8Lenient invalid
    after <- getCPUTime
    let nanos = (after - before) `div` 1000 `div` numIterations
    putStrLn $ show (BS.length invalid) <> "\t" <> show nanos
```

Plotting the length vs run time records it prints clearly shows a quadratic growth:

![image](https://user-images.githubusercontent.com/3857449/214569639-9330ca8a-2dec-4435-ac0c-312c01b18b14.png)

<details>

<summary>Raw data of my run</summary>

Length (Bytes) | Nanoseconds
-- | --
10002 | 363155
20002 | 440389
30002 | 758905
40002 | 1278086
50002 | 2079332
60002 | 2975872
70002 | 4092472
80002 | 5322233
90002 | 6755837
100002 | 8353817
110002 | 10090751
120002 | 12221735
130002 | 14223879
140002 | 16317896
150002 | 19183989
160002 | 21774090
170002 | 24199864
180002 | 26894248
190002 | 30935749
200002 | 34761685
210002 | 37555264
220002 | 40420152
230002 | 45140731
240002 | 50637179
250002 | 54801430
260002 | 58834538
270002 | 63470996
280002 | 71390759
290002 | 73341154
300002 | 78781234
310002 | 87749863
320002 | 93538205
330002 | 96206465
340002 | 104967901
350002 | 112452576
360002 | 118074408
370002 | 124928813
380002 | 133200718
390002 | 139740339
400002 | 147652567
410002 | 157100436
420002 | 163282491
430002 | 171497972
440002 | 182591830
450002 | 191219840
460002 | 198834990
470002 | 211232776
480002 | 220616658
490002 | 228667562
500002 | 241215551
510002 | 249927112
520002 | 260673296
530002 | 272971252
540002 | 282285464
550002 | 292999742
560002 | 307085056
570002 | 319529945
580002 | 329887608
590002 | 342599272
600002 | 351595686
610002 | 367887033
620002 | 380861250
630002 | 393925018
640002 | 409024668
650002 | 423207592
660002 | 434496271
670002 | 450354392
680002 | 464690475
690002 | 476479403
700002 | 493291074
710002 | 506146007
720002 | 519209847
730002 | 538656736
740002 | 549421069
750002 | 568419934

</details>

This issue seems to be mainly caused by the fact that `isValidBS` (regardless of its actual implementation as chosen by the preprocessor defines) only returns a boolean. That way, in the failure case, the error recovery code around it has no way of knowing where the faulty sequence starts.

If instead it would return an `Int` indicating the number of valid bytes found in the beginning, the error-recovery code would immediately know how many bytes can be used as-is for the decoded text, and where to start error recovery (like inserting replacement characters). It seems to me that the decoder used in `isValidBS` should already have this information (after all, it needs to know what part of the string it's currently processing), so maybe just exposing this would be an option.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Decoding of bytestrings with invalid UTF-8 has quadratic run time #495

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Length (Bytes)	Nanoseconds
10002	363155
20002	440389
30002	758905
40002	1278086
50002	2079332
60002	2975872
70002	4092472
80002	5322233
90002	6755837
100002	8353817
110002	10090751
120002	12221735
130002	14223879
140002	16317896
150002	19183989
160002	21774090
170002	24199864
180002	26894248
190002	30935749
200002	34761685
210002	37555264
220002	40420152
230002	45140731
240002	50637179
250002	54801430
260002	58834538
270002	63470996
280002	71390759
290002	73341154
300002	78781234
310002	87749863
320002	93538205
330002	96206465
340002	104967901
350002	112452576
360002	118074408
370002	124928813
380002	133200718
390002	139740339
400002	147652567
410002	157100436
420002	163282491
430002	171497972
440002	182591830
450002	191219840
460002	198834990
470002	211232776
480002	220616658
490002	228667562
500002	241215551
510002	249927112
520002	260673296
530002	272971252
540002	282285464
550002	292999742
560002	307085056
570002	319529945
580002	329887608
590002	342599272
600002	351595686
610002	367887033
620002	380861250
630002	393925018
640002	409024668
650002	423207592
660002	434496271
670002	450354392
680002	464690475
690002	476479403
700002	493291074
710002	506146007
720002	519209847
730002	538656736
740002	549421069
750002	568419934

Decoding of bytestrings with invalid UTF-8 has quadratic run time #495

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions