i like this idea, with one modification: report byte positions in hex, e.g. 0x23ab5. it is more obviously not a line number (since we conventionally use decimal for line numbers); it is already commonplace to use hex for byte offsets in hex editors, debuggers, and so on; and as a side benefit, in large files it will end up being more compact anyway.
for tools talking to tools byte offsets is a clear win. but i would prefer to see tools reporting byte offsets in addition to line numbers, as a default (with switches if you like to choose only line numbers, or only byte offsets). because when a human has to be in the loop interpreting the result, line numbers have the convenience of being shorter (fewer errors in reading and typing them); easy to correlate by eye when an editor shows line numbers in the gutter (such as when having to make do with less-capable text editors that cant parse the output); and often easier to communicate to another human (such as when helping someone over screenshare, referring to line numbers is very valuable in a way that byte offsets cannot be).
A small benefit for line numbers is that if I don’t have my editor hooked up to read compiler messages, it’s easier to type in a line number than a byte position, especially if I’m not looking at the errors and the code side by side. Perhaps that speaks more to a workflow problem though…
Yes please. Give a byte offset to the editor to jump automatically, or for your `M-x goto-char`.
And give me, the human, line & column so if I want to, I can just scroll to the location manually. Also, it's much easier to type the line number, then the byte offset.
Now that you've said it, it feels wrong that this isn't already standard. That said, maybe the other benefit of the line + column format is that it doesn't *require* support in the text editor, by which I mean it is more feasible for a human to navigate to a certain line and column than a byte position? (But for this to work in practice to any realistic degree, you need a text editor that displays line numbers, which seems if anything more complicated than a "jump to offset" capability.)
On a completely trivial note: when you said that "line 4201, column 14" would correspond to at least "byte 4215", are you sure (assuming these numbers are all indexed from 1) that it wouldn't be byte 4214?
Line 1, column 1 would correspond to byte 1 right? So line 2, column 3 would correspond to (at least) byte 4? I may not be thinking clearly, someone let me know if I'm wrong.
You are definitely right - I have made this mistake many times, in fact, and I am very happy that the byte positioning version doesn't have this error since it starts at byte 0 as god intended :)
[Edit: I have now updated the article with corrected numbers, but in doing the calculation, I guess I would say, it's byte 4213, not 4214, right? Because you _also_ subtract one from the column index, not just the line index. This kind of confusion is yet another reason not to use this scheme IMO!!]
That is what cobol mainframe programs would do... They would report byte offsets. You are given the exact byte offset, then you have to look for a relative byte offset in a table to retrieve a relative line number in the file.
I can completely understand why line numbers are the standard. When you were working with printed code, it makes plenty of sense to use them. Even though we don't need to do that anymore, the line number convention unfortunately stuck :(
I now read through the article and I hope I didn't miss anything. If I did I'm about to make an ass of myself, oh well.
I often use UTF-16 or custom encodings for in-memory representation in text/code editors. So trying to find out the byte offset to point the user to is not very practical. However line/column is not a perfect solution either, but since I already have to dabble with line endings parsing and cluster breaks, it's easier to get there.
Despite good intents of UTF-8 designers and their advice to not read or write invalid UTF-8 sequences, there are such files out there that would make it quite terrible to find the right spot. That said I haven't implemented it yet and it might be easier than I think (been accused of exaggeration once or twice before).
However I can see how this gets extremely practical with tooling (especially for inherently binary files). But why keep it just there and not have all the tooling have also a "mechanical" mode that would return any kind of logs or errors in binary (or even JSON) format so we don't have to implement a custom parser for each tool separately (I'm looking at you Linux).
Is UTF-8 really a problem? Since you can only have a few bytes at most in a UTF-8 code point encode, isn't it just the case of searching a few bytes backwards? If there's an invalid sequence, then there is no requirement to find the start of it anyway, since you can't display it?
Problem with UTF-8 is that it can encode same <128 codepoint with up to 7 different UTF-8 sequences. So if you decode that sequence to be 0x0041 in UTF-16 for your processing format, you can't simply convert it back to simple 0x40 in UTF-8 since you don't store the information on which of the 7 variants it originally was in the file.
Further you can have files in obscure extended-ASCII like Windows-1252, 357, 850 and KOI, which does happen since some of the domain-specific hardware still uses these.
It's not just text. this happens whenever the processing format differs from the storage one: audio, video or images. The larger the data set is the more you'd want a different more efficient processing format for your use case.
Obviously it makes sense to standardize your project-specific tooling to a single error reporting model, but I'm not sure it's a universal thing we can ask from other developers to do everywhere.
However, unless you have the errors fixed by a machine, it'll probably be a human looking at them and knowing that a video at offset 0xabc883 has a problem won't help you unless you have another tool that converts that to "Frame 5, Rectangle [10, 10, 34, 44]".
This would imply that you were OK with a text editor which loads a file and then saves the same file, but produces a totally different file as a result. That seems like something that would already be a bug, does it not? It would certainly lead to some surprises for the user if they weren't expecting that...
That actually happens even now. There are different normalization forms in Unicode (and obviously Windows preferers different ones to Mac) that end up with a different binary representation, there are nuances in UTF-8 encoding (like the ones I mentioned). It's rather rare editors detect and keep these. The only way forward there is to either never use anything character beyond value of 127 (which is impossible for cases when you deal with translation data), or globally agree on all of these across all editors people use (at least within your organization).
It happens with PNGs and JPEGs and everything else too, there's extra information (EXIF and similar) that is chopped off depending on what the library you use. For instance opening and saving PNGs via stb_image/_write will never look back (not even API has a way of passing the data). Even if we'd pass the data we'd have to do extra work to maintain the binary representation.
I have a question, can you clarify what you mean by this? "Problem with UTF-8 is that it can encode same <128 codepoint with up to 7 different UTF-8 sequences".
I'm assuming you aren't talking about overlong encodings, because overlong are always invalid.
And by your next reply, you talk about unicode normalizations, but normalization isn't a UTF-8 problem, as it can happen in UTF-16 as well.
So I'm curious as to what you're referencing, cheers!
In first instance I meant overlong sequences in UTF-8. Most of the writers out there will write the shortest form, but there's endless number of readers (meaning libraries) that have no problem with reading those (which sure is incorrect by-the-book). I myself implemented a bunch of UTF-8 readers that were able to handle corrupted files using many others for reference. Hence you can end up with those quite easily. One thing is what they say is valid, other what's out there in actual files. (Yet another funny part is BOM, hopefully solved by now, but I'm sure you'd find UTF-8 with BOM out there too)
In second instance we've discussed how binary representation of a file can change just by loading and saving it. In case of a text file, the editors tend to convert the file to some internal representation which not only means different encoding, but can also re-interpret the file. Often you might not even have a practical chance to avoid it if you're relying on native strings. This might happen with line endings, indentation, Unicode normalization.
Hope this makes it clear and I'm sorry for confusion.
Sadly I'm quite busy with other things, but it'd be nice to actually check which editors accept invalid UTF-8, which do normalizations, etc. I was always interested in seeing how text editors are implemented and like to think I talk from experience here, but would be much more valuable to get the actual data on this.
However, the change in binary form happens in other file formats (images, videos, etc.) and I doubt paying customers (and hence the companies) actually care about binary consistency (sadly).
First of all, thank you for taking the time to reply and explain!
On overlong encodings:
What I do/plan to do in my next text editor is just to leave them as-is, treat them as an invalid sequence, and display them as such (U+FFFD, or preferably by using an inverted question mark in my terminal, as U+FFFD is technically a valid character).
Mainly because they might not actually be overlong encodings, but corruption (bit flips) that happened to appear as an overlong encoding; This should be fairly rare, but it's not impossible, most likely it's just another application that messed up the encoding.
You could use heuristics to figure out what went wrong and write code to correct it back to what codepoint it most likely was, but I think it's better to leave such an error as-is, and let someone with a hex editor and knowledge about the document (and its intentions) figure out what went wrong, rather than blindly make assumptions.
In fact, I prefer when editors don't make changes without being asked to do so.
If I want it to convert indentations, or line endings, I'll tell it via a command or something, but in my preference, it should just use the document as-is, without any further changes. It can however, alert me to inconsistencies or how the text is stored (Like line endings).
I don't prefer this due to this article, and how it would respect byte offsets in error messages, but I prefer this because it would then respect my intentions; If I open a file as a text file, it should interpret the file as if it were text, as-is, even if it get displayed wrong (because maybe it isn't a normal text file).
The same goes for unicode normalization, only do it if asked to. This is also more consistent with what hex editors do, and since I want to implant a basic hex editor into my editor, it would play nicely with each other.
On the note of BOM, I actually have come across UTF-8 documents with a BOM, notably, about a year or two ago with Visual Studio; I had to wrangle with VS after it insisted on saving my files with unicode comments in codepages (yes really); Once I told it to just use UTF-8, it did... And it added a BOM... But such is to be expected from VS, where nothing is ever easy, clean or sensible :')
Edit Addendum:
I want to add that my reasoning for my decisions are just my preferences, and not something I believe to be objectively the best way to go about things. While I prefer things to work this way, there are plenty of situations where my preferences here are just wrong. That's why I want to do this for my editor, not something for others.
Hey I'm on Discord, happy to talk there more if you want, so we don't pollute Casey's comments. ;)
The UTF-8 correction thing can be dangerous, it's recommended to fail any incorrectly written UTF-8. It's mostly impossible to write UTF-8 incorrectly and many of those cases should be treated as intentional attempts at breaching the security. Some of it is mentioned here: https://unicode.org/reports/tr36/
It's for sure good to provide a service for the users, but I was thinking mostly about a separate tool and/or "workflow" initiated by the editor that would correct the UTF-8 and re-open it.
> In fact, I prefer when editors don't make changes without being asked to do so.
I prefer this too. That's why my latest iteration was using UTF-8 internally and was experimenting with buffers that would allow you to mem-map files. However, I'd probably run a validator on the file first, to avoid having to deal with weirdness in the code + inform the user that they are trying to edit an incorrect UTF-8 file which my lead to more changes.
It's also perfectly doable, I got quite far with not too much code. I basically just separated the storage form from editing/visual.
However, as with everything there are other editors that produce the files, and the more inconsistencies they produce, the more you have to deal with which can lead to overly complex code and the more code you have the more buggy and possibly slow you are.
Generally having multiple strategies would help (based on file size, invalid sequences, inconsistent normalization/line-endings/indentation). Most of the files people will edit will always be smallish < 64kB files, be correctly encoded and consistent. But from time to time you'll get invalid ones (for example you yourself editing a file intentionally broken you use for testing), or overly large ones (several GBs), etc.
In any case, fingers crossed for your next editor! Sounds like you know your stuff and care about right things.
When you mention your "potential downside", I don't get why end of line is a real problem. Are you talking the situation where more than one file is embedded? (as it's the case most of the time)
Line 12000, 14 on Linux could be byte ~150,000, while being byte ~162,000 on Windows, resulting in byte locations being so far off that they could be out of scope (in large files). I think that's what he means. I'm not sure what you're referring to with "more than one file is embedded" though.
I am currently writing a text editor based on some of the lessons I've learned from this course, and I have come across this exact problem with line & column reporting, but I've also found that reporting a "column" can be ambiguous in another, annoying way.
Say I get an error in my source code on line 411, column 78; How was the tabs counted towards the column number reported? Well I use an indentation width of 8, but the compiler doesn't know that (and won't make assumptions), and so it counts the tabs as if the width is 1.
That's fine, reasonable and understandable - Unfortunately, since my editor isn't done, I'm using another editor... And it does _not_ count tabs as a column width of 1, rather it respects my settings and uses a width of 8 (That is, unless I select my text, in that case it counts codepoints... Not bytes, nor columns... Codepoints. Not sure why).
So any time I have an error and I need to trace to the column, I have to replace the tabs with spaces, or manually count/correct the column number given to me. This is silly, and just giving me an index into the file would have been a _lot_ easier.
In fact, for my own editor, my code for converting an index into a line number and column is trivial and fast, so jumping to an index is almost zero work.
I always interpret this as "number of valid caret positions" or in Unicode vernacular "offset of grapheme cluster break". It has to be outside of your visual representation and closer to logical positioning that everyone can count. So this editor you're using in the meantime is not doing it correctly.
i like this idea, with one modification: report byte positions in hex, e.g. 0x23ab5. it is more obviously not a line number (since we conventionally use decimal for line numbers); it is already commonplace to use hex for byte offsets in hex editors, debuggers, and so on; and as a side benefit, in large files it will end up being more compact anyway.
for tools talking to tools byte offsets is a clear win. but i would prefer to see tools reporting byte offsets in addition to line numbers, as a default (with switches if you like to choose only line numbers, or only byte offsets). because when a human has to be in the loop interpreting the result, line numbers have the convenience of being shorter (fewer errors in reading and typing them); easy to correlate by eye when an editor shows line numbers in the gutter (such as when having to make do with less-capable text editors that cant parse the output); and often easier to communicate to another human (such as when helping someone over screenshare, referring to line numbers is very valuable in a way that byte offsets cannot be).
A small benefit for line numbers is that if I don’t have my editor hooked up to read compiler messages, it’s easier to type in a line number than a byte position, especially if I’m not looking at the errors and the code side by side. Perhaps that speaks more to a workflow problem though…
It might be reasonable to consider reporting both a byte offset and a line/column number, especially for the transitional period.
Yes please. Give a byte offset to the editor to jump automatically, or for your `M-x goto-char`.
And give me, the human, line & column so if I want to, I can just scroll to the location manually. Also, it's much easier to type the line number, then the byte offset.
Now that you've said it, it feels wrong that this isn't already standard. That said, maybe the other benefit of the line + column format is that it doesn't *require* support in the text editor, by which I mean it is more feasible for a human to navigate to a certain line and column than a byte position? (But for this to work in practice to any realistic degree, you need a text editor that displays line numbers, which seems if anything more complicated than a "jump to offset" capability.)
On a completely trivial note: when you said that "line 4201, column 14" would correspond to at least "byte 4215", are you sure (assuming these numbers are all indexed from 1) that it wouldn't be byte 4214?
Line 1, column 1 would correspond to byte 1 right? So line 2, column 3 would correspond to (at least) byte 4? I may not be thinking clearly, someone let me know if I'm wrong.
You are definitely right - I have made this mistake many times, in fact, and I am very happy that the byte positioning version doesn't have this error since it starts at byte 0 as god intended :)
[Edit: I have now updated the article with corrected numbers, but in doing the calculation, I guess I would say, it's byte 4213, not 4214, right? Because you _also_ subtract one from the column index, not just the line index. This kind of confusion is yet another reason not to use this scheme IMO!!]
- Casey
That is what cobol mainframe programs would do... They would report byte offsets. You are given the exact byte offset, then you have to look for a relative byte offset in a table to retrieve a relative line number in the file.
I can completely understand why line numbers are the standard. When you were working with printed code, it makes plenty of sense to use them. Even though we don't need to do that anymore, the line number convention unfortunately stuck :(
- Casey
I now read through the article and I hope I didn't miss anything. If I did I'm about to make an ass of myself, oh well.
I often use UTF-16 or custom encodings for in-memory representation in text/code editors. So trying to find out the byte offset to point the user to is not very practical. However line/column is not a perfect solution either, but since I already have to dabble with line endings parsing and cluster breaks, it's easier to get there.
Despite good intents of UTF-8 designers and their advice to not read or write invalid UTF-8 sequences, there are such files out there that would make it quite terrible to find the right spot. That said I haven't implemented it yet and it might be easier than I think (been accused of exaggeration once or twice before).
However I can see how this gets extremely practical with tooling (especially for inherently binary files). But why keep it just there and not have all the tooling have also a "mechanical" mode that would return any kind of logs or errors in binary (or even JSON) format so we don't have to implement a custom parser for each tool separately (I'm looking at you Linux).
Is UTF-8 really a problem? Since you can only have a few bytes at most in a UTF-8 code point encode, isn't it just the case of searching a few bytes backwards? If there's an invalid sequence, then there is no requirement to find the start of it anyway, since you can't display it?
- Casey
Problem with UTF-8 is that it can encode same <128 codepoint with up to 7 different UTF-8 sequences. So if you decode that sequence to be 0x0041 in UTF-16 for your processing format, you can't simply convert it back to simple 0x40 in UTF-8 since you don't store the information on which of the 7 variants it originally was in the file.
Further you can have files in obscure extended-ASCII like Windows-1252, 357, 850 and KOI, which does happen since some of the domain-specific hardware still uses these.
It's not just text. this happens whenever the processing format differs from the storage one: audio, video or images. The larger the data set is the more you'd want a different more efficient processing format for your use case.
Obviously it makes sense to standardize your project-specific tooling to a single error reporting model, but I'm not sure it's a universal thing we can ask from other developers to do everywhere.
However, unless you have the errors fixed by a machine, it'll probably be a human looking at them and knowing that a video at offset 0xabc883 has a problem won't help you unless you have another tool that converts that to "Frame 5, Rectangle [10, 10, 34, 44]".
This would imply that you were OK with a text editor which loads a file and then saves the same file, but produces a totally different file as a result. That seems like something that would already be a bug, does it not? It would certainly lead to some surprises for the user if they weren't expecting that...
- Casey
That actually happens even now. There are different normalization forms in Unicode (and obviously Windows preferers different ones to Mac) that end up with a different binary representation, there are nuances in UTF-8 encoding (like the ones I mentioned). It's rather rare editors detect and keep these. The only way forward there is to either never use anything character beyond value of 127 (which is impossible for cases when you deal with translation data), or globally agree on all of these across all editors people use (at least within your organization).
It happens with PNGs and JPEGs and everything else too, there's extra information (EXIF and similar) that is chopped off depending on what the library you use. For instance opening and saving PNGs via stb_image/_write will never look back (not even API has a way of passing the data). Even if we'd pass the data we'd have to do extra work to maintain the binary representation.
Hi, hope all is well.
I have a question, can you clarify what you mean by this? "Problem with UTF-8 is that it can encode same <128 codepoint with up to 7 different UTF-8 sequences".
I'm assuming you aren't talking about overlong encodings, because overlong are always invalid.
And by your next reply, you talk about unicode normalizations, but normalization isn't a UTF-8 problem, as it can happen in UTF-16 as well.
So I'm curious as to what you're referencing, cheers!
In first instance I meant overlong sequences in UTF-8. Most of the writers out there will write the shortest form, but there's endless number of readers (meaning libraries) that have no problem with reading those (which sure is incorrect by-the-book). I myself implemented a bunch of UTF-8 readers that were able to handle corrupted files using many others for reference. Hence you can end up with those quite easily. One thing is what they say is valid, other what's out there in actual files. (Yet another funny part is BOM, hopefully solved by now, but I'm sure you'd find UTF-8 with BOM out there too)
In second instance we've discussed how binary representation of a file can change just by loading and saving it. In case of a text file, the editors tend to convert the file to some internal representation which not only means different encoding, but can also re-interpret the file. Often you might not even have a practical chance to avoid it if you're relying on native strings. This might happen with line endings, indentation, Unicode normalization.
Hope this makes it clear and I'm sorry for confusion.
Sadly I'm quite busy with other things, but it'd be nice to actually check which editors accept invalid UTF-8, which do normalizations, etc. I was always interested in seeing how text editors are implemented and like to think I talk from experience here, but would be much more valuable to get the actual data on this.
However, the change in binary form happens in other file formats (images, videos, etc.) and I doubt paying customers (and hence the companies) actually care about binary consistency (sadly).
Ahh, I see what you mean.
First of all, thank you for taking the time to reply and explain!
On overlong encodings:
What I do/plan to do in my next text editor is just to leave them as-is, treat them as an invalid sequence, and display them as such (U+FFFD, or preferably by using an inverted question mark in my terminal, as U+FFFD is technically a valid character).
Mainly because they might not actually be overlong encodings, but corruption (bit flips) that happened to appear as an overlong encoding; This should be fairly rare, but it's not impossible, most likely it's just another application that messed up the encoding.
You could use heuristics to figure out what went wrong and write code to correct it back to what codepoint it most likely was, but I think it's better to leave such an error as-is, and let someone with a hex editor and knowledge about the document (and its intentions) figure out what went wrong, rather than blindly make assumptions.
In fact, I prefer when editors don't make changes without being asked to do so.
If I want it to convert indentations, or line endings, I'll tell it via a command or something, but in my preference, it should just use the document as-is, without any further changes. It can however, alert me to inconsistencies or how the text is stored (Like line endings).
I don't prefer this due to this article, and how it would respect byte offsets in error messages, but I prefer this because it would then respect my intentions; If I open a file as a text file, it should interpret the file as if it were text, as-is, even if it get displayed wrong (because maybe it isn't a normal text file).
The same goes for unicode normalization, only do it if asked to. This is also more consistent with what hex editors do, and since I want to implant a basic hex editor into my editor, it would play nicely with each other.
On the note of BOM, I actually have come across UTF-8 documents with a BOM, notably, about a year or two ago with Visual Studio; I had to wrangle with VS after it insisted on saving my files with unicode comments in codepages (yes really); Once I told it to just use UTF-8, it did... And it added a BOM... But such is to be expected from VS, where nothing is ever easy, clean or sensible :')
Edit Addendum:
I want to add that my reasoning for my decisions are just my preferences, and not something I believe to be objectively the best way to go about things. While I prefer things to work this way, there are plenty of situations where my preferences here are just wrong. That's why I want to do this for my editor, not something for others.
Hey I'm on Discord, happy to talk there more if you want, so we don't pollute Casey's comments. ;)
The UTF-8 correction thing can be dangerous, it's recommended to fail any incorrectly written UTF-8. It's mostly impossible to write UTF-8 incorrectly and many of those cases should be treated as intentional attempts at breaching the security. Some of it is mentioned here: https://unicode.org/reports/tr36/
It's for sure good to provide a service for the users, but I was thinking mostly about a separate tool and/or "workflow" initiated by the editor that would correct the UTF-8 and re-open it.
> In fact, I prefer when editors don't make changes without being asked to do so.
I prefer this too. That's why my latest iteration was using UTF-8 internally and was experimenting with buffers that would allow you to mem-map files. However, I'd probably run a validator on the file first, to avoid having to deal with weirdness in the code + inform the user that they are trying to edit an incorrect UTF-8 file which my lead to more changes.
It's also perfectly doable, I got quite far with not too much code. I basically just separated the storage form from editing/visual.
However, as with everything there are other editors that produce the files, and the more inconsistencies they produce, the more you have to deal with which can lead to overly complex code and the more code you have the more buggy and possibly slow you are.
Generally having multiple strategies would help (based on file size, invalid sequences, inconsistent normalization/line-endings/indentation). Most of the files people will edit will always be smallish < 64kB files, be correctly encoded and consistent. But from time to time you'll get invalid ones (for example you yourself editing a file intentionally broken you use for testing), or overly large ones (several GBs), etc.
In any case, fingers crossed for your next editor! Sounds like you know your stuff and care about right things.
When you mention your "potential downside", I don't get why end of line is a real problem. Are you talking the situation where more than one file is embedded? (as it's the case most of the time)
Line 12000, 14 on Linux could be byte ~150,000, while being byte ~162,000 on Windows, resulting in byte locations being so far off that they could be out of scope (in large files). I think that's what he means. I'm not sure what you're referring to with "more than one file is embedded" though.
My bad, end of line was, in my mind, for wathever reason the end of the file ... It's obvious now
I am currently writing a text editor based on some of the lessons I've learned from this course, and I have come across this exact problem with line & column reporting, but I've also found that reporting a "column" can be ambiguous in another, annoying way.
Say I get an error in my source code on line 411, column 78; How was the tabs counted towards the column number reported? Well I use an indentation width of 8, but the compiler doesn't know that (and won't make assumptions), and so it counts the tabs as if the width is 1.
That's fine, reasonable and understandable - Unfortunately, since my editor isn't done, I'm using another editor... And it does _not_ count tabs as a column width of 1, rather it respects my settings and uses a width of 8 (That is, unless I select my text, in that case it counts codepoints... Not bytes, nor columns... Codepoints. Not sure why).
So any time I have an error and I need to trace to the column, I have to replace the tabs with spaces, or manually count/correct the column number given to me. This is silly, and just giving me an index into the file would have been a _lot_ easier.
In fact, for my own editor, my code for converting an index into a line number and column is trivial and fast, so jumping to an index is almost zero work.
I always interpret this as "number of valid caret positions" or in Unicode vernacular "offset of grapheme cluster break". It has to be outside of your visual representation and closer to logical positioning that everyone can count. So this editor you're using in the meantime is not doing it correctly.