From f9eda5ab28f1c18be178505773aaf951d5762a19 Mon Sep 17 00:00:00 2001 From: Price Hiller Date: Fri, 1 Mar 2024 11:26:53 -0600 Subject: [PATCH] feat(post): add dsa blog post --- posts/dsa-loop-midterm-question-analysis.md | 358 ++++++++++++++++++++ 1 file changed, 358 insertions(+) create mode 100644 posts/dsa-loop-midterm-question-analysis.md diff --git a/posts/dsa-loop-midterm-question-analysis.md b/posts/dsa-loop-midterm-question-analysis.md new file mode 100644 index 0000000..97f8631 --- /dev/null +++ b/posts/dsa-loop-midterm-question-analysis.md @@ -0,0 +1,358 @@ +--- +name: Proving a Data Structures and Algorithms Midterm Question Wrong +summary: An anthill to die upon. +tags: + - college + - datastructures + - c +published: 2024-03-01 +updated: 2024-03-01 +--- + +# Context + +I'm currently enrolled at a university working towards a Computer Science degree. As part of my degree plan I have to +take a Data Structures and Algorithms (DSA) course. + +At the time of writing we had our midterm yesterday morning, and I missed marks on a question that I know for a fact had +no correct answer. This post is an analysis as to _why_ the question's multiple choice answers were all incorrect. + +I'm writing this knowing I'm being a bit of a pedant and knowing that it's _only_ a single question. But for whatever +reason this keeps circling around in my mind throwing around all the chairs and spray painting gang markers on the +walls. As such, I'm writing to get the poltergeist out. + +# The Question + +Here's roughly the question that was asked: + +> Given the following code: +> +> ```c +> #include +> +> int main(int argc, char *argv[]) { +> int arr[2][2] = {{1, 2}, {3, 4}}; +> for (int i = 0; i < 2; i++) { +> for (int j = 0; j < 2; j--) { +> printf("arr[%d][%d] = %d, ", i, j, arr[i][j]); +> } +> } +> } +> ``` +> +> What will the program do? +> +> **a.)** There will be a warning\ +> **b.)** Infinitely loop\ +> **c.)** Output: `arr[0][0] = 1, arr[0][1] = 2, arr[1][0] = 3, arr[1][1] = 4, `\ +> **d.)** Output: `[0][-1] = 1, arr[0][-2] = 2, arr[1][-1] = 3, arr[1][-2] = 4, ` + +I chose `a`, after realizing that's the _closest_ answer to being correct. Unfortunately this question is asking about +the most horrid thing on Earth and a large reason languages like Rust exist β€” **undefined behavior** (UB)[^1]. + +Answer `a`, of course, is not correct. The exam had answer `b` as correct, which is also wrong. In fact, none of the +above answers are correct. + +# Why is the above code undefined behavior (UB)? + +The code given to us has an out-of-bounds array access. Take a look at the `for` loops. The first `for` loop is all +good, no problem there. The inner one, however, is our problem child. + +```c +int arr[2][2] = {{1, 2}, {3, 4}}; +for (int i = 0; i < 2; i++) { + for (int j = 0; j < 2; j--) { // This line is bad! + // Eventually the sub-indexing of j will be out of bounds! + printf("arr[%d][%d] = %d, ", i, j, arr[i][j]); + } +} +``` + +The inner `for` loop decrements the variable `j` and will continue looping until `j` is not less than 2 (so loop until +`j` is more than or equal to 2). This means the array will eventually be indexed by whatever integer is stored into `j` +and since `j` is decremented from 0 it will be negative. + +As a result the array will be sub-indexed by a negative number, which is outside the bounds of the array. This is our +undefined behavior. + +Now that we've identified the issue with this problem let's get into why none of the provided answers are correct. + +# Why is answer `a` wrong? + +To save you some scrolling, answer `a` was `There will be a warning`. In this course, questions have been asked with +identical (or extremely close at least) answers like `a` which were interested in compiler behavior. + +When I was marked wrong on this answer, I was actually quite surprised. Even more so when I recreated the identical code +and attempted to compile it with `gcc` and got the following compilation _error_: + +``` +t.c: In function β€˜main’: +t.c:7:13: error: iteration 1 invokes undefined behavior [-Werror=aggressive-loop-optimizations] + 7 | printf("arr[%d][%d] = %d, ", i, j, arr[i][j]); + | ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ +t.c:6:27: note: within this loop + 6 | for (int j = 0; j < 2; j--) { + | ~~^~~ +cc1: all warnings being treated as errors +``` + +It seemed apparent to me that the question had no right answers. Imagine my surprise then when the professor compiled +the same code in an online compiler and received neither an error or warning. That made me lose a bit of sanity and a +few hours later I realized the culprit, I have a _gat dang_ alias hiding out in my `zsh` config ruining property values: + +```zsh +alias gcc="gcc -Werror -Wall -Wpedantic -Warray-bounds -O3" +``` + +I bring this up because I spent perhaps a few too many hours trying to figure out what the heck was going on. A simple +`command -v gcc` would've found it _immediately_, but I was too dumb to check that for quite some time 😞. + +I also bring it up because my problem with a difference in compiler settings being a problem _is_ a problem. By having +potential answers be about compiler behavior the question immediately invokes a _ton_ of implicit compiler configuration +and flags that were (to my knowledge) never clearly stated in my course. The problem then is no longer about DSA β€” +instead it's about knowing the quirks of a compiler that may have differences on every single system and configurations +out there. + +Ok, but let's just assume we're rocking standard `gcc` with no flags, and no configuration elsewhere. Thus, there +shouldn't be any compiler warnings or errors. With that caveat in mind, answer `a` can't be right. + +# Why are answers `c` & `d` wrong? + +I'm lumping these two together because they're wrong for similar reasons and I figure I should cover all of my bases. + +Answers `c` & `d` were: + +> **c.)** Output: `arr[0][0] = 1, arr[0][1] = 2, arr[1][0] = 3, arr[1][1] = 4, `\ +> **d.)** Output: `[0][-1] = 1, arr[0][-2] = 2, arr[1][-1] = 3, arr[1][-2] = 4, ` + +`c` is wrong because the inner loop is being continuously decremented. In theory the sub-indexing integer `j` should _never_ get a positive value. Purely _in theory_. + +`d` is wrong because by indexing negatively into the array, the output values are incredibly unlikely to match the +actual values stored in the array; on top of that, indexing negatively into an array in C invokes UB, making this +necessarily wrong anyhow. + +# Why is the supposedly correct answer, `b`, also wrong? + +Answer `b` stated that "**the program will infinitely loop**". + +I'll paste the code again here for your reference so you don't have to scroll back up to see it. + +```c +#include + +int main(int argc, char *argv[]) { + int arr[2][2] = {{1, 2}, {3, 4}}; + for (int i = 0; i < 2; i++) { + for (int j = 0; j < 2; j--) { + printf("arr[%d][%d] = %d, ", i, j, arr[i][j]); + } + } +} +``` + +On the surface level it certainly seems this code would infinitely loop. Our inner loop integer `j` is always being set +more and more negatively and it appears it will never be more than two. After all, subtracting one from a negative will +still leave you with a negative. + +There are two reasons that come to my mind at a glance why the above code actually won't run until the Sun swallows the +Earth and time becomes meaningless. + +## Why `b` is Wrong, Part Uno + +The first reason that code won't run infinitely is the realities of undefined behavior. + +On my laptop running Linux, kernel version 6.7.6 at the time of writing, the code above will have a segmentation fault +within the first few seconds of running. The Linux kernel actually has quite a few built-in security features and it +won't let my code willy nilly access random addresses which sometimes happens with the negative indexing. + +Outside of the kernel security mechanisms, since we're in UB land who's to say what will actually happen as its +_undefined_. So at best saying it runs forever is a guess β€” not actually something that can be proven easily. + +## Why `b` is Wrong, Part Dos + +The second reason is the much stronger of the two. Let's, the for the sake of argument, say the above code never has a +segmentation fault. It will always be able to decrement `j` and sub-index into the array. + +Even with these concessions, the program will still not run infinitely. + +To understand why, we have to dive into how integers work in C. + +### ISO/IEC 9899:2023 Purgatory [(Oh the misery)](https://youtu.be/vW23W0aDCjQ?t=32) + +So let's go see what the actual published standard has to say on what an integer is. If we go on over to +[open-std.org](https://open-std.org) and go check out the latest published draft of "ISO/IEC 9899 - Revision of the C +standard" we end up with this [big ol' document](https://open-std.org/JTC1/SC22/WG14/www/docs/n3096.pdf). Crack 'er open +and head down to page 41, section `6.2.6.2` (fun to pronounce as "sixty-two sixty-two"). + +An integer, according to the standard, can be _signed_ or _unsigned_. Since the integer we're interested in, `j`, is a +signed integer, let's go through how a signed integer is standardized as in C. + +> For signed integer types, the bits of the object representation shall be divided into three groups: value bits, +> padding bits, and the sign bit. ... the signed type uses the same number of _N_ bits as values bits and sign bit. _N - +> 1_ are value bits and the remaining bit is the sign bit. ... If the sign bit is zero, it shall not affect the +> resulting value. If the sign bit is one, it has value `-(2^(n-1))`. + +Note one more thing, the standard does _not_ define the actual size of an integer. That's left to each implementation. +Since we're using the `gcc` compiler, let's see what the [`gcc` docs have to say on the size of a +integer](https://www.gnu.org/software/gnu-c-manual/gnu-c-manual.html#Integer-Types). + +> The integer data types range in size from at least 8 bits to at least 32 bits. The C99 standard extends this range to +> include integer sizes of at least 64 bits. You should use integer types for storing whole number values (and the char +> data type for storing characters). The sizes and ranges listed for these types are minimums; depending on your +> computer platform, these sizes and ranges may be larger. + +Ah... that's, uhhh, a pain β€” it's gonna vary by architecture. + +On my system, integers get 32 bits when using `gcc` and thus can represent `4,294,967,295` distinct numbers. Remember, +the integer we're working with is signed, thus half of those numbers will be negative. That means the largest and +smallest numbers my integers can be are `2,147,483,647` and `-2,147,483,648` respectively. + +Hmmm... what happens when we exceed either of those numbers? Let's find out. + +```c +#include +#include +#include + +int main() +{ + printf("Starting number is: %d\n", INT_MIN); + printf("New number is: %d\n", INT_MIN - 1); + return EXIT_SUCCESS; +} +``` + +`INT_MIN` in this case is the smallest integer representable. + +`gcc` warns about an "integer overflow"... interesting. + +``` +❯ gcc program.c + +t.c: In function β€˜main’: +t.c:8:43: warning: integer overflow in expression of type β€˜int’ results in β€˜2147483647’ [-Woverflow] + 8 | printf("New number is: %d\n", INT_MIN - 1); + | ^ +``` + +Well let's run it and see what the output is. + +``` +❯ ./a.out + +Starting number is: -2147483648 +New number is: 2147483647 +``` + +Oh, huh. Neat. + +So if we exceed the bounds the sign flips, literally an overflow or, in this case, an underflow. + +### Actually for real this time why `b` is wrong + +With the overflow and underflow behavior now covered, let's take a look back at the code given to me in class: + +```c +#include + +int main(int argc, char *argv[]) { + int arr[2][2] = {{1, 2}, {3, 4}}; + for (int i = 0; i < 2; i++) { + for (int j = 0; j < 2; j--) { + printf("arr[%d][%d] = %d, ", i, j, arr[i][j]); + } + } +} +``` + +Now with everything we've covered let's go point by point: + +1. So long as `j` is less than 2 this will continue to run. +2. Integer `j` will be decremented for every execution of the inner loop. +3. Integer types have a maximum capacity and if we exceed that capacity the sign changes. +4. Eventually, `j` will reach the smallest value an integer can represent and on the next loop it will underflow and + become a massively positive number. +5. Integer `j` will then eventually be more than 2 in value and thus the condition `j < 2` will be invalidated and the loop + will end. + +To prove the code won't infinitely loop, let's actually run the code with a few small modifications to speed things up: + +```c +#include +#include + +int main(int argc, char *argv[]) { + int arr[2][2] = {{1, 2}, {3, 4}}; + // Move j's initializer up here so we can access it after the loop. + // Also, set j to `INT_MIN` so our loop doesn't have to decrement + // 2147483648 times before the theorized underflow occurs which + // just cuts down on how long it takes to underflow. + int j = INT_MIN; + for (int i = 0; i < 2; i++) { + for (; j < 2; j--) { + // Comment out the arr access to avoid seg faults + // printf("arr[%d][%d] = %d, ", i, j, arr[i][j]); + printf("J -> %d\n", j); + } + } + // If the loop isn't infinite, we should see this message + printf("THE LOOP WAS NOT INFINITE, J VALUE: %d\n", j); +} +``` + +Running the above outputs: + +``` +❯ gcc prog.c && ./a.out + +J -> -2147483648 +THE LOOP WAS NOT INFINITE, J VALUE: 2147483647 +``` + +Y u p. That's underflow for sure. + +We can clearly see where `j` was the smallest possible integer and then wrapped around to the largest possible integer +and in doing so caused the inner loop to end. + +Thus, answer `b` is also wrong. + +# A quick summary + +As a result of all the above points, we can readily see that the potential answers we could pick from are all wrong. + +> **a.)** There will be a warning \ +> **b.)** Infinitely loop <- An integer underflow will eventually occur causing the looping to end \ +> **c.)** Output: `arr[0][0] = 1, arr[0][1] = 2, arr[1][0] = 3, arr[1][1] = 4, ` \ +> **d.)** Output: `[0][-1] = 1, arr[0][-2] = 2, arr[1][-1] = 3, arr[1][-2] = 4, ` + +A quick summary of why each is wrong: + +**a.)** We won't get a warning unless we specify compiler flags/config for `gcc`. \ +**b.)** An integer underflow will occur causing the looping to end and thus the program has a finite execution period, +and also the program will almost certainly have a segmentation fault causing the looping to end. \ +**c.)** `j` will never sub-index into 1 or 2. \ +**d.)** Indexing `j` negatively invokes undefined behavior, so we can't know the output. As well as the fact that the +pattern for accessing the first index is wrong anyhow. + +# Is My Pedantry (πŸ€“) Satisfied Now? + +Somewhat, to be wholly honest I am still annoyed I got marked off on that question. At least now I can point somewhere +and say "See, look! I was right!" and then stick my fingers in my ears and cry in the corner. + +I actually skipped how integers _really_ work under the hood in terms of their binary representation. If you're +interested, here's the binary of the earlier underflow code example: + +``` +Starting number is: -2147483648 | Binary: 10000000000000000000000000000000 + New number is: 2147483647 | Binary: 01111111111111111111111111111111 +``` + +That opens up a whole 'nother can of worms of _two's complement_ and actually how the bits are stored and manipulated. +A deeper look at why the underflow behavior occurs through how integers are added at the binary level can also be used +to disprove answer `b`. I cut a full break down of this out, because I realized it was somewhat irrelevant. The quick +integer whole number example makes the same point without nearly as much technicality. + +The main takeaway: anytime you end up in undefined behavior land with C things get wacky real quick and all numeric +types are actually black magic pretending to be approachable. + +[^1]: If you don't know what undefined behavior is, Wikipedia has an article on it [here](https://en.wikipedia.org/wiki/Undefined_behavior).