feat(post): add dsa blog post

This commit is contained in:
Price Hiller 2024-03-01 11:26:53 -06:00
parent d068f69b42
commit f9eda5ab28
Signed by: Price
GPG Key ID: C3FADDE7A8534BEB

View File

@ -0,0 +1,358 @@
---
name: Proving a Data Structures and Algorithms Midterm Question Wrong
summary: An anthill to die upon.
tags:
- college
- datastructures
- c
published: 2024-03-01
updated: 2024-03-01
---
# Context
I'm currently enrolled at a university working towards a Computer Science degree. As part of my degree plan I have to
take a Data Structures and Algorithms (DSA) course.
At the time of writing we had our midterm yesterday morning, and I missed marks on a question that I know for a fact had
no correct answer. This post is an analysis as to _why_ the question's multiple choice answers were all incorrect.
I'm writing this knowing I'm being a bit of a pedant and knowing that it's _only_ a single question. But for whatever
reason this keeps circling around in my mind throwing around all the chairs and spray painting gang markers on the
walls. As such, I'm writing to get the poltergeist out.
# The Question
Here's roughly the question that was asked:
> Given the following code:
>
> ```c
> #include <stdio.h>
>
> int main(int argc, char *argv[]) {
> int arr[2][2] = {{1, 2}, {3, 4}};
> for (int i = 0; i < 2; i++) {
> for (int j = 0; j < 2; j--) {
> printf("arr[%d][%d] = %d, ", i, j, arr[i][j]);
> }
> }
> }
> ```
>
> What will the program do?
>
> **a.)** There will be a warning\
> **b.)** Infinitely loop\
> **c.)** Output: `arr[0][0] = 1, arr[0][1] = 2, arr[1][0] = 3, arr[1][1] = 4, `\
> **d.)** Output: `[0][-1] = 1, arr[0][-2] = 2, arr[1][-1] = 3, arr[1][-2] = 4, `
I chose `a`, after realizing that's the _closest_ answer to being correct. Unfortunately this question is asking about
the most horrid thing on Earth and a large reason languages like Rust exist — **undefined behavior** (UB)[^1].
Answer `a`, of course, is not correct. The exam had answer `b` as correct, which is also wrong. In fact, none of the
above answers are correct.
# Why is the above code undefined behavior (UB)?
The code given to us has an out-of-bounds array access. Take a look at the `for` loops. The first `for` loop is all
good, no problem there. The inner one, however, is our problem child.
```c
int arr[2][2] = {{1, 2}, {3, 4}};
for (int i = 0; i < 2; i++) {
for (int j = 0; j < 2; j--) { // This line is bad!
// Eventually the sub-indexing of j will be out of bounds!
printf("arr[%d][%d] = %d, ", i, j, arr[i][j]);
}
}
```
The inner `for` loop decrements the variable `j` and will continue looping until `j` is not less than 2 (so loop until
`j` is more than or equal to 2). This means the array will eventually be indexed by whatever integer is stored into `j`
and since `j` is decremented from 0 it will be negative.
As a result the array will be sub-indexed by a negative number, which is outside the bounds of the array. This is our
undefined behavior.
Now that we've identified the issue with this problem let's get into why none of the provided answers are correct.
# Why is answer `a` wrong?
To save you some scrolling, answer `a` was `There will be a warning`. In this course, questions have been asked with
identical (or extremely close at least) answers like `a` which were interested in compiler behavior.
When I was marked wrong on this answer, I was actually quite surprised. Even more so when I recreated the identical code
and attempted to compile it with `gcc` and got the following compilation _error_:
```
t.c: In function main:
t.c:7:13: error: iteration 1 invokes undefined behavior [-Werror=aggressive-loop-optimizations]
7 | printf("arr[%d][%d] = %d, ", i, j, arr[i][j]);
| ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
t.c:6:27: note: within this loop
6 | for (int j = 0; j < 2; j--) {
| ~~^~~
cc1: all warnings being treated as errors
```
It seemed apparent to me that the question had no right answers. Imagine my surprise then when the professor compiled
the same code in an online compiler and received neither an error or warning. That made me lose a bit of sanity and a
few hours later I realized the culprit, I have a _gat dang_ alias hiding out in my `zsh` config ruining property values:
```zsh
alias gcc="gcc -Werror -Wall -Wpedantic -Warray-bounds -O3"
```
I bring this up because I spent perhaps a few too many hours trying to figure out what the heck was going on. A simple
`command -v gcc` would've found it _immediately_, but I was too dumb to check that for quite some time 😞.
I also bring it up because my problem with a difference in compiler settings being a problem _is_ a problem. By having
potential answers be about compiler behavior the question immediately invokes a _ton_ of implicit compiler configuration
and flags that were (to my knowledge) never clearly stated in my course. The problem then is no longer about DSA —
instead it's about knowing the quirks of a compiler that may have differences on every single system and configurations
out there.
Ok, but let's just assume we're rocking standard `gcc` with no flags, and no configuration elsewhere. Thus, there
shouldn't be any compiler warnings or errors. With that caveat in mind, answer `a` can't be right.
# Why are answers `c` & `d` wrong?
I'm lumping these two together because they're wrong for similar reasons and I figure I should cover all of my bases.
Answers `c` & `d` were:
> **c.)** Output: `arr[0][0] = 1, arr[0][1] = 2, arr[1][0] = 3, arr[1][1] = 4, `\
> **d.)** Output: `[0][-1] = 1, arr[0][-2] = 2, arr[1][-1] = 3, arr[1][-2] = 4, `
`c` is wrong because the inner loop is being continuously decremented. In theory the sub-indexing integer `j` should _never_ get a positive value. Purely _in theory_.
`d` is wrong because by indexing negatively into the array, the output values are incredibly unlikely to match the
actual values stored in the array; on top of that, indexing negatively into an array in C invokes UB, making this
necessarily wrong anyhow.
# Why is the supposedly correct answer, `b`, also wrong?
Answer `b` stated that "**the program will infinitely loop**".
I'll paste the code again here for your reference so you don't have to scroll back up to see it.
```c
#include <stdio.h>
int main(int argc, char *argv[]) {
int arr[2][2] = {{1, 2}, {3, 4}};
for (int i = 0; i < 2; i++) {
for (int j = 0; j < 2; j--) {
printf("arr[%d][%d] = %d, ", i, j, arr[i][j]);
}
}
}
```
On the surface level it certainly seems this code would infinitely loop. Our inner loop integer `j` is always being set
more and more negatively and it appears it will never be more than two. After all, subtracting one from a negative will
still leave you with a negative.
There are two reasons that come to my mind at a glance why the above code actually won't run until the Sun swallows the
Earth and time becomes meaningless.
## Why `b` is Wrong, Part Uno
The first reason that code won't run infinitely is the realities of undefined behavior.
On my laptop running Linux, kernel version 6.7.6 at the time of writing, the code above will have a segmentation fault
within the first few seconds of running. The Linux kernel actually has quite a few built-in security features and it
won't let my code willy nilly access random addresses which sometimes happens with the negative indexing.
Outside of the kernel security mechanisms, since we're in UB land who's to say what will actually happen as its
_undefined_. So at best saying it runs forever is a guess — not actually something that can be proven easily.
## Why `b` is Wrong, Part Dos
The second reason is the much stronger of the two. Let's, the for the sake of argument, say the above code never has a
segmentation fault. It will always be able to decrement `j` and sub-index into the array.
Even with these concessions, the program will still not run infinitely.
To understand why, we have to dive into how integers work in C.
### ISO/IEC 9899:2023 Purgatory [(Oh the misery)](https://youtu.be/vW23W0aDCjQ?t=32)
So let's go see what the actual published standard has to say on what an integer is. If we go on over to
[open-std.org](https://open-std.org) and go check out the latest published draft of "ISO/IEC 9899 - Revision of the C
standard" we end up with this [big ol' document](https://open-std.org/JTC1/SC22/WG14/www/docs/n3096.pdf). Crack 'er open
and head down to page 41, section `6.2.6.2` (fun to pronounce as "sixty-two sixty-two").
An integer, according to the standard, can be _signed_ or _unsigned_. Since the integer we're interested in, `j`, is a
signed integer, let's go through how a signed integer is standardized as in C.
> For signed integer types, the bits of the object representation shall be divided into three groups: value bits,
> padding bits, and the sign bit. ... the signed type uses the same number of _N_ bits as values bits and sign bit. _N -
> 1_ are value bits and the remaining bit is the sign bit. ... If the sign bit is zero, it shall not affect the
> resulting value. If the sign bit is one, it has value `-(2^(n-1))`.
Note one more thing, the standard does _not_ define the actual size of an integer. That's left to each implementation.
Since we're using the `gcc` compiler, let's see what the [`gcc` docs have to say on the size of a
integer](https://www.gnu.org/software/gnu-c-manual/gnu-c-manual.html#Integer-Types).
> The integer data types range in size from at least 8 bits to at least 32 bits. The C99 standard extends this range to
> include integer sizes of at least 64 bits. You should use integer types for storing whole number values (and the char
> data type for storing characters). The sizes and ranges listed for these types are minimums; depending on your
> computer platform, these sizes and ranges may be larger.
Ah... that's, uhhh, a pain — it's gonna vary by architecture.
On my system, integers get 32 bits when using `gcc` and thus can represent `4,294,967,295` distinct numbers. Remember,
the integer we're working with is signed, thus half of those numbers will be negative. That means the largest and
smallest numbers my integers can be are `2,147,483,647` and `-2,147,483,648` respectively.
Hmmm... what happens when we exceed either of those numbers? Let's find out.
```c
#include <stdio.h>
#include <limits.h>
#include <stdlib.h>
int main()
{
printf("Starting number is: %d\n", INT_MIN);
printf("New number is: %d\n", INT_MIN - 1);
return EXIT_SUCCESS;
}
```
`INT_MIN` in this case is the smallest integer representable.
`gcc` warns about an "integer overflow"... interesting.
```
gcc program.c
t.c: In function main:
t.c:8:43: warning: integer overflow in expression of type int results in 2147483647 [-Woverflow]
8 | printf("New number is: %d\n", INT_MIN - 1);
| ^
```
Well let's run it and see what the output is.
```
./a.out
Starting number is: -2147483648
New number is: 2147483647
```
Oh, huh. Neat.
So if we exceed the bounds the sign flips, literally an overflow or, in this case, an underflow.
### Actually for real this time why `b` is wrong
With the overflow and underflow behavior now covered, let's take a look back at the code given to me in class:
```c
#include <stdio.h>
int main(int argc, char *argv[]) {
int arr[2][2] = {{1, 2}, {3, 4}};
for (int i = 0; i < 2; i++) {
for (int j = 0; j < 2; j--) {
printf("arr[%d][%d] = %d, ", i, j, arr[i][j]);
}
}
}
```
Now with everything we've covered let's go point by point:
1. So long as `j` is less than 2 this will continue to run.
2. Integer `j` will be decremented for every execution of the inner loop.
3. Integer types have a maximum capacity and if we exceed that capacity the sign changes.
4. Eventually, `j` will reach the smallest value an integer can represent and on the next loop it will underflow and
become a massively positive number.
5. Integer `j` will then eventually be more than 2 in value and thus the condition `j < 2` will be invalidated and the loop
will end.
To prove the code won't infinitely loop, let's actually run the code with a few small modifications to speed things up:
```c
#include <limits.h>
#include <stdio.h>
int main(int argc, char *argv[]) {
int arr[2][2] = {{1, 2}, {3, 4}};
// Move j's initializer up here so we can access it after the loop.
// Also, set j to `INT_MIN` so our loop doesn't have to decrement
// 2147483648 times before the theorized underflow occurs which
// just cuts down on how long it takes to underflow.
int j = INT_MIN;
for (int i = 0; i < 2; i++) {
for (; j < 2; j--) {
// Comment out the arr access to avoid seg faults
// printf("arr[%d][%d] = %d, ", i, j, arr[i][j]);
printf("J -> %d\n", j);
}
}
// If the loop isn't infinite, we should see this message
printf("THE LOOP WAS NOT INFINITE, J VALUE: %d\n", j);
}
```
Running the above outputs:
```
gcc prog.c && ./a.out
J -> -2147483648
THE LOOP WAS NOT INFINITE, J VALUE: 2147483647
```
Y u p. That's underflow for sure.
We can clearly see where `j` was the smallest possible integer and then wrapped around to the largest possible integer
and in doing so caused the inner loop to end.
Thus, answer `b` is also wrong.
# A quick summary
As a result of all the above points, we can readily see that the potential answers we could pick from are all wrong.
> **a.)** There will be a warning \
> **b.)** Infinitely loop <- An integer underflow will eventually occur causing the looping to end \
> **c.)** Output: `arr[0][0] = 1, arr[0][1] = 2, arr[1][0] = 3, arr[1][1] = 4, ` \
> **d.)** Output: `[0][-1] = 1, arr[0][-2] = 2, arr[1][-1] = 3, arr[1][-2] = 4, `
A quick summary of why each is wrong:
**a.)** We won't get a warning unless we specify compiler flags/config for `gcc`. \
**b.)** An integer underflow will occur causing the looping to end and thus the program has a finite execution period,
and also the program will almost certainly have a segmentation fault causing the looping to end. \
**c.)** `j` will never sub-index into 1 or 2. \
**d.)** Indexing `j` negatively invokes undefined behavior, so we can't know the output. As well as the fact that the
pattern for accessing the first index is wrong anyhow.
# Is My Pedantry (🤓) Satisfied Now?
Somewhat, to be wholly honest I am still annoyed I got marked off on that question. At least now I can point somewhere
and say "See, look! I was right!" and then stick my fingers in my ears and cry in the corner.
I actually skipped how integers _really_ work under the hood in terms of their binary representation. If you're
interested, here's the binary of the earlier underflow code example:
```
Starting number is: -2147483648 | Binary: 10000000000000000000000000000000
New number is: 2147483647 | Binary: 01111111111111111111111111111111
```
That opens up a whole 'nother can of worms of _two's complement_ and actually how the bits are stored and manipulated.
A deeper look at why the underflow behavior occurs through how integers are added at the binary level can also be used
to disprove answer `b`. I cut a full break down of this out, because I realized it was somewhat irrelevant. The quick
integer whole number example makes the same point without nearly as much technicality.
The main takeaway: anytime you end up in undefined behavior land with C things get wacky real quick and all numeric
types are actually black magic pretending to be approachable.
[^1]: If you don't know what undefined behavior is, Wikipedia has an article on it [here](https://en.wikipedia.org/wiki/Undefined_behavior).