Blog/posts/dsa-loop-midterm-question-analysis.md

15 KiB
Raw Permalink Blame History

name summary tags published updated
Proving a Data Structures and Algorithms Midterm Question Wrong An anthill to die upon.
college
datastructures
c
2024-03-01 2024-03-01

Context

I'm currently enrolled at a university working towards a Computer Science degree. As part of my degree plan I have to take a Data Structures and Algorithms (DSA) course.

At the time of writing we had our midterm yesterday morning, and I missed marks on a question that I know for a fact had no correct answer. This post is an analysis as to why the question's multiple choice answers were all incorrect.

I'm writing this knowing I'm being a bit of a pedant and knowing that it's only a single question. But for whatever reason this keeps circling around in my mind throwing around all the chairs and spray painting gang markers on the walls. As such, I'm writing to get the poltergeist out.

The Question

Here's roughly the question that was asked:

Given the following code:

#include <stdio.h>

int main(int argc, char *argv[]) {
    int arr[2][2] = {{1, 2}, {3, 4}};
    for (int i = 0; i < 2; i++) {
        for (int j = 0; j < 2; j--) {
            printf("arr[%d][%d] = %d, ", i, j, arr[i][j]);
        }
    }
}

What will the program do?

a.) There will be a warning
b.) Infinitely loop
c.) Output: arr[0][0] = 1, arr[0][1] = 2, arr[1][0] = 3, arr[1][1] = 4,
d.) Output: [0][-1] = 1, arr[0][-2] = 2, arr[1][-1] = 3, arr[1][-2] = 4,

I chose a, after realizing that's the closest answer to being correct. Unfortunately this question is asking about the most horrid thing on Earth and a large reason languages like Rust exist — undefined behavior (UB)1.

Answer a, of course, is not correct. The exam had answer b as correct, which is also wrong. In fact, none of the above answers are correct.

Why is the above code undefined behavior (UB)?

The code given to us has an out-of-bounds array access. Take a look at the for loops. The first for loop is all good, no problem there. The inner one, however, is our problem child.

int arr[2][2] = {{1, 2}, {3, 4}};
for (int i = 0; i < 2; i++) {
    for (int j = 0; j < 2; j--) { // This line is bad!
        // Eventually the sub-indexing of j will be out of bounds!
        printf("arr[%d][%d] = %d, ", i, j, arr[i][j]);
    }
}

The inner for loop decrements the variable j and will continue looping until j is not less than 2 (so loop until j is more than or equal to 2). This means the array will eventually be indexed by whatever integer is stored into j and since j is decremented from 0 it will be negative.

As a result the array will be sub-indexed by a negative number, which is outside the bounds of the array. This is our undefined behavior.

Now that we've identified the issue with this problem let's get into why none of the provided answers are correct.

Why is answer a wrong?

To save you some scrolling, answer a was There will be a warning. In this course, questions have been asked with identical (or extremely close at least) answers like a which were interested in compiler behavior.

When I was marked wrong on this answer, I was actually quite surprised. Even more so when I recreated the identical code and attempted to compile it with gcc and got the following compilation error:

t.c: In function main:
t.c:7:13: error: iteration 1 invokes undefined behavior [-Werror=aggressive-loop-optimizations]
    7 |             printf("arr[%d][%d] = %d, ", i, j, arr[i][j]);
      |             ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
t.c:6:27: note: within this loop
    6 |         for (int j = 0; j < 2; j--) {
      |                         ~~^~~
cc1: all warnings being treated as errors

It seemed apparent to me that the question had no right answers. Imagine my surprise then when the professor compiled the same code in an online compiler and received neither an error or warning. That made me lose a bit of sanity and a few hours later I realized the culprit, I have a gat dang alias hiding out in my zsh config ruining property values:

alias gcc="gcc -Werror -Wall -Wpedantic -Warray-bounds -O3"

I bring this up because I spent perhaps a few too many hours trying to figure out what the heck was going on. A simple command -v gcc would've found it immediately, but I was too dumb to check that for quite some time 😞.

I also bring it up because my problem with a difference in compiler settings being a problem is a problem. By having potential answers be about compiler behavior the question immediately invokes a ton of implicit compiler configuration and flags that were (to my knowledge) never clearly stated in my course. The problem then is no longer about DSA — instead it's about knowing the quirks of a compiler that may have differences on every single system and configurations out there.

Ok, but let's just assume we're rocking standard gcc with no flags, and no configuration elsewhere. Thus, there shouldn't be any compiler warnings or errors. With that caveat in mind, answer a can't be right.

Why are answers c & d wrong?

I'm lumping these two together because they're wrong for similar reasons and I figure I should cover all of my bases.

Answers c & d were:

c.) Output: arr[0][0] = 1, arr[0][1] = 2, arr[1][0] = 3, arr[1][1] = 4,
d.) Output: [0][-1] = 1, arr[0][-2] = 2, arr[1][-1] = 3, arr[1][-2] = 4,

c is wrong because the inner loop is being continuously decremented. In theory the sub-indexing integer j should never get a positive value. Purely in theory.

d is wrong because by indexing negatively into the array, the output values are incredibly unlikely to match the actual values stored in the array; on top of that, indexing negatively into an array in C invokes UB, making this necessarily wrong anyhow.

Why is the supposedly correct answer, b, also wrong?

Answer b stated that "the program will infinitely loop".

I'll paste the code again here for your reference so you don't have to scroll back up to see it.

#include <stdio.h>

int main(int argc, char *argv[]) {
    int arr[2][2] = {{1, 2}, {3, 4}};
    for (int i = 0; i < 2; i++) {
        for (int j = 0; j < 2; j--) {
            printf("arr[%d][%d] = %d, ", i, j, arr[i][j]);
        }
    }
}

On the surface level it certainly seems this code would infinitely loop. Our inner loop integer j is always being set more and more negatively and it appears it will never be more than two. After all, subtracting one from a negative will still leave you with a negative.

There are two reasons that come to my mind at a glance why the above code actually won't run until the Sun swallows the Earth and time becomes meaningless.

Why b is Wrong, Part Uno

The first reason that code won't run infinitely is the realities of undefined behavior.

On my laptop running Linux, kernel version 6.7.6 at the time of writing, the code above will have a segmentation fault within the first few seconds of running. The Linux kernel actually has quite a few built-in security features and it won't let my code willy nilly access random addresses which sometimes happens with the negative indexing.

Outside of the kernel security mechanisms, since we're in UB land who's to say what will actually happen as its undefined. So at best saying it runs forever is a guess — not actually something that can be proven easily.

Why b is Wrong, Part Dos

The second reason is the much stronger of the two. Let's, the for the sake of argument, say the above code never has a segmentation fault. It will always be able to decrement j and sub-index into the array.

Even with these concessions, the program will still not run infinitely.

To understand why, we have to dive into how integers work in C.

ISO/IEC 9899:2023 Purgatory (Oh the misery)

So let's go see what the actual published standard has to say on what an integer is. If we go on over to open-std.org and go check out the latest published draft of "ISO/IEC 9899 - Revision of the C standard" we end up with this big ol' document. Crack 'er open and head down to page 41, section 6.2.6.2 (fun to pronounce as "sixty-two sixty-two").

An integer, according to the standard, can be signed or unsigned. Since the integer we're interested in, j, is a signed integer, let's go through how a signed integer is standardized as in C.

For signed integer types, the bits of the object representation shall be divided into three groups: value bits, padding bits, and the sign bit. ... the signed type uses the same number of N bits as values bits and sign bit. N - 1 are value bits and the remaining bit is the sign bit. ... If the sign bit is zero, it shall not affect the resulting value. If the sign bit is one, it has value -(2^(n-1)).

Note one more thing, the standard does not define the actual size of an integer. That's left to each implementation. Since we're using the gcc compiler, let's see what the gcc docs have to say on the size of a integer.

The integer data types range in size from at least 8 bits to at least 32 bits. The C99 standard extends this range to include integer sizes of at least 64 bits. You should use integer types for storing whole number values (and the char data type for storing characters). The sizes and ranges listed for these types are minimums; depending on your computer platform, these sizes and ranges may be larger.

Ah... that's, uhhh, a pain — it's gonna vary by architecture.

On my system, integers get 32 bits when using gcc and thus can represent 4,294,967,295 distinct numbers. Remember, the integer we're working with is signed, thus half of those numbers will be negative. That means the largest and smallest numbers my integers can be are 2,147,483,647 and -2,147,483,648 respectively.

Hmmm... what happens when we exceed either of those numbers? Let's find out.

#include <stdio.h>
#include <limits.h>
#include <stdlib.h>

int main()
{
    printf("Starting number is: %d\n", INT_MIN);
    printf("New number is: %d\n", INT_MIN - 1);
    return EXIT_SUCCESS;
}

INT_MIN in this case is the smallest integer representable.

gcc warns about an "integer overflow"... interesting.

 gcc program.c

t.c: In function main:
t.c:8:43: warning: integer overflow in expression of type int results in 2147483647 [-Woverflow]
    8 |     printf("New number is: %d\n", INT_MIN - 1);
      |                                           ^

Well let's run it and see what the output is.

 ./a.out

Starting number is: -2147483648
New number is: 2147483647

Oh, huh. Neat.

So if we exceed the bounds the sign flips, literally an overflow or, in this case, an underflow.

Actually for real this time why b is wrong

With the overflow and underflow behavior now covered, let's take a look back at the code given to me in class:

#include <stdio.h>

int main(int argc, char *argv[]) {
    int arr[2][2] = {{1, 2}, {3, 4}};
    for (int i = 0; i < 2; i++) {
        for (int j = 0; j < 2; j--) {
            printf("arr[%d][%d] = %d, ", i, j, arr[i][j]);
        }
    }
}

Now with everything we've covered let's go point by point:

  1. So long as j is less than 2 this will continue to run.
  2. Integer j will be decremented for every execution of the inner loop.
  3. Integer types have a maximum capacity and if we exceed that capacity the sign changes.
  4. Eventually, j will reach the smallest value an integer can represent and on the next loop it will underflow and become a massively positive number.
  5. Integer j will then eventually be more than 2 in value and thus the condition j < 2 will be invalidated and the loop will end.

To prove the code won't infinitely loop, let's actually run the code with a few small modifications to speed things up:

#include <limits.h>
#include <stdio.h>

int main(int argc, char *argv[]) {
    int arr[2][2] = {{1, 2}, {3, 4}};
    // Move j's initializer up here so we can access it after the loop.
    // Also, set j to `INT_MIN` so our loop doesn't have to decrement
    // 2147483648 times before the theorized underflow occurs which
    // just cuts down on how long it takes to underflow.
    int j = INT_MIN;
    for (int i = 0; i < 2; i++) {
        for (; j < 2; j--) {
            // Comment out the arr access to avoid seg faults
            // printf("arr[%d][%d] = %d, ", i, j, arr[i][j]);
            printf("J -> %d\n", j);
        }
    }
    // If the loop isn't infinite, we should see this message
    printf("THE LOOP WAS NOT INFINITE, J VALUE: %d\n", j);
}

Running the above outputs:

 gcc prog.c && ./a.out

J -> -2147483648
THE LOOP WAS NOT INFINITE, J VALUE: 2147483647

Y u p. That's underflow for sure.

We can clearly see where j was the smallest possible integer and then wrapped around to the largest possible integer and in doing so caused the inner loop to end.

Thus, answer b is also wrong.

A quick summary

As a result of all the above points, we can readily see that the potential answers we could pick from are all wrong.

a.) There will be a warning
b.) Infinitely loop <- An integer underflow will eventually occur causing the looping to end
c.) Output: arr[0][0] = 1, arr[0][1] = 2, arr[1][0] = 3, arr[1][1] = 4,
d.) Output: [0][-1] = 1, arr[0][-2] = 2, arr[1][-1] = 3, arr[1][-2] = 4,

A quick summary of why each is wrong:

a.) We won't get a warning unless we specify compiler flags/config for gcc.
b.) An integer underflow will occur causing the looping to end and thus the program has a finite execution period, and also the program will almost certainly have a segmentation fault causing the looping to end.
c.) j will never sub-index into 1 or 2.
d.) Indexing j negatively invokes undefined behavior, so we can't know the output. As well as the fact that the pattern for accessing the first index is wrong anyhow.

Is My Pedantry (🤓) Satisfied Now?

Somewhat, to be wholly honest I am still annoyed I got marked off on that question. At least now I can point somewhere and say "See, look! I was right!" and then stick my fingers in my ears and cry in the corner.

I actually skipped how integers really work under the hood in terms of their binary representation. If you're interested, here's the binary of the earlier underflow code example:

Starting number is: -2147483648 | Binary: 10000000000000000000000000000000
      New number is: 2147483647 | Binary: 01111111111111111111111111111111

That opens up a whole 'nother can of worms of two's complement and actually how the bits are stored and manipulated. A deeper look at why the underflow behavior occurs through how integers are added at the binary level can also be used to disprove answer b. I cut a full break down of this out, because I realized it was somewhat irrelevant. The quick integer whole number example makes the same point without nearly as much technicality.

The main takeaway: anytime you end up in undefined behavior land with C things get wacky real quick and all numeric types are actually black magic pretending to be approachable.


  1. If you don't know what undefined behavior is, Wikipedia has an article on it here. ↩︎