Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[C++] [Question] How to detect taint on elements in a collection #18098

Open
JustusAdam opened this issue Nov 25, 2024 · 8 comments
Open

[C++] [Question] How to detect taint on elements in a collection #18098

JustusAdam opened this issue Nov 25, 2024 · 8 comments
Labels
question Further information is requested

Comments

@JustusAdam
Copy link

I am trying to detect the flow into potential_leak in the following, simplified code. This is just the minimal example, the vector can be constructed any way, e.g. with a series if push_back or via iterator etc and I’m trying to find a way to reliably detect taint on any elements at the sink location. Also assume that I do not have access to the source code of potential_leak and thus could detect the taint when the elements are accessed.

std::vector<int> v { sensitive_data };
potential_leak(v);

My simplified query is

import cpp
import semmle.code.cpp.dataflow.new.TaintTracking

module TaintConfig implements DataFlow::ConfigSig {
  predicate isSource(DataFlow::Node source) {
    exists(VariableAccess v | 
      v.getTarget().getName() = "sensitive_data" 
    }
  }

  predicate isSink(DataFlow::Node sink) {
    exists(Call c |
      c.getTarget().getName() = "potential_leak" and
      c.getArgument(0) = e
    )
  }
}

module Flow = TaintTracking::Global<TaintConfig>;

from DataFlow::Node src, DataFlow::Node sink
where Flow::flow(src, sink)
select src, sink

However this does not detect the flow. Is there some way to select the elements inside of v as sinks for this query?

CodeQL version: 2.19.3

@JustusAdam JustusAdam added the question Further information is requested label Nov 25, 2024
@redsun82
Copy link
Contributor

👋 @JustusAdam

I'm guessing you might have edited your code snippet leaving out some information (the exists in isSource is not closed, and there's an undefined e in isSink).

However, trying out this example, it would indeed seem we don't currently track taint through vectors. I will ask my colleagues if it's really the case.

In the meantime, this seems to cover your simple example, by defining additional flow steps:

import cpp
import semmle.code.cpp.dataflow.new.TaintTracking

module TaintConfig implements DataFlow::ConfigSig {
  predicate isSource(DataFlow::Node source) {
    source.asExpr().(VariableAccess).getTarget().getName() = "sensitive_data"
  }

  predicate isAdditionalFlowStep(DataFlow::Node lhs, DataFlow::Node rhs) {
    exists(ConstructorCall c | c.getTarget().getName() = ["vector", "initializer_list"]
       and c = rhs.asExpr() and c.getAnArgument() = lhs.asExpr())
  }

  predicate isSink(DataFlow::Node sink) {
    exists(Call c | c.getTarget().getName() = "potential_leak" and
    c.getArgument(0) = sink.asExpr())
  }
}

module Flow = TaintTracking::Global<TaintConfig>;

from DataFlow::Node src, DataFlow::Node sink
where Flow::flow(src, sink)
select src, "flow to $@", sink, sink.toString()

notice however that modelling all ways in which an element can be inserted into a vector might be tricky (push_back, emplace, assign, insert at the minimum, but then also via iterators like std::back_inserter or std::iota or the iterator overload of constructors, assign and insert, and probably other ways...). I will circle back after I'll ask if there's no better way.

@redsun82
Copy link
Contributor

👋 @JustusAdam

  • It turns out we do model flow through many vector constructors and member functions, the full list of things covered can be found here. You may notice that:
    • push_back, emplace_back, insert, assign from an iterator are covered
    • construction from an initializer_list is not. I've opened an internal issue to cover that, so we may have that in the future (though I cannot commit on any specific roadmap). In the meantime the isAdditionalFlowStep I provided should get that covered.
  • If a sink is a function call argument on a container, like potential_leak in your example, then the container containing tainted elements will indeed be implicitly tainted as well. So nothing needs to be done to get that.
  • One has, however, to pay attention when defining sources, sinks and additional flow steps, to use the correct predicate among Node.asExpr and Node.asIndirectExpr. The former deal with taint propagating through values, the latter with taint propagating through references. In your case we should for example use asIndirectExpr in isSink if potential_leak takes the vector in by reference. As also push_back and emplace_back take arguments by reference we may want to cover both cases in isSource using [source.asExpr(), source.asIndirectExpr()] instead of just source.asExpr().

@JustusAdam
Copy link
Author

Hi @redsun82,

Thank you very much for looking into this. I did know that you modeled collections but I didn't realize they overtainted to the container itself. Nice.

Sorry for the buggy code, I was trying to fix the indentation after copying it and I deleted too much by accident.

So it seems I just ran into one of the unsupported methods 😅. Anyway your workaround appears to work, thanks a lot.

Since I have you I have two related questions:

  1. Using models is the only way to achieve this taint propagation right? There is no way to taint the entire object at the src expression right? And there's no general way of tainting at src or doing this taint propagation, you have to model for every data structure? Because I also have this same issue with unordered_map where, when iterating over it, the taint propagates to the iterators and tuples, but not to their fields.
  2. Just for my understanding, why exactly do you need the models? I was playing around with it yesterday by creating a vector manually and iteration and indexing worked with the taint analysis but push_back didn't. Do you know why this is?

@redsun82
Copy link
Contributor

Hi @JustusAdam

Sorry for the buggy code, I was trying to fix the indentation after copying it and I deleted too much by accident.

No worries!

Using models is the only way to achieve this taint propagation right? There is no way to taint the entire object at the src expression right? And there's no general way of tainting at src or doing this taint propagation, you have to model for every data structure? Because I also have this same issue with unordered_map where, when iterating over it, the taint propagates to the iterators and tuples, but not to their fields.

Well, not really: everything that models do can be done with explicit and careful coding of the ConfigSig implementation. It's just that that is extremely tedious and error-prone to get right for all cases, and models provide a much faster shortcut (and in some cases we also have ways of generating them).

About

There is no way to taint the entire object at the src expression right?

Well, again, you could do that with additional taint steps. For example this (untested) snippet:

import semmle.code.cpp.ir.dataflow.internal.DataFlowPrivate

predicate isAdditionalFlowStep(DataFlow::Node node1, DataFlow::Node node2) {
  // anything that generates a store step into a vector type is now also an additional flow step that taints the whole vector
  storeStep(node1, _, node2) and
  node2.getType().hasQualifiedName("std", "", "vector")
}

but in general we wouldn't recommend it.

Just for my understanding, why exactly do you need the models? I was playing around with it yesterday by creating a vector manually and iteration and indexing worked with the taint analysis but push_back didn't. Do you know why this is?

How did push_back not work?

Just for clarification, the fact that potentially_leak(v) works currently, is not given by the fact that taint spreads to the vector in general, but just when a vector with tainted elements flows into an argument sink. Technically this happens here.

As for unordered_map, I'll file an issue to extend modeling to more STL containers, I think that makes a lot of sense.

@JustusAdam
Copy link
Author

Just for clarification, the fact that potentially_leak(v) works currently, is not given by the fact that taint spreads to the vector in general, but just when a vector with tainted elements flows into an argument sink. Technically this happens here.

So this is kind of what I meant by "taint the whole object". I didn't know this is what happened already internally. But this is only for elements that are stored specially via the Element[...] model right? This does not just look for any contained pointers?

How did push_back not work?

So I hand rolled this vector implementation with a couple of test cases (all below) and ran the taint analysis. The long and short of it is that using indexing worked, taint inflicted via push_back is lost.

template <typename T>
class vec_iter
{
    T *__p;

public:
    vec_iter(T *p) : __p(p) {}

    bool operator!=(vec_iter<T> &other)
    {
        return __p != other.__p;
    }

    vec_iter<T> &operator++()
    {
        __p++;
        return *this;
    }

    T &operator*()
    {
        return *__p;
    }
};

template <typename T>
class vec
{
    T *__start;
    T *__end;
    uint64_t cap;

public:
    vec()
    {
        __start = 0;
        __end = __start;
        cap = 0;
    };

    void ensure_cap(uint64_t additional)
    {
        auto current = (uint64_t)(__end - __start);
        auto target = current + additional;
        if (target > cap)
        {
            auto new_cap = cap * 2;
            if (new_cap == 0)
            {
                new_cap = 4;
            }

            while (new_cap < target)
                new_cap *= 2;
            auto old_start = __start;
            __start = new T[new_cap];
            __end = __start + current;
            memcpy(__start, old_start, cap * sizeof(T));
            free(old_start);
            cap = new_cap;
        }
    }

    void push_back(const T &elem)
    {
        ensure_cap(1);
        *__end = elem;
        __end++;
    }

    vec_iter<T> begin()
    {
        return vec_iter(__start);
    }

    vec_iter<T> end()
    {
        return vec_iter(__end);
    }

    T &operator[](uint64_t n)
    {
        return *(__start + n);
    }
};

int main(int argv, char **argc)
{
    int result = 0;

    // Indexing, this works
    vec<int> v1 = vec<int>();
    v1[0] = source();
    result += target(v1[0]);

    // Push_back, this doesn't work
    vec<int> v2;
    v2.push_back(source());
    auto elem = v2[0];
    result += target(elem);

    return result;
}

Query:

import cpp
import semmle.code.cpp.dataflow.new.TaintTracking

module SourceSinkCallConfig implements DataFlow::ConfigSig {
  predicate isSource(DataFlow::Node source) { isSourceCall(source.asExpr()) }

  predicate isSink(DataFlow::Node sink) {
    isTargetOperand(sink.asExpr()) or isTargetOperand(sink.asIndirectExpr())
  }
}

predicate isTargetOperand(Expr o) {
  exists(FunctionCall call | isTargetFunction(call.getTarget()) and call.getArgument(0) = o)
}

predicate isSourceCall(Expr e) { e.(FunctionCall).getTarget().getName() = "source" }

module SourceSinkCallTaint = TaintTracking::Global<SourceSinkCallConfig>;

from DataFlow::Node source, DataFlow::Node sink, int source_line, int sink_line
where
  SourceSinkCallTaint::flow(source, sink) and
  source_line = source.getLocation().getStartLine() and
  sink_line = sink.getLocation().getStartLine()
select source, source_line, sink, sink_line

@redsun82
Copy link
Contributor

redsun82 commented Nov 28, 2024

hi @JustusAdam

I see. It's probably the case that taint analysis cannot see through the logic of ensure_cap, so modelling (either through careful ConfigSig programming or using models as data) would indeed be the way to go.

I couldn't help but notice a couple of oddities (though I understand yours is just demo code):

  • in
    vec<int> v1 = vec<int>();
    v1[0] = source();
    result += target(v1[0]);
    you are hitting undefined behaviour (and in practice a segfault) as you are assigning source() to a nullptr (you didn't yet make room in vec)
  • you should not pair new with free. new T[size] will by default allocate sizeof(T) * size bytes and call the default T constructor, while free will free the memory but not call T destructors. Not to mention that the new operator can actually be overloaded and do whatever, in which case free may really wreak havoc. You should always pair new with delete (and new[] with delete[], without mixing them!), and only use free if you used malloc/calloc

@JustusAdam
Copy link
Author

Hye, thanks for the explanation. I don't use the manual memory management in C++ too much and I also just created this example for understanding CodeQL, not actually to use it and I only tested that pushing into the vector works 😅. With proper std::vectors I had versions too that constructed vectors of at least one element, but that made no difference (and it shouldn't typically the static analysis doesn't model the actual value of capacity or the pointers).

Anyway I don't think the problem is ensure_cap. I tried the same vector with just immediately allocating an array with 10 elements and pushing onto that with no resizing (e.g. commenting out the call to ensure_cap) and it has the same issue. I think the problem is actually that the analysis doesn't realize that __start is an alias for __end and refers to the same memory location. It is starting to look to me as though the taint analysis in CodeQL doesn't perform any alias analysis? I have confirmed this behavior with a simpler test as well but since this appears to me like a separate issue so I opened one #18151 where I mention the test example too.

@rvermeulen
Copy link
Contributor

I haven't fully read the thread so if this is already mentioned you can skip it, but when working with tainted elements of a collection or fields of a struct, you typically want to implement allowImplicitRead to allow our analysis to assume arbitrary read steps at the sink when the collection or structure is passed to the sink (the collection/struct might not be tainted, but a member could be).

Here is an example implementation

predicate allowImplicitRead(DataFlow::Node node, DataFlow::ContentSet cs) {

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

3 participants