Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

added functionality to track overlapping accesses #258

Merged
merged 17 commits into from
Nov 25, 2024

Conversation

sadrabt
Copy link
Contributor

@sadrabt sadrabt commented Oct 14, 2024

This pull request address the issue of overlapping accesses for DSA as discussed by issue #254

Remaining:

  • create a test for global overlapping (can use clang/jumptable)
  • create a test for stack overlapping
  • test indirect overlapping access

@l-kent
Copy link
Contributor

l-kent commented Oct 14, 2024

I don't think this is working correctly yet.

For indirect_calls/jumptable/clang:BAP, these are the relevant instructions with overlapping stack accesses:

00000413: mem := mem with [R31 + 0x10, el]:u128 <- V0

00000429: R8 := mem[R31 + 0x10, el]:u64

00000438: R8 := mem[R31 + 0x18, el]:u64

The 128 bit access to R31 + 16 overlaps with the 64 bit access to R31 + 16 and the 64 bit access to R31 + 24.

However, the analysis doesn't correctly determine that these overlap. The results it gives are that:

  • 00000429_R8 points to node 438 which contains pointers to add_six and add_two and is collapsed.
  • 00000438_R8 points to node 447 which is empty.

There is also node 450 which contains Stack_16 (of size 16) and Stack_12 (of size 4). It has three cells:

  • 0 points to node 438
  • 8 points to node 447
  • 12 points to node 439 (which is also empty).

This is incorrect - accesses to R31 + 16 and R31 + 12 never overlap, and these offsets don't make any sense.

Stack_12 and Stack_16 do not overlap and the analysis should not find that they do.
00000429_R8 should contain a pointer to add_two (this is already fine)
00000438_R8 should contain a pointer to add_six (this is not correct as-is)

It's alright if they both point to a collapsed node containing pointers both add_two and add_six, but it's even better if we can maintain the offsets if possible?

@sadrabt sadrabt force-pushed the points_to_analysis_overlapping_access branch from 4751106 to 159c5d9 Compare October 15, 2024 00:32
@l-kent
Copy link
Contributor

l-kent commented Oct 15, 2024

This seems to be working now so I'll merge it soon.

It would be nice to be able to improve the precision so it can tell that add_two is at offset 0 and add_six is at offset 8 in the combined node and then tell that 00000429_R8 is pointing to add_two, 00000438_R8 is pointing to add_six, and 0000040b_V0 is pointing to both, but that would probably be a fair bit more complex and isn't a must have.

@l-kent
Copy link
Contributor

l-kent commented Oct 15, 2024

Oh you were planning to write tests weren't you? I'll wait for you do to those before merging.

@sadrabt
Copy link
Contributor Author

sadrabt commented Oct 15, 2024

// 0x10DC0 contains a pointer to the procedure add_two: separate node with a single cell at the start which points to add_two
// 0x10DC8 contains a pointer to the procedure add_six: separate node with a single cell at the start which points to add_six

once the analysis process
V0 := mem[0x10DC0, el]:u128
it will merge the nodes for 0x10DC0 and 0x10DC8 with offset 8 so that the resulting node will have a cell at offset 0 which corresponds to 0x10DC0 and a cell at offset 8 corresponding to 0x10DC8. Each cell still has a distinct pointee

here currently the analysis will grow the cell corresponding to 0x10DC0 the size of the access. it causes that cell to overlap the cell corresponding to 0x10DC8 and as a result their outgoing edges (their pointees) are also merged. This causes the collapse

an alternative would be to not grow 0x10DC0's cell but instead do a multi load/write.
instead of growing the 0x10DC0's cell, we start from that cell and continue to merge each consecutive cell's pointee with a cell at a different offset in V0's Node. The problem here is that we don't have a way of telling which offsets are the correct ones in V0's node to be merged with the pointees. I think one possibility would be to increment a set number (64 bits or one pointer's length) since we are only concerned with pointers. However, I'm unsure of the full implications of this for soundness

@l-kent

This comment was marked as outdated.

@l-kent

This comment was marked as resolved.

@l-kent
Copy link
Contributor

l-kent commented Oct 15, 2024

instead of growing the 0x10DC0's cell, we start from that cell and continue to merge each consecutive cell's pointee with a cell at a different offset in V0's Node. The problem here is that we don't have a way of telling which offsets are the correct ones in V0's node to be merged with the pointees.

Can you explain this further? I don't quite follow what the problem is, why don't we have a way of telling which offsets to use?

@sadrabt
Copy link
Contributor Author

sadrabt commented Oct 15, 2024

after V0 := mem[0x10DC0, el]:u128
we want V0 node to look something like
V0 = {Cell1 == pointee(0x10DC0), cell2 == pointee(0x10DC8)},
I don't think we have a way of determining what offset cell2 begins at because we don't know the size of the pointee(0x10DC0)

@l-kent
Copy link
Contributor

l-kent commented Oct 15, 2024

We do actually know the sizes. The way it works is that:

  • add_two has address 0x754
  • add_six has address 0x768
  • 0x10DC0 is a location that contains 0x754, so it is a pointer to a pointer to add_two
  • 0x10DC8 is a location that contains 0x768, so it is a pointer to a pointer to add_six
  • These are all pointers which are 64-bit/8-byte values

Because 0x10DC8 is 8 bytes after 0x10DC0 and they are loaded together, we know that the offset for cell 2 is 8 just from that.

@sadrabt
Copy link
Contributor Author

sadrabt commented Oct 15, 2024

we would know the size if the pointee is a single (not merged with anything else) global object (possibly single stack objects aswell), because we can look it up in the symbol table and know its size and also know the object isn't merged with something else so we can rely on our known size. But I don't think we can tell the size in general. if pointee is not global or stack we won't be able to tell it's size, same thing follows if a stack or global object has been merged with some other nontrivial (not a place holder for SSA variables) object. But you are right we are able to get more precision in certain cases

@l-kent
Copy link
Contributor

l-kent commented Oct 15, 2024

We know the size because there is an object at 0x10DC0 and an object at 0x10DC8, and we are accessing both via a 128-bit access to 0x10DC0. 0x10DC8 is offset from 0x10DC0 by 8 bytes, so cell2 should also have an offset of 8 bytes. This applies broadly, whether we are accessing global data, or the stack, or a heap location. We don't need to know anything about the symbol table for this (except that those objects exist at those addresses in this example).

Of course, if those objects have already been merged with anything else that has caused their nodes to collapse then you can't possibly do anything except collapse these together too, but that's fine.

If you don't know the sizes, you can't even know that there are overlapping accesses to begin with.

@sadrabt
Copy link
Contributor Author

sadrabt commented Oct 15, 2024

You are so correct. OF COURSE we have sizes of the pointees because we use them to determine if there is an overlapping access. Thank you for clarifying that. My symbol table to global objects (Data Structure Nodes) conversion step is missing the middle indirection level (cell for 754 pointing to add_two) from 0x10DC0 to 754 to add_two. I'll fix that next and then should be able to get more precise overlapping accesses implemented.

@l-kent
Copy link
Contributor

l-kent commented Oct 15, 2024

Thanks, that makes sense. I agree that we need to add that extra indirection layer that isn't in there at present.

@sadrabt sadrabt force-pushed the points_to_analysis_overlapping_access branch from 0185072 to 879508a Compare October 21, 2024 01:11
@l-kent
Copy link
Contributor

l-kent commented Oct 21, 2024

You accidentally overwrote some of your previous fixes with this force push

@sadrabt sadrabt force-pushed the points_to_analysis_overlapping_access branch from 879508a to 0185072 Compare October 21, 2024 02:20
@l-kent
Copy link
Contributor

l-kent commented Oct 28, 2024

I'm not sure if this extra indirection actually makes sense either - I think I'd gotten a bit confused when I agreed that it was needed, so my apologies for that. The issue is that 0x754 is the address of the function add_two, so the value 0x754 is a function pointer to add_two. This means that any register that contains the value 0x754 is just pointing to add_two - it isn't pointing to a pointer to add_two. That would only be the case if a register held the value 0x10DC0 (which is an address that contains the value (0x754).

The difficulty we're dealing with is that we want to be able to have a node where cell 0 is add_two, and cell 8 is add_six, but I don't think that's really supported at the moment - would it be possible to implement something like that and does it make any sense as an approach?

Using negative addresses doesn't really seem like a sensible way to represent this either, if we wanted indirection like that a better option would probably be a new sort of node to represent functions, rather than negative addresses which would only be confusing as they don't actually mean anything.

@sadrabt
Copy link
Contributor Author

sadrabt commented Oct 28, 2024

I agree that the indirection doesn't seem correct specially because the global address range for the functions are shrunk but I think it makes sense to represent 0x754 as a cell pointing to add_two rather than add_two. This way we want a node where cell 0 is a pointing to add_two and cell 8 is a pointing to add_six which seems more correct than having add_two with size 20 bytes at cell 0 and add_six with size 20 bytes at cell 8.

we need a better way of representing function pointers. I think it is suitable to represent them as single cell nodes with the single cell pointing to the function (e.g. add_two). However, I think here the appropriate address range is [0x754, 0x754] but we want the cell to be 8 bytes long

I also agree with the negative address not being very sensible, currently all the globals have an address range but the current approach represents 0x754 to 0x754 + 8 as a pointer to add_two and hence had to put add_two itself in some range that won't be addressed. I'm still unclear on how I want to represent the function pointers tho, so I think this should be addressed after the above issue is resolved

@l-kent
Copy link
Contributor

l-kent commented Oct 28, 2024

It depends on the semantics - varToCell maps each SSA Register to a Slice (a Cell and an offset into that Cell) - does the Cell represent the value of the register, or what the register points to?

@sadrabt
Copy link
Contributor Author

sadrabt commented Oct 28, 2024

The cell represents the value. The cell may or may not be pointing to something

@l-kent
Copy link
Contributor

l-kent commented Oct 28, 2024

Ah, ok, the indirection is needed then, that clears things up. At the moment it feels like it's the other way around - a DataLocation with start 0x754 seems like it is representing the memory location 0x754, not the value 0x754 (which points to said memory location), so more work may need to be done to make this clear & consistent.

I think the simplest solution is that the cell with value 0x754 (which should have size 8 then) should then point to a cell containing a new type of MemoryLocation that represents a function?

@l-kent
Copy link
Contributor

l-kent commented Oct 28, 2024

Similar work may need to be done with DataLocations that represent the addresses of global variables too - there may need to be a new MemoryLocation that represents a global variable's contents (the value that is actually at the address)?

@l-kent
Copy link
Contributor

l-kent commented Oct 28, 2024

However, I think here the appropriate address range is [0x754, 0x754] but we want the cell to be 8 bytes long

The value 0x754 is a function pointer and is 8 bytes long, so the cell should be 8 bytes long if it represents that as a value. What does the address range actually mean? If cells represent values, then I'm not sure the concept of an address range is needed at all?

@l-kent
Copy link
Contributor

l-kent commented Oct 28, 2024

This pull request also seems to incorrectly identify overlapping accesses in some cases. An example of this is for correct/malloc_with_local/clang:BAP. There are no overlapping stack accesses in main(), but it incorrectly finds accesses to R31 + 4, R31 + 8 and R31 + 16 to be overlapping, even though the accesses to R31 + 4 are always 32-bit (so 4 bytes) and the other accesses are always 64-bit (so 8 bytes), meaning that there is no overlap.

This is the .bir annotated with the accesses and allocations if it assists:

0000075d: sub main(main_argc, main_argv, main_result)
00000772: main_argc :: in u32 = low:32[R0]
00000773: main_argv :: in out u64 = R1
00000774: main_result :: out u32 = low:32[R0]

00000352:
00000356: R31 := R31 - 0x30
0000035c: #4 := R31 + 0x20
// access to R31 + 32
00000362: mem := mem with [#4, el]:u64 <- R29
// access to R31 + 40
00000368: mem := mem with [#4 + 8, el]:u64 <- R30
0000036e: R29 := R31 + 0x20
00000373: R8 := 0
// access to R31
0000037b: mem := mem with [R31, el]:u32 <- 31:0[R8]

// access to R31 + 28
00000382: mem := mem with [R29 - 4, el]:u32 <- 0
00000387: R0 := 1
0000038c: R30 := 0x834
// allocate heap1
0000038f: call @malloc with return %00000391

00000391:
// access to R31 + 16
00000397: mem := mem with [R31 + 0x10, el]:u64 <- R0
0000039c: R0 := 4
000003a1: R30 := 0x840
// allocate heap2
000003a3: call @malloc with return %000003a5

000003a5:
// access to R31 + 8
000003ab: mem := mem with [R31 + 8, el]:u64 <- R0
000003b0: R8 := 0xA
// access to R31 + 4
000003b8: mem := mem with [R31 + 4, el]:u32 <- 31:0[R8]
// access to R31 + 16
000003bf: R9 := mem[R31 + 0x10, el]:u64
000003c4: R8 := 0x41
// access to heap1
000003cc: mem := mem with [R9] <- 7:0[R8]
// access to R31 + 8
000003d3: R9 := mem[R31 + 8, el]:u64
000003d8: R8 := 0x2A
// access to heap2
000003e0: mem := mem with [R9, el]:u32 <- 31:0[R8]
// access to R31 + 16
000003e7: R8 := mem[R31 + 0x10, el]:u64
// access to heap1
000003ee: R1 := pad:64[mem[R8]]
000003f3: R0 := 0
000003f9: R0 := R0 + 0x8D4
000003fe: R30 := 0x878
00000401: call @printf with return %00000403

00000403:
// access to R31 + 8
00000408: R8 := mem[R31 + 8, el]:u64
// access to heap2
0000040f: R1 := pad:64[mem[R8, el]:u32]
00000414: R0 := 0
0000041a: R0 := R0 + 0x8E5
0000041f: R30 := 0x88C
00000421: call @printf with return %00000423

00000423:
// access to R31 + 4
00000428: R1 := pad:64[mem[R31 + 4, el]:u32]
0000042d: R0 := 0
00000433: R0 := R0 + 0x8F5
00000438: R30 := 0x89C
0000043a: call @printf with return %0000043c

0000043c:
// access to R31 + 16
00000441: R0 := mem[R31 + 0x10, el]:u64
00000446: R30 := 0x8A4
00000449: call @free with return %0000044b

0000044b:
// access to R31 + 8
00000450: R0 := mem[R31 + 8, el]:u64
00000455: R30 := 0x8AC
00000457: call @free with return %00000459

00000459:
// access to R31 
0000045e: R0 := pad:64[mem[R31, el]:u32]
00000464: #5 := R31 + 0x20
// access to R31 + 32
00000469: R29 := mem[#5, el]:u64
// access to R31 + 40
0000046e: R30 := mem[#5 + 8, el]:u64
00000474: R31 := R31 + 0x30
00000479: call R30 with noreturn

@sadrabt
Copy link
Contributor Author

sadrabt commented Oct 28, 2024

The cells represent values but only values that matter are addresses. The values don't matter themselves, all we care about is if the value is an address or not and where the address points to if it is. Therefore, I don't think we need to do this for non function globals.

Same with 0x754 the value doesn't matter as much as that it is represented by a cell that points to add_two. I agree that the address range is not needed, however, it is convenient to be able to accesses globals with some unique key like their address for testing. And since the addresses are needed for finding appropriate offsets and overlapping accesses, we can represent the value 0x754 with address range 0x754 to 0x754 which is a cell that points to add_two like in the current iteration.

@l-kent
Copy link
Contributor

l-kent commented Oct 28, 2024

You do need to change how you represent other memory locations as well because otherwise it would be treating function pointers inconsistently with other pointers. For instance, currently the size of a DataLocation is the size of the object that the cell containing it points to, but since you are changing how you represent function pointers, then the size of a DataLocation will mean the size of the object pointed to for an object pointer, but the size of the pointer itself (or just 0 currently?) for a function pointer. 'DataLocation' is also a fairly unclear name given what it actually means - it should probably be something more like 'DataPointer'.

If you just want a unique key for testing, then there isn't a need for an address range at all - you could just use the value itself as a key? The value 0x754 isn't at the address 0x754, it's at the address 0x10DC0, so we need to make sure this is something that's handled consistently.

@sadrabt
Copy link
Contributor Author

sadrabt commented Nov 18, 2024

I just realized that the overlapping accesses only work the address being read/written to is directly a stack/global but not if there is a level of indirection. If there is any indirection same issue with inter procedural overlapping access occur. This issue is solved for interprocedural stack overlapping accesses and i'm currently working on the globals. Same solution can be applied to indirect overlapping accesses.

Effectively every merge (other than direct accesses to global/stack) has to be wrapped in extra functionality that checks for overlapping accesses to global/stack nodes through indirection. It's also possible to incorporate this into merge itself, but since we may require extra merges, have to find a suitable base case to end the recursion.

The current solution would be somewhat inefficient. To know if a node is a reference to a particular stack, or global. We have to run find() on all the stack/globals and see if they match. Through flags we can eliminate nodes that aren't stacks or globals. Alternatively could implement a more accurate tracking of stack and globals through merges

@l-kent
Copy link
Contributor

l-kent commented Nov 18, 2024

I think we're going to want the more accurate tracking, even if that's more work to implement?

@l-kent
Copy link
Contributor

l-kent commented Nov 18, 2024

But if there's an easier, less efficient way, it's fine to implement that first I think?

@sadrabt
Copy link
Contributor Author

sadrabt commented Nov 18, 2024

Yeah I'm gonna try to fix it, with the current inefficient way and then implement tracking separately

@sadrabt
Copy link
Contributor Author

sadrabt commented Nov 24, 2024

This pull request also seems to incorrectly identify overlapping accesses in some cases. An example of this is for correct/malloc_with_local/clang:BAP. There are no overlapping stack accesses in main(), but it incorrectly finds accesses to R31 + 4, R31 + 8 and R31 + 16 to be overlapping, even though the accesses to R31 + 4 are always 32-bit (so 4 bytes) and the other accesses are always 64-bit (so 8 bytes), meaning that there is no overlap.

Is this still an issue, I'm not sure if it was fixed during the other updates or if I'm missing the problem but for me R31 + 4, R31 + 8 and R31 + 16 are separate nodes with a single cell and correct accessed sizes

@l-kent
Copy link
Contributor

l-kent commented Nov 25, 2024

I think you fixed that last week but I'll have another look to make sure now.

@l-kent
Copy link
Contributor

l-kent commented Nov 25, 2024

Yes, that's working now. The results generally seem to make sense now from what I can see, including in the jumptable/clang test.

This is the main thing that still needs to be addressed:

You do need to change how you represent other memory locations as well because otherwise it would be treating function pointers inconsistently with other pointers. For instance, currently the size of a DataLocation is the size of the object that the cell containing it points to, but since you are changing how you represent function pointers, then the size of a DataLocation will mean the size of the object pointed to for an object pointer, but the size of the pointer itself (or just 0 currently?) for a function pointer. 'DataLocation' is also a fairly unclear name given what it actually means - it should probably be something more like 'DataPointer'.

If you just want a unique key for testing, then there isn't a need for an address range at all - you could just use the value itself as a key? The value 0x754 isn't at the address 0x754, it's at the address 0x10DC0, so we need to make sure this is something that's handled consistently.

Currently a DataLocation can mean a location (which can even be a procedure) or a pointer to a location, and there isn't any consistency to what it means or what its size parameter means, which should be fixed. Right now the node with DataLocation(add_two's pointer, 1876, 0) (which represents a location containing the value 1876, which is a pointer to add_two) will point to the node with DataLocation(add_two, 1876, 20), which represents the actual procedure add_two with address 1876. This is inconsistent with the approach for global variables, where DataLocation(x, 69680, 4) represents a location containing the value 69680 (which is a pointer to x, not the actual location at address 69680), and points to another cell which represents the actual contents of x.

@sadrabt
Copy link
Contributor Author

sadrabt commented Nov 25, 2024

function pointers are handled similar to other globals now. The only difference is that the first cell will point to a special function Node instead of empty/placeholder node as its content.

@sadrabt
Copy link
Contributor Author

sadrabt commented Nov 25, 2024

I'm happy to merge this unless if you find other problems with it. I found an issue with the cloning of stack cells discussed in #274. I will address that in a general refactoring of how the stack handling works (remove SAA, track stack offsets via cells themselves, negative/overflow issues and cloning of stack cells)

@l-kent
Copy link
Contributor

l-kent commented Nov 25, 2024

I still think it would be good to change 'DataLocation' to 'DataPointer' as it represents a pointer, not a location.

function pointers are handled similar to other globals now. The only difference is that the first cell will point to a special function Node instead of empty/placeholder node as its content.

Thanks, this makes sense.

I'll merge this soon once I run some final checks. It's fine to leave the stack overhaul to another pull request.

@l-kent
Copy link
Contributor

l-kent commented Nov 25, 2024

I guess I'm still a bit confused by why the globalMappings table works like it does, why do we need to have a concept of an AddressRange when the globalMappings table is a mapping from AddressRanges to Fields, which are node + offset pairs that represent a cell containing the value that is the base of the address range? It's just a bit convoluted as-is. I feel like either it would make more sense for globalMappings to map to the cells representing the contents of the address range, or just have a mapping from addresses to the cell that contains that value?

scripts/lift.sh Outdated
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is this change?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure. I think I may have just changed permissions on this file but I haven't changed any of its content

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Strange!

Comment on lines 185 to 201
def multiAccess(lhsOrValue: Cell, pointer: Cell, size: Int): Unit = {
// TODO there should be another check here to see we cover the bytesize
// otherwise can fall back on expanding

val diff = pointer.offset - lhsOrValue.offset
val startPointerOffset = pointer.offset
val lhsNode = lhsOrValue.node.get


val pointers = pointer.node.get.cells.filter((offset, _) => offset >= startPointerOffset && offset < startPointerOffset + size).toSeq.sortBy((offset, cell) => offset)
for (((offset, cell), i) <- pointers.zipWithIndex) {
val lhs = lhsNode.addCell(offset - diff, 0) // todo check if 0 is the right size
graph.find(cell).growSize(if i < pointers.size - 1 then (pointers(i + 1)._1 - offset - diff).toInt else (size - (offset - diff)).toInt)
val res = graph.mergeCells(lhs, graph.adjust(graph.find(cell).getPointee))
graph.handleOverlapping(res)
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This method could really use a comment explaining what it is doing, as it is very convoluted as-is

@sadrabt
Copy link
Contributor Author

sadrabt commented Nov 25, 2024

I guess I'm still a bit confused by why the globalMappings table works like it does, why do we need to have a concept of an AddressRange when the globalMappings table is a mapping from AddressRanges to Fields, which are node + offset pairs that represent a cell containing the value that is the base of the address range? It's just a bit convoluted as-is. I feel like either it would make more sense for globalMappings to map to the cells representing the contents of the address range, or just have a mapping from addresses to the cell that contains that value?

The field is because we could have a global begin in the middle of a cell. We could instead have a cell/internal offset of the cell. It's just easier to get the node and call getCell(offset)/addCell(offset) on it rather than get a cell and then calculate the total offset by adding the cell's offset to the internal

The address ranges map to the base of the range. Having the end allows us to know whether an address in the middle of the range is pointing to the same memory location but at an offset. We will not be able to determine this otherwise (for example from size of nodes/cells since those sizes are also affected by other merges) The address range in effect stores the size of the actual "global" which starts at the node + offset pair we are pointing to

@l-kent
Copy link
Contributor

l-kent commented Nov 25, 2024

The address range in effect stores the size of the actual "global" which starts at the node + offset pair we are pointing to
My point is that the node + offset pair refers to the cell containing the pointer to the memory location described by the address range. Why not just map directly to the cell representing the contents of the memory location described by the address range?

Like if we have a global variable x with address 69680 and size 4, then we will get the following entry in globalMappings:

AddressRange(69680, 69684) -> Field(Node(558, Data(x, 69680, 4)), 0)

Here, the cell at offset 0 in Node 558 is a cell that contains a pointer to the global variable x, not the contents of the global variable x, so why don't we just map directly from AddressRange(69680, 69684) to the cell that represents the contents of global variable x?

Copy link
Contributor

@l-kent l-kent left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is all fine except 'DataLocation' should be renamed as discussed

@sadrabt
Copy link
Contributor Author

sadrabt commented Nov 25, 2024

Here, the cell at offset 0 in Node 558 is a cell that contains a pointer to the global variable x, not the contents of the global variable x, so why don't we just map directly from AddressRange(69680, 69684) to the cell that represents the contents of global variable x?

This is true for the local graphs where the range always has a value at the beginning of the first cell in the node but after a bunch of merges in the subsequent graphs the address range may start in the middle of a cell. So even if we have mapping like
AddressRange(69680, 69684) -> Cell(Node(558, Data(x, 69680, 4)), 0)
we will need to keep track of an internalOffset into the above cell to get
AddressRange(69680, 69684) -> Slice(Cell(Node(558, Data(x, 69680, 4)), 0), 1)

This is an alternative way we could track it but it's easier to track the offset directly instead of having to get the cell's offset and add it to it's internal offset to create "auxiliary" cells

So instead we have
AddressRange(69680, 69684) -> Field(Node(558, Data(x, 69680, 4)), 1)

@l-kent
Copy link
Contributor

l-kent commented Nov 25, 2024

So even if we have mapping like
AddressRange(69680, 69684) -> Cell(Node(558, Data(x, 69680, 4)), 0)
we will need to keep track of an internalOffset into the above cell to get
AddressRange(69680, 69684) -> Slice(Cell(Node(558, Data(x, 69680, 4)), 0), 1)

This isn't what I'm suggesting, and seem to be largely equivalent to what we already have. What I'm saying is that that Cell is one that contains a pointer to the global variable x - it doesn't represent the contents of global variable x. I'm wondering why we don't directly map from the AddressRange to the Cell (or Slice, or Field, whichever works) that represents the contents of the memory location represented by that AddressRange?

@sadrabt
Copy link
Contributor Author

sadrabt commented Nov 25, 2024

This isn't what I'm suggesting, and seem to be largely equivalent to what we already have. What I'm saying is that that Cell is one that contains a pointer to the global variable x - it doesn't represent the contents of global variable x. I'm wondering why we don't directly map from the AddressRange to the Cell (or Slice, or Field, whichever works) that represents the contents of the memory location represented by that AddressRange?

Sorry I misunderstood. I think we need a pointer to the location. That way if a global address is stored in the heap/stack or another global we can treat it like any other pointer. we could load a global pointer through indirection and not worry about it being a global (we kind of have to check because of overlapping accesses currently but we can treat them like any other dereference, otherwise we would have to merge the global location with its content but merge the pointees of any other access)

@l-kent l-kent merged commit 7ef566f into main Nov 25, 2024
2 checks passed
@l-kent
Copy link
Contributor

l-kent commented Nov 27, 2024

I think we need to revert this for the time being, the unsoundness issues are more significant than I thought.

@l-kent
Copy link
Contributor

l-kent commented Nov 27, 2024

I've reverted this, so make a new pull request for points_to_analysis_overlapping_access

I will outline the issues I've discovered in #276 now that I have time to do so.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants