Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

auto-sync progress tracker: Refactor and implement architectures #2015

Open
21 of 47 tasks
Tracked by #2001
Rot127 opened this issue May 9, 2023 · 28 comments
Open
21 of 47 tasks
Tracked by #2001

auto-sync progress tracker: Refactor and implement architectures #2015

Rot127 opened this issue May 9, 2023 · 28 comments
Labels
tracking Tracking issue
Milestone

Comments

@Rot127
Copy link
Collaborator

Rot127 commented May 9, 2023

Note to x86: x86 is not part of this list, because we can not generate all tables in C.
Refer to capstone-engine/llvm-capstone#13 for details.

Note about changes introduced with auto-sync:
For a preview what changes will come in v6, please take a look at the WIP release guide.


This issue tracks the auto-sync refactoring and implementation effort of architecture modules.

The table below lists the responsible developers for each architecture.

In progress

Arch CS PR llvm-capstone PR Part of (planned) release Assigned developer(s) Based on LLVM repo
SPARC None yet None yet v6 @DMaroo LLVM-project
BPF #2568 Not applicable v6 @Roeegg2 RFC, Linux kernel docs
ARC None yet None yet v6 @R33v0LT LLVM-project

.td edits upstreamed

Most LLVM td files miss some information about instructions (memory read/writes, operands incorrectly assigned as in/out etc.). Since we rely on this we need to fix it. Those fixes should be upstreamed to LLVM.

Done

Arch PR Part of release Assigned developer(s) LLVM repo
Alpha #2071 v6 @R33v0LT LLVM-project (release v3.0)
AArch64 #2026 v6 @Rot127 LLVM-project
ARM #1949 v6 @Rot127 LLVM-project
PPC #2013 v6 @Rot127 LLVM-project
TriCore #1973 v5 @imbillow TriDis
HPPA #2265 v6 @R33v0LT Not Auto-sync based
LoongArch #2349 v6 @jiegec LLVM-project
MIPS #2410 v6 @wargio LLVM-project
SystemZ #2462 v6 @Rot127 LLVM-project
Xtensa #2380 v6 @imbillow LLVM-project

Arch extensions

Adding CPU extensions which are not part of upsteram LLVM is easier now.
Here are they tracked.

Arch Extension name issue previous attempt/notes Done
PPC VLE #2241 https://lists.llvm.org/pipermail/llvm-dev/2014-July/074613.html No
PPC PS (Paired-Single) None https://reviews.llvm.org/D85137 Yes
Mips NanoMips None Mediatek LLVM: https://github.com/MediaTek-Labs/llvm-project/tree/mtk-pub/nanomips-llvm16, more context: rizinorg/ideas#5 Yes
Mips EE None Not in LLVM, see: #940 (comment) No

Effort level of not refactored/implemented archs

Arch Number of operand groups Generates Note Implementation type Difficulty level
AVR ~3 Yes None New Easy
CSKY ~7 Yes None New Medium
DirectX ~1 Yes Deviates from common design. New Medium-Hard
EVM ~2 Not tested Very small module, llvm repo: https://github.com/etclabscore/evm_llvm New Easy
Hexagon ~2 No Deviates from common design. New Hard
Lanai ~10 Yes None New Easy
M68k ~28 Yes None Refactor Medium
MSP430 ~6 Yes None New Easy
SPIRV ~9 No td files faulty New Medium
VE ~8 Yes None New Medium
XCore ~15 No td files faulty Refactor Medium

Note to RISC-V: RISC-V will not be generated via LLVM because the LLVM architecture definitions are not precise enough for our use case. Instead, a SAIL based generator will be used (#2392).

Legend

  • Number of operand groups: Operand groups which have a distinct print functions. Indicates effort to implement the LLVM <-> CS mapping code (fill cs_detail and the like).
  • Generates: inc files generate with most recent backends.
  • Note: Worthy to note.
  • Implementation type: Refactor current implementation or implement new arch module.
  • Difficulty level: Guessed difficulty of this arch (base on points above and complexity like number of instructions etc.). Though "Easy" still means you have to familiarize yourself how LLVM definitions and the updater work. My guess is it will take at least a week of work.

Getting started

  • If you like to refactor an architecture module or implement a new one, please comment here and we add you. Also we can give hints to important information.
  • Please add a draft PR once you've done the first commit, so the progress is visible and there is a place for discussion.
  • Please refer to the auto-sync documentation to learn how to refactor or implement an architecture with auto-sync

TODO for refactored archs

List of missing things which should be done before v6 to get a nice round package.

Capstone

  • Missing alias support for SystemZ
  • Missing alias support for LoongArch
  • Update docs with ASUpdater.py instructions.
  • Modernize Capstone testing  #1984
  • Update all archs to LLVM 18
  • Remove tablegen files from suite.
  • Add CS assert version and add the asserts to the LLVM files again.
  • Wrap all possible code into CAPSTONE_DIET.
  • Run 0x0 to 0xffffffff as input once on ARM, PPC, AArch64 (with details enabled) to check for segfaults.
  • name2id docs. Parameter max should be changed to table size and in the loop be max - 1
  • Consider to have alias details and real details live along. So users do not need to decide for one (how would this play together with CAPSONE_DIET).
  • Possibly [Auto-Sync] Generate general instruction encoding format #2152
  • AArch64 missing details tasks #2196
  • Expose PPC instruction formats on the public interface

LLVM revisions

Auto-Sync

  • add refactor setting to auto-sync updater.
  • Add auto-sync unit tests
  • Translate template functions as functions, not as macros.

Backends

  • Generate decoding/printing macros as functions, if there is only a single version (allows proper debugging, which would be a blessing).

ARM

PPC

  • Encoding info

AArch64

  • Encoding info
@XVilka
Copy link
Contributor

XVilka commented Jun 19, 2023

@kabeor @aquynh @Rot127 I propose to make the next release, with auto-sync changes a 6.0, not 5.1 because:

  • There are slight API changes
  • The amount of code changes is HUGE

https://github.com/orgs/capstone-engine/projects/1 - then it would need to be updated too.

@aquynh
Copy link
Collaborator

aquynh commented Jun 19, 2023

please can you summarize the API changes here?

@Rot127
Copy link
Collaborator Author

Rot127 commented Jun 19, 2023

@aquynh

ARM

  • Enum changes:
    • ARM_CC_ -> ARMCC
    • System registers are renamed to match C++ namespaces. Also group Banked and system registers into different groups.
    • Some instr. enum entries no longer exist (e.g. VPUSH, VPOP).
  • Some instruction groups which are not part of LLVM were removed (e.g. GROUP_INT)
    • Groups like RET, INT should be added via Mapper separately.
  • Feature groups like ARM_GRP_CRC are renamed to match LLVM nameing: ARM_FEATURE_HasCRC
  • Features are now checked more strictly (V8, MCLASS, ARM, THUMB) because instruction aliases are supported now. And those alias might change depending on enabled features.
  • The memory offset register or immediate are now always part of the memory operand. Offsets or index operands are no longer separated. Before, only offset ops which were within the [] brackets were added.
  • writeback is part of detail and no longer of detail.arm.
  • Register alias not defined in LLVM (r15 = pc etc.) are no longer printed as default. Must be enabled via CS_OPT_SYNTAX_CS_REG_ALIAS or -a for the cstool.
  • The immediate value of operands is no of type uint32_t, no longer int32_t.

PPC

  • Predicate enums members are renamed. They now use the LLVM name (e.g. PPC_BC_NU_PLUS -> PPC_PRED_NU_PLUS).
  • Branch conditions are now saved in more detail in cs_ppc.bc.
  • The base register of an PPC memory operand was not present if reg = r0. This is fixed now.
  • ppc_ops_crx is removed (wasn't used).

AArch64

  • Renamed all ARM64 -> AArch64 (for filenames, enums variable names). Necessary to be consistent with LLVM.
  • SME operands changed (contin more detail, terminology is closer to the docs).
  • System operands change (now categorized into SysAlias, SysImm, SysReg).

This list is also part of the PR.

@aquynh
Copy link
Collaborator

aquynh commented Jun 26, 2023

@aquynh

ARM

  • Enum changes:

    • ARM_CC_ -> ARMCC
    • System registers are renamed to match C++ namespaces. Also group Banked and system registers into different groups.
    • Some instr. enum entries no longer exist (e.g. VPUSH, VPOP).
  • Some instruction groups which are not part of LLVM were removed (e.g. GROUP_INT)

    • Groups like RET, INT should be added via Mapper separately.
  • Feature groups like ARM_GRP_CRC are renamed to match LLVM nameing: ARM_FEATURE_HasCRC

  • Features are now checked more strictly (V8, MCLASS, ARM, THUMB) because instruction aliases are supported now. And those alias might change depending on enabled features.

  • The memory offset register or immediate are now always part of the memory operand. Offsets or index operands are no longer separated. Before, only offset ops which were within the [] brackets were added.

  • writeback is part of detail and no longer of detail.arm.

  • Register alias not defined in LLVM (r15 = pc etc.) are no longer printed as default. Must be enabled via CS_OPT_SYNTAX_CS_REG_ALIAS or -a for the cstool.

  • The immediate value of operands is no of type uint32_t, no longer int32_t.

PPC

  • Predicate enums members are renamed. They now use the LLVM name (e.g. PPC_BC_NU_PLUS -> PPC_PRED_NU_PLUS).
  • Branch conditions are now saved in more detail in cs_ppc.bc.
  • The base register of an PPC memory operand was not present if reg = r0. This is fixed now.
  • ppc_ops_crx is removed (wasn't used).

AArch64

  • Renamed all ARM64 -> AArch64 (for filenames, enums variable names). Necessary to be consistent with LLVM.
  • SME operands changed (contin more detail, terminology is closer to the docs).
  • System operands change (now categorized into SysAlias, SysImm, SysReg).

This list is also part of the PR.

cant we avoid breaking compatibility?

@Rot127
Copy link
Collaborator Author

Rot127 commented Jun 26, 2023

@aquynh The short answer is no.

But let me go into more details also for others:

The problem with automatic Capstone updates is that due to the C++ and C difference we have many cases to handle when C is not equivalent to C++.

To reduce those cases we need to be as close to the original C++ syntax and semantic as possible. Because every renaming (i.e. enum values), semantic and overall design changes, almost always add manual work during an update.

This is why those breaking changes are needed.
Each of them moves Capstone code semantically or syntactically closer to the LLVM definitions.

This is of cause a pain for compatibility, but it is definitely worth it in the long run.
Because:

  • All auto-sync archs are semantically pretty much equivalent to LLVM.
  • Unifies how modules work, so we can share code between them (see the new Mapping.* files).
  • Reduces the effort to update (less manual work, due to less edge cases). So hopefully more people update their archs.

Here more detail to each breaking change.

Enum changes:

Done to match LLVM naming. It saves us to change enum names over several files whenever we update.

ARM

Some instruction groups which are not part of LLVM were removed (e.g. GROUP_INT)

Capstone unique instr. groups (like RET, INT) are added via Mapper separately (is on the toddo list). Because they are not defined in LLVM, we can not generate them without adding exceptions again.

Features are now checked more strictly (V8, MCLASS, ARM, THUMB) because instruction aliases are supported now. And those alias might change depending on enabled features.

Simple necessity, because with the new instructions the same bytes have a valid decoding depending on the enabled features.

The memory offset register or immediate are now always part of the memory operand. Offsets or index operands are no longer separated. Before, only offset ops which were within the [] brackets were added.

Move closer to the LLVM logic. The disponent of a memory access doesn't need to be within the [] brackets (e.g. strt fp, [sp], 4). But the disponent is defined as part of the memory operand. This was incorrectly represented in CS before.

writeback is part of detail and no longer of detail.arm.

We support now the concept of Tied operands (the way LLVM describes writeback registers). So writeback information is now known for all auto-sync archs.

Register alias not defined in LLVM (r15 = pc etc.) are no longer printed as default. Must be enabled via CS_OPT_SYNTAX_CS_REG_ALIAS or -a for the cstool.

As said before, modules will be more equivalent to the llvm-objdump results. Also the register naming and the decoded asm string.

The immediate value of operands is not of type uint32_t, no longer int32_t.

See: #2056

PPC

Branch conditions are now saved in more detail in cs_ppc.bc.

Just a nice feature we are now able to provide.

The base register of an PPC memory operand was not present if reg = r0. This is fixed now.

A semantical fix. The base register should have been set.

ppc_ops_crx is removed (wasn't used).

Wasn't used.

AArch64

Renamed all ARM64 -> AArch64 (for filenames, enums variable names). Necessary to be consistent with LLVM.

This is a big one. But having two names for the same architecture in the code is a nightmare for generation. Also it just doesn't bring any value. Being closer to LLVM is the choice here.

SME operands changed (contain more detail, terminology is closer to the docs).

Again a nice feature addition because we save more detail. Being closer to the official docs when it comes to naming eases integrating Capstone in other projects.

System operands change (now categorized into SysAlias, SysImm, SysReg).

Again, move to LLVM semantic because:

  • It wasn't correct before (system immediate and other alias were incorrectly identified as system registers or not categorized at all).
  • This mimics the inheritance of system operands within the LLVM code.
  • Eases generation.

Personally, I think that Capstone will become more and more irrelevant as disassembler engine if we:

  • Do not modernize it (update archs, update testing)
  • Provide people a relative easy way to add more features and architectures (e.g. the instruction encoding, instruction form information (for PPC) and others).

If we do not go through the pain of breaking compatibility once, to gain log term improvements, Capstone just won't be competitive to other disassemblers in the future.

@aquynh
Copy link
Collaborator

aquynh commented Jun 26, 2023 via email

@aquynh
Copy link
Collaborator

aquynh commented Jun 26, 2023 via email

@aquynh
Copy link
Collaborator

aquynh commented Jun 26, 2023 via email

@aquynh
Copy link
Collaborator

aquynh commented Jun 26, 2023

lets take one example: we want to rename ARM_CC to ARMCC.

can we have compatibility by keeping ARM_CC, and add (new) ARMCC, so everyone is happy?

@Rot127
Copy link
Collaborator Author

Rot127 commented Jun 26, 2023 via email

@aquynh
Copy link
Collaborator

aquynh commented Jun 26, 2023

I answer your suggestions later. On the phone it is difficult to write well. As a general note though: I second @XVilkas here. I work on this now for half a year and opened the ARM PR as draft after two-three months. Exactly to ask for this kind of feedback, suggestions on design and other choices. This is also why I added the list of breaking changes to the PR description, updated it continuously and asked for feedback on the big ones. So no one needed to read through the code and can save time. Also, I think I made clear that I am more than happy to provide detailed answers and provide more details if requested. As I stated in the ARM PR, my time I can spend on this is limited (until end of July). And we want and need to start building on it in Rizin. With all due respect, but there were at least four and a half months to discuss a big decision like this. And we really need to carry on. 26 Jun 2023 17:23:51 Nguyen Anh Quynh @.>:

I never say it is not merged, Anton. On Mon, Jun 26, 2023, 23:12 Anton Kochkov @.
> wrote: > If auto-sync work is not merged, I am afraid we have to fork the capstone. > It's your choice - you want updated architectures or not. > > — > Reply to this email directly, view it on GitHub > <#2015 (comment)>, > or unsubscribe > https://github.com/notifications/unsubscribe-auth/ABNQNYGJCQLEGSSBCAUYCGTXNGRFNANCNFSM6AAAAAAX3POTSU > . > You are receiving this because you were mentioned.Message ID: > @.***> > — Reply to this email directly, view it on GitHub[#2015 (comment)], or unsubscribe[https://github.com/notifications/unsubscribe-auth/AK5ET6CWWG5MAD4QVHRZWLTXNGSQLANCNFSM6AAAAAAX3POTSU]. You are receiving this because you were mentioned.[Tracking image][https://github.com/notifications/beacon/AK5ET6B4GEQVBVWGEALNKVLXNGSQLA5CNFSM6AAAAAAX3POTSWWGG33NNVSW45C7OR4XAZNMJFZXG5LFINXW23LFNZ2KUY3PNVWWK3TUL5UWJTS72PDV4.gif]

understood - we are all short of time, especially those who are maintaining this project without paying, in spare time.

we will try to merge this in July.

@Rot127
Copy link
Collaborator Author

Rot127 commented Jun 26, 2023

@aquynh I agree that the compatibility concern is very valid. But since I asked in the PRs and got no push back I assumed modernizing is more important.

I would propose to finish up the v5 release and add a note that v6 will bring big change.

If people try out the next branch and figure they rely desperately on some of those old stuff, we can think about how to make it compatible for them in a different branch. Or guide them to do this on their own.

can we have compatibility by keeping ARM_CC, and add (new) ARMCC, so everyone is happy?

This specifically is not just a syntax change, but also the values change (ARM_CC_invalid = 0 is removed with ARMCC_UNDEF = 15). Reversing this also means to:

  1. Have two different CC enums (for CS and LLVM)
  2. Having to translate between each of those enums.

But there is little meaning in keeping this complexity (other then compatibility reasons of cause).

XVilka referenced this issue in bkoppelmann/qemu Jun 27, 2023
this are the changes from volumit
(https://github.com/volumit/qemu/commits/master) compacted into one
patch.

Signed-off-by: Bastian Koppelmann <[email protected]>
@XVilka
Copy link
Contributor

XVilka commented Jul 4, 2023

Because it takes longer than I expected, I suggest targeting upcoming LLVM 17.0 release with a few nice updates in ARMv9 and RISC-V extensions: https://discourse.llvm.org/t/llvm-17-0-0-release-planning-and-update/71762

  • July 25th - release/17.x branch created
  • July 27th - 17.0.0-rc1 released
  • August 9th - 17.0.0-rc2 released

https://llvm.org/docs/ReleaseNotes.html#non-comprehensive-list-of-changes-in-this-release

@kabeor
Copy link
Member

kabeor commented Jul 6, 2023

@XVilka In that way, should we merge #1949 after 17.x release?

Suggest to continue this topic at capstone-engine/llvm-capstone#11

@XVilka

This comment was marked as resolved.

@Rot127

This comment was marked as resolved.

@jiegec

This comment was marked as resolved.

@0pendev
Copy link

0pendev commented Jan 30, 2024

@Rot127 As discussed in the rizin Mattermost I have some free time and would like to update the BPF support to auto-sync 🙌

@Rot127
Copy link
Collaborator Author

Rot127 commented Jan 30, 2024

Great! Please go ahead. Start with the documentation and let me know if you need help with something. If something is not clearly written or needs clarification in you opinion please let me know as well. The docs haven't been read by many people yet. So any fresh looks at it are welcome.

Also notify me when you have a fork pushed and a draft PR opened. So we can link it here. Draft PR is preferred, because we can comment on it more easily.

@Rot127

This comment was marked as resolved.

@jiegec

This comment was marked as resolved.

@Quentin01

This comment was marked as resolved.

@Rot127

This comment was marked as resolved.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
tracking Tracking issue
Projects
Status: In Progress
Development

No branches or pull requests

7 participants