similarity-d ~main

Command line utilities for text similarity experiments.


To use this package, run the following command in your project's root directory:

Manual usage
Put the following dependency into your project's dependences section:

similarity-d

CI

Japanese README

similarity-d is a command-line tool written in D for finding similar functions in a project. It uses a Tree Edit Distance (TED) algorithm to compare normalized abstract syntax trees.

Prerequisites

The tool requires DMD 2.111.0 or newer and relies on dub being available in your PATH. If the compiler is not installed, use the official install.sh script or Windows installer provided by the D language project. After installation, run dub --version to confirm the tool chain is accessible.

Usage

dub fetch similarity-d
dub run similarity-d -- [options]

Options

  • --dir <path> Directory to search for .d source files (defaults to current directory).
  • --threshold <float> Similarity threshold used to decide matches.
  • --min-lines <integer> Minimum number of lines in a function to be considered.
  • --min-tokens <integer> Minimum number of normalized AST nodes (default 20).
  • --no-size-penalty Disable length penalty when computing similarity.
  • --print Print the snippet of each function when reporting results.
  • --cross-file[=true|false] Allow comparison across different files (default true). Use --cross-file=false to limit comparisons within each file.
  • --exclude-unittests Skip unittest blocks when collecting functions.
  • --exclude-nested Ignore nested functions and only collect top-level declarations.
  • --version Print the package version and exit.

The CLI compares all functions it finds in the specified directory and prints any matches whose similarity score exceeds the threshold. Each result lists the two locations and the calculated similarity. The TED algorithm normalizes identifiers and literals before calculating an edit-distance based score.

When computing similarity a length penalty is applied so that short functions do not dominate the results. This behaviour can be disabled with --no-size-penalty if you want the raw tree edit distance score.

Example:

$ similarity-d --threshold=0.8 --min-lines=3 --dir=source --exclude-nested
# disable cross-file comparisons
$ similarity-d --threshold=0.8 --cross-file=false
$ similarity-d --version
0.1.0

Remarks

This project adapts the ideas from mizchi/similarity, a multi-language code duplication detector written in Rust and TypeScript. While the original repository focuses on various languages, similarity-d implements the same tree edit distance approach specifically for D source code.

Sample Usage

Several small examples are available under the samples/ folder. Each folder contains a couple of .d files and a short README describing the scenario.

samples/basic

This directory has two almost identical functions. Lower the token filter to see the match:

$ dub run -- --dir samples/basic --min-tokens=0
samples/basic\file_a.d:3-9 <-> samples/basic\file_b.d:3-9 score=1 priority=7
samples/basic\file_a.d:20-26 <-> samples/basic\file_b.d:20-26 score=1 priority=7

Cross-file comparison is enabled by default, so functions from file_a.d and file_b.d match. Restrict the tool to compare only within each file:

$ dub run -- --dir samples/basic --min-tokens=0 --cross-file=false
No similar functions found.

Running without --min-tokens=0 prints nothing because the default value of 20 filters out these tiny functions.

samples/threshold

Two functions with different lengths live in a.d. The default threshold of 0.85 hides the pair:

$ dub run -- --dir samples/threshold
No similar functions found.

Lowering the threshold reveals a partial match:

$ dub run -- --dir samples/threshold --threshold=0.3 --min-tokens=0 --cross-file=false
samples/threshold\a.d:1-7 <-> samples/threshold\a.d:9-17 score=0.346939 priority=3.12245

samples/nested

This folder illustrates how nested functions influence the results. The files contain identical nested addOne functions but different outer functions.

$ dub run -- --dir samples/nested --min-tokens=0
samples/nested\file_a.d:3-9 <-> samples/nested\file_b.d:3-9 score=1 priority=7

Ignoring nested functions removes the match:

$ dub run -- --dir samples/nested --min-tokens=0 --exclude-nested
No similar functions found.

Development

Run the full test suite before sending a pull request. The project expects coverage information to be generated and kept above 70% for each module.

dub test --coverage --coverage-ctfe

After running the tests, run the coverage check script to ensure each source-*.lst file reports at least 70% coverage:

rdmd ./scripts/check_coverage.d

The script exits with an error if any coverage file is below the threshold.

To verify the command line interface still works, invoke it with a minimal configuration:

dub run -- --dir source/lib --exclude-unittests --threshold=0.9 --min-lines=3

Dependency Maintenance

Refresh dependencies with dub upgrade and then run the full test suite with coverage. If the tests pass, verify the CLI works as shown above and commit the updated manifest files. See AGENTS.md for the detailed procedure.

Contributing

See CONTRIBUTING.md for workflow and guidelines.

Authors:
  • lempiji
Dependencies:
dmd:frontend
Versions:
0.1.0 2025-Jun-29
~zmjwhu-codex/実施する定期的メンテナンスタスクの選定 2025-Jul-04
~wmna58-codex/実施する定期的メンテナンスタスク選定 2025-Jul-04
~main 2025-Jul-07
~codex/実施する定期的メンテナンスタスクの選定 2025-Jul-04
Show all 14 versions
Download Stats:
  • 0 downloads today

  • 0 downloads this week

  • 0 downloads this month

  • 0 downloads total

Score:
0.1
Short URL:
similarity-d.dub.pm