similarity-d ~main
Command line utilities for text similarity experiments.
To use this package, run the following command in your project's root directory:
Manual usage
Put the following dependency into your project's dependences section:
similarity-d
similarity-d is a command-line tool written in D for finding similar functions in a project. It uses a Tree Edit Distance (TED) algorithm to compare normalized abstract syntax trees.
Prerequisites
The tool requires DMD 2.111.0 or newer and relies on dub
being available in your PATH
.
If the compiler is not installed, use the official install.sh
script or Windows installer provided by the D language project.
After installation, run dub --version
to confirm the tool chain is accessible.
Usage
dub fetch similarity-d
dub run similarity-d -- [options]
Options
--dir
<path> Directory to search for.d
source files (defaults to current directory).--threshold
<float> Similarity threshold used to decide matches.--min-lines
<integer> Minimum number of lines in a function to be considered.--min-tokens
<integer> Minimum number of normalized AST nodes (default 20).--no-size-penalty
Disable length penalty when computing similarity.--print
Print the snippet of each function when reporting results.--cross-file
[=true|false] Allow comparison across different files (defaulttrue
). Use--cross-file=false
to limit comparisons within each file.--exclude-unittests
Skipunittest
blocks when collecting functions.--exclude-nested
Ignore nested functions and only collect top-level declarations.--version
Print the package version and exit.
The CLI compares all functions it finds in the specified directory and prints any matches whose similarity score exceeds the threshold. Each result lists the two locations and the calculated similarity. The TED algorithm normalizes identifiers and literals before calculating an edit-distance based score.
When computing similarity a length penalty is applied so that short functions do not dominate the results. This behaviour can be disabled with --no-size-penalty
if you want the raw tree edit distance score.
Example:
$ similarity-d --threshold=0.8 --min-lines=3 --dir=source --exclude-nested
# disable cross-file comparisons
$ similarity-d --threshold=0.8 --cross-file=false
$ similarity-d --version
0.1.0
Remarks
This project adapts the ideas from mizchi/similarity, a multi-language code duplication detector written in Rust and TypeScript. While the original repository focuses on various languages, similarity-d implements the same tree edit distance approach specifically for D source code.
Sample Usage
Several small examples are available under the samples/
folder. Each folder
contains a couple of .d
files and a short README describing the scenario.
samples/basic
This directory has two almost identical functions. Lower the token filter to see the match:
$ dub run -- --dir samples/basic --min-tokens=0
samples/basic\file_a.d:3-9 <-> samples/basic\file_b.d:3-9 score=1 priority=7
samples/basic\file_a.d:20-26 <-> samples/basic\file_b.d:20-26 score=1 priority=7
Cross-file comparison is enabled by default, so functions from file_a.d
and file_b.d
match. Restrict the tool to compare only within each file:
$ dub run -- --dir samples/basic --min-tokens=0 --cross-file=false
No similar functions found.
Running without --min-tokens=0
prints nothing because the default value of 20
filters out these tiny functions.
samples/threshold
Two functions with different lengths live in a.d
. The default threshold of
0.85
hides the pair:
$ dub run -- --dir samples/threshold
No similar functions found.
Lowering the threshold reveals a partial match:
$ dub run -- --dir samples/threshold --threshold=0.3 --min-tokens=0 --cross-file=false
samples/threshold\a.d:1-7 <-> samples/threshold\a.d:9-17 score=0.346939 priority=3.12245
samples/nested
This folder illustrates how nested functions influence the results. The files
contain identical nested addOne
functions but different outer functions.
$ dub run -- --dir samples/nested --min-tokens=0
samples/nested\file_a.d:3-9 <-> samples/nested\file_b.d:3-9 score=1 priority=7
Ignoring nested functions removes the match:
$ dub run -- --dir samples/nested --min-tokens=0 --exclude-nested
No similar functions found.
Development
Run the full test suite before sending a pull request. The project expects coverage information to be generated and kept above 70% for each module.
dub test --coverage --coverage-ctfe
After running the tests, run the coverage check script to ensure each
source-*.lst
file reports at least 70% coverage:
rdmd ./scripts/check_coverage.d
The script exits with an error if any coverage file is below the threshold.
To verify the command line interface still works, invoke it with a minimal configuration:
dub run -- --dir source/lib --exclude-unittests --threshold=0.9 --min-lines=3
Dependency Maintenance
Refresh dependencies with dub upgrade
and then run the full test suite with coverage. If the tests pass, verify the CLI works as shown above and commit the updated manifest files. See AGENTS.md for the detailed procedure.
Contributing
See CONTRIBUTING.md for workflow and guidelines.
- ~main released 17 hours ago
- lempiji/similarity-d
- MIT
- Authors:
- Dependencies:
- dmd:frontend
- Versions:
-
0.1.0 2025-Jun-29 ~zmjwhu-codex/実施する定期的メンテナンスタスクの選定 2025-Jul-04 ~wmna58-codex/実施する定期的メンテナンスタスク選定 2025-Jul-04 ~main 2025-Jul-07 ~codex/実施する定期的メンテナンスタスクの選定 2025-Jul-04 - Download Stats:
-
-
0 downloads today
-
0 downloads this week
-
0 downloads this month
-
0 downloads total
-
- Score:
- 0.1
- Short URL:
- similarity-d.dub.pm