If you want to count lines of code by language in a repository, start with this:
cloc .
That is the useful answer most people are looking for. It gives you a language breakdown, separates blank lines from comments and code, and works well enough for a quick read on most repositories.
The more honest answer is only slightly longer:
cloc . --exclude-dir=.git,node_modules,dist,build,vendor
That version is closer to what you usually want in a real codebase. Counting
node_modules, generated build output, vendored dependencies, or checked-in
distribution artifacts will make the report look precise while quietly lying to
you. The command is still simple, but the exclusion list matters.
Lines of code are not a measure of engineering value. They do not tell you whether the system is well designed, whether the tests are useful, or whether a team is productive. But a language-level line count is still a handy diagnostic. It tells you what kind of repository you are actually dealing with before you start arguing about build tools, staffing, migration plans, or CI time.
I use line counts as a map, not a scoreboard.
The Fast Path
For a normal repository, these are the commands worth knowing.
| Goal | Command |
|---|---|
| Count lines by language | cloc . |
| Exclude common dependency and build directories | cloc . --exclude-dir=.git,node_modules,dist,build,vendor |
| Get machine-readable output | cloc . --json |
| Get a per-file breakdown | cloc . --by-file |
| Use a faster Rust-based counter | tokei . |
| Use no extra dependencies | Read Counting Lines of Code by Language Using Only Unix Tools |
If you are doing this once by hand, cloc . is fine. If you are putting the
result in a report, dashboard, migration plan, or CI job, slow down and decide
what should be excluded.
Why Count Lines Of Code By Language?
The useful question is not "How big is this repository?"
The useful question is "What kinds of engineering work does this repository contain?"
A repository that is 70% TypeScript, 20% Go, and 10% Terraform has a different operating shape from one that is mostly C++, Python, and generated protobuf bindings. The line count will not tell you whether the code is good, but it will help you ask better questions:
- Which languages dominate the maintenance burden?
- Are generated files inflating the apparent size of the repo?
- Is a "small Python helper" quietly becoming a real subsystem?
- Does the language mix explain why CI is slow?
- Are migration claims backed by measurable movement over time?
- Does the team have the right review expertise for the code that actually exists?
That is why line counts show up in onboarding docs, architecture reviews, build system migrations, and technical due diligence. They are crude, but crude does not mean useless. A hammer is crude too. It still does a job.
Use cloc For The Default Answer
cloc is the boring, reliable default for
counting lines of code by language. It counts blank lines, comment lines, and
code lines across many languages, and it can emit results in formats such as
plain text, JSON, XML, YAML, CSV, and Markdown.
Install it with your usual package manager:
brew install cloc
On Debian or Ubuntu:
sudo apt-get install cloc
Then run it at the root of the repository:
cloc .
Example output looks like this:
-------------------------------------------------------------------------------
Language files blank comment code
-------------------------------------------------------------------------------
Python 42 920 610 5210
TypeScript 31 700 280 4300
Go 18 360 190 2800
YAML 16 80 40 620
-------------------------------------------------------------------------------
SUM: 107 2060 1120 12930
-------------------------------------------------------------------------------
The important columns are:
files: how many filesclocclassified as that languageblank: blank linescomment: comment-only linescode: lines counted as code
For most engineering conversations, the code column is the number people care
about. The files column is often just as interesting. A language with a small
line count and a large file count might be configuration, generated stubs, or a
thin layer spread across the system.
Exclude The Stuff That Should Not Count
The easiest way to get a misleading line count is to count everything.
For JavaScript and TypeScript repositories, node_modules can dwarf the code
your team owns. For compiled projects, build, dist, target, or bazel-bin
can pull generated output into the report. For older projects, vendor may be a
mix of third-party dependencies and locally patched code.
Start with an exclusion list:
cloc . --exclude-dir=.git,node_modules,dist,build,target,vendor
For Python projects, you may also want:
cloc . --exclude-dir=.git,.venv,__pycache__,build,dist
For Bazel repositories, be explicit about generated output:
cloc . --exclude-dir=.git,bazel-bin,bazel-out,bazel-testlogs,bazel-myrepo
Replace bazel-myrepo with whatever output symlink exists in your
repository. Bazel users should also think carefully before counting generated
sources. Sometimes generated code is an operational reality you need to measure.
Sometimes it is noise. Decide which conversation you are having.
This is the same kind of judgment that shows up in build tooling decisions. If you are choosing between command runners and real build systems, the language mix matters. I wrote more about that in Bazel vs. Make vs. Just: Choosing Build Tools for Real Engineering Teams.
Count Only Git-Tracked Files
One practical trick is to count only files tracked by Git. cloc can do that
directly:
cloc --vcs git
That avoids local scratch files, build artifacts, downloaded test data, and whatever else happens to be sitting in a developer's working tree.
If you need a more explicit file list, use --list-file:
git ls-files > /tmp/cloc-files.txt
cloc --list-file=/tmp/cloc-files.txt
If you need exclusions before writing that list, use Git pathspecs:
git ls-files -- ':!:node_modules' ':!:dist' ':!:build' > /tmp/cloc-files.txt
cloc --list-file=/tmp/cloc-files.txt
This is especially useful for CI jobs because the result is less dependent on whatever the checkout directory happens to contain after previous steps.
I would not make line counting a required CI gate unless there is a very specific reason. But it can be useful as a reporting job, especially when you are tracking a migration from one language or framework to another. If you do wire it into a developer workflow, treat it like any other local command surface: boring, repeatable, and easy to run. That is the same principle behind Making Local CI Commands Boring Enough for Humans and AI Agents.
Save JSON For Scripts And Dashboards
If you want to graph the language mix over time, do not scrape the text table. Ask for JSON:
cloc . --json > cloc.json
Then use jq to inspect it:
jq '.SUM.code' cloc.json
Or pull out the language totals:
jq 'to_entries[]
| select(.key != "header" and .key != "SUM")
| {language: .key, files: .value.nFiles, code: .value.code}' cloc.json
That gives you a clean path to dashboards, scheduled reports, pull request comments, or trend snapshots. If the point is historical tracking, store the result with a timestamp and the Git commit SHA:
commit=$(git rev-parse HEAD)
date=$(date -u +%Y-%m-%dT%H:%M:%SZ)
cloc . --json | jq --arg commit "$commit" --arg date "$date" \
'. + {slaptijack_meta: {commit: $commit, date: $date}}' \
> "loc-$commit.json"
That is more useful than a spreadsheet someone updates twice and forgets.
Use tokei When Speed Matters
tokei is a fast Rust-based alternative
to cloc. It is a good choice when the repository is large, when you want a
quick local command, or when you prefer its default behavior.
Install it with Cargo:
cargo install tokei
Run it:
tokei .
One nice operational detail: tokei respects .gitignore and .ignore files,
and it supports additional exclusions with --exclude.
tokei . --exclude node_modules --exclude dist --exclude build
It can also emit JSON:
tokei . --output json > tokei.json
My usual recommendation:
- Use
clocwhen you want the conventional answer and broad familiarity. - Use
tokeiwhen speed and ignore-file behavior matter more. - Use a small Unix pipeline when installing tools is not an option.
The Unix-only approach is covered in the companion article, Counting Lines of Code by Language Using Only Unix Tools.
Do Not Worship The Number
Line counts are easy to collect and easy to misuse.
A 500-line module can be worse than a 5,000-line module if the smaller one hides more coupling. Generated code can be large and boring. Configuration can be tiny and terrifying. Tests can inflate line counts in a way that is actually healthy. Deleting code is often good, but deleting the wrong abstraction can make the next six changes harder.
Use lines of code by language for questions like:
- "What languages need first-class build and test support?"
- "What code should new engineers learn first?"
- "Is this migration actually moving?"
- "Which generated directories are polluting our repository metrics?"
- "Does CI time correlate with the parts of the repo that are growing?"
Avoid using it for questions like:
- "Which team is most productive?"
- "Which engineer wrote the most value?"
- "Is this codebase good?"
- "Should we reward deletion without understanding what changed?"
Metrics are tools. The minute they become targets, people start optimizing the wrong thing.
My Practical Default
For a quick local read:
cloc . --exclude-dir=.git,node_modules,dist,build,target,vendor
For a Git-tracked report:
cloc --vcs git --json > cloc.json
For a fast interactive check:
tokei .
For a no-dependency fallback:
find . -type f -name '*.py' -exec wc -l {} + \
| awk '{sum+=$1} END {print sum}'
That last command is intentionally less clever than a real language counter. It counts files by extension, not by language heuristics. Sometimes that is enough. Sometimes it is exactly the wrong abstraction. Pick the tool based on the decision you need to make.
Conclusion
The best way to count lines of code by language in a repository is usually
cloc ., with exclusions for dependencies, build output, generated files, and
other noise. If speed matters, try tokei. If you cannot install anything, use
find, wc, and awk with a clear understanding of their limits.
The real value is not the number. The value is the conversation the number makes more concrete.
When you know the repository is mostly TypeScript, or that the Go service is now larger than the Python system it was supposed to replace, or that half your "code" is generated output, you can make better engineering decisions. You can choose better build tools, design better CI jobs, plan migrations more honestly, and stop arguing from vibes when a simple measurement would do.
Count the lines. Then do the engineering judgment part.