How to Count Lines of Code by Language in a Git Repository

Published · Updated · Programming

If you want to count lines of code by language in a repository, start with this:

cloc .

That is the useful answer most people are looking for. It gives you a language breakdown, separates blank lines from comments and code, and works well enough for a quick read on most repositories.

The more honest answer is only slightly longer:

cloc . --exclude-dir=.git,node_modules,dist,build,vendor

That version is closer to what you usually want in a real codebase. Counting node_modules, generated build output, vendored dependencies, or checked-in distribution artifacts will make the report look precise while quietly lying to you. The command is still simple, but the exclusion list matters.

Lines of code are not a measure of engineering value. They do not tell you whether the system is well designed, whether the tests are useful, or whether a team is productive. But a language-level line count is still a handy diagnostic. It tells you what kind of repository you are actually dealing with before you start arguing about build tools, staffing, migration plans, or CI time.

I use line counts as a map, not a scoreboard.

The Fast Path

For a normal repository, these are the commands worth knowing.

Goal Command
Count lines by language cloc .
Exclude common dependency and build directories cloc . --exclude-dir=.git,node_modules,dist,build,vendor
Get machine-readable output cloc . --json
Get a per-file breakdown cloc . --by-file
Use a faster Rust-based counter tokei .
Use no extra dependencies Read Counting Lines of Code by Language Using Only Unix Tools

If you are doing this once by hand, cloc . is fine. If you are putting the result in a report, dashboard, migration plan, or CI job, slow down and decide what should be excluded.

Why Count Lines Of Code By Language?

The useful question is not "How big is this repository?"

The useful question is "What kinds of engineering work does this repository contain?"

A repository that is 70% TypeScript, 20% Go, and 10% Terraform has a different operating shape from one that is mostly C++, Python, and generated protobuf bindings. The line count will not tell you whether the code is good, but it will help you ask better questions:

  • Which languages dominate the maintenance burden?
  • Are generated files inflating the apparent size of the repo?
  • Is a "small Python helper" quietly becoming a real subsystem?
  • Does the language mix explain why CI is slow?
  • Are migration claims backed by measurable movement over time?
  • Does the team have the right review expertise for the code that actually exists?

That is why line counts show up in onboarding docs, architecture reviews, build system migrations, and technical due diligence. They are crude, but crude does not mean useless. A hammer is crude too. It still does a job.

Use cloc For The Default Answer

cloc is the boring, reliable default for counting lines of code by language. It counts blank lines, comment lines, and code lines across many languages, and it can emit results in formats such as plain text, JSON, XML, YAML, CSV, and Markdown.

Install it with your usual package manager:

brew install cloc

On Debian or Ubuntu:

sudo apt-get install cloc

Then run it at the root of the repository:

cloc .

Example output looks like this:

-------------------------------------------------------------------------------
Language                     files          blank        comment           code
-------------------------------------------------------------------------------
Python                          42            920            610           5210
TypeScript                      31            700            280           4300
Go                              18            360            190           2800
YAML                            16             80             40            620
-------------------------------------------------------------------------------
SUM:                           107           2060           1120          12930
-------------------------------------------------------------------------------

The important columns are:

  • files: how many files cloc classified as that language
  • blank: blank lines
  • comment: comment-only lines
  • code: lines counted as code

For most engineering conversations, the code column is the number people care about. The files column is often just as interesting. A language with a small line count and a large file count might be configuration, generated stubs, or a thin layer spread across the system.

Exclude The Stuff That Should Not Count

The easiest way to get a misleading line count is to count everything.

For JavaScript and TypeScript repositories, node_modules can dwarf the code your team owns. For compiled projects, build, dist, target, or bazel-bin can pull generated output into the report. For older projects, vendor may be a mix of third-party dependencies and locally patched code.

Start with an exclusion list:

cloc . --exclude-dir=.git,node_modules,dist,build,target,vendor

For Python projects, you may also want:

cloc . --exclude-dir=.git,.venv,__pycache__,build,dist

For Bazel repositories, be explicit about generated output:

cloc . --exclude-dir=.git,bazel-bin,bazel-out,bazel-testlogs,bazel-myrepo

Replace bazel-myrepo with whatever output symlink exists in your repository. Bazel users should also think carefully before counting generated sources. Sometimes generated code is an operational reality you need to measure. Sometimes it is noise. Decide which conversation you are having.

This is the same kind of judgment that shows up in build tooling decisions. If you are choosing between command runners and real build systems, the language mix matters. I wrote more about that in Bazel vs. Make vs. Just: Choosing Build Tools for Real Engineering Teams.

Count Only Git-Tracked Files

One practical trick is to count only files tracked by Git. cloc can do that directly:

cloc --vcs git

That avoids local scratch files, build artifacts, downloaded test data, and whatever else happens to be sitting in a developer's working tree.

If you need a more explicit file list, use --list-file:

git ls-files > /tmp/cloc-files.txt
cloc --list-file=/tmp/cloc-files.txt

If you need exclusions before writing that list, use Git pathspecs:

git ls-files -- ':!:node_modules' ':!:dist' ':!:build' > /tmp/cloc-files.txt
cloc --list-file=/tmp/cloc-files.txt

This is especially useful for CI jobs because the result is less dependent on whatever the checkout directory happens to contain after previous steps.

I would not make line counting a required CI gate unless there is a very specific reason. But it can be useful as a reporting job, especially when you are tracking a migration from one language or framework to another. If you do wire it into a developer workflow, treat it like any other local command surface: boring, repeatable, and easy to run. That is the same principle behind Making Local CI Commands Boring Enough for Humans and AI Agents.

Save JSON For Scripts And Dashboards

If you want to graph the language mix over time, do not scrape the text table. Ask for JSON:

cloc . --json > cloc.json

Then use jq to inspect it:

jq '.SUM.code' cloc.json

Or pull out the language totals:

jq 'to_entries[]
  | select(.key != "header" and .key != "SUM")
  | {language: .key, files: .value.nFiles, code: .value.code}' cloc.json

That gives you a clean path to dashboards, scheduled reports, pull request comments, or trend snapshots. If the point is historical tracking, store the result with a timestamp and the Git commit SHA:

commit=$(git rev-parse HEAD)
date=$(date -u +%Y-%m-%dT%H:%M:%SZ)
cloc . --json | jq --arg commit "$commit" --arg date "$date" \
  '. + {slaptijack_meta: {commit: $commit, date: $date}}' \
  > "loc-$commit.json"

That is more useful than a spreadsheet someone updates twice and forgets.

Use tokei When Speed Matters

tokei is a fast Rust-based alternative to cloc. It is a good choice when the repository is large, when you want a quick local command, or when you prefer its default behavior.

Install it with Cargo:

cargo install tokei

Run it:

tokei .

One nice operational detail: tokei respects .gitignore and .ignore files, and it supports additional exclusions with --exclude.

tokei . --exclude node_modules --exclude dist --exclude build

It can also emit JSON:

tokei . --output json > tokei.json

My usual recommendation:

  • Use cloc when you want the conventional answer and broad familiarity.
  • Use tokei when speed and ignore-file behavior matter more.
  • Use a small Unix pipeline when installing tools is not an option.

The Unix-only approach is covered in the companion article, Counting Lines of Code by Language Using Only Unix Tools.

Do Not Worship The Number

Line counts are easy to collect and easy to misuse.

A 500-line module can be worse than a 5,000-line module if the smaller one hides more coupling. Generated code can be large and boring. Configuration can be tiny and terrifying. Tests can inflate line counts in a way that is actually healthy. Deleting code is often good, but deleting the wrong abstraction can make the next six changes harder.

Use lines of code by language for questions like:

  • "What languages need first-class build and test support?"
  • "What code should new engineers learn first?"
  • "Is this migration actually moving?"
  • "Which generated directories are polluting our repository metrics?"
  • "Does CI time correlate with the parts of the repo that are growing?"

Avoid using it for questions like:

  • "Which team is most productive?"
  • "Which engineer wrote the most value?"
  • "Is this codebase good?"
  • "Should we reward deletion without understanding what changed?"

Metrics are tools. The minute they become targets, people start optimizing the wrong thing.

My Practical Default

For a quick local read:

cloc . --exclude-dir=.git,node_modules,dist,build,target,vendor

For a Git-tracked report:

cloc --vcs git --json > cloc.json

For a fast interactive check:

tokei .

For a no-dependency fallback:

find . -type f -name '*.py' -exec wc -l {} + \
  | awk '{sum+=$1} END {print sum}'

That last command is intentionally less clever than a real language counter. It counts files by extension, not by language heuristics. Sometimes that is enough. Sometimes it is exactly the wrong abstraction. Pick the tool based on the decision you need to make.

Conclusion

The best way to count lines of code by language in a repository is usually cloc ., with exclusions for dependencies, build output, generated files, and other noise. If speed matters, try tokei. If you cannot install anything, use find, wc, and awk with a clear understanding of their limits.

The real value is not the number. The value is the conversation the number makes more concrete.

When you know the repository is mostly TypeScript, or that the Go service is now larger than the Python system it was supposed to replace, or that half your "code" is generated output, you can make better engineering decisions. You can choose better build tools, design better CI jobs, plan migrations more honestly, and stop arguing from vibes when a simple measurement would do.

Count the lines. Then do the engineering judgment part.

Slaptijack's Koding Kraken