In a previous article, we
explored how to count lines of code by language using tools like cloc
. While
cloc
and similar tools are convenient, sometimes you may prefer to avoid
additional dependencies and rely solely on standard Unix utilities. Maybe you’re
working in a minimal environment, or you just want to understand how it can be
done “by hand.”
In this article, we’ll walk through a method of counting lines of code by
language using only Unix tools commonly found on most Unix-like systems. We’ll
assume that you categorize files by their extensions—e.g., .py
for Python,
.java
for Java, .js
for JavaScript—since language detection is tricky without
specialized tools.
Why Use Only Unix Tools?
- No Extra Dependencies: In restricted or minimal environments, installing
cloc
or similar tools might not be possible. - Flexibility: By using
find
,wc
,awk
, and related tools, you can customize exactly how you count and summarize files. - Learning Experience: Understanding how to chain simple tools together can deepen your mastery of Unix pipelines and scripting.
Basic Approach
The general strategy is:
- Identify files belonging to a particular language by their extension.
- Use
wc -l
to count how many lines each file has. - Summarize the results to get a total count for each language.
This relies on a consistent mapping from file extensions to languages.
Example Directory Structure
Suppose you have a repository with various languages:
myrepo/
├── src/
│ ├── main.py
│ ├── utils.py
│ ├── app.java
│ └── helper.js
└── tests/
├── test_main.py
├── test_app.java
└── test_helper.js
We want to produce a summary like:
Python: XXX lines
Java: YYY lines
JS: ZZZ lines
Counting Lines by Extension
Let’s break it down by language extensions. We’ll handle each language separately, then combine results.
Python (".py") example:
find . -type f -name '*.py' -print0 \
| xargs -0 cat \
| wc -l
What’s happening here?
find . -type f -name '*.py' -print0
: Lists all Python files in the current directory and subdirectories, using-print0
for a null-terminated list (safer for filenames with spaces).xargs -0 cat
: Feeds the file list tocat
, concatenating all files.wc -l
: Counts total lines across all.py
files.
The output is a single number representing all lines of Python code.
Java (".java") example:
find . -type f -name '*.java' -print0 \
| xargs -0 cat \
| wc -l
JavaScript (".js") example:
find . -type f -name '*.js' -print0 \
| xargs -0 cat \
| wc -l
You can repeat this pattern for as many extensions as you need.
Combining Results in One Go
If you have multiple languages, you might want a summarized report. One approach is to run each command separately and then print them together:
echo "Python: $(find . -type f -name '*.py' -print0 | xargs -0 cat | wc -l) lines"
echo "Java: $(find . -type f -name '*.java' -print0 | xargs -0 cat | wc -l) lines"
echo "JS: $(find . -type f -name '*.js' -print0 | xargs -0 cat | wc -l) lines"
This prints out a neat summary.
Handling Multiple Extensions per Language
Some languages may have multiple file extensions (e.g., .hpp
and .h
for C++
headers, .c
and .cc
for C and C++ sources). You can list multiple patterns
with -o
(OR) conditions in find
:
find . -type f \( -name '*.c' -o -name '*.h' \) -print0 \
| xargs -0 cat \
| wc -l
This counts lines for .c
and .h
files together.
Dealing With Large Codebases
For very large codebases, the cat | wc -l
approach might be slow since it
concatenates all files into one stream. Alternatives:
-
Count each file’s lines individually, then sum:
find . -type f -name '*.py' -exec wc -l {} + \ | awk '{sum+=$1} END {print sum}'
Here:
-exec wc -l {} +
runswc -l
on multiple files at once.awk '{sum+=$1} END {print sum}'
sums up the first column (line counts) fromwc
’s output.
-
This avoids
cat
ing all files together and might be more memory-efficient.
Example Using awk
to Summarize Multiple Languages
Imagine you define a small script to handle multiple extensions in one go. If
your repository only has .py
, .java
, and .js
files, you could do something
like:
for ext in py java js; do
lines=$(find . -type f -name "*.$ext" -exec wc -l {} + | awk '{sum+=$1} END {print sum}')
echo "$ext: $lines lines"
done
This loops over each extension, runs wc -l
on all matching files, sums them up
with awk
, and prints the result.
Sample Output:
py: 1200 lines
java: 850 lines
js: 300 lines
Limitations of the Extension-Based Approach
Relying purely on file extensions is a heuristic. Some projects might not adhere
strictly to naming conventions. Also, certain languages share extensions or have
multiple variants. Without a tool like cloc
or tokei
that attempts language
detection, you’re limited to patterns you define yourself.
Despite these limitations, this approach is sufficient for many codebases that follow conventional naming practices.
Interpreting the Results
- Context is Key: Lines of code doesn’t inherently measure complexity or quality.
- Focus on Trends: Running these commands periodically can show growth or reduction in certain language codebases.
- Combine With Other Metrics: Pair line counts with test coverage, code complexity, or commit frequency data to get a fuller picture of code health.
Conclusion
Counting lines of code by language using only Unix tools is straightforward if
you rely on file extensions and standard utilities like find
, wc
, awk
, and
xargs
. While not as convenient or feature-rich as specialized tools like
cloc
, these Unix pipelines let you get results without installing extra
dependencies.
As you refine your approach—adding more extensions, filtering certain directories, or integrating into scripts—you can create a custom, lightweight solution that fits your environment perfectly.
Armed with these Unix-only techniques, you can quickly assess the distribution of code in your repository and track changes over time—even in the most minimal environments.