Counting Lines of Code by Language Using Only Unix Tools

In a previous article, we explored how to count lines of code by language using tools like cloc. While cloc and similar tools are convenient, sometimes you may prefer to avoid additional dependencies and rely solely on standard Unix utilities. Maybe you’re working in a minimal environment, or you just want to understand how it can be done “by hand.”

In this article, we’ll walk through a method of counting lines of code by language using only Unix tools commonly found on most Unix-like systems. We’ll assume that you categorize files by their extensions—e.g., .py for Python, .java for Java, .js for JavaScript—since language detection is tricky without specialized tools.

Why Use Only Unix Tools?

No Extra Dependencies: In restricted or minimal environments, installing cloc or similar tools might not be possible.
Flexibility: By using find, wc, awk, and related tools, you can customize exactly how you count and summarize files.
Learning Experience: Understanding how to chain simple tools together can deepen your mastery of Unix pipelines and scripting.

Basic Approach

The general strategy is:

Identify files belonging to a particular language by their extension.
Use wc -l to count how many lines each file has.
Summarize the results to get a total count for each language.

This relies on a consistent mapping from file extensions to languages.

Example Directory Structure

Suppose you have a repository with various languages:

myrepo/
├── src/
│   ├── main.py
│   ├── utils.py
│   ├── app.java
│   └── helper.js
└── tests/
    ├── test_main.py
    ├── test_app.java
    └── test_helper.js

We want to produce a summary like:

Python: XXX lines
Java:   YYY lines
JS:     ZZZ lines

Counting Lines by Extension

Let’s break it down by language extensions. We’ll handle each language separately, then combine results.

Python (".py") example:

find . -type f -name '*.py' -print0 \
  | xargs -0 cat \
  | wc -l

What’s happening here?

find . -type f -name '*.py' -print0: Lists all Python files in the current directory and subdirectories, using -print0 for a null-terminated list (safer for filenames with spaces).
xargs -0 cat: Feeds the file list to cat, concatenating all files.
wc -l: Counts total lines across all .py files.

The output is a single number representing all lines of Python code.

Java (".java") example:

find . -type f -name '*.java' -print0 \
  | xargs -0 cat \
  | wc -l

JavaScript (".js") example:

find . -type f -name '*.js' -print0 \
  | xargs -0 cat \
  | wc -l

You can repeat this pattern for as many extensions as you need.

Combining Results in One Go

If you have multiple languages, you might want a summarized report. One approach is to run each command separately and then print them together:

echo "Python: $(find . -type f -name '*.py' -print0 | xargs -0 cat | wc -l) lines"
echo "Java: $(find . -type f -name '*.java' -print0 | xargs -0 cat | wc -l) lines"
echo "JS: $(find . -type f -name '*.js' -print0 | xargs -0 cat | wc -l) lines"

This prints out a neat summary.

Handling Multiple Extensions per Language

Some languages may have multiple file extensions (e.g., .hpp and .h for C++ headers, .c and .cc for C and C++ sources). You can list multiple patterns with -o (OR) conditions in find:

find . -type f \( -name '*.c' -o -name '*.h' \) -print0 \
  | xargs -0 cat \
  | wc -l

This counts lines for .c and .h files together.

Dealing With Large Codebases

For very large codebases, the cat | wc -l approach might be slow since it concatenates all files into one stream. Alternatives:

Count each file’s lines individually, then sum:
```
find . -type f -name '*.py' -exec wc -l {} + \
    | awk '{sum+=$1} END {print sum}'
```
Here:
- -exec wc -l {} + runs wc -l on multiple files at once.
- awk '{sum+=$1} END {print sum}' sums up the first column (line counts) from wc’s output.
This avoids cating all files together and might be more memory-efficient.

Example Using `awk` to Summarize Multiple Languages

Imagine you define a small script to handle multiple extensions in one go. If your repository only has .py, .java, and .js files, you could do something like:

for ext in py java js; do
    lines=$(find . -type f -name "*.$ext" -exec wc -l {} + | awk '{sum+=$1} END {print sum}')
    echo "$ext: $lines lines"
done

This loops over each extension, runs wc -l on all matching files, sums them up with awk, and prints the result.

Sample Output:

py: 1200 lines
java: 850 lines
js: 300 lines

Limitations of the Extension-Based Approach

Relying purely on file extensions is a heuristic. Some projects might not adhere strictly to naming conventions. Also, certain languages share extensions or have multiple variants. Without a tool like cloc or tokei that attempts language detection, you’re limited to patterns you define yourself.

Despite these limitations, this approach is sufficient for many codebases that follow conventional naming practices.

Interpreting the Results

Context is Key: Lines of code doesn’t inherently measure complexity or quality.
Focus on Trends: Running these commands periodically can show growth or reduction in certain language codebases.
Combine With Other Metrics: Pair line counts with test coverage, code complexity, or commit frequency data to get a fuller picture of code health.

Conclusion

Counting lines of code by language using only Unix tools is straightforward if you rely on file extensions and standard utilities like find, wc, awk, and xargs. While not as convenient or feature-rich as specialized tools like cloc, these Unix pipelines let you get results without installing extra dependencies.

As you refine your approach—adding more extensions, filtering certain directories, or integrating into scripts—you can create a custom, lightweight solution that fits your environment perfectly.

Armed with these Unix-only techniques, you can quickly assess the distribution of code in your repository and track changes over time—even in the most minimal environments.