Counting Lines of Code by Language in a Code Repository

Posted on in programming

Understanding the composition of a codebase can provide valuable insights: Which languages dominate the repository? Where might maintenance efforts be concentrated? Counting lines of code by language is a common task when assessing project complexity, preparing for audits, or reporting progress in multi-language environments.

In this article, we’ll explore how to use existing tools—particularly cloc—to count lines of code by language. We’ll also discuss alternatives and best practices for interpreting results.

Why Count Lines of Code by Language?

  • Complexity Assessment: A repository with millions of lines of C++ may require different tooling and expertise than one mostly in Python.
  • Resource Allocation: Identifying which languages are most prominent can guide training, hiring, or budgeting for particular skill sets.
  • Historical Trends: Tracking how your codebase evolves over time reveals shifts in technology usage or the impact of refactoring efforts.

Using cloc

cloc is a popular open-source tool designed for counting lines of code in various programming languages. It’s easy to install and runs on multiple platforms.

Installation Examples:

  • macOS (Homebrew):

    brew install cloc
    
  • Ubuntu/Debian:

    sudo apt-get install cloc
    
  • Direct Download: Download the single Perl script from GitHub and run it directly.

Basic Usage:

cloc .

This command runs cloc on the current directory (.) and prints a summary of lines of code by language.

Example Output:

     100 text files.
     100 unique files.                              
      10 files ignored.

github.com/AlDanial/cloc v 1.90  T=0.50 s (180.0 files/s, 30000 lines/s)
-------------------------------------------------------------------------------
Language                     files          blank        comment           code
-------------------------------------------------------------------------------
Java                            5             40             30            500
Python                          8             20             10            200
JavaScript                      3             15              5            150
JSON                            2              0              0             50
-------------------------------------------------------------------------------
SUM:                            18             75             45            900
-------------------------------------------------------------------------------

This shows that the repository has code in Java, Python, JavaScript, and JSON, with a breakdown of blank, comment, and code lines.

Customizing the Analysis

You can refine what cloc analyzes using various flags:

  • Exclude Certain Directories:

    cloc . --exclude-dir=node_modules,build
    
  • Exclude Certain File Types:

    cloc . --exclude-ext=txt,md
    
  • Include Only Specific Languages:

    cloc . --match-f='\.py$'
    

    (This example only counts files ending in .py.)

Machine-Readable Output: For integration with scripts or CI/CD pipelines, cloc can produce outputs in JSON, YAML, or XML:

cloc . --json

This produces a JSON summary that you can parse programmatically.

Alternative Tools

  • sloccount: Another older tool for counting lines of code by language. Less flexible than cloc but still used in some legacy workflows.
  • tokei: A Rust-based alternative to cloc, known for speed and wide language support.

    cargo install tokei
    tokei .
    
  • loc: A minimalistic tool with fewer features, but easy to use:

    loc
    

Each tool varies in language detection heuristics, performance, and reporting style. Experiment with a few to find the one that fits your workflow best.

Best Practices for Interpreting Results

  1. Context Matters:
    Lines of code is a rough metric. More code isn’t always worse, and fewer lines don’t always mean simplicity. Combine line counts with complexity metrics, code review feedback, and test coverage data for a full picture.
  2. Ignore Vendor or Third-Party Code:
    Directories like node_modules or vendor often contain external code that can skew results. Use exclusion flags to focus only on your project’s source files.
  3. Track Over Time:
    Running cloc or tokei periodically and storing results can reveal trends—perhaps Python code is slowly replacing Java code, or the amount of JavaScript is growing as the front-end expands.
  4. Integrate into CI/CD:
    Incorporating line counts into build pipelines provides ongoing monitoring. If a PR massively increases lines of code in a particular language, it might need extra scrutiny.

Example Use Cases

  • Project Onboarding:
    A new team member can quickly understand a codebase’s composition: “Mostly Java with a small Python utility and a few JavaScript tools.”
  • Refactoring Plans:
    If a directory intended to be a small support utility now has thousands of lines of code, it may be a candidate for restructuring or splitting into multiple services.
  • Budgeting and Resource Planning:
    Identifying that 80% of code is in C++ might justify hiring a C++ specialist or investing in better C++ tooling.

Conclusion

Counting lines of code by language is a straightforward yet powerful way to gain insights into your repository’s structure. Tools like cloc and tokei make it simple to generate language-specific statistics, which can inform decision-making, resource allocation, and long-term maintenance strategies.

While lines of code isn’t a perfect metric, used wisely, it can help you track complexity, understand your codebase’s composition, and guide future improvements. With a single command, you can transform an opaque code repository into a quantified landscape of languages, offering clarity and direction as your project grows and evolves.


With these techniques, you’re equipped to break down your codebase by language and leverage the insights to enhance maintainability and growth.

Slaptijack's Koding Kraken