Project Ideas
For ClamAV library & application projects, submit pull-requests to: https://github.com/Cisco-Talos/clamav
For ClamAV documentation projects, submit pull-requests to: https://github.com/Cisco-Talos/clamav-faq/pulls
Tip: If you find that any of the bugs or projects have already been completed, you can help out simply by updating the list in a pull-request to update this document.
- Project Ideas
- Bugs
- Larger Projects
- CMake:
-D MAINTAINER_MODE=ON
- CMake:
-D CODE_COVERAGE=ON
- Develop New Detection Capabilities for PE/ELF/MachO Executables
- Develop Memory Scanning Capabilities for Unix
- WebAssembly Runtime
- Add Unpacking Support for New Packers
- Add Support for Matching on .NET Internals
- Extract Macros from OXML docs
- Dynamically add new file types simply by adding file type magic (.ftm) signatures
- Register scanners for each file type, Write bytecode "signature" scanners.
- Limit logical signature alerts based on file type
- libclamav Callback Function to Request Additional File
- CMake:
Bugs
There's only so much our core dev team can schedule into each release. Many bugs probably won't be fixed without your help! Feel free to troll our open GitHub Issues if you're looking for project ideas!
Larger Projects
The following are a list of project ideas for someone looking to work on a larger project. Any projects labeled "Risky" or "Exploratory" are thought to be more likely to fail, or to have significant drawbacks that will result in the new feature being ultimately rejected.
Please don't take it personally if the ClamAV team decide not to merge your implementation due to perceived complexity, stability, or other such concerns.
Contributors are expected to implement ample documentation for any new code or feature. Directions on how to test the contribution as well as unit and/or system tests will significantly help with PR review and will improve the likelihood that your contribution will be accepted.
Unstable or incomplete work is not likely to be accepted. The core development team has a long backlog of tasks and a curated roadmap for the next 6-12 months and will not have time to complete an unfinished project for you.
Contributors submitting a sizeable new feature will be asked to sign a Contributors License Agreement (CLA) before the contribution can be accepted.
CMake: -D MAINTAINER_MODE=ON
The purpose of "maintainer" build-mode is to update source generated by tools like Flex, Bison, and GPerf which are not readily accessible on every platform.
In this case, the project is to add GNU gperf
support to the our CMake build system's Maintainer-Mode (-D MAINTAINER_MODE=ON
). To complete this task, you'll need to detect GPerf when using Maintainer-Mode, and it should be required. When the build runs, it should regenerate and overwrite the libclamav/jsparse/generated
files in the source directory using gperf
with jsparse-keywords.gperf
.
The contributor should add the new option to CMakeOptions.cmake
and document the feature in INSTALL.cmake.md
as well as in the clamav-faq
repo's development.md
developer documentation, after the feature has merged.
Category: Low-hanging fruit, Development
What you will learn from this project:
- CMake C/C++ build system skills
Required skills:
- Linux/Unix familiarity. Familiarity with compiling C/C++ projects.
Project Size: Small
CMake: -D CODE_COVERAGE=ON
Add a -D CODE_COVERAGE=ON
option to the CMake build system which will build ClamAV with code coverage features enabled.
An ideal solution would support code coverage in when using GCC, Clang, and MSVC.
See development.md
in the clamav-faq
repo for additional insight on how gcov
, lcov
, and genhtml
can be used today with the Autotools build system.
The contributor should add the new option to CMakeOptions.cmake
and document the feature in INSTALL.cmake.md
as well as in the clamav-faq
repo's development.md
developer documentation, after the feature has merged.
Category: Low-hanging fruit, Development
What you will learn from this project:
- CMake C/C++ build system skills
- Familiarity with C/C++ code coverage
Required skills:
- Linux/Unix familiarity. Familiarity with compiling C/C++ projects.
Project Size: Small
Develop New Detection Capabilities for PE/ELF/MachO Executables
ClamAV parses the PE/ELF/MachO headers on executables that it scans, but doesn't make all of the data that it extracts available for use by NDB/LDB signatures. Some features that would be great to have include:
- The ability to distinguish between regular executables and DLLs/SOs/DYLIBs (add new keywords?)
- Subsignature modifiers that can limit subsigs to only being evaluated against sections with memory permission flags (Read/Write/Execute). This would allow signatures to be evaluated more efficiently and also would decrease the chance of signature false positives.
- Parsing digital signatures in signed MachO exes and evaluating against the certificate trust / block
.crb
rules - Other features that might be helpful?
As PE, ELF, and MachO parsing features already exist in C, C is the mostly likely language of choice. However any major new self contained code would ideally be written in Rust.
Category: Core Development
What you will learn from this project:
- The PE, ELF, and MachO file formats
- How ClamAV parses executable headers, performs signature matching, and the capabilities are provided
- How to write ClamAV signatures to match on executable files
Required skills:
- Strong C development experience
- Rust development experince (as needed)
Project Size: Large
Develop Memory Scanning Capabilities for Unix
Today, ClamAV works by scanning files on disk for malware. It'd be great if ClamAV could also be used to scan process memory on a system its running on in order to detect malware that isn't present on disk.
The ClamAV team is already looking into integrating such a feature from clamav-win32, a project by Gianluigi Tiesi who has graciously agreed to allow us to include this memory scanning feature and others in the upstream clamav project.
This project would be to develop a similar capability for use on Linux and/or macOS and/or BSD Unix scanning clients.
As this is a relatively large new feature, an ideal solution would be written in Rust.
Category: Fun/Peripheral
What you will learn from this project:
- The techniques and OS APIs related to inspecting the memory of running processes
- The security mechanisms in place to limit arbitrary access to process memory
Required skills:
- Strong Rust development experience.
- Linux/Unix operating systems experience.
Project Size: Large
WebAssembly Runtime
Background: ClamAV has for a long time had runtime support for running portable plugins we call "bytecode signatures". ClamAV has a custom bytecode compiler to compile these plugins from a C-like language and uses LLVM or a homegrown "bytecode interpreter" to run the plugins. This solution is strikingly similar to a newer portable plugin technology: WebAssembly!
The goal of project would be to create a proof-of-concept WebAssembly (wasm) runtime in ClamAV so that "wasm signatures" could be written in Rust and executed in a wasm sandbox. As with our current bytecode signature technology, the wasm signatures would run at specific hooks in the ClamAV scanning process. They would need access to the file map (buffer) being scanned, and would be given a limited API to call into ClamAV functions.
For a proof-of-concept, executing a local wasm plugin that has access to the file being scanned (without copying the data) would be fine. A production solution would need to convert the wasm plugin to an ascii-text encoding so it can be distributed much the same way the current bytecode signature .cbc
plugins are distributed. As with the bytecode signatures, clamscan
and clamd
must not load the plugins unless they've been digitally signed or the --bytecode-unsigned
/BytecodeUnsigned
options are set, which would disable this safety precaution.
Important Notes: The ClamAV bytecode compiler project is currently undergoing a major re-write. Once complete, the new bytecode compiler will effectively be a Python script that invokes
clang
with a collection of custom compiler passes that effectively compile C code into ClamAV-bytecode plugins. This project would have you extend that project to instead userustc
to compile Rust ClamAV-WASM plugins.
Category: Core Development, Fun
What you will learn from this project:
- Compilers
- LLVM, WebAssembly JIT
- Executable plugin sandboxing
- Rust
Required skills:
- C/C++ development experience.
- Rust development experience.
Project Size: Large
Add Unpacking Support for New Packers
ClamAV includes support for unpacking executables generated by several software packers so that malware can't use them to easily evade detection. The list of packers currently supported can be found in the Introduction of the ClamAV Manual. There are many packers out there, though, so there is always a need to write unpacking code for ones that are frequently used by malware authors. Some that are currently needed include:
- UPX for ELF
- MPRESS (although we do have some bytecode signatures for MPRESS - those might be sufficient)
- If anyone is interested in this, we can analyze thousands of samples and identify more candidates for this list
Improvements to existing executable (PE/ELF/MachO) parsing code would likely be in C, but any new standalone modules would ideally be written in Rust.
Category: Fun/Peripheral
What you will learn from this project:
- How packers function, the steps involved in run-time loading and fixing memory maps, and a general approach to unpacking
- You'll gain experience reverse-engineering real-world malware
Required skills:
- C development experience.
- Rust development experience.
Project Size: Large
Add Support for Matching on .NET Internals
YARA extracts certain properties of .NET executables and makes them available for signatures to use for detection: https://yara.readthedocs.io/en/v3.6.0/modules/dotnet.html
Can ClamAV do something similar? For instance, extract the GUIDs and allow matching on those the way we do entries in the PE VersionInfo section?
Tip: An ideal solution for this and any new file parsing feature should be written in Rust and called by our existing C code.
Category: Fun/Peripheral
What you will learn from this project:
- How .NET executables are structured, and how they work internally
- How to write .NET applications (for testing)
- You'll also test your code against real-world malware, and perform reverse-engineering of samples as needed (if they break your code).
Required skills:
- C development experience.
- Rust development experience.
- Any prior experience in the areas listed above is a plus.
Project Size: Large
Extract Macros from OXML docs
ClamAV and SigTool currently support parsing OLE Office files to decompress and extract macros for scanning. The newer version OOXML Office files do not have this support, resulting in detection possible for macros in these documents. The ability to both extract and scan macros would enable better coverage. This might mean creating a new target type to prevent creating two signatures one for OLE macros and another for OOXML macros.
Tip: An ideal solution for this and any new file parsing feature should be written in Rust and called by our existing C code.
Category:
What you will learn from this project:
- ClamAV and SigTool internals
- Office document macro compression (RLE compression)
- Macro storage in OOXML files
Required skills:
- C development experience.
- Rust development experience.
- Any prior experience in the areas listed above is a plus.
Project Size: Medium
Dynamically add new file types simply by adding file type magic (.ftm) signatures
Known file types are currently baked into each ClamAV versions along with file type magic signatures. See filetypes_int.h
, filetypes.h
, and filetypes.c
. The hardcoded signature definitions for these hardcoded types are generally overridden by daily.ftm
, a component of daily.cvd
used to tweak file type identification definitions after release.
This project would be to re-architect how file types are stored in libclamav so new file types can be dynamically added when daily.ftm
(or some other .ftm
file) is loaded. Supplemental .ftm
files should supplement the existing file type definitions, allowing an extra.ftm
file to be tested alongside daily.cvd
.
This new capability when combined with the ability to register bytecode signatures as new file type scanners will dramatically increase the ability to extend ClamAV functionality between major version updates. Even when combined with logical signatures that target specific file types (using the proposed new Type:
keyword instead of Target:
, see below project idea), will allow creative analysts to write more compact and efficient logical signatures.
Category: Fun, Core Development
What you will learn from this project:
- Software architecture experience.
Required skills:
- C development experience.
Project Size: Medium
Register scanners for each file type, Write bytecode "signature" scanners.
Bytecode signatures are the portable executable plugin format for ClamAV. If ClamAV file types each had one or more*
linked list of file type handlers ("scanners"), then a bytecode API could be added to register a bytecode signature as a new scanner for a file type.
This project should be completed after the project to dynamically add new file types with new file type magic signatures (above). This new scanning architecture would be really powerful way to add features to the product without requiring a major version update. When combined with the project to run WebAssembly signatures written in Rust (project idea above) -- this plugin-based scanner feature would have the potential to become the fastest and safest way to add new capabilities to ClamAV.
Example use case:
One example use case of this feature would be to alert on the malicious use of crypto miner wallet IDs.
Cryptomining malware has become increasingly prevalent with the rise in cryptocurrency prices, and we have thousands of wallet identifiers known to be associated with malicious cryptomining campaigns. We don't have a robust way of using these IDs for detection, though, because we only want to raise an alert if the ID appears to be used in a malicious way (Ex: hardcoded into a mining application or as part of a coin miner configuration file) and not in legitimate ways (Ex: blog posts about campaigns or wallet block lists used by the mining pools).
The two use-cases that we want to alert on are miner config files and executables with the embedded wallet identifier. We could have two .ftm
rules (one for each case) that indicate a CL_TYPE_MINER
or something like that, and then scanning execution for CL_TYPE_MINER
can go to the bytecode sig to perform any other checks that may be necessary.
*
Additional Considerations: ClamAV has several locations in the scanning process for invoking file type scanners:
- After initial file type identification, and before the "raw scan". In
cli_magic_scan()
. - Once for each embedded file types found when using
scanraw()
to also match on embedded type recognition signatures*
. Inscanraw()
.*
Embedded type recognition signature matching is a feature used to identify self-extracting archives and some harder to identify file formats, like XML-based office document formats, DMG files, master boot records (MBR), etc. It isn't used for some archive and disk image formats that we'll unpack later anyways because they cause excessive type false positives and duplicate file scanning. A common example without this safety measure was duplicate file extraction and scanning of zip file entries found in a tarball.
- After scanning all of the found embedded types (above). At the end of
scanraw()
. These could probably be moved to (4) if it is deemed safe to remove the 1st "safety measure" call toscanraw()
incli_magic_scan()
(i.e we'd only callscanraw()
once, ever). - Again, after the call to
scanraw()
at the bottom ofcli_magic_scan()
, for types that have bytecode hooks that won't execute unless a logical signature matches, requiringscanraw()
to perform matching first.
Considering that there are 3 or 4 placement options for scanners, it may be required to have 3 (or 4) different lists to add to when registering a new scanner to indicate when to run the scanner in the scanning process. An enum argument for the function would indicate which list to add it to. If inserting the new scanner for a given type from the front of the list, and only invoking the next scanner if the first one returns CL_EPARSE
or CL_EFORMAT
, then a scanner registration could be used to override an existing/built-in one or supplement it, whichever is desired.
This project would would require coming up with a common file-type-scanner API for all scanners (including bytecode scanners), and would enable moving all file-type-scanners out of scanners.c
and into a new file for each in a scanners
subdirectory. A separate parsers
subdirectory should be added at this time and each file type parser would be moved there. The distinction between a "scanner" and a "parser" is this. A scanner uses a parser to extract bits to be scanned. A parser may simply be something like an archive extraction library. In some cases, particularly in internally developed code, the distinction may be less clear and so the entire thing may be better placed under the scanners
directory as the entry-point will doubtless need to use the common file-type-scanner API.
This project will also require creating lots of regression tests for file type identification to ensure that the new architecture doesn't accidentally misclassify or fail to scan certain files.
The majority of the work won't actually change ClamAV's behavior, which may seem frustrating, but the end goal is super cool. Code cleanup and organization along the way will also make a meaningful difference. This project could be split into pieces:
- Establish a common file type scanner function API and reorganize the scanners and parsers as described above.
- Convert the API into a callback function pointer definition and create a registration API. Add a set of scanner callback lists to each file type. The built-in scanners should be initialized either at compile time or at least when libclamav is initialized, depending on the chosen design.
Category: Very Fun, Core Development
What you will learn from this project:
- Software architecture experience
- How to write ClamAV signatures (bytecode and LDB sigs)
- You'll test your code against real-world malware, and can do reverse engineering if you'd like to expand the initial coinminer classification logic.
Required skills:
- Strong C development experience.
- Any prior experience in the areas listed below is a plus.
Project Size: Very Large
Limit logical signature alerts based on file type
ClamAV signatures have a "Target Type" which is an integer type which can be used in signatures to limit signature matches to specific file types. ClamAV also categorizes signature patterns into two different Aho-Corasick pattern-matching trie's by Target Type. Target Type 1
(Windows executables (EXE/DLL/SYS/etc.) go in one trie, and everything else goes in the other trie. Unfortunately, not every file type has an associated target type. In addition, while it's conceivable to be able to add new text-based file types dynamically (see the above project idea about file type magic signatures), it is less feasible to dynamically add new numerical target types.
For some advanced reading, see:
This project is to add a new "FileTypes:
" keyword to the TargetDescriptionBlock
for Logical Signature (.ldb
) to limit logical signature alerts to specific file types, much like you currently can do with Target Types ("Target:
"), Container File Types ("Container:
"), and Container Intermediate Types ("Intermediates:
"). While this isn't expected to improve scan times, it should reduce overall signature size as analysts will no longer need to duplicate the file-type-magic signature in order to limit alerting on a signature match by file type.
To illustrate, this is the file type magic signature for a Microsoft Shortcut File, aka CL_TYPE_LNK
:
0:0:4C0000000114020000000000C000000000000046:Microsoft Windows Shortcut File:CL_TYPE_ANY:CL_TYPE_LNK:100
Though we can classify a file as CL_TYPE_LNK
and even unpack the file with custom scanner using that type, there is presently no way to write a signature for CL_TYPE_LNK
files without duplicating the 0:4C0000000114020000000000C000000000000046
bit.
At present a signature to alert on a "malicious" shortcut containing 0xdeadbeef
might look like this:
SignatureName;Target:0;(0&1);0:4C0000000114020000000000C000000000000046;deadbeef
After this change, the signature could instead read:
SignatureName;Target:0,FileTypes:CL_TYPE_LNK;(0);deadbeef
Extending this, we would really like to build this new option to replace "Container" and "Intermediates". We would like to also specify parent file types, and use a logical condition supporting alternative file types for each layer.
Some examples:
Filetypes:(ZIP|RAR)>PDF
to say "a PDF in a ZIP or a RAR"
Filetypes:(ZIP|RAR)>(PDF|HTML)
to say "a PDF or HTML file in a ZIP or a RAR"
Filetypes:EML>ZIP>*
to say "any file in a ZIP in an email"
Category: Low-hanging Fruit, Core Development
What you will learn from this project:
- Knowledge of ClamAV's signature databases, and logical signature evaluation.
Required skills:
- C development experience.
Project Size: Medium
libclamav Callback Function to Request Additional File
Add a callback function to give libclamav file parsers the ability to request additional file data from the scanning application -- I.e. clamscan
and clamd
(and by extension clamdscan
& clamonacc
).
This feature would enable support for split-archive scans, if all components of the split archive are present and available to the scanning application. To make this work for clamdscan
+clamd
, or clamonacc
+clamd
, the request would also have to be relayed by clamd
over the socket API to the scanning client, and the client would have to respond with additional data, filepath, or file descriptor for clamd
to provide via the callback to file parser.
Disclaimer: It's entirely likely that this idea is bogus and wouldn't work over the
clamd
+clamdscan
socket API. This task would require a fair amount exploratory coding.
When a file is scanned, the scanner (eg cli_scanrar
) may call a callback function provided by clamscan or clamd to request scan access to other files by name, with the expectation that it would receive an fmap
in response. Specifically, when the first file in a split archive is scanned, the parser could request fmap
s for subsequent files to provide to the archive extraction library. Direct scanning of files other than the first file in a split archive will skip, because they are split and are not the first file.
Category: Risky/Exploratory, Core Development
What you will learn from this project:
- ClamAV and SigTool internals
- Socket programming
Required skills:
- C and C++ development experience.
Project Size: Large