Data Recovery Algorithm
Data Recovery Algorithm
Have you ever wondered why your site is not on the first Google search results page for your coveted keywords? The answer lies in the popularity of the website.
Top Positioning and the Google algorithm
Google measures the popularity of websites and websites, also referred to as authority, through links from other websites. PageRank is Google’s rating of the popularity of a website. The positioning of the web pages for a keyword on the results pages of the Google search engine is determined by the Google algorithm: The popularity has a greater weight than the content.
PageRank and keyword difficulty
A newly indexed web page has pagerank 0 and most popular webpages pagerank 10. PageRank is logarithmic. A website on a high popularity website is positioned over a website on a less popular website, with all other aspects being equal.
Many of the best-positioned desirable keyword web pages have PageRank 0, which results in most web professionals questioning the value of PageRank. Others believe that PageRank is the most important factor in Google’s algorithm and provides the best indication of keyword difficulty. The keyword difficulty is an indication of the competitiveness of the top websites for this keyword.
Various keyword difficulty tools include the page rank of the website to evaluate the contest. In an eBook, the PageRanks of Google’s top 10 websites were for the keyword “Soundbar Reviews”:
8 + 4 + 2 + 0 + 1 + 1 + 1 + 9 + 5 + 8 = 39/10 = 3.9 average.
This example would support those who question the value of PageRank.
The current HomePage PageRanks for the top websites for ‘Soundbar Reviews’ are
In 2013 there are many file systems. There are FAT, NTFS, HFS, exFAT, ext2 / ext3 and many other file systems used by many different operating systems. And yet, the oldest and simplest file system of all is still strong. The FAT system is outdated and has many limitations on the maximum disk size and the size of a single file. This file system is rather simple by today’s standards. It does not provide entitlement management or built-in mechanisms for resetting and restoring transactions. No built-in compression or encryption. And yet it is very popular for many applications. The FAT system is so easy to deploy, requires so little resources, and requires so little effort that it is irreplaceable for a variety of mobile applications.
The FAT is used in most digital cameras. Most memory cards used in media players, smartphones and tablets are formatted using FAT. Even Android devices can use FAT-formatted memory cards. In other words, FAT is alive and well despite its age.
Restore information from FAT volumes
If the FAT system is so popular, data recovery tools that support that file system will be required. This article will share the experience gained in developing a data recovery tool.
Before we go into the file system internals, let’s take a quick look at why data recovery is possible. In fact, the operating system (Windows, Android, or another system used in a digital camera or media player) does not delete or destroy information when a file is deleted. Instead, the system marks a record in the file system to show the space previously occupied by the file as available. The record itself is marked as deleted. This method is much faster than deleting hard disk contents. It also reduces the wear.
As you can see, the actual content of a file remains available somewhere on the hard disk. In this way, data recovery tools can be used. The question now is how to identify which sectors on the disk contain information related to a particular file. To do this, a data recovery tool can either parse the file system or search the content area on the hard disk for deleted files by matching the raw content with a database of predefined persistent signatures.
This second method is often referred to as “signature search” or “content-aware analysis”. In forensic applications, the same approach is called “carving”. Regardless of the name, the algorithms are very similar. They read the entire disk surface and look for distinctive signatures to identify files with certain supported formats. Once a known signature is found, the algorithm performs a secondary check and then reads and analyzes the apparent header of the file. By analyzing the header, the algorithm can determine the exact length of the file. By reading disk sectors after the file starts, the algorithm restores the contents of a deleted file.
If you follow up carefully, you may have already encountered some issues with this approach. It works extremely slowly and can only identify a limited number of known (supported) file formats. Most importantly, this approach assumes that disk sectors that follow the header of the file belong to that particular file, which is not always true. Files are not always saved one after the other. Instead, the operating system can write blocks to the first available clusters on disk. As a result, the file can be fragmented into multiple parts. Recovering fragmented files with signature lookup is a matter of success or failure: normally, short, defragmented files can be recovered without any problem, while long, fragmented files may not be recovered or damaged after recovery.
In practice, the signature search works pretty well. Most of the files that matter to the user are documents, images, and other similarly small files. Granted, a long video may not be recoverable, but a typical document or JPEG image is usually a size below the fragmentation threshold and is recovering quite well.
However, if fragmented files need to be recovered, the tool must combine the information that came from the file system and collected during the disk scan. For example, this allows you to exclude clusters that are already in use by other files, which, as we’ll see in the next chapter, greatly improves the chance of a successful recovery.
Use information from the file system to improve recovery quality
As we have seen, signature searching alone works great if there is no file system left on the hard disk, or if the file system is so badly damaged that it becomes unusable. In all other cases, information from the file system can greatly improve the quality of recovery.
Let’s take a big file that we need to recover. Suppose the file was fragmented (as is typical for larger files). If you simply use the signature search, only the first fragment of the file will be restored. The other fragments are not restored correctly. It is therefore important to determine which sectors on the disk belong to that particular file.
Windows and other operating systems determine which sectors belong to which file by listing records in the file system. File system records contain information about which sectors belong to which file.
Search for a file system: the partition system
Before we analyze the file system, we first have to identify and locate one. But before we search for a file system, let’s look at how Windows deals with partitions.
Windows describes disks with a partition system that contains one or more tables. Each table describes a single partition. The record contains the start address of the partition and its length. The partition type is also specified.
The hard drive is divided into three partitions with corresponding volume labels.
This table contains information about the type, beginning, and end of each partition.
To locate the file system, the data recovery tool must parse the partition table, if one is still available. But what if no partition table is left or the hard disk has been repartitioned and the new partition table no longer contains any information about the deleted volume? In this case, the tool searches the hard disk to identify all available file systems.
When searching for a file system, the algorithm assumes that each partition contains a file system. Most file systems can be identified by searching for a particular persistent signature. For example, the FAT file system is identified by values recorded in the 510th and 511th bytes of the starting sectors. If the values recorded in these addresses are “0x55” and “0xaa”, the tool will start a secondary check.
The secondary check allows the tool to ensure that the actual file system is found as opposed to random encounters. The secondary check checks certain values used by the file system. For example, one of the records available in the FAT system indicates the number of sectors included in the cluster. This value is always represented with a power of two. It can be 1, 2, 4, 8, 16, 32, 64 or 128. If this address contains a different value, the structure is not a file system.
After we’ve found the file system, we can start analyzing the records. Our goal is to identify addresses of the physical sectors on the disk that contain data belonging to a deleted file. To do this, a data recovery algorithm searches the file system and enumerates its records.
In the FAT system, each file and directory has a corresponding record in the file system, a so-called directory entry. Directory entries contain information about the file, including name, attributes, start address, and length.
The contents of a file or directory are stored in blocks of the same length. These data blocks are called clusters. Each cluster contains a certain number of disk sectors. This number is a fixed value for each FAT volume. It is recorded in the appropriate file system structure.
The tricky part is when a file or directory contains more than a single cluster. Subsequent clusters are identified by data structures called FAT (File Allocation Table). These structures are used to identify subsequent clusters that belong to a particular file and to determine if a particular cluster is busy or available.
Before analyzing the file system, it is important to identify the three system areas.
The first area is reserved; It contains important information about the file system. In FAT12 and FAT16 this area is one sector long. FAT32 can use more than one sector. The size of this area is specified in the boot sector.
The second area belongs to the FAT system and contains primary and secondary file system structures. This area immediately follows the reserved area. Its size is defined by the size and number of FAT structures.
Finally, the last area contains the actual data. The contents of files and directories are stored in this special area.
In the analysis of the file system, especially the FAT area is of interest. This area contains information about the physical addresses of the files on the hard disk.
When analyzing the file system, it is important to correctly determine the three system areas. The reserved area always starts at the very beginning of the file system (sector number 0). The size of this area is specified in the boot sector. In FAT12 and FAT16, the size of this area is exactly one sector. In FAT32, this area can occupy multiple sectors.
The FAT area immediately follows the reserved area. The FAT area contains one or more FAT structures. The size of this area is calculated by multiplying the number of FAT structures by the size of each structure. These values are also stored in the boot sector.
We are finally close to recovering our first file. For example, suppose the file was recently deleted and no part of the file was overwritten with other data. This means that all clusters previously used by this file are now marked as available.
It is important to note that the system can also delete the corresponding FAT records. This means we get information about the starting address of the file, its attributes, and its size, but we have no way to get data about subsequent clusters.
At this point, you can not recover the entire list of clusters that belong to the deleted file. However, we can still try to recover the contents of the file by reading the first cluster. If the file is relatively small and fits into a single cluster, that’s great! We have just recovered the file. However, if the file is larger than the size of a single cluster, we must develop an algorithm to recover the rest of the file.
The FAT system is not an easy way to determine which clusters belong to a deleted file, so this task is always a guessing game. The easiest way is to read the clusters after the first one and ignore whether those clusters are occupied by other files or not. As silly as it sounds, this is the only method available if no file system is available or if the file system is empty (for example, after formatting the disk).
The other method is more complex and only reads information from clusters that are not populated with data from other files. This method considers information about clusters that are being used by other files specified in the file system.
It is logical to assume that the second method provides better results than the first method (assuming that the file system is available and not empty). The second method can even restore fragmented files.
There are three different scenarios for restoring a file that occupies 6 file system clusters. The file size is 7094 bytes. The cluster size is 2048 bytes. This means that the deleted file originally occupied 4 clusters. In addition, we know the address of the initial cluster (cluster 56). Red color indicates clusters that are occupied by other data while empty clusters are filled white.
In scenario A, the file occupies 4 subsequent clusters (that is, the file is not fragmented). In this case, the file can be correctly restored by both algorithms. Both algorithms read clusters 56 through 59 correctly.
In scenario B, the file was fragmented and stored in 3 fragments. Clusters 57 and 60 are used by another file. In this scenario, the first algorithm restores clusters 56 through 59, which return a corrupted file. The second method correctly restores clusters 56, 58, 59, and 61.
In the last scenario C, the deleted file was also fragmented (the same clusters as in scenario B). However, clusters 57 and 60 are not used by any other file. In this scenario, both algorithms restore clusters 56 through 59 and both return a corrupted file.
As we can see, neither method is perfect, but the second algorithm has a higher chance of successful recovery than the first one.
In our simple scenario, we assumed that all parts of the file are still available and will not be overwritten with other data. In real life this is not always the case. If some parts of a file are taken over by other files, no algorithm can completely restore the file.